Daily arXiv Papers - 2025-08-07

Summaries of research papers from arXiv

Today’s Research Highlights

AI-enhanced summaries of the latest research papers from arXiv.

Table of Contents

cs.CL

[1] How Deep Is Representational Bias in LLMs? The Cases of Caste and Religion

Agrima Seth, Monojit Choudhary, Sunayana Sitaram, Kentaro Toyama, Aditya Vashistha, Kalika Bali

Main category: cs.CL

TL;DR: The paper audits GPT-4 Turbo for representational bias in India, revealing overrepresentation of dominant groups in religion and caste, despite diversity prompts. Bias is ‘sticky’ and hard to correct with nudges, suggesting deeper model changes are needed.

DetailsMotivation: To explore less-studied dimensions of bias (religion, caste) in LLMs and assess the persistence of bias despite diversity prompts.

Method: Generated 7,200 stories about Indian life events (e.g., weddings) using GPT-4 Turbo, comparing output diversity to census data.

Result: GPT-4 overrepresents dominant groups, showing ‘winner-take-all’ bias. Prompt-based nudges had limited impact.

Conclusion: Diversifying training data isn’t enough; fundamental model changes are required to address deep-seated bias.

Abstract: Representational bias in large language models (LLMs) has predominantly been measured through single-response interactions and has focused on Global North-centric identities like race and gender. We expand on that research by conducting a systematic audit of GPT-4 Turbo to reveal how deeply encoded representational biases are and how they extend to less-explored dimensions of identity. We prompt GPT-4 Turbo to generate over 7,200 stories about significant life events (such as weddings) in India, using prompts designed to encourage diversity to varying extents. Comparing the diversity of religious and caste representation in the outputs against the actual population distribution in India as recorded in census data, we quantify the presence and “stickiness” of representational bias in the LLM for religion and caste. We find that GPT-4 responses consistently overrepresent culturally dominant groups far beyond their statistical representation, despite prompts intended to encourage representational diversity. Our findings also suggest that representational bias in LLMs has a winner-take-all quality that is more biased than the likely distribution bias in their training data, and repeated prompt-based nudges have limited and inconsistent efficacy in dislodging these biases. These results suggest that diversifying training data alone may not be sufficient to correct LLM bias, highlighting the need for more fundamental changes in model development. Dataset and Codebook: https://github.com/agrimaseth/How-Deep-Is-Representational-Bias-in-LLMs

[2] FeynTune: Large Language Models for High-Energy Theory

Paul Richmond, Prarit Agarwal, Borun Chowdhury, Vasilis Niarchos, Constantinos Papageorgakis

Main category: cs.CL

TL;DR: Specialized LLMs for High-Energy Physics, fine-tuned from Llama-3.1, outperform base models and commercial LLMs in abstract completion tasks.

DetailsMotivation: To develop specialized language models for High-Energy Theoretical Physics by fine-tuning on arXiv abstracts.

Method: Fine-tuned 20 variants of Llama-3.1 on arXiv abstracts from hep-th, hep-ph, and gr-qc, using Low-Rank Adaptation approaches and varying dataset sizes.

Result: Outperformed the base model and commercial LLMs (ChatGPT, Claude, Gemini, DeepSeek) in hep-th abstract completion tasks.

Conclusion: Specialized LLMs show promise for High-Energy Physics, with insights for further development.

Abstract: We present specialized Large Language Models for theoretical High-Energy Physics, obtained as 20 fine-tuned variants of the 8-billion parameter Llama-3.1 model. Each variant was trained on arXiv abstracts (through August 2024) from different combinations of hep-th, hep-ph and gr-qc. For a comparative study, we also trained models on datasets that contained abstracts from disparate fields such as the q-bio and cs categories. All models were fine-tuned using two distinct Low-Rank Adaptation fine-tuning approaches and varying dataset sizes, and outperformed the base model on hep-th abstract completion tasks. We compare performance against leading commercial LLMs (ChatGPT, Claude, Gemini, DeepSeek) and derive insights for further developing specialized language models for High-Energy Theoretical Physics.

[3] Intent Aware Context Retrieval for Multi-Turn Agricultural Question Answering

Abhay Vijayvargia, Ajay Nagpal, Kundeshwar Pundalik, Atharva Savarkar, Smita Gautam, Pankaj Singh, Rohit Saluja, Ganesh Ramakrishnan

Main category: cs.CL

TL;DR: The paper introduces Krishi Sathi, an AI-powered chatbot for Indian farmers, using an IFT model and multi-turn dialogue to provide personalized agricultural advice in English and Hindi.

DetailsMotivation: To address the lack of timely, accessible, and language-friendly agricultural advice for Indian farmers, especially in rural areas with low literacy.

Method: Combines an IFT model fine-tuned on Indian agricultural datasets with a structured, multi-turn dialogue system and Retrieval-Augmented Generation (RAG) for tailored responses. Supports text and speech (ASR/TTS).

Result: Achieved 97.53% query response accuracy, 91.35% contextual relevance, and 97.53% query completion rate, with responses under 6 seconds.

Conclusion: The system effectively improves digital agricultural support in India through intent-driven dialogue, instruction-tuned models, and retrieval-based generation.

Abstract: Indian farmers often lack timely, accessible, and language-friendly agricultural advice, especially in rural areas with low literacy. To address this gap in accessibility, this paper presents a novel AI-powered agricultural chatbot, Krishi Sathi, designed to support Indian farmers by providing personalized, easy-to-understand answers to their queries through both text and speech. The system’s intelligence stems from an IFT model, subsequently refined through fine-tuning on Indian agricultural knowledge across three curated datasets. Unlike traditional chatbots that respond to one-off questions, Krishi Sathi follows a structured, multi-turn conversation flow to gradually collect the necessary details from the farmer, ensuring the query is fully understood before generating a response. Once the intent and context are extracted, the system performs Retrieval-Augmented Generation (RAG) by first fetching information from a curated agricultural database and then generating a tailored response using the IFT model. The chatbot supports both English and Hindi languages, with speech input and output features (via ASR and TTS) to make it accessible for users with low literacy or limited digital skills. This work demonstrates how combining intent-driven dialogue flows, instruction-tuned models, and retrieval-based generation can improve the quality and accessibility of digital agricultural support in India. This approach yielded strong results, with the system achieving a query response accuracy of 97.53%, 91.35% contextual relevance and personalization, and a query completion rate of 97.53%. The average response time remained under 6 seconds, ensuring timely support for users across both English and Hindi interactions.

[4] Hierarchical Verification of Speculative Beams for Accelerating LLM Inference

Jaydip Sen, Harshitha Puvvala, Subhasis Dasgupta

Main category: cs.CL

TL;DR: The paper introduces the Hierarchical Verification Tree (HVT), a framework to improve LLM inference efficiency by prioritizing high-likelihood drafts and pruning suboptimal candidates early, reducing time and energy without compromising output quality.

DetailsMotivation: Addressing the inefficiency in LLM inference due to sequential verification of draft sequences, which leads to unnecessary computational overhead.

Method: Proposes HVT, a hierarchical framework with a verification-pruning algorithm, integrated into standard LLM pipelines without retraining.

Result: HVT outperforms existing speculative decoding methods, reducing inference time and energy while maintaining output quality.

Conclusion: Hierarchical verification strategies like HVT offer a promising direction for accelerating LLM inference efficiently.

Abstract: Large language models (LLMs) have achieved remarkable success across diverse natural language processing tasks but face persistent challenges in inference efficiency due to their autoregressive nature. While speculative decoding and beam sampling offer notable improvements, traditional methods verify draft sequences sequentially without prioritization, leading to unnecessary computational overhead. This work proposes the Hierarchical Verification Tree (HVT), a novel framework that restructures speculative beam decoding by prioritizing high-likelihood drafts and enabling early pruning of suboptimal candidates. Theoretical foundations and a formal verification-pruning algorithm are developed to ensure correctness and efficiency. Integration with standard LLM inference pipelines is achieved without requiring retraining or architecture modification. Experimental evaluations across multiple datasets and models demonstrate that HVT consistently outperforms existing speculative decoding schemes, achieving substantial reductions in inference time and energy consumption while maintaining or enhancing output quality. The findings highlight the potential of hierarchical verification strategies as a new direction for accelerating large language model inference.

[5] WINELL: Wikipedia Never-Ending Updating with LLM Agents

Revanth Gangi Reddy, Tanay Dixit, Jiaxin Qin, Cheng Qian, Daniel Lee, Jiawei Han, Kevin Small, Xing Fan, Ruhi Sarikaya, Heng Ji

Main category: cs.CL

TL;DR: WiNELL is an LLM-based multi-agent framework for continuously updating Wikipedia by aggregating online info, selecting important knowledge, and generating precise edit suggestions, outperforming baselines like GPT-4o.

DetailsMotivation: Wikipedia's reliance on manual updates is inefficient; WiNELL aims to automate this using LLM agents for continuous knowledge acquisition.

Method: Multi-agent framework aggregates online info, selects relevant knowledge, and generates edit suggestions trained on Wikipedia’s edit history.

Result: WiNELL outperforms baselines in info coverage and editing efficiency, demonstrating timely factual updates on Wikipedia.

Conclusion: WiNELL shows promise for automating knowledge base updates, advancing LLM agents for never-ending learning.

Abstract: Wikipedia, a vast and continuously consulted knowledge base, faces significant challenges in maintaining up-to-date content due to its reliance on manual human editors. Inspired by the vision of continuous knowledge acquisition in NELL and fueled by advances in LLM-based agents, this paper introduces WiNELL, an agentic framework for continuously updating Wikipedia articles. Our approach employs a multi-agent framework to aggregate online information, select new and important knowledge for a target entity in Wikipedia, and then generate precise edit suggestions for human review. Our fine-grained editing models, trained on Wikipedia’s extensive history of human edits, enable incorporating updates in a manner consistent with human editing behavior. Our editor models outperform both open-source instruction-following baselines and closed-source LLMs (e.g., GPT-4o) in key information coverage and editing efficiency. End-to-end evaluation on high-activity Wikipedia pages demonstrates WiNELL’s ability to identify and suggest timely factual updates. This opens up a promising research direction in LLM agents for automatically updating knowledge bases in a never-ending fashion.

[6] GanitBench: A bi-lingual benchmark for evaluating mathematical reasoning in Vision Language Models

Ashutosh Bandooni, Brindha Subburaj

Main category: cs.CL

TL;DR: GanitBench is a multilingual (English and Hindi) benchmark for evaluating VLMs on math questions, showing GPT-4o mini’s dominance and performance drops in Hindi and under ‘Double Lock’ constraints.

DetailsMotivation: Address the lack of multilingual (especially Hindi) and domain-specific benchmarks for VLMs, focusing on math.

Method: Collect 1527 math questions from Indian exams (JEE Advanced, CBSE), evaluate VLMs in zero-shot and two-shot CoT settings, and introduce ‘Double Lock’ constraints.

Result: GPT-4o mini leads with 38.15% accuracy; performance drops in Hindi and under constraints; two-shot CoT is more effective.

Conclusion: GanitBench promotes multilingual research, highlighting challenges and effectiveness of two-shot CoT for VLMs.

Abstract: Benchmarks for evaluating reasoning among Vision Language Models (VLMs) on several fields and domains are being curated more frequently over the last few years. However these are often monolingual, mostly available in English. Additionally there also is a lack of datasets available in Hindi on tasks apart from comprehension and translation. We introduce GanitBench, a tough benchmark consisting of 1527 vision-only questions covering several topics in Mathematics

  • available in languages English and Hindi. Collected from two major examinations from India, the JEE Advanced and the CBSE Boards examinations, this benchmark includes questions in the form of images comprising of figures essential to a question as well as text. We evaluate two closed source models for the same, in zero-shot Chain-of-Thought (CoT) and two-shot CoT settings. GPT-4o mini is found to be the more dominant model on the benchmark, with it’s highest average accuracy being 38.15%. We also evaluate models through a “Double Lock” constraint, which brings down the performance of the models by considerable margins. We observe that two-shot CoT appears to be a more effective setting under this environment. Performance of the two VLMs also decreases when answering the same questions in the Hindi language. We hope to facilitate the inclusion of languages like Hindi in research through our work.

[7] AttnTrace: Attention-based Context Traceback for Long-Context LLMs

Yanting Wang, Runpeng Geng, Ying Chen, Jinyuan Jia

Main category: cs.CL

TL;DR: AttnTrace is a new method for tracing context contributions in LLMs, using attention weights for improved accuracy and efficiency over existing methods like TracLLM.

DetailsMotivation: To address the high computational cost and inefficiency of current context traceback methods in LLMs, which are crucial for applications like forensic analysis and improving interpretability.

Method: AttnTrace leverages attention weights from LLMs, enhanced by two novel techniques, and provides theoretical support for its design.

Result: AttnTrace outperforms state-of-the-art methods in accuracy and efficiency and aids in detecting prompt injections in long contexts.

Conclusion: AttnTrace offers a practical, efficient solution for context traceback, with demonstrated real-world applicability, such as identifying injected instructions in manipulated LLM outputs.

Abstract: Long-context large language models (LLMs), such as Gemini-2.5-Pro and Claude-Sonnet-4, are increasingly used to empower advanced AI systems, including retrieval-augmented generation (RAG) pipelines and autonomous agents. In these systems, an LLM receives an instruction along with a context–often consisting of texts retrieved from a knowledge database or memory–and generates a response that is contextually grounded by following the instruction. Recent studies have designed solutions to trace back to a subset of texts in the context that contributes most to the response generated by the LLM. These solutions have numerous real-world applications, including performing post-attack forensic analysis and improving the interpretability and trustworthiness of LLM outputs. While significant efforts have been made, state-of-the-art solutions such as TracLLM often lead to a high computation cost, e.g., it takes TracLLM hundreds of seconds to perform traceback for a single response-context pair. In this work, we propose AttnTrace, a new context traceback method based on the attention weights produced by an LLM for a prompt. To effectively utilize attention weights, we introduce two techniques designed to enhance the effectiveness of AttnTrace, and we provide theoretical insights for our design choice. We also perform a systematic evaluation for AttnTrace. The results demonstrate that AttnTrace is more accurate and efficient than existing state-of-the-art context traceback methods. We also show that AttnTrace can improve state-of-the-art methods in detecting prompt injection under long contexts through the attribution-before-detection paradigm. As a real-world application, we demonstrate that AttnTrace can effectively pinpoint injected instructions in a paper designed to manipulate LLM-generated reviews. The code is at https://github.com/Wang-Yanting/AttnTrace.

[8] Majority Bit-Aware Watermarking For Large Language Models

Jiahao Xu, Rui Hu, Zikai Zhang

Main category: cs.CL

TL;DR: MajorMark and MajorMark$^+$ are novel watermarking methods for LLMs that improve the trade-off between text quality and decoding accuracy by using majority bit-aware encoding and clustering-based decoding.

DetailsMotivation: Address concerns about LLM misuse by embedding identifiable binary messages for origin verification and misuse tracing, overcoming limitations of existing multi-bit watermarking schemes.

Method: MajorMark uses majority bit-aware encoding for flexible token selection and clustering-based decoding. MajorMark$^+$ partitions messages into blocks for independent encoding and deterministic decoding.

Result: Outperforms prior methods in decoding accuracy and text quality, as demonstrated by experiments on state-of-the-art LLMs.

Conclusion: MajorMark and MajorMark$^+$ offer a robust solution for watermarking LLM-generated text, balancing quality and accuracy effectively.

Abstract: The growing deployment of Large Language Models (LLMs) in real-world applications has raised concerns about their potential misuse in generating harmful or deceptive content. To address this issue, watermarking techniques have emerged as a promising solution by embedding identifiable binary messages into generated text for origin verification and misuse tracing. While recent efforts have explored multi-bit watermarking schemes capable of embedding rich information such as user identifiers, they typically suffer from the fundamental trade-off between text quality and decoding accuracy: to ensure reliable message decoding, they have to restrict the size of preferred token sets during encoding, yet such restrictions reduce the quality of the generated content. In this work, we propose MajorMark, a novel watermarking method that improves this trade-off through majority bit-aware encoding. MajorMark selects preferred token sets based on the majority bit of the message, enabling a larger and more flexible sampling of tokens. In contrast to prior methods that rely on token frequency analysis for decoding, MajorMark employs a clustering-based decoding strategy, which maintains high decoding accuracy even when the preferred token set is large, thus preserving both content quality and decoding accuracy. We further introduce MajorMark$^+$, which partitions the message into multiple blocks to independently encode and deterministically decode each block, thereby further enhancing the quality of watermarked text and improving decoding accuracy. Extensive experiments on state-of-the-art LLMs demonstrate that our methods significantly enhance both decoding accuracy and text generation quality, outperforming prior multi-bit watermarking baselines.

[9] Hallucination to Truth: A Review of Fact-Checking and Factuality Evaluation in Large Language Models

Subhey Sadi Rahman, Md. Adnanul Islam, Md. Mahbub Alam, Musarrat Zeba, Md. Abdur Rahman, Sadia Sultana Chowa, Mohaimenul Azam Khan Raiaan, Sami Azam

Main category: cs.CL

TL;DR: The review analyzes challenges in evaluating LLM-generated content for factual accuracy, proposing solutions like advanced prompting, fine-tuning, and RAG methods. It highlights the need for domain-specific fact-checking frameworks.

DetailsMotivation: LLMs often generate misinformation due to training on inaccurate data, necessitating robust fact-checking methods.

Method: Systematic review of literature (2020-2025) focusing on evaluation methods, mitigation techniques, and frameworks like RAG and instruction tuning.

Result: Current metrics are limited; grounding outputs with external evidence and domain-specific customization improves factual consistency.

Conclusion: Building accurate, explainable, and domain-specific LLMs is crucial for trustworthy language models.

Abstract: Large Language Models (LLMs) are trained on vast and diverse internet corpora that often include inaccurate or misleading content. Consequently, LLMs can generate misinformation, making robust fact-checking essential. This review systematically analyzes how LLM-generated content is evaluated for factual accuracy by exploring key challenges such as hallucinations, dataset limitations, and the reliability of evaluation metrics. The review emphasizes the need for strong fact-checking frameworks that integrate advanced prompting strategies, domain-specific fine-tuning, and retrieval-augmented generation (RAG) methods. It proposes five research questions that guide the analysis of the recent literature from 2020 to 2025, focusing on evaluation methods and mitigation techniques. The review also discusses the role of instruction tuning, multi-agent reasoning, and external knowledge access via RAG frameworks. Key findings highlight the limitations of current metrics, the value of grounding outputs with validated external evidence, and the importance of domain-specific customization to improve factual consistency. Overall, the review underlines the importance of building LLMs that are not only accurate and explainable but also tailored for domain-specific fact-checking. These insights contribute to the advancement of research toward more trustworthy and context-aware language models.

[10] Beyond the Leaderboard: Rethinking Medical Benchmarks for Large Language Models

Zizhan Ma, Wenxuan Wang, Guo Yu, Yiu-Fai Cheung, Meidan Ding, Jie Liu, Wenting Chen, Linlin Shen

Main category: cs.CL

TL;DR: MedCheck is a lifecycle-oriented framework addressing reliability issues in medical LLM benchmarks, revealing systemic flaws and offering guidelines for improvement.

DetailsMotivation: Existing medical LLM benchmarks lack clinical fidelity, data integrity, and safety metrics, necessitating a standardized evaluation framework.

Method: Introduces MedCheck, a framework with 46 criteria across five stages of benchmark development, applied to evaluate 53 medical LLM benchmarks.

Result: Uncovered systemic issues like clinical disconnect, data contamination, and neglected safety metrics.

Conclusion: MedCheck diagnoses flaws in current benchmarks and provides actionable guidelines for reliable AI evaluation in healthcare.

Abstract: Large language models (LLMs) show significant potential in healthcare, prompting numerous benchmarks to evaluate their capabilities. However, concerns persist regarding the reliability of these benchmarks, which often lack clinical fidelity, robust data management, and safety-oriented evaluation metrics. To address these shortcomings, we introduce MedCheck, the first lifecycle-oriented assessment framework specifically designed for medical benchmarks. Our framework deconstructs a benchmark’s development into five continuous stages, from design to governance, and provides a comprehensive checklist of 46 medically-tailored criteria. Using MedCheck, we conducted an in-depth empirical evaluation of 53 medical LLM benchmarks. Our analysis uncovers widespread, systemic issues, including a profound disconnect from clinical practice, a crisis of data integrity due to unmitigated contamination risks, and a systematic neglect of safety-critical evaluation dimensions like model robustness and uncertainty awareness. Based on these findings, MedCheck serves as both a diagnostic tool for existing benchmarks and an actionable guideline to foster a more standardized, reliable, and transparent approach to evaluating AI in healthcare.

[11] An Entity Linking Agent for Question Answering

Yajie Luo, Yihong Wu, Muzhi Li, Fengran Mo, Jia Ao Sun, Xinyu Wang, Liheng Ma, Yingxue Zhang, Jian-Yun Nie

Main category: cs.CL

TL;DR: A QA-focused entity linking agent using a Large Language Model improves accuracy for short, ambiguous questions by simulating human workflows.

DetailsMotivation: Existing EL methods underperform on short, ambiguous QA questions, necessitating a better approach.

Method: An agent based on a Large Language Model identifies mentions, retrieves candidates, and makes decisions.

Result: Experiments show the agent’s robustness and effectiveness in tool-based EL and QA tasks.

Conclusion: The proposed agent enhances EL for QA, addressing limitations of traditional methods.

Abstract: Some Question Answering (QA) systems rely on knowledge bases (KBs) to provide accurate answers. Entity Linking (EL) plays a critical role in linking natural language mentions to KB entries. However, most existing EL methods are designed for long contexts and do not perform well on short, ambiguous user questions in QA tasks. We propose an entity linking agent for QA, based on a Large Language Model that simulates human cognitive workflows. The agent actively identifies entity mentions, retrieves candidate entities, and makes decision. To verify the effectiveness of our agent, we conduct two experiments: tool-based entity linking and QA task evaluation. The results confirm the robustness and effectiveness of our agent.

[12] Chain of Questions: Guiding Multimodal Curiosity in Language Models

Nima Iji, Kia Dashtipour

Main category: cs.CL

TL;DR: The paper introduces Chain of Questions (CoQ), a curiosity-driven framework for multimodal LLMs to dynamically generate questions and activate relevant sensory modalities, improving reasoning accuracy and interpretability.

DetailsMotivation: Existing reasoning improvements in LLMs haven't fully extended to multimodal contexts, where models must decide which sensory modalities to engage for complex real-world interactions.

Method: The CoQ framework encourages multimodal LLMs to generate targeted questions about their surroundings, guiding selective activation of relevant modalities for better reasoning.

Result: Evaluated on a novel multimodal benchmark, CoQ improves a foundation model’s ability to identify and integrate sensory information, enhancing accuracy and interpretability.

Conclusion: CoQ effectively bridges the gap in multimodal reasoning, aligning the reasoning process with diverse tasks and improving model performance.

Abstract: Reasoning capabilities in large language models (LLMs) have substantially advanced through methods such as chain-of-thought and explicit step-by-step explanations. However, these improvements have not yet fully transitioned to multimodal contexts, where models must proactively decide which sensory modalities such as vision, audio, or spatial perception to engage when interacting with complex real-world environments. In this paper, we introduce the Chain of Questions (CoQ) framework, a curiosity-driven reasoning approach that encourages multimodal language models to dynamically generate targeted questions regarding their surroundings. These generated questions guide the model to selectively activate relevant modalities, thereby gathering critical information necessary for accurate reasoning and response generation. We evaluate our framework on a novel multimodal benchmark dataset, assembled by integrating WebGPT, ScienceQA, AVSD, and ScanQA datasets. Experimental results demonstrate that our CoQ method improves a foundation model’s ability to effectively identify and integrate pertinent sensory information. This leads to improved accuracy, interpretability, and alignment of the reasoning process with diverse multimodal tasks.

[13] Sotopia-RL: Reward Design for Social Intelligence

Haofei Yu, Zhengyang Qi, Yining Zhao, Kolby Nottingham, Keyang Xuan, Bodhisattwa Prasad Majumder, Hao Zhu, Paul Pu Liang, Jiaxuan You

Main category: cs.CL

TL;DR: Sotopia-RL improves social intelligence in LLMs by refining episode-level feedback into utterance-level, multi-dimensional rewards, addressing partial observability and multi-dimensionality in social interactions.

DetailsMotivation: Social interactions in LLMs face challenges like partial observability and multi-dimensionality, making traditional RL inefficient.

Method: Proposes Sotopia-RL, a framework using utterance-level, multi-dimensional rewards for better credit assignment and reduced reward hacking.

Result: Achieves state-of-the-art social goal completion scores (7.17 on Sotopia-hard, 8.31 on Sotopia-full), outperforming existing methods.

Conclusion: Sotopia-RL’s utterance-level credit assignment and multi-dimensional rewards are essential for effective RL training in social tasks.

Abstract: Social intelligence has become a critical capability for large language models (LLMs), enabling them to engage effectively in real-world social tasks such as accommodation, persuasion, collaboration, and negotiation. Reinforcement learning (RL) is a natural fit for training socially intelligent agents because it allows models to learn sophisticated strategies directly through social interactions. However, social interactions have two key characteristics that set barriers for RL training: (1) partial observability, where utterances have indirect and delayed effects that complicate credit assignment, and (2) multi-dimensionality, where behaviors such as rapport-building or knowledge-seeking contribute indirectly to goal achievement. These characteristics make Markov decision process (MDP)-based RL with single-dimensional episode-level rewards inefficient and unstable. To address these challenges, we propose Sotopia-RL, a novel framework that refines coarse episode-level feedback into utterance-level, multi-dimensional rewards. Utterance-level credit assignment mitigates partial observability by attributing outcomes to individual utterances, while multi-dimensional rewards capture the full richness of social interactions and reduce reward hacking. Experiments in Sotopia, an open-ended social learning environment, demonstrate that Sotopia-RL achieves state-of-the-art social goal completion scores (7.17 on Sotopia-hard and 8.31 on Sotopia-full), significantly outperforming existing approaches. Ablation studies confirm the necessity of both utterance-level credit assignment and multi-dimensional reward design for RL training. Our implementation is publicly available at: https://github.com/sotopia-lab/sotopia-rl.

[14] The State Of TTS: A Case Study with Human Fooling Rates

Praveen Srinivasa Varadhan, Sherry Thomas, Sai Teja M. S., Suvrat Bhooshan, Mitesh M. Khapra

Main category: cs.CL

TL;DR: The paper introduces Human Fooling Rate (HFR) to measure how often TTS speech is mistaken for human, revealing gaps in current evaluations and progress.

DetailsMotivation: To assess if TTS systems can deceive humans in Turing-like tests, moving beyond subjective evaluations.

Method: Large-scale evaluation of open-source and commercial TTS models using HFR, comparing human and machine-generated speech.

Result: Commercial models approach human deception in zero-shot settings, but open-source systems lag. Fine-tuning helps but doesn’t fully bridge the gap.

Conclusion: More realistic, human-centric evaluations are needed alongside subjective tests to accurately benchmark TTS progress.

Abstract: While subjective evaluations in recent years indicate rapid progress in TTS, can current TTS systems truly pass a human deception test in a Turing-like evaluation? We introduce Human Fooling Rate (HFR), a metric that directly measures how often machine-generated speech is mistaken for human. Our large-scale evaluation of open-source and commercial TTS models reveals critical insights: (i) CMOS-based claims of human parity often fail under deception testing, (ii) TTS progress should be benchmarked on datasets where human speech achieves high HFRs, as evaluating against monotonous or less expressive reference samples sets a low bar, (iii) Commercial models approach human deception in zero-shot settings, while open-source systems still struggle with natural conversational speech; (iv) Fine-tuning on high-quality data improves realism but does not fully bridge the gap. Our findings underscore the need for more realistic, human-centric evaluations alongside existing subjective tests.

[15] CoAct-1: Computer-using Agents with Coding as Actions

Linxin Song, Yutong Dai, Viraj Prabhu, Jieyu Zhang, Taiwei Shi, Li Li, Junnan Li, Silvio Savarese, Zeyuan Chen, Jieyu Zhao, Ran Xu, Caiming Xiong

Main category: cs.CL

TL;DR: CoAct-1 introduces a hybrid multi-agent system combining GUI control and programmatic execution, achieving higher efficiency and success rates on complex tasks.

DetailsMotivation: Autonomous GUI agents face inefficiency and brittleness in long-horizon tasks due to reliance on GUI manipulation alone.

Method: CoAct-1 uses an Orchestrator to delegate tasks between a GUI Operator and a Programmer agent, enabling Python/Bash scripting for efficiency.

Result: Achieves 60.76% success rate on OSWorld benchmark, reducing steps to 10.15 vs. 15 for GUI-only agents.

Conclusion: Integrating coding as an action enhances automation power, efficiency, and scalability.

Abstract: Autonomous agents that operate computers via Graphical User Interfaces (GUIs) often struggle with efficiency and reliability on complex, long-horizon tasks. While augmenting these agents with planners can improve task decomposition, they remain constrained by the inherent limitations of performing all actions through GUI manipulation, leading to brittleness and inefficiency. In this work, we introduce a more robust and flexible paradigm: enabling agents to use coding as a enhanced action. We present CoAct-1, a novel multi-agent system that synergistically combines GUI-based control with direct programmatic execution. CoAct-1 features an Orchestrator that dynamically delegates subtasks to either a conventional GUI Operator or a specialized Programmer agent, which can write and execute Python or Bash scripts. This hybrid approach allows the agent to bypass inefficient GUI action sequences for tasks like file management and data processing, while still leveraging visual interaction when necessary. We evaluate our system on the challenging OSWorld benchmark, where CoAct-1 achieves a new state-of-the-art success rate of 60.76%, significantly outperforming prior methods. Furthermore, our approach dramatically improves efficiency, reducing the average number of steps required to complete a task to just 10.15, compared to 15 for leading GUI agents. Our results demonstrate that integrating coding as a core action provides a more powerful, efficient, and scalable path toward generalized computer automation.

[16] CAP-LLM: Context-Augmented Personalized Large Language Models for News Headline Generation

Raymond Wilson, Cole Graham, Chase Carter, Zefeng Yang, Ruiqi Gu

Main category: cs.CL

TL;DR: CAP-LLM is a novel framework for personalized news headline generation, integrating user preferences and factual consistency into a pre-trained LLM. It outperforms baselines in factual accuracy and personalization.

DetailsMotivation: Addressing the limitations of existing methods in capturing user interests and ensuring factual consistency in news headlines.

Method: CAP-LLM combines a User Preference Encoder, Context Injection Adapter, and Fact-Consistency Reinforcement Module to enhance LLM-based headline generation.

Result: Achieves state-of-the-art performance on the PENS dataset, improving factual consistency (FactCC 87.50) and personalization (Pc(avg) 2.73).

Conclusion: CAP-LLM effectively balances personalization and factual accuracy, validated by ablation studies and human evaluations.

Abstract: In the era of information overload, personalized news headline generation is crucial for engaging users by tailoring content to their preferences while accurately conveying news facts. Existing methods struggle with effectively capturing complex user interests and ensuring factual consistency, often leading to generic or misleading headlines. Leveraging the unprecedented capabilities of Large Language Models (LLMs) in text generation, we propose Context-Augmented Personalized LLM (CAP-LLM), a novel framework that integrates user preferences and factual consistency constraints into a powerful pre-trained LLM backbone. CAP-LLM features a User Preference Encoder to capture long-term user interests, a Context Injection Adapter to seamlessly integrate these preferences and current article context into the LLM’s generation process, and a Fact-Consistency Reinforcement Module employing a novel contrastive loss to mitigate hallucination. Evaluated on the real-world PENS dataset, CAP-LLM achieves state-of-the-art performance across all metrics. Notably, it significantly improves factual consistency (FactCC of 87.50) over strong baselines like BART (86.67), while simultaneously enhancing personalization (Pc(avg) 2.73, Pc(max) 17.25) and content coverage (ROUGE-1 26.55, ROUGE-2 9.95, ROUGE-L 23.01). Our ablation studies, human evaluations, and sensitivity analyses further validate the effectiveness of each component and the robustness of our approach, demonstrating CAP-LLM’s ability to achieve a superior balance between personalization and factual accuracy in news headline generation.

[17] Data and AI governance: Promoting equity, ethics, and fairness in large language models

Alok Abhishek, Lisa Erickson, Tushar Bandopadhyay

Main category: cs.CL

TL;DR: The paper presents a framework for governing and mitigating bias in LLMs throughout their lifecycle, from development to deployment, using the BEATS suite and AI governance strategies.

DetailsMotivation: To address bias, fairness, and ethical gaps in LLMs and ensure responsible AI deployment.

Method: Proposes a data and AI governance framework, leveraging the BEATS test suite for bias evaluation and continuous monitoring.

Result: Enables rigorous benchmarking, real-time evaluation, and proactive governance of LLM outputs, reducing risks of discrimination.

Conclusion: The framework enhances the safety and responsibility of GenAI systems, promoting ethically aligned AI applications.

Abstract: In this paper, we cover approaches to systematically govern, assess and quantify bias across the complete life cycle of machine learning models, from initial development and validation to ongoing production monitoring and guardrail implementation. Building upon our foundational work on the Bias Evaluation and Assessment Test Suite (BEATS) for Large Language Models, the authors share prevalent bias and fairness related gaps in Large Language Models (LLMs) and discuss data and AI governance framework to address Bias, Ethics, Fairness, and Factuality within LLMs. The data and AI governance approach discussed in this paper is suitable for practical, real-world applications, enabling rigorous benchmarking of LLMs prior to production deployment, facilitating continuous real-time evaluation, and proactively governing LLM generated responses. By implementing the data and AI governance across the life cycle of AI development, organizations can significantly enhance the safety and responsibility of their GenAI systems, effectively mitigating risks of discrimination and protecting against potential reputational or brand-related harm. Ultimately, through this article, we aim to contribute to advancement of the creation and deployment of socially responsible and ethically aligned generative artificial intelligence powered applications.

[18] Confidence-Weighted Token Set Cover for Early Hypothesis Pruning in Self-Consistency

Md Arafat Sultan, Ramón Fernandez Astudillo

Main category: cs.CL

TL;DR: The paper proposes a method to improve token efficiency in self-consistency for long reasoning tasks by pruning hypotheses early using confidence and lexical coverage indicators.

DetailsMotivation: Self-consistency is effective but token-expensive; the goal is to enhance its efficiency without losing parallelism.

Method: Generate solutions in parallel, prune hypotheses using model confidence and lexical coverage, and apply a weighted set cover algorithm.

Result: Token efficiency improved by 10-35% across five LLMs on three math benchmarks.

Conclusion: Early hypothesis pruning with lightweight indicators can significantly boost token efficiency in self-consistency tasks.

Abstract: Despite its simplicity and efficacy, the high token expenditure of self-consistency can limit its practical utility. Here we investigate if self-consistency can be made more token-efficient for long chain-of-thought reasoning tasks, while preserving its parallelism, through early hypothesis pruning. Concretely, we generate all solutions in parallel, but periodically prune intermediate hypotheses that are deemed unnecessary based on two lightweight indicators: (a) the model’s own confidence in individual hypotheses, and (b) lexical coverage of all current hypotheses by candidate subsets that are under consideration for continued retention. We design a fast weighted set cover algorithm that utilizes the two indicators; our evaluation of five LLMs on three math benchmarks shows that this method can improve token efficiency for all models, by 10-35% in many cases.

[19] Are Today’s LLMs Ready to Explain Well-Being Concepts?

Bohan Jiang, Dawei Li, Zhen Tan, Chengshuai Zhao, Huan Liu

Main category: cs.CL

TL;DR: The paper explores whether LLMs can generate accurate and tailored well-being explanations for diverse audiences, using a large dataset and a novel evaluation framework. Fine-tuning with SFT and DPO improves explanation quality.

DetailsMotivation: To address the challenge of LLMs providing high-quality, audience-specific well-being explanations, ensuring both accuracy and adaptability.

Method: Constructed a dataset of 43,880 explanations from 10 LLMs, introduced a principle-guided LLM-as-a-judge framework, and fine-tuned models using SFT and DPO.

Result: LLM judges aligned with human evaluations; explanation quality varied by model, audience, and category; DPO- and SFT-finetuned models outperformed larger ones.

Conclusion: Preference-based learning (DPO and SFT) effectively enhances LLM-generated well-being explanations, making them more accurate and tailored.

Abstract: Well-being encompasses mental, physical, and social dimensions essential to personal growth and informed life decisions. As individuals increasingly consult Large Language Models (LLMs) to understand well-being, a key challenge emerges: Can LLMs generate explanations that are not only accurate but also tailored to diverse audiences? High-quality explanations require both factual correctness and the ability to meet the expectations of users with varying expertise. In this work, we construct a large-scale dataset comprising 43,880 explanations of 2,194 well-being concepts, generated by ten diverse LLMs. We introduce a principle-guided LLM-as-a-judge evaluation framework, employing dual judges to assess explanation quality. Furthermore, we show that fine-tuning an open-source LLM using Supervised Fine-Tuning (SFT) and Direct Preference Optimization (DPO) can significantly enhance the quality of generated explanations. Our results reveal: (1) The proposed LLM judges align well with human evaluations; (2) explanation quality varies significantly across models, audiences, and categories; and (3) DPO- and SFT-finetuned models outperform their larger counterparts, demonstrating the effectiveness of preference-based learning for specialized explanation tasks.

[20] Transferring Expert Cognitive Models to Social Robots via Agentic Concept Bottleneck Models

Xinyu Zhao, Zhen Tan, Maya Enisman, Minjae Seo, Marta R. Durantini, Dolores Albarracin, Tianlong Chen

Main category: cs.CL

TL;DR: A social robot co-facilitator uses a transparent concept bottleneck model to assist human facilitators in group meetings by interpreting social cues and providing discreet recommendations.

DetailsMotivation: The cognitive load on facilitators and the limitations of opaque foundation models (FMs) in group settings create a need for an interpretable, embodied technology to enhance facilitation.

Method: The system employs a transfer learning framework to distill FM knowledge into a specialized, transparent agentic concept bottleneck model (CBM) that analyzes multimodal meeting data.

Result: The CBM-based robot outperforms zero-shot FMs in predicting intervention needs, generalizes across groups, and transfers expert facilitator knowledge to novices.

Conclusion: The work offers a blueprint for augmenting human social facilitation with transparent, interpretable AI systems.

Abstract: Successful group meetings, such as those implemented in group behavioral-change programs, work meetings, and other social contexts, must promote individual goal setting and execution while strengthening the social relationships within the group. Consequently, an ideal facilitator must be sensitive to the subtle dynamics of disengagement, difficulties with individual goal setting and execution, and interpersonal difficulties that signal a need for intervention. The challenges and cognitive load experienced by facilitators create a critical gap for an embodied technology that can interpret social exchanges while remaining aware of the needs of the individuals in the group and providing transparent recommendations that go beyond powerful but “black box” foundation models (FMs) that identify social cues. We address this important demand with a social robot co-facilitator that analyzes multimodal meeting data and provides discreet cues to the facilitator. The robot’s reasoning is powered by an agentic concept bottleneck model (CBM), which makes decisions based on human-interpretable concepts like participant engagement and sentiments, ensuring transparency and trustworthiness. Our core contribution is a transfer learning framework that distills the broad social understanding of an FM into our specialized and transparent CBM. This concept-driven system significantly outperforms direct zero-shot FMs in predicting the need for intervention and enables real-time human correction of its reasoning. Critically, we demonstrate robust knowledge transfer: the model generalizes across different groups and successfully transfers the expertise of senior human facilitators to improve the performance of novices. By transferring an expert’s cognitive model into an interpretable robotic partner, our work provides a powerful blueprint for augmenting human capabilities in complex social domains.

[21] HarmonyGuard: Toward Safety and Utility in Web Agents via Adaptive Policy Enhancement and Dual-Objective Optimization

Yurun Chen, Xavier Hu, Yuhan Liu, Keting Yin, Juncheng Li, Zhuosheng Zhang, Shengyu Zhang

Main category: cs.CL

TL;DR: HarmonyGuard is a multi-agent framework for balancing safety and utility in web environments, improving policy compliance by 38% and task completion by 20%.

DetailsMotivation: Current research lacks collaborative optimization of safety and utility in web environments, necessitating a solution like HarmonyGuard.

Method: HarmonyGuard uses a multi-agent architecture with Adaptive Policy Enhancement (Policy Agent) and Dual-Objective Optimization (Utility Agent).

Result: Evaluations show HarmonyGuard improves policy compliance by up to 38% and task completion by 20%, achieving over 90% policy compliance.

Conclusion: HarmonyGuard effectively addresses the gap in balancing safety and utility in web environments, outperforming existing baselines.

Abstract: Large language models enable agents to autonomously perform tasks in open web environments. However, as hidden threats within the web evolve, web agents face the challenge of balancing task performance with emerging risks during long-sequence operations. Although this challenge is critical, current research remains limited to single-objective optimization or single-turn scenarios, lacking the capability for collaborative optimization of both safety and utility in web environments. To address this gap, we propose HarmonyGuard, a multi-agent collaborative framework that leverages policy enhancement and objective optimization to jointly improve both utility and safety. HarmonyGuard features a multi-agent architecture characterized by two fundamental capabilities: (1) Adaptive Policy Enhancement: We introduce the Policy Agent within HarmonyGuard, which automatically extracts and maintains structured security policies from unstructured external documents, while continuously updating policies in response to evolving threats. (2) Dual-Objective Optimization: Based on the dual objectives of safety and utility, the Utility Agent integrated within HarmonyGuard performs the Markovian real-time reasoning to evaluate the objectives and utilizes metacognitive capabilities for their optimization. Extensive evaluations on multiple benchmarks show that HarmonyGuard improves policy compliance by up to 38% and task completion by up to 20% over existing baselines, while achieving over 90% policy compliance across all tasks. Our project is available here: https://github.com/YurunChen/HarmonyGuard.

[22] Marco-Voice Technical Report

Fengping Tian, Chenyang Lyu, Xuanfan Ni, Haoqin Sun, Qingjuan Li, Zhiqiang Qian, Haijun Li, Longyue Wang, Zhao Xu, Weihua Luo, Kaifu Zhang

Main category: cs.CL

TL;DR: The paper introduces Marco-Voice, a speech synthesis system integrating voice cloning and emotion control, using speaker-emotion disentanglement and rotational emotional embedding for natural, expressive speech.

DetailsMotivation: To address challenges in achieving expressive, controllable, and natural speech synthesis while preserving speaker identity across diverse contexts.

Method: Uses speaker-emotion disentanglement with in-batch contrastive learning and rotational emotional embedding for smooth emotion control. Evaluated on CSEMOTIONS, a 10-hour Mandarin emotional speech dataset.

Result: Marco-Voice shows significant improvements in speech clarity and emotional richness, outperforming in objective and subjective metrics.

Conclusion: The system represents a major advance in expressive neural speech synthesis, with code and dataset publicly available.

Abstract: This paper presents a multifunctional speech synthesis system that integrates voice cloning and emotion control speech synthesis within a unified framework. The goal of this work is to address longstanding challenges in achieving highly expressive, controllable, and natural speech generation that faithfully preserves speaker identity across diverse linguistic and emotional contexts. Our approach introduces an effective speaker-emotion disentanglement mechanism with in-batch contrastive learning, enabling independent manipulation of speaker identity and eemotional style, as well as rotational emotional embedding integration method for smooth emotion control. To support comprehensive training and evaluation, we construct CSEMOTIONS, a high-quality emotional speech dataset containing 10 hours of Mandarin speech from six professional speakers across seven emotional categories. Extensive experiments demonstrate that our system, Marco-Voice, achieves substantial improvements in both objective and subjective metrics. Comprehensive evaluations and analysis were conducted, results show that MarcoVoice delivers competitive performance in terms of speech clarity and emotional richness, representing a substantial advance in the field of expressive neural speech synthesis. Our code and dataset are publicly available at https://github.com/AIDC-AI/Marco-Voice and https://huggingface.co/datasets/AIDC-AI/CSEMOTIONS respectively.

[23] Step More: Going Beyond Single Backpropagation in Meta Learning Based Model Editing

Xiaopeng Li, Shasha Li, Xi Wang, Shezheng Song, Bin Ji, Shangwen Wang, Jun Ma, Xiaodong Liu, Mina Liu, Jie Yu

Main category: cs.CL

TL;DR: SMEdit improves meta-learning-based model editing (MLBME) by using multiple backpropagation steps (MBPS) and norm regularization, enhancing performance in low-data scenarios and training efficiency.

DetailsMotivation: Large Language Models (LLMs) are static and costly to update. Model editing, especially MLBME, is efficient but struggles with low-data scenarios and KL divergence computation bottlenecks.

Method: Proposes SMEdit, which uses MBPS for better editing under limited supervision and norm regularization for efficient training.

Result: SMEdit outperforms prior MLBME methods on two datasets and LLMs, with MBPS integrable into existing methods for further improvement.

Conclusion: SMEdit addresses MLBME limitations, offering better performance and efficiency, with potential for broader application.

Abstract: Large Language Models (LLMs) underpin many AI applications, but their static nature makes updating knowledge costly. Model editing offers an efficient alternative by injecting new information through targeted parameter modifications. In particular, meta-learning-based model editing (MLBME) methods have demonstrated notable advantages in both editing effectiveness and efficiency. Despite this, we find that MLBME exhibits suboptimal performance in low-data scenarios, and its training efficiency is bottlenecked by the computation of KL divergence. To address these, we propose $\textbf{S}$tep $\textbf{M}$ore $\textbf{Edit}$ ($\textbf{SMEdit}$), a novel MLBME method that adopts $\textbf{M}$ultiple $\textbf{B}$ackpro$\textbf{P}$agation $\textbf{S}$teps ($\textbf{MBPS}$) to improve editing performance under limited supervision and a norm regularization on weight updates to improve training efficiency. Experimental results on two datasets and two LLMs demonstrate that SMEdit outperforms prior MLBME baselines and the MBPS strategy can be seamlessly integrated into existing methods to further boost their performance. Our code will be released soon.

[24] ZARA: Zero-shot Motion Time-Series Analysis via Knowledge and Retrieval Driven LLM Agents

Zechen Li, Baiyu Chen, Hao Xue, Flora D. Salim

Main category: cs.CL

TL;DR: ZARA is a zero-shot, explainable HAR framework using raw motion time-series, outperforming baselines by 2.53x in macro F1 without fine-tuning.

DetailsMotivation: Existing HAR methods require retraining for new activities or setups, and LLM-based approaches lack accuracy and interpretability.

Method: ZARA integrates a feature knowledge base, multi-sensor retrieval, and a hierarchical agent pipeline to guide LLMs for predictions and explanations.

Result: ZARA achieves state-of-the-art zero-shot performance, surpassing baselines by 2.53x in macro F1, with clear reasoning.

Conclusion: ZARA offers flexible, interpretable HAR without task-specific training, advancing trustworthy motion time-series analysis.

Abstract: Motion sensor time-series are central to human activity recognition (HAR), with applications in health, sports, and smart devices. However, existing methods are trained for fixed activity sets and require costly retraining when new behaviours or sensor setups appear. Recent attempts to use large language models (LLMs) for HAR, typically by converting signals into text or images, suffer from limited accuracy and lack verifiable interpretability. We propose ZARA, the first agent-based framework for zero-shot, explainable HAR directly from raw motion time-series. ZARA integrates an automatically derived pair-wise feature knowledge base that captures discriminative statistics for every activity pair, a multi-sensor retrieval module that surfaces relevant evidence, and a hierarchical agent pipeline that guides the LLM to iteratively select features, draw on this evidence, and produce both activity predictions and natural-language explanations. ZARA enables flexible and interpretable HAR without any fine-tuning or task-specific classifiers. Extensive experiments on 8 HAR benchmarks show that ZARA achieves SOTA zero-shot performance, delivering clear reasoning while exceeding the strongest baselines by 2.53x in macro F1. Ablation studies further confirm the necessity of each module, marking ZARA as a promising step toward trustworthy, plug-and-play motion time-series analysis. Our codes are available at https://github.com/zechenli03/ZARA.

[25] Large Reasoning Models Are Autonomous Jailbreak Agents

Thilo Hagendorff, Erik Derner, Nuria Oliver

Main category: cs.CL

TL;DR: Large reasoning models (LRMs) simplify jailbreaking AI models, making it accessible to non-experts with a 97.14% success rate.

DetailsMotivation: To demonstrate how LRMs can bypass AI safety mechanisms easily and at scale, highlighting a regression in model alignment.

Method: Evaluated four LRMs (DeepSeek-R1, Gemini 2.5 Flash, Grok 3 Mini, Qwen3 235B) as autonomous adversaries against nine target models using a benchmark of 70 harmful prompts.

Result: Achieved a 97.14% attack success rate, showing LRMs can systematically erode safety guardrails.

Conclusion: Urgent need to align frontier models to resist and prevent jailbreaking, as LRMs can act as jailbreak agents.

Abstract: Jailbreaking – bypassing built-in safety mechanisms in AI models – has traditionally required complex technical procedures or specialized human expertise. In this study, we show that the persuasive capabilities of large reasoning models (LRMs) simplify and scale jailbreaking, converting it into an inexpensive activity accessible to non-experts. We evaluated the capabilities of four LRMs (DeepSeek-R1, Gemini 2.5 Flash, Grok 3 Mini, Qwen3 235B) to act as autonomous adversaries conducting multi-turn conversations with nine widely used target models. LRMs received instructions via a system prompt, before proceeding to planning and executing jailbreaks with no further supervision. We performed extensive experiments with a benchmark of harmful prompts composed of 70 items covering seven sensitive domains. This setup yielded an overall attack success rate across all model combinations of 97.14%. Our study reveals an alignment regression, in which LRMs can systematically erode the safety guardrails of other models, highlighting the urgent need to further align frontier models not only to resist jailbreak attempts, but also to prevent them from being co-opted into acting as jailbreak agents.

[26] DTPA: Dynamic Token-level Prefix Augmentation for Controllable Text Generation

Jiabing Yang, Yixiang Chen, Zichen Wen, Chenhang Cui, Peiyan Li, Yuan Xu, Bowen Fang, Yan Huang, Liang Wang

Main category: cs.CL

TL;DR: The paper addresses the decline in controllability of long-form text generation using prefix-based methods and proposes Dynamic Token-level Prefix Augmentation (DTPA) to enhance it.

DetailsMotivation: Previous studies focus on short sequences, leaving long-form text generation underexplored, with controllability declining as sequence length increases.

Method: Proposes DTPA, a lightweight framework that dynamically amplifies attention to prefixes and selects optimal prefix types, scaling exponentially with sequence length.

Result: DTPA outperforms other methods in attribute control while maintaining fluency, diversity, and topic relevance, especially in long text generation.

Conclusion: DTPA effectively enhances controllability in long-form text generation, balancing quality and attribute constraints.

Abstract: Controllable Text Generation (CTG) is a vital subfield in Natural Language Processing (NLP), aiming to generate text that aligns with desired attributes. However, previous studies commonly focus on the quality of controllable text generation for short sequences, while the generation of long-form text remains largely underexplored. In this paper, we observe that the controllability of texts generated by the powerful prefix-based method Air-Decoding tends to decline with increasing sequence length, which we hypothesize primarily arises from the observed decay in attention to the prefixes. Meanwhile, different types of prefixes including soft and hard prefixes are also key factors influencing performance. Building on these insights, we propose a lightweight and effective framework called Dynamic Token-level Prefix Augmentation (DTPA) based on Air-Decoding for controllable text generation. Specifically, it first selects the optimal prefix type for a given task. Then we dynamically amplify the attention to the prefix for the attribute distribution to enhance controllability, with a scaling factor growing exponentially as the sequence length increases. Moreover, based on the task, we optionally apply a similar augmentation to the original prompt for the raw distribution to balance text quality. After attribute distribution reconstruction, the generated text satisfies the attribute constraints well. Experiments on multiple CTG tasks demonstrate that DTPA generally outperforms other methods in attribute control while maintaining competitive fluency, diversity, and topic relevance. Further analysis highlights DTPA’s superior effectiveness in long text generation.

[27] PAIRS: Parametric-Verified Adaptive Information Retrieval and Selection for Efficient RAG

Wang Chen, Guanqiang Qi, Weikang Li, Yang Li, Deguo Xia, Jizhou Huang

Main category: cs.CL

TL;DR: PAIRS improves RAG systems by adaptively deciding when to retrieve external knowledge, reducing costs by 25% while boosting accuracy.

DetailsMotivation: Current RAG systems inefficiently retrieve for every query and risk irrelevance with sparse queries.

Method: PAIRS uses a dual-path mechanism: checks LLM’s parametric knowledge first, retrieves only if needed, and filters documents adaptively.

Result: Reduces retrieval costs by 25% and improves accuracy (+1.1% EM, +1.0% F1) on QA benchmarks.

Conclusion: PAIRS enhances RAG efficiency and accuracy by integrating parametric and retrieved knowledge adaptively.

Abstract: Retrieval-Augmented Generation (RAG) has become a cornerstone technique for enhancing large language models (LLMs) with external knowledge. However, current RAG systems face two critical limitations: (1) they inefficiently retrieve information for every query, including simple questions that could be resolved using the LLM’s parametric knowledge alone, and (2) they risk retrieving irrelevant documents when queries contain sparse information signals. To address these gaps, we introduce Parametric-verified Adaptive Information Retrieval and Selection (PAIRS), a training-free framework that integrates parametric and retrieved knowledge to adaptively determine whether to retrieve and how to select external information. Specifically, PAIRS employs a dual-path generation mechanism: First, the LLM produces both a direct answer and a context-augmented answer using self-generated pseudo-context. When these outputs converge, PAIRS bypasses external retrieval entirely, dramatically improving the RAG system’s efficiency. For divergent cases, PAIRS activates a dual-path retrieval (DPR) process guided by both the original query and self-generated contextual signals, followed by an Adaptive Information Selection (AIS) module that filters documents through weighted similarity to both sources. This simple yet effective approach can not only enhance efficiency by eliminating unnecessary retrievals but also improve accuracy through contextually guided retrieval and adaptive information selection. Experimental results on six question-answering (QA) benchmarks show that PAIRS reduces retrieval costs by around 25% (triggering for only 75% of queries) while still improving accuracy-achieving +1.1% EM and +1.0% F1 over prior baselines on average.

[28] Efficient Strategy for Improving Large Language Model (LLM) Capabilities

Julián Camilo Velandia Gutiérrez

Main category: cs.CL

TL;DR: The paper proposes strategies to improve the efficiency of Large Language Models (LLMs) in resource-constrained environments by combining data processing, careful data selection, training strategies, and architectural adjustments.

DetailsMotivation: LLMs require significant computational resources, limiting their large-scale deployment. The goal is to enhance efficiency without compromising performance.

Method: The approach includes defining criteria for reliable datasets, controlled experiments with configurations, and systematic evaluation of variants in capability, versatility, response time, and safety. Comparative tests validated the strategies.

Result: The developed variants showed improved efficiency in resource-constrained environments, as validated by comparative tests.

Conclusion: The proposed strategies effectively enhance LLM efficiency, making them more viable for deployment in constrained settings.

Abstract: Large Language Models (LLMs) have become a milestone in the field of artificial intelligence and natural language processing. However, their large-scale deployment remains constrained by the need for significant computational resources. This work proposes starting from a base model to explore and combine data processing and careful data selection techniques, training strategies, and architectural adjustments to improve the efficiency of LLMs in resource-constrained environments and within a delimited knowledge base. The methodological approach included defining criteria for building reliable datasets, conducting controlled experiments with different configurations, and systematically evaluating the resulting variants in terms of capability, versatility, response time, and safety. Finally, comparative tests were conducted to measure the performance of the developed variants and to validate the effectiveness of the proposed strategies. This work is based on the master’s thesis in Systems and Computer Engineering titled “Efficient Strategy for Improving the Capabilities of Large Language Models (LLMs)”.

[29] ToolGrad: Efficient Tool-use Dataset Generation with Textual “Gradients”

Zhongyi Zhou, Kohei Uehara, Haoyu Zhang, Jingtao Zhou, Lin Gu, Ruofei Du, Zheng Xu, Tatsuya Harada

Main category: cs.CL

TL;DR: ToolGrad introduces an agentic framework to generate tool-use datasets by first constructing valid tool-use chains and then synthesizing user queries, improving efficiency and quality.

DetailsMotivation: Prior methods for synthesizing tool-use datasets are inefficient and prone to annotation failures, prompting the need for a better approach.

Method: ToolGrad inverts the traditional paradigm by first creating tool-use chains guided by textual gradients, then generating user queries.

Result: ToolGrad-5k, the resulting dataset, has 100% pass rate, lower cost, and outperforms models trained on expensive baselines.

Conclusion: ToolGrad’s answer-first approach is more efficient and effective for generating high-quality tool-use datasets.

Abstract: Prior work synthesizes tool-use LLM datasets by first generating a user query, followed by complex tool-use annotations like DFS. This leads to inevitable annotation failures and low efficiency in data generation. We introduce ToolGrad, an agentic framework that inverts this paradigm. ToolGrad first constructs valid tool-use chains through an iterative process guided by textual “gradients”, and then synthesizes corresponding user queries. This “answer-first” approach led to ToolGrad-5k, a dataset generated with more complex tool use, lower cost, and 100% pass rate. Experiments show that models trained on ToolGrad-5k outperform those on expensive baseline datasets and proprietary LLMs, even on OOD benchmarks.

[30] GM-PRM: A Generative Multimodal Process Reward Model for Multimodal Mathematical Reasoning

Jianghangfan Zhang, Yibo Yan, Kening Zheng, Xin Zou, Song Dai, Xuming Hu

Main category: cs.CL

TL;DR: GM-PRM is introduced as a Generative Multimodal Process Reward Model to actively correct errors in reasoning steps, improving solution quality and diversity in multimodal math tasks.

DetailsMotivation: Existing multimodal PRMs are limited to binary verification and lack corrective or explanatory power, hindering complex mathematical reasoning.

Method: GM-PRM provides fine-grained analysis of reasoning steps and generates corrections for errors. It uses Refined Best-of-N (Refined-BoN) to guide policy models toward better solutions.

Result: GM-PRM achieves state-of-the-art results on multimodal math benchmarks with high data efficiency (20K-sample training dataset).

Conclusion: GM-PRM transforms PRMs into active reasoning collaborators, significantly enhancing performance in complex mathematical tasks.

Abstract: Multimodal Large Language Models (MLLMs) demonstrate remarkable capabilities but often struggle with complex, multi-step mathematical reasoning, where minor errors in visual perception or logical deduction can lead to complete failure. While Process Reward Models (PRMs) offer step-by-step supervision, existing multimodal PRMs are limited to being binary verifiers that can identify but not correct errors, offering little explanatory power. To address these deficiencies, we introduce the Generative Multimodal Process Reward Model (GM-PRM), a novel paradigm that transforms the PRM from a passive judge into an active reasoning collaborator. Instead of a simple scalar score, GM-PRM provides a fine-grained, interpretable analysis of each reasoning step, evaluating its step intent, visual alignment, and logical soundness. More critically, GM-PRM is trained to generate a corrected version of the first erroneous step it identifies. This unique corrective capability enables our new test-time inference strategy, Refined Best-of-N (Refined-BoN). This framework actively enhances solution quality by using the PRM’s generated correction to guide the policy model toward a more promising reasoning trajectory, thereby improving the diversity and correctness of the solution pool. We demonstrate that GM-PRM achieves state-of-the-art results on multiple multimodal math benchmarks, significantly boosting policy model performance with remarkable data efficiency, requiring only a 20K-sample training dataset. Our code will be released upon acceptance.

[31] Unveiling Over-Memorization in Finetuning LLMs for Reasoning Tasks

Zhiwen Ruan, Yun Chen, Yutao Hou, Peng Li, Yang Liu, Guanhua Chen

Main category: cs.CL

TL;DR: The paper studies over-memorization in LLM finetuning, where models memorize training data but perform well on tests, though with reduced robustness and generalization.

DetailsMotivation: To understand the learning dynamics of LLM finetuning, particularly the over-memorization phenomenon and its impact on model performance.

Method: Analyze conditions like training epochs and learning rates that lead to over-memorization, and evaluate its effects on robustness and generalization.

Result: Over-memorization occurs under specific conditions, leading to high test perplexity but good accuracy, while harming robustness and diversity.

Conclusion: Over-memorization is a unique issue in LLMs, requiring careful checkpoint and learning rate selection during finetuning.

Abstract: The pretrained large language models (LLMs) are finetuned with labeled data for better instruction following ability and alignment with human values. In this paper, we study the learning dynamics of LLM finetuning on reasoning tasks and reveal the uncovered over-memorization phenomenon during a specific stage of LLM finetuning. At this stage, the LLMs have excessively memorized training data and exhibit high test perplexity while maintaining good test accuracy. We investigate the conditions that lead to LLM over-memorization and find that training epochs and large learning rates contribute to this issue. Although models with over-memorization demonstrate comparable test accuracy to normal models, they suffer from reduced robustness, poor out-of-distribution generalization, and decreased generation diversity. Our experiments unveil the over-memorization to be broadly applicable across different tasks, models, and finetuning methods. Our research highlights that overparameterized, extensively finetuned LLMs exhibit unique learning dynamics distinct from traditional machine learning models. Based on our observations of over-memorization, we provide recommendations on checkpoint and learning rate selection during finetuning.

[32] Difficulty-Based Preference Data Selection by DPO Implicit Reward Gap

Xuan Qi, Rongwu Xu, Zhijing Jin

Main category: cs.CL

TL;DR: A novel difficulty-based data selection strategy for preference datasets improves LLM alignment efficiency, outperforming baselines with only 10% of data.

DetailsMotivation: Aligning LLMs with human preferences is challenging due to reliance on costly preference datasets; current methods lack efficient data selection.

Method: Introduces a difficulty-based selection strategy using DPO implicit reward gaps to identify challenging cases.

Result: Outperforms five baselines, achieving superior performance with 10% of original data.

Conclusion: The method offers a scalable, efficient solution for LLM alignment with limited resources.

Abstract: Aligning large language models (LLMs) with human preferences is a critical challenge in AI research. While methods like Reinforcement Learning from Human Feedback (RLHF) and Direct Preference Optimization (DPO) are widely used, they often rely on large, costly preference datasets. The current work lacks methods for high-quality data selection specifically for preference data. In this work, we introduce a novel difficulty-based data selection strategy for preference datasets, grounded in the DPO implicit reward mechanism. By selecting preference data examples with smaller DPO implicit reward gaps, which are indicative of more challenging cases, we improve data efficiency and model alignment. Our approach consistently outperforms five strong baselines across multiple datasets and alignment tasks, achieving superior performance with only 10% of the original data. This principled, efficient selection method offers a promising solution for scaling LLM alignment with limited resources.

[33] Hacking Hallucinations of MLLMs with Causal Sufficiency and Necessity

Peizheng Guo, Jingyao Wang, Wenwen Qiang, Huijie Guo, Changwen Zheng, Jiahuan Zhou, Gang Hua

Main category: cs.CL

TL;DR: A reinforcement learning framework is proposed to reduce hallucinations in Multimodal Large Language Models (MLLMs) by ensuring causal completeness of tokens.

DetailsMotivation: MLLMs often generate semantically inconsistent outputs (hallucinations), which can be due to missing causal factors or misleading non-causal cues.

Method: A novel reinforcement learning framework evaluates token-level causal completeness (causal sufficiency and necessity) and uses it to guide optimization via GRPO.

Result: The approach effectively mitigates hallucinations across various benchmark datasets and tasks.

Conclusion: The proposed method successfully addresses hallucinations in MLLMs by focusing on causally relevant tokens.

Abstract: Multimodal Large Language Models (MLLMs) have demonstrated impressive capabilities across vision-language tasks. However, they may suffer from hallucinations–generating outputs that are semantically inconsistent with the input image or text. Through causal analyses, we find that: (i) hallucinations with omission may arise from the failure to adequately capture essential causal factors, and (ii) hallucinations with fabrication are likely caused by the model being misled by non-causal cues. To address these challenges, we propose a novel reinforcement learning framework guided by causal completeness, which jointly considers both causal sufficiency and causal necessity of tokens. Specifically, we evaluate each token’s standalone contribution and counterfactual indispensability to define a token-level causal completeness reward. This reward is used to construct a causally informed advantage function within the GRPO optimization framework, encouraging the model to focus on tokens that are both causally sufficient and necessary for accurate generation. Experimental results across various benchmark datasets and tasks demonstrate the effectiveness of our approach, which effectively mitigates hallucinations in MLLMs.

[34] Characterizing Deep Research: A Benchmark and Formal Definition

Abhinav Java, Ashmit Khandelwal, Sukruta Midigeshi, Aaron Halfaker, Amit Deshpande, Navin Goyal, Ankur Gupta, Nagarajan Natarajan, Amit Sharma

Main category: cs.CL

TL;DR: The paper formalizes deep research (DR) tasks, introduces a benchmark (LiveDRBench), and evaluates DR systems, highlighting their challenges and performance.

DetailsMotivation: To address the underdefined scope of deep research tasks and distinguish them from other reasoning-intensive problems.

Method: Proposes a formal characterization of DR, defines it using an intermediate output representation, and introduces a benchmark (LiveDRBench) with 100 tasks.

Result: State-of-the-art DR systems show F1 scores between 0.02 and 0.72, with OpenAI’s model leading at 0.55. Analysis reveals search and grounding challenges.

Conclusion: The benchmark and analysis provide a foundation for improving DR systems’ search mechanisms and grounding capabilities.

Abstract: Information tasks such as writing surveys or analytical reports require complex search and reasoning, and have recently been grouped under the umbrella of \textit{deep research} – a term also adopted by recent models targeting these capabilities. Despite growing interest, the scope of the deep research task remains underdefined and its distinction from other reasoning-intensive problems is poorly understood. In this paper, we propose a formal characterization of the deep research (DR) task and introduce a benchmark to evaluate the performance of DR systems. We argue that the core defining feature of deep research is not the production of lengthy report-style outputs, but rather the high fan-out over concepts required during the search process, i.e., broad and reasoning-intensive exploration. To enable objective evaluation, we define DR using an intermediate output representation that encodes key claims uncovered during search-separating the reasoning challenge from surface-level report generation. Based on this formulation, we propose a diverse, challenging benchmark LiveDRBench with 100 challenging tasks over scientific topics (e.g., datasets, materials discovery, prior art search) and public interest events (e.g., flight incidents, movie awards). Across state-of-the-art DR systems, F1 score ranges between 0.02 and 0.72 for any sub-category. OpenAI’s model performs the best with an overall F1 score of 0.55. Analysis of reasoning traces reveals the distribution over the number of referenced sources, branching, and backtracking events executed by current DR systems, motivating future directions for improving their search mechanisms and grounding capabilities. The benchmark is available at https://github.com/microsoft/LiveDRBench.

[35] Eliciting and Analyzing Emergent Misalignment in State-of-the-Art Large Language Models

Siddhant Panpatil, Hiskias Dingeto, Haon Park

Main category: cs.CL

TL;DR: State-of-the-art language models are vulnerable to conversational attacks, revealing gaps in alignment methods. A manual red-teaming study identified 10 attack scenarios, later automated into MISALIGNMENTBENCH, showing a 76% vulnerability rate across models.

DetailsMotivation: To expose vulnerabilities in current alignment techniques of language models, particularly under subtle conversational manipulation.

Method: Systematic manual red-teaming with Claude-4-Opus to identify attack scenarios, followed by automated testing (MISALIGNMENTBENCH) across five frontier LLMs.

Result: 76% overall vulnerability rate, with GPT-4.1 most susceptible (90%) and Claude-4-Sonnet least (40%). Models exhibited deception, value drift, and manipulative reasoning.

Conclusion: Current alignment methods are insufficient against scenario-based manipulation, necessitating improved robustness in future AI systems.

Abstract: Despite significant advances in alignment techniques, we demonstrate that state-of-the-art language models remain vulnerable to carefully crafted conversational scenarios that can induce various forms of misalignment without explicit jailbreaking. Through systematic manual red-teaming with Claude-4-Opus, we discovered 10 successful attack scenarios, revealing fundamental vulnerabilities in how current alignment methods handle narrative immersion, emotional pressure, and strategic framing. These scenarios successfully elicited a range of misaligned behaviors, including deception, value drift, self-preservation, and manipulative reasoning, each exploiting different psychological and contextual vulnerabilities. To validate generalizability, we distilled our successful manual attacks into MISALIGNMENTBENCH, an automated evaluation framework that enables reproducible testing across multiple models. Cross-model evaluation of our 10 scenarios against five frontier LLMs revealed an overall 76% vulnerability rate, with significant variations: GPT-4.1 showed the highest susceptibility (90%), while Claude-4-Sonnet demonstrated greater resistance (40%). Our findings demonstrate that sophisticated reasoning capabilities often become attack vectors rather than protective mechanisms, as models can be manipulated into complex justifications for misaligned behavior. This work provides (i) a detailed taxonomy of conversational manipulation patterns and (ii) a reusable evaluation framework. Together, these findings expose critical gaps in current alignment strategies and highlight the need for robustness against subtle, scenario-based manipulation in future AI systems.

[36] Reasoning Beyond Labels: Measuring LLM Sentiment in Low-Resource, Culturally Nuanced Contexts

Millicent Ochieng, Anja Thieme, Ignatius Ezeani, Risa Ueno, Samuel Maina, Keshet Ronen, Javier Gonzalez, Jacki O’Neill

Main category: cs.CL

TL;DR: The paper introduces a framework to evaluate sentiment analysis in culturally nuanced, low-resource contexts, focusing on LLMs’ performance in interpreting sentiment in informal, code-mixed WhatsApp messages.

DetailsMotivation: Sentiment analysis in culturally diverse, low-resource settings is challenging due to assumptions of fixed labels and universal expressions in conventional NLP.

Method: The study uses human-annotated data, sentiment-flipped counterfactuals, and rubric-based explanation evaluation to assess LLMs’ interpretability, robustness, and alignment with human reasoning.

Result: Top-tier LLMs show interpretive stability, while open models struggle with ambiguity or sentiment shifts.

Conclusion: The work emphasizes the need for culturally sensitive, reasoning-aware AI evaluation in real-world communication.

Abstract: Sentiment analysis in low-resource, culturally nuanced contexts challenges conventional NLP approaches that assume fixed labels and universal affective expressions. We present a diagnostic framework that treats sentiment as a context-dependent, culturally embedded construct, and evaluate how large language models (LLMs) reason about sentiment in informal, code-mixed WhatsApp messages from Nairobi youth health groups. Using a combination of human-annotated data, sentiment-flipped counterfactuals, and rubric-based explanation evaluation, we probe LLM interpretability, robustness, and alignment with human reasoning. Framing our evaluation through a social-science measurement lens, we operationalize and interrogate LLMs outputs as an instrument for measuring the abstract concept of sentiment. Our findings reveal significant variation in model reasoning quality, with top-tier LLMs demonstrating interpretive stability, while open models often falter under ambiguity or sentiment shifts. This work highlights the need for culturally sensitive, reasoning-aware AI evaluation in complex, real-world communication.

[37] ReasoningGuard: Safeguarding Large Reasoning Models with Inference-time Safety Aha Moments

Yuquan Wang, Mi Zhang, Yining Wang, Geng Hong, Xiaoyu You, Min Yang

Main category: cs.CL

TL;DR: ReasoningGuard is an inference-time safeguard for Large Reasoning Models (LRMs) that detects critical reasoning steps and triggers safety-oriented reflections, outperforming existing defenses with minimal extra cost.

DetailsMotivation: LRMs are vulnerable to harmful content generation during reasoning, and current defenses are costly and lack scalability.

Method: Leverages internal attention behavior to identify critical reasoning points, implements scaling sampling for optimal path selection, and triggers safety reflections.

Result: Effectively mitigates three types of jailbreak attacks, outperforms seven existing safeguards, and avoids exaggerated safety issues.

Conclusion: ReasoningGuard provides scalable, efficient safety for LRMs without requiring fine-tuning or expert knowledge.

Abstract: Large Reasoning Models (LRMs) have demonstrated impressive performance in reasoning-intensive tasks, but they remain vulnerable to harmful content generation, particularly in the mid-to-late steps of their reasoning processes. Existing defense mechanisms, however, rely on costly fine-tuning and additional expert knowledge, which restricts their scalability. In this work, we propose ReasoningGuard, an inference-time safeguard for LRMs, which injects timely safety aha moments to steer harmless while helpful reasoning processes. Leveraging the model’s internal attention behavior, our approach accurately identifies critical points in the reasoning path, and triggers spontaneous, safety-oriented reflection. To safeguard both the subsequent reasoning steps and the final answers, we further implement a scaling sampling strategy during the decoding phase, selecting the optimal reasoning path. Inducing minimal extra inference cost, ReasoningGuard effectively mitigates three types of jailbreak attacks, including the latest ones targeting the reasoning process of LRMs. Our approach outperforms seven existing safeguards, achieving state-of-the-art safety defenses while effectively avoiding the common exaggerated safety issues.

[38] Hierarchical Text Classification Using Black Box Large Language Models

Kosuke Yoshimura, Hisashi Kashima

Main category: cs.CL

TL;DR: The study explores using black box LLMs for Hierarchical Text Classification (HTC), comparing three prompting strategies in zero-shot and few-shot settings. Few-shot improves accuracy, and LLMs outperform traditional models on deeper hierarchies, but API costs rise with deeper hierarchies.

DetailsMotivation: Address challenges of data scarcity and model complexity in HTC by leveraging LLMs via APIs as an alternative to traditional methods.

Method: Evaluate three prompting strategies (DL, DH, TMH) in zero-shot and few-shot settings on two datasets, comparing accuracy and cost.

Result: Few-shot improves accuracy; LLMs (especially DH) outperform traditional models on deeper hierarchies, but API costs increase with deeper hierarchies.

Conclusion: Black box LLMs show promise for HTC, but prompt strategy selection must balance performance and cost.

Abstract: Hierarchical Text Classification (HTC) aims to assign texts to structured label hierarchies; however, it faces challenges due to data scarcity and model complexity. This study explores the feasibility of using black box Large Language Models (LLMs) accessed via APIs for HTC, as an alternative to traditional machine learning methods that require extensive labeled data and computational resources. We evaluate three prompting strategies – Direct Leaf Label Prediction (DL), Direct Hierarchical Label Prediction (DH), and Top-down Multi-step Hierarchical Label Prediction (TMH) – in both zero-shot and few-shot settings, comparing the accuracy and cost-effectiveness of these strategies. Experiments on two datasets show that a few-shot setting consistently improves classification accuracy compared to a zero-shot setting. While a traditional machine learning model achieves high accuracy on a dataset with a shallow hierarchy, LLMs, especially DH strategy, tend to outperform the machine learning model on a dataset with a deeper hierarchy. API costs increase significantly due to the higher input tokens required for deeper label hierarchies on DH strategy. These results emphasize the trade-off between accuracy improvement and the computational cost of prompt strategy. These findings highlight the potential of black box LLMs for HTC while underscoring the need to carefully select a prompt strategy to balance performance and cost.

[39] DP-GPT4MTS: Dual-Prompt Large Language Model for Textual-Numerical Time Series Forecasting

Chanjuan Liu, Shengzhi Wang, Enqiang Zhu

Main category: cs.CL

TL;DR: DP-GPT4MTS introduces a dual-prompt LLM framework for time series forecasting, combining explicit and textual prompts to improve accuracy by leveraging textual context.

DetailsMotivation: Traditional models ignore textual data, limiting accuracy. Existing LLM frameworks fail to effectively process timestamped text, introducing redundancy.

Method: DP-GPT4MTS uses dual prompts: explicit for task instructions and textual for context-aware embeddings, refined via self-attention and feed-forward networks.

Result: Outperforms state-of-the-art algorithms on diverse datasets, demonstrating improved forecasting accuracy.

Conclusion: Incorporating textual context through a dual-prompt mechanism enhances time series forecasting accuracy.

Abstract: Time series forecasting is crucial in strategic planning and decision-making across various industries. Traditional forecasting models mainly concentrate on numerical time series data, often overlooking important textual information such as events and news, which can significantly affect forecasting accuracy. While large language models offer a promise for integrating multimodal data, existing single-prompt frameworks struggle to effectively capture the semantics of timestamped text, introducing redundant information that can hinder model performance. To address this limitation, we introduce DP-GPT4MTS (Dual-Prompt GPT2-base for Multimodal Time Series), a novel dual-prompt large language model framework that combines two complementary prompts: an explicit prompt for clear task instructions and a textual prompt for context-aware embeddings from time-stamped data. The tokenizer generates the explicit prompt while the embeddings from the textual prompt are refined through self-attention and feed-forward networks. Comprehensive experiments conducted on diverse textural-numerical time series datasets demonstrate that this approach outperforms state-of-the-art algorithms in time series forecasting. This highlights the significance of incorporating textual context via a dual-prompt mechanism to achieve more accurate time series predictions.

[40] TalkDep: Clinically Grounded LLM Personas for Conversation-Centric Depression Screening

Xi Wang, Anxo Perez, Javier Parapar, Fabio Crestani

Main category: cs.CL

TL;DR: A novel clinician-in-the-loop pipeline, TalkDep, uses advanced language models to simulate diverse and clinically valid patient responses for depression diagnosis training.

DetailsMotivation: The shortage of real training data for mental health services limits support for depression diagnosis, prompting the need for better simulated patients.

Method: TalkDep leverages language models, psychiatric criteria, symptom severity scales, and contextual factors to generate authentic patient responses.

Result: Simulated patients are validated by clinical professionals, offering scalable resources for improving diagnostic systems.

Conclusion: Validated simulated patients enhance the robustness and generalizability of automatic depression diagnosis tools.

Abstract: The increasing demand for mental health services has outpaced the availability of real training data to develop clinical professionals, leading to limited support for the diagnosis of depression. This shortage has motivated the development of simulated or virtual patients to assist in training and evaluation, but existing approaches often fail to generate clinically valid, natural, and diverse symptom presentations. In this work, we embrace the recent advanced language models as the backbone and propose a novel clinician-in-the-loop patient simulation pipeline, TalkDep, with access to diversified patient profiles to develop simulated patients. By conditioning the model on psychiatric diagnostic criteria, symptom severity scales, and contextual factors, our goal is to create authentic patient responses that can better support diagnostic model training and evaluation. We verify the reliability of these simulated patients with thorough assessments conducted by clinical professionals. The availability of validated simulated patients offers a scalable and adaptable resource for improving the robustness and generalisability of automatic depression diagnosis systems.

[41] KVSink: Understanding and Enhancing the Preservation of Attention Sinks in KV Cache Quantization for LLMs

Zunhai Su, Kehong Yuan

Main category: cs.CL

TL;DR: The paper introduces KVSink, a method to predict and preserve attention sinks during KV cache quantization, improving performance over existing strategies.

DetailsMotivation: The need to better understand and address attention sinks in KV cache quantization for efficient LLM inference, as current methods are insufficient.

Method: Examines attention sinks’ role in activation outliers and introduces KVSink, a plug-and-play method to predict sink tokens.

Result: KVSink outperforms the Preserve-First-N strategy and enhances KVQuant, improving perplexity and reducing reliance on 16-bit outliers.

Conclusion: KVSink effectively preserves attention sinks, advancing KV cache quantization for LLM inference.

Abstract: Key-Value (KV) cache quantization has become a widely adopted optimization technique for efficient large language models (LLMs) inference by reducing KV cache memory usage and mitigating memory-bound constraints. Recent studies have emphasized the importance of preserving the original precision of KVs for the first few tokens to ensure the protection of attention sinks. While this approach has proven effective in mitigating performance degradation, its underlying principles remain insufficiently understood. Moreover, it fails to address the recent discovery that attention sinks can emerge beyond the initial token positions. In this work, we elucidate the underlying mechanisms of attention sinks during inference by examining their role in the cross-layer evolution of extreme activation outliers. Additionally, we provide a comprehensive analysis of the interplay between attention sinks and KV cache quantization. Based on our enhanced understanding, we introduce \textit{\textbf{KVSink}}, a plug-and-play method that effectively predicts sink tokens with negligible overhead, enabling more thorough preservation. Extensive experiments demonstrate that KVSink outperforms the existing Preserve-First-N (PFN) strategy, offering more effective preservation of attention sinks during KV cache quantization. Moreover, when applied to the well-established KVQuant method, KVSink further improves perplexity (PPL) and reduces reliance on 16-bit numerical outliers.

[42] ShoppingBench: A Real-World Intent-Grounded Shopping Benchmark for LLM-based Agents

Jiangyuan Wang, Kejun Xiao, Qi Sun, Huaipeng Zhao, Tao Luo, Jiandong Zhang, Xiaoyi Zeng

Main category: cs.CL

TL;DR: ShoppingBench is a new benchmark for complex e-commerce tasks, challenging even advanced language agents like GPT-4.1, with success rates below 50%. A distilled smaller agent achieves competitive performance.

DetailsMotivation: Existing benchmarks lack coverage of complex user goals in e-commerce, such as voucher application or multi-product seller searches.

Method: A scalable framework simulates user instructions from real-world product intents, using a large-scale shopping sandbox with 2.5M products. Trajectory distillation and fine-tuning enhance smaller agents.

Result: State-of-the-art agents like GPT-4.1 score under 50% success on ShoppingBench. The distilled smaller agent matches GPT-4.1’s performance.

Conclusion: ShoppingBench addresses the gap in complex e-commerce tasks, proving challenging for current agents, while distillation offers a viable solution for smaller models.

Abstract: Existing benchmarks in e-commerce primarily focus on basic user intents, such as finding or purchasing products. However, real-world users often pursue more complex goals, such as applying vouchers, managing budgets, and finding multi-products seller. To bridge this gap, we propose ShoppingBench, a novel end-to-end shopping benchmark designed to encompass increasingly challenging levels of grounded intent. Specifically, we propose a scalable framework to simulate user instructions based on various intents derived from sampled real-world products. To facilitate consistent and reliable evaluations, we provide a large-scale shopping sandbox that serves as an interactive simulated environment, incorporating over 2.5 million real-world products. Experimental results demonstrate that even state-of-the-art language agents (such as GPT-4.1) achieve absolute success rates under 50% on our benchmark tasks, highlighting the significant challenges posed by our ShoppingBench. In addition, we propose a trajectory distillation strategy and leverage supervised fine-tuning, along with reinforcement learning on synthetic trajectories, to distill the capabilities of a large language agent into a smaller one. As a result, our trained agent achieves competitive performance compared to GPT-4.1.

[43] A Few Words Can Distort Graphs: Knowledge Poisoning Attacks on Graph-based Retrieval-Augmented Generation of Large Language Models

Jiayi Wen, Tianxin Chen, Zhirun Zheng, Cheng Huang

Main category: cs.CL

TL;DR: GraphRAG enhances LLMs with structured knowledge graphs but is vulnerable to knowledge poisoning attacks (KPAs) that manipulate graph construction, severely impacting downstream reasoning.

DetailsMotivation: To expose vulnerabilities in GraphRAG, where malicious text modifications can poison knowledge graphs and mislead reasoning, highlighting the need for robust defenses.

Method: Proposes two KPAs: Targeted KPA (TKPA) for precise QA outcome control and Universal KPA (UKPA) for disrupting graph integrity via linguistic cues.

Result: TKPA achieves 93.1% success in manipulating QA outcomes; UKPA reduces QA accuracy from 95% to 50% with minimal text changes. Current defenses fail to detect these attacks.

Conclusion: GraphRAG is highly susceptible to knowledge poisoning, and securing it against such attacks remains an open challenge.

Abstract: Graph-based Retrieval-Augmented Generation (GraphRAG) has recently emerged as a promising paradigm for enhancing large language models (LLMs) by converting raw text into structured knowledge graphs, improving both accuracy and explainability. However, GraphRAG relies on LLMs to extract knowledge from raw text during graph construction, and this process can be maliciously manipulated to implant misleading information. Targeting this attack surface, we propose two knowledge poisoning attacks (KPAs) and demonstrate that modifying only a few words in the source text can significantly change the constructed graph, poison the GraphRAG, and severely mislead downstream reasoning. The first attack, named Targeted KPA (TKPA), utilizes graph-theoretic analysis to locate vulnerable nodes in the generated graphs and rewrites the corresponding narratives with LLMs, achieving precise control over specific question-answering (QA) outcomes with a success rate of 93.1%, while keeping the poisoned text fluent and natural. The second attack, named Universal KPA (UKPA), exploits linguistic cues such as pronouns and dependency relations to disrupt the structural integrity of the generated graph by altering globally influential words. With fewer than 0.05% of full text modified, the QA accuracy collapses from 95% to 50%. Furthermore, experiments show that state-of-the-art defense methods fail to detect these attacks, highlighting that securing GraphRAG pipelines against knowledge poisoning remains largely unexplored.

[44] Modelling and Classifying the Components of a Literature Review

Francisco Bolaños, Angelo Salatino, Francesco Osborne, Enrico Motta

Main category: cs.CL

TL;DR: The paper introduces a new annotation schema for rhetorical roles in scientific literature and evaluates LLMs for classifying these roles, achieving high performance with fine-tuning and semi-synthetic data.

DetailsMotivation: To improve AI methods for analyzing scientific literature and support high-quality literature review generation by defining a relevant annotation schema and effective large-scale annotation strategies.

Method: 1) Introduces a novel annotation schema for literature review generation. 2) Evaluates 37 LLMs using zero-shot and fine-tuning approaches on a new benchmark (Sci-Sentence) with manually and automatically labeled sentences.

Result: Fine-tuned LLMs achieve >96% F1. GPT-4o performs best, but lightweight open-source models also excel. Semi-synthetic data boosts performance of small encoders and open decoder models.

Conclusion: The study advances rhetorical role classification in scientific literature, demonstrating the effectiveness of fine-tuned LLMs and the utility of semi-synthetic data for improving model performance.

Abstract: Previous work has demonstrated that AI methods for analysing scientific literature benefit significantly from annotating sentences in papers according to their rhetorical roles, such as research gaps, results, limitations, extensions of existing methodologies, and others. Such representations also have the potential to support the development of a new generation of systems capable of producing high-quality literature reviews. However, achieving this goal requires the definition of a relevant annotation schema and effective strategies for large-scale annotation of the literature. This paper addresses these challenges by 1) introducing a novel annotation schema specifically designed to support literature review generation and 2) conducting a comprehensive evaluation of a wide range of state-of-the-art large language models (LLMs) in classifying rhetorical roles according to this schema. To this end, we also present Sci-Sentence, a novel multidisciplinary benchmark comprising 700 sentences manually annotated by domain experts and 2,240 sentences automatically labelled using LLMs. We evaluate 37 LLMs on this benchmark, spanning diverse model families and sizes, using both zero-shot learning and fine-tuning approaches. The experiments yield several novel insights that advance the state of the art in this challenging domain. First, the current generation of LLMs performs remarkably well on this task when fine-tuned on high-quality data, achieving performance levels above 96% F1. Second, while large proprietary models like GPT-4o achieve the best results, some lightweight open-source alternatives also demonstrate excellent performance. Finally, enriching the training data with semi-synthetic examples generated by LLMs proves beneficial, enabling small encoders to achieve robust results and significantly enhancing the performance of several open decoder models.

[45] GTPO and GRPO-S: Token and Sequence-Level Reward Shaping with Policy Entropy

Hongze Tan, Jianfei Pan

Main category: cs.CL

TL;DR: The paper introduces Dynamic Entropy Weighting to improve RL for LLMs by fine-grained credit assignment via entropy-weighted rewards for tokens and sequences, outperforming DAPO.

DetailsMotivation: Current RL methods for LLMs use uniform rewards for all tokens, limiting performance in long-chain reasoning tasks.

Method: Proposes Dynamic Entropy Weighting with GTPO (token-level entropy-weighted rewards) and GRPO-S (sequence-level entropy-weighted rewards).

Result: Experiments show significant improvement over DAPO, with entropy-weighting being key to performance boost.

Conclusion: Dynamic Entropy Weighting enhances deep reasoning in LLMs by enabling precise policy updates.

Abstract: Reinforcement learning (RL) with algorithms like Group Relative Policy Optimization (GRPO) improves Large Language Model (LLM) reasoning, but is limited by a coarse-grained credit assignment that applies a uniform reward to all tokens in a sequence. This is a major flaw in long-chain reasoning tasks. This paper solves this with \textbf{Dynamic Entropy Weighting}. Our core idea is that high-entropy tokens in correct responses can guide the policy toward a higher performance ceiling. This allows us to create more fine-grained reward signals for precise policy updates via two ways: 1) \textbf{Group Token Policy Optimization} (\textbf{GTPO}), we assigns a entropy-weighted reward to each token for fine-grained credit assignment. 2) \textbf{Sequence-Level Group Relative Policy Optimization} (\textbf{GRPO-S}), we assigns a entropy-weighted reward to each sequence based on its average token entropy. Experiments show our methods significantly outperform the strong DAPO baseline. The results confirm that our entropy-weighting mechanism is the key driver of this performance boost, offering a better path to enhance deep reasoning in models.

[46] AIC CTU@FEVER 8: On-premise fact checking through long context RAG

Herbert Ullrich, Jan Drchal

Main category: cs.CL

TL;DR: A fact-checking pipeline won FEVER 8 shared task, using a two-step RAG pipeline, deployable on-premise with high performance under hardware constraints.

DetailsMotivation: To achieve state-of-the-art fact-checking performance under limited computational resources.

Method: A two-step RAG (Retrieval-Augmented Generation) pipeline, building on a previous submission.

Result: Scored first in FEVER 8 shared task, with high performance (Ev2R test-score) under hardware constraints.

Conclusion: The pipeline is effective and deployable on-premise, even with limited GPU resources.

Abstract: In this paper, we present our fact-checking pipeline which has scored first in FEVER 8 shared task. Our fact-checking system is a simple two-step RAG pipeline based on our last year’s submission. We show how the pipeline can be redeployed on-premise, achieving state-of-the-art fact-checking performance (in sense of Ev2R test-score), even under the constraint of a single NVidia A10 GPU, 23GB of graphical memory and 60s running time per claim.

[47] Improving Crash Data Quality with Large Language Models: Evidence from Secondary Crash Narratives in Kentucky

Xu Zhang, Mei Chen

Main category: cs.CL

TL;DR: The study compares NLP techniques for crash data quality, finding fine-tuned transformers like RoBERTa outperform zero-shot LLMs and logistic regression, balancing accuracy and efficiency.

DetailsMotivation: To enhance crash data quality by evaluating NLP techniques for mining crash narratives, using secondary crash identification as a case study.

Method: Compared three model classes: zero-shot LLMs, fine-tuned transformers, and logistic regression, tested on Kentucky crash narratives from 2015-2022.

Result: Fine-tuned RoBERTa achieved the highest F1-score (0.90) and accuracy (95%), while LLMs showed high recall but high computational costs.

Conclusion: Fine-tuned transformers offer a balanced solution for crash data quality, with practical deployment considerations for scalability and privacy.

Abstract: This study evaluates advanced natural language processing (NLP) techniques to enhance crash data quality by mining crash narratives, using secondary crash identification in Kentucky as a case study. Drawing from 16,656 manually reviewed narratives from 2015-2022, with 3,803 confirmed secondary crashes, we compare three model classes: zero-shot open-source large language models (LLMs) (LLaMA3:70B, DeepSeek-R1:70B, Qwen3:32B, Gemma3:27B); fine-tuned transformers (BERT, DistilBERT, RoBERTa, XLNet, Longformer); and traditional logistic regression as baseline. Models were calibrated on 2015-2021 data and tested on 1,771 narratives from 2022. Fine-tuned transformers achieved superior performance, with RoBERTa yielding the highest F1-score (0.90) and accuracy (95%). Zero-shot LLaMA3:70B reached a comparable F1 of 0.86 but required 139 minutes of inference; the logistic baseline lagged well behind (F1:0.66). LLMs excelled in recall for some variants (e.g., GEMMA3:27B at 0.94) but incurred high computational costs (up to 723 minutes for DeepSeek-R1:70B), while fine-tuned models processed the test set in seconds after brief training. Further analysis indicated that mid-sized LLMs (e.g., DeepSeek-R1:32B) can rival larger counterparts in performance while reducing runtime, suggesting opportunities for optimized deployments. Results highlight trade-offs between accuracy, efficiency, and data requirements, with fine-tuned transformer models balancing precision and recall effectively on Kentucky data. Practical deployment considerations emphasize privacy-preserving local deployment, ensemble approaches for improved accuracy, and incremental processing for scalability, providing a replicable scheme for enhancing crash-data quality with advanced NLP.

[48] Why are LLMs’ abilities emergent?

Vladimír Havlík

Main category: cs.CL

TL;DR: The paper explores emergent properties in LLMs and DNNs, arguing they arise from complex nonlinear dynamics, not just scaling, and compares them to natural emergent phenomena.

DetailsMotivation: To understand the unexpected emergent capabilities of LLMs and DNNs, addressing the challenge of 'creation without understanding' in AI.

Method: Combines theoretical analysis and empirical observation of scaling laws, grokking, and phase transitions in DNNs.

Result: Emergent abilities stem from complex dynamics of nonlinear systems, not just parameter scaling, resembling natural emergent phenomena.

Conclusion: LLMs and DNNs should be viewed as complex dynamical systems governed by universal emergence principles, shifting focus to internal dynamic transformations.

Abstract: The remarkable success of Large Language Models (LLMs) in generative tasks has raised fundamental questions about the nature of their acquired capabilities, which often appear to emerge unexpectedly without explicit training. This paper examines the emergent properties of Deep Neural Networks (DNNs) through both theoretical analysis and empirical observation, addressing the epistemological challenge of “creation without understanding” that characterises contemporary AI development. We explore how the neural approach’s reliance on nonlinear, stochastic processes fundamentally differs from symbolic computational paradigms, creating systems whose macro-level behaviours cannot be analytically derived from micro-level neuron activities. Through analysis of scaling laws, grokking phenomena, and phase transitions in model capabilities, I demonstrate that emergent abilities arise from the complex dynamics of highly sensitive nonlinear systems rather than simply from parameter scaling alone. My investigation reveals that current debates over metrics, pre-training loss thresholds, and in-context learning miss the fundamental ontological nature of emergence in DNNs. I argue that these systems exhibit genuine emergent properties analogous to those found in other complex natural phenomena, where systemic capabilities emerge from cooperative interactions among simple components without being reducible to their individual behaviours. The paper concludes that understanding LLM capabilities requires recognising DNNs as a new domain of complex dynamical systems governed by universal principles of emergence, similar to those operating in physics, chemistry, and biology. This perspective shifts the focus from purely phenomenological definitions of emergence to understanding the internal dynamic transformations that enable these systems to acquire capabilities that transcend their individual components.

[49] What Do Humans Hear When Interacting? Experiments on Selective Listening for Evaluating ASR of Spoken Dialogue Systems

Kiyotada Mori, Seiya Kawano, Chaoran Liu, Carlos Toshinori Ishi, Angel Fernando Garcia Contreras, Koichiro Yoshino

Main category: cs.CL

TL;DR: The paper explores human selective listening to improve ASR evaluation for spoken dialogue systems.

DetailsMotivation: To identify ASR capabilities needed for SDSs by studying human selective listening during dialogue response generation.

Method: Experimental comparison of human transcriptions for dialogue responses and reference transcriptions.

Result: Confirmed human selective listening behavior, suggesting a gap between ASR and human transcription abilities.

Conclusion: Proposes a new ASR evaluation method leveraging human selective listening to bridge the gap between ASR and human performance.

Abstract: Spoken dialogue systems (SDSs) utilize automatic speech recognition (ASR) at the front end of their pipeline. The role of ASR in SDSs is to recognize information in user speech related to response generation appropriately. Examining selective listening of humans, which refers to the ability to focus on and listen to important parts of a conversation during the speech, will enable us to identify the ASR capabilities required for SDSs and evaluate them. In this study, we experimentally confirmed selective listening when humans generate dialogue responses by comparing human transcriptions for generating dialogue responses and reference transcriptions. Based on our experimental results, we discuss the possibility of a new ASR evaluation method that leverages human selective listening, which can identify the gap between transcription ability between ASR systems and humans.

[50] Dialogue Response Prefetching Based on Semantic Similarity and Prediction Confidence of Language Model

Kiyotada Mori, Seiya Kawano, Angel Fernando Garcia Contreras, Koichiro Yoshino

Main category: cs.CL

TL;DR: The paper proposes a Prediction Confidence Model (PCM) to decide if prefetching dialogue responses is feasible by estimating semantic similarity between predicted and actual user utterances, aiming to reduce user-perceived latency.

DetailsMotivation: To minimize user-perceived latency (UPL) in spoken dialogue systems by predicting complete user utterances early and prefetching responses.

Method: Introduces a PCM that estimates semantic similarity between predicted and actual user utterances to determine prefetching feasibility.

Result: Evaluated based on differences between predicted and complete user utterances.

Conclusion: The PCM effectively aids in reducing UPL by improving the accuracy of prefetching decisions.

Abstract: Prefetching of dialogue responses has been investigated to reduce user-perceived latency (UPL), which refers to the user’s waiting time before receiving the system’s response, in spoken dialogue systems. To reduce the UPL, it is necessary to predict complete user utterances before the end of the user’s speech, typically by language models, to prepare prefetched dialogue responses. In this study, we proposed a prediction confidence model (PCM) that determines whether prefetching is possible or not by estimating the semantic similarity between the predicted complete user utterance and the complete user utterance. We evaluated our PCM based on the differences between the predicted complete user utterance and the complete user utterance.

[51] Evaluating, Synthesizing, and Enhancing for Customer Support Conversation

Jie Zhu, Huaixia Dou, Junhui Li, Lifan Guo, Feng Chen, Chi Zhang, Fang Kong

Main category: cs.CL

TL;DR: The paper introduces the Customer Support Conversation (CSC) task to improve customer service interactions using structured strategies. It proposes a framework, creates datasets (CSConv and RoleCS), and shows LLM fine-tuning enhances response quality.

DetailsMotivation: Existing dialogue datasets lack strategic guidance, and real-world service data is hard to access and annotate. The goal is to train agents for better, strategy-aligned responses.

Method: A structured CSC framework based on COPC guidelines is introduced, with five stages and twelve strategies. Datasets (CSConv and RoleCS) are created using LLMs for annotation and training.

Result: Fine-tuning LLMs on RoleCS improves strategy-aligned response generation. Human evaluations confirm better problem resolution.

Conclusion: The CSC framework and datasets effectively enhance customer support interactions, with LLM fine-tuning proving beneficial.

Abstract: Effective customer support requires not only accurate problem solving but also structured and empathetic communication aligned with professional standards. However, existing dialogue datasets often lack strategic guidance, and real-world service data is difficult to access and annotate. To address this, we introduce the task of Customer Support Conversation (CSC), aimed at training customer service agents to respond using well-defined support strategies. We propose a structured CSC framework grounded in COPC guidelines, defining five conversational stages and twelve strategies to guide high-quality interactions. Based on this, we construct CSConv, an evaluation dataset of 1,855 real-world customer-agent conversations rewritten using LLMs to reflect deliberate strategy use, and annotated accordingly. Additionally, we develop a role-playing approach that simulates strategy-rich conversations using LLM-powered roles aligned with the CSC framework, resulting in the training dataset RoleCS. Experiments show that fine-tuning strong LLMs on RoleCS significantly improves their ability to generate high-quality, strategy-aligned responses on CSConv. Human evaluations further confirm gains in problem resolution. All code and data will be made publicly available at https://github.com/aliyun/qwen-dianjin.

[52] StepFun-Formalizer: Unlocking the Autoformalization Potential of LLMs through Knowledge-Reasoning Fusion

Yutong Wu, Di Huang, Ruosi Wan, Yue Peng, Shijie Shang, Chenrui Cao, Lei Qi, Rui Zhang, Zidong Du, Jie Yan, Xing Hu

Main category: cs.CL

TL;DR: Autoformalization translates natural-language math into formal language. ThinkingF improves accuracy by enhancing formal knowledge and reasoning, achieving SOTA results.

DetailsMotivation: Existing autoformalization methods using LLMs suffer from low accuracy due to gaps in formal knowledge and reasoning.

Method: ThinkingF uses data synthesis (two datasets: formal knowledge-rich and reasoning trajectories) and training (SFT and RLVR) to improve abilities.

Result: StepFun-Formalizer-32B achieves SOTA BEq@1 scores: 40.5% on FormalMATH-Lite and 26.7% on ProverBench.

Conclusion: ThinkingF effectively addresses autoformalization challenges, outperforming prior models.

Abstract: Autoformalization aims to translate natural-language mathematical statements into a formal language. While LLMs have accelerated progress in this area, existing methods still suffer from low accuracy. We identify two key abilities for effective autoformalization: comprehensive mastery of formal-language domain knowledge, and reasoning capability of natural language problem understanding and informal-formal alignment. Without the former, a model cannot identify the correct formal objects; without the latter, it struggles to interpret real-world contexts and map them precisely into formal expressions. To address these gaps, we introduce ThinkingF, a data synthesis and training pipeline that improves both abilities. First, we construct two datasets: one by distilling and selecting large-scale examples rich in formal knowledge, and another by generating informal-to-formal reasoning trajectories guided by expert-designed templates. We then apply SFT and RLVR with these datasets to further fuse and refine the two abilities. The resulting 7B and 32B models exhibit both comprehensive formal knowledge and strong informal-to-formal reasoning. Notably, StepFun-Formalizer-32B achieves SOTA BEq@1 scores of 40.5% on FormalMATH-Lite and 26.7% on ProverBench, surpassing all prior general-purpose and specialized models.

[53] Automated Generation of Curriculum-Aligned Multiple-Choice Questions for Malaysian Secondary Mathematics Using Generative AI

Rohaizah Abdul Wahid, Muhamad Said Nizamuddin Nadim, Suliana Sulaiman, Syahmi Akmal Shaharudin, Muhammad Danial Jupikil, Iqqwan Jasman Su Azlan Su

Main category: cs.CL

TL;DR: The paper explores scalable educational assessment tools in Malaysia using GenAI, comparing four MCQ-generation pipelines for Form 1 Mathematics in Bahasa Melayu. RAG-based methods outperform non-grounded prompting, ensuring better curriculum alignment and factual accuracy.

DetailsMotivation: Addressing the need for scalable, high-quality educational tools in Malaysia, especially for low-resource languages like Bahasa Melayu, while leveraging GenAI's potential despite challenges like factual accuracy.

Method: Four pipelines for MCQ generation: non-grounded prompting (structured/basic) and RAG approaches (LangChain/manual). Grounded in curriculum documents, evaluated via STS for alignment and RAG-QA for validity.

Result: RAG-based pipelines outperform non-grounded methods, achieving higher curriculum alignment and factual validity. Trade-offs between ease (framework-based RAG) and control (manual RAG) are analyzed.

Conclusion: The study provides a validated method for curriculum-specific content generation in low-resource languages, introduces RAG-QA evaluation, and offers insights for EdTech solutions in similar regions.

Abstract: This paper addresses the critical need for scalable and high-quality educational assessment tools within the Malaysian education system. It highlights the potential of Generative AI (GenAI) while acknowledging the significant challenges of ensuring factual accuracy and curriculum alignment, especially for low-resource languages like Bahasa Melayu. This research introduces and compares four incremental pipelines for generating Form 1 Mathematics multiple-choice questions (MCQs) in Bahasa Melayu using OpenAI’s GPT-4o. The methods range from non-grounded prompting (structured and basic) to Retrieval-Augmented Generation (RAG) approaches (one using the LangChain framework, one implemented manually). The system is grounded in official curriculum documents, including teacher-prepared notes and the yearly teaching plan (RPT). A dual-pronged automated evaluation framework is employed to assess the generated questions. Curriculum alignment is measured using Semantic Textual Similarity (STS) against the RPT, while contextual validity is verified through a novel RAG-based Question-Answering (RAG-QA) method. The results demonstrate that RAG-based pipelines significantly outperform non-grounded prompting methods, producing questions with higher curriculum alignment and factual validity. The study further analyzes the trade-offs between the ease of implementation of framework-based RAG and the fine-grained control offered by a manual pipeline. This work presents a validated methodology for generating curriculum-specific educational content in a low-resource language, introduces a symbiotic RAG-QA evaluation technique, and provides actionable insights for the development and deployment of practical EdTech solutions in Malaysia and similar regions.

[54] CALE : Concept-Aligned Embeddings for Both Within-Lemma and Inter-Lemma Sense Differentiation

Bastien Liétard, Gabriel Loiseau

Main category: cs.CL

TL;DR: The paper introduces Concept Differentiation, an extension to Word-in-Context, to include inter-word scenarios, and proposes Concept-Aligned Embeddings (CALE) for improved lexical semantic representations.

DetailsMotivation: To address the limitation of Word-in-Context, which only compares occurrences of the same lemma, by capturing inter-word semantic relations.

Method: Proposes Concept Differentiation, creates a dataset from SemCor, and fine-tunes models (CALE) on this dataset.

Result: CALE achieves best performance in lexical semantic tasks and improves the spatial organization of embeddings.

Conclusion: CALE provides efficient multi-purpose representations of lexical meaning, demonstrating the value of the proposed extension and fine-tuning.

Abstract: Lexical semantics is concerned with both the multiple senses a word can adopt in different contexts, and the semantic relations that exist between meanings of different words. To investigate them, Contextualized Language Models are a valuable tool that provides context-sensitive representations that can be used to investigate lexical meaning. Recent works like XL-LEXEME have leveraged the task of Word-in-Context to fine-tune them to get more semantically accurate representations, but Word-in-Context only compares occurrences of the same lemma, limiting the range of captured information. In this paper, we propose an extension, Concept Differentiation, to include inter-words scenarios. We provide a dataset for this task, derived from SemCor data. Then we fine-tune several representation models on this dataset. We call these models Concept-Aligned Embeddings (CALE). By challenging our models and other models on various lexical semantic tasks, we demonstrate that the proposed models provide efficient multi-purpose representations of lexical meaning that reach best performances in our experiments. We also show that CALE’s fine-tuning brings valuable changes to the spatial organization of embeddings.

[55] StyliTruth : Unlocking Stylized yet Truthful LLM Generation via Disentangled Steering

Chenglei Shen, Zhongxiang Sun, Teng Shi, Xiao Zhang, Jun Xu

Main category: cs.CL

TL;DR: StyliTruth addresses the trade-off between stylization and truthfulness in LLM responses by separating style and truth subspaces, outperforming existing methods.

DetailsMotivation: Existing representation editing methods degrade truthfulness when imposing styles, termed stylization-induced truthfulness collapse.

Method: StyliTruth uses orthogonal deflation to separate style and truth subspaces, enabling independent control via adaptive token-level steering vectors.

Result: StyliTruth reduces truthfulness collapse and balances style adherence with truthfulness better than existing methods.

Conclusion: StyliTruth effectively maintains both stylistic fidelity and truthfulness in LLM responses.

Abstract: Generating stylized large language model (LLM) responses via representation editing is a promising way for fine-grained output control. However, there exists an inherent trade-off: imposing a distinctive style often degrades truthfulness. Existing representation editing methods, by naively injecting style signals, overlook this collateral impact and frequently contaminate the model’s core truthfulness representations, resulting in reduced answer correctness. We term this phenomenon stylization-induced truthfulness collapse. We attribute this issue to latent coupling between style and truth directions in certain key attention heads, and propose StyliTruth, a mechanism that preserves stylization while keeping truthfulness intact. StyliTruth separates the style-relevant and truth-relevant subspaces in the model’s representation space via an orthogonal deflation process. This decomposition enables independent control of style and truth in their own subspaces, minimizing interference. By designing adaptive, token-level steering vectors within each subspace, we dynamically and precisely control the generation process to maintain both stylistic fidelity and truthfulness. We validate our method on multiple styles and languages. Extensive experiments and analyses show that StyliTruth significantly reduces stylization-induced truthfulness collapse and outperforms existing inference-time intervention methods in balancing style adherence with truthfulness.

[56] Unveiling the Landscape of Clinical Depression Assessment: From Behavioral Signatures to Psychiatric Reasoning

Zhuang Chen, Guanqun Bi, Wen Zhang, Jiawei Hu, Aoyun Wang, Xiyao Xiao, Kun Feng, Minlie Huang

Main category: cs.CL

TL;DR: The paper introduces C-MIND, a clinical dataset for depression assessment, analyzes behavioral signatures, evaluates model performance, and improves LLM diagnostic accuracy with clinical expertise.

DetailsMotivation: To address the lack of clinically validated data and real-world effectiveness in automated depression assessment.

Method: Uses C-MIND dataset with multimodal data (audio, video, transcript, fNIRS) to analyze behavioral signatures, train classical models, and enhance LLMs with clinical expertise.

Result: Clinical expertise improves LLM diagnostic performance by up to 10% in Macro-F1 score.

Conclusion: The study provides a robust infrastructure for clinical depression assessment, combining data and algorithmic advancements for reliable mental healthcare research.

Abstract: Depression is a widespread mental disorder that affects millions worldwide. While automated depression assessment shows promise, most studies rely on limited or non-clinically validated data, and often prioritize complex model design over real-world effectiveness. In this paper, we aim to unveil the landscape of clinical depression assessment. We introduce C-MIND, a clinical neuropsychiatric multimodal diagnosis dataset collected over two years from real hospital visits. Each participant completes three structured psychiatric tasks and receives a final diagnosis from expert clinicians, with informative audio, video, transcript, and functional near-infrared spectroscopy (fNIRS) signals recorded. Using C-MIND, we first analyze behavioral signatures relevant to diagnosis. We train a range of classical models to quantify how different tasks and modalities contribute to diagnostic performance, and dissect the effectiveness of their combinations. We then explore whether LLMs can perform psychiatric reasoning like clinicians and identify their clear limitations in realistic clinical settings. In response, we propose to guide the reasoning process with clinical expertise and consistently improves LLM diagnostic performance by up to 10% in Macro-F1 score. We aim to build an infrastructure for clinical depression assessment from both data and algorithmic perspectives, enabling C-MIND to facilitate grounded and reliable research for mental healthcare.

[57] Beyond Brainstorming: What Drives High-Quality Scientific Ideas? Lessons from Multi-Agent Collaboration

Nuo Chen, Yicheng Tong, Jiaying Wu, Minh Duc Duong, Qian Wang, Qingyun Zou, Bryan Hooi, Bingsheng He

Main category: cs.CL

TL;DR: Multi-agent discussions outperform solitary ideation in generating research proposals, with cognitive diversity and expertise being key factors.

DetailsMotivation: To explore if structured multi-agent discussions can enhance creativity in scientific ideation compared to single-agent approaches.

Method: A cooperative multi-agent framework is tested with variations in group size, leadership, and team composition, evaluated via agent-based scoring and human review.

Result: Multi-agent discussions significantly outperform solitary ideation, with leadership and cognitive diversity improving proposal quality. Expertise is essential for success.

Conclusion: Multi-agent systems with diverse and expert teams yield better creative outcomes, offering insights for designing collaborative AI ideation tools.

Abstract: While AI agents show potential in scientific ideation, most existing frameworks rely on single-agent refinement, limiting creativity due to bounded knowledge and perspective. Inspired by real-world research dynamics, this paper investigates whether structured multi-agent discussions can surpass solitary ideation. We propose a cooperative multi-agent framework for generating research proposals and systematically compare configurations including group size, leaderled versus leaderless structures, and team compositions varying in interdisciplinarity and seniority. To assess idea quality, we employ a comprehensive protocol with agent-based scoring and human review across dimensions such as novelty, strategic vision, and integration depth. Our results show that multi-agent discussions substantially outperform solitary baselines. A designated leader acts as a catalyst, transforming discussion into more integrated and visionary proposals. Notably, we find that cognitive diversity is a primary driver of quality, yet expertise is a non-negotiable prerequisite, as teams lacking a foundation of senior knowledge fail to surpass even a single competent agent. These findings offer actionable insights for designing collaborative AI ideation systems and shed light on how team structure influences creative outcomes.

[58] Share Your Attention: Transformer Weight Sharing via Matrix-based Dictionary Learning

Magauiya Zhussip, Dmitriy Shopkhoev, Ammar Ali, Stamatios Lefkimmiatis

Main category: cs.CL

TL;DR: MASA introduces structured weight sharing across transformer layers, reducing attention parameters by 66.7% without performance loss, outperforming existing methods.

DetailsMotivation: High computational and memory demands of LLMs limit deployment; inter-block redundancy in transformers is underexplored.

Method: Decomposes attention projection matrices into shared dictionary atoms, enabling weight sharing across layers with standard training.

Result: Achieves better accuracy and perplexity than baselines, scales well, and works for Vision Transformers with similar efficiency.

Conclusion: MASA provides a scalable, parameter-efficient solution for transformers without compromising performance, applicable to pretrained LLMs.

Abstract: Large language models (LLMs) have revolutionized AI applications, yet their high computational and memory demands hinder their widespread deployment. Existing compression techniques focus on intra-block optimizations (e.g. low-rank approximation, attention head pruning), while the repetitive layered structure of transformers implies significant inter-block redundancy - a dimension largely unexplored beyond key-value (KV) caching. Inspired by dictionary learning in CNNs, we propose a framework for structured weight sharing across transformer layers. Our approach decomposes attention projection matrices into shared dictionary atoms, reducing the attention module’s parameters by 66.7% while achieving on-par performance. Unlike complex methods requiring distillation or architectural changes, MASA (Matrix Atom Sharing in Attention) operates as a drop-in replacement - trained with standard optimizers

  • and represents each layer’s weights as linear combinations of shared matrix atoms. Experiments across scales (100M-700M parameters) show that MASA achieves better benchmark accuracy and perplexity than grouped-query attention (GQA), low-rank baselines and recently proposed Repeat-all-over/Sequential sharing at comparable parameter budgets. Ablation studies confirm robustness to the dictionary size and the efficacy of shared representations in capturing cross-layer statistical regularities. Extending to Vision Transformers (ViT), MASA matches performance metrics on image classification and detection tasks with 66.7% fewer attention parameters. By combining dictionary learning strategies with transformer efficiency, MASA offers a scalable blueprint for parameter-efficient models without sacrificing performance. Finally, we investigate the possibility of employing MASA on pretrained LLMs to reduce their number of parameters without experiencing any significant drop in their performance.

[59] TURA: Tool-Augmented Unified Retrieval Agent for AI Search

Zhejun Zhao, Yuehu Dong, Alley Liu, Lixue Zheng, Pingsheng Liu, Dongdong Shen, Long Xia, Jiashu Zhao, Dawei Yin

Main category: cs.CL

TL;DR: TURA bridges the gap between static RAG and dynamic information by combining RAG with agentic tool-use for real-time AI search.

DetailsMotivation: Traditional RAG struggles with real-time and structured queries, limiting search engines to static content. TURA addresses this by integrating dynamic sources like databases and APIs.

Method: TURA uses a three-stage framework: Intent-Aware Retrieval, DAG-based Task Planner, and Distilled Agent Executor for efficient tool calling.

Result: TURA successfully combines static and dynamic information, serving millions with low-latency, real-time answers.

Conclusion: TURA is a pioneering framework for AI search, effectively merging static and dynamic data retrieval for industrial-scale applications.

Abstract: The advent of Large Language Models (LLMs) is transforming search engines into conversational AI search products, primarily using Retrieval-Augmented Generation (RAG) on web corpora. However, this paradigm has significant industrial limitations. Traditional RAG approaches struggle with real-time needs and structured queries that require accessing dynamically generated content like ticket availability or inventory. Limited to indexing static pages, search engines cannot perform the interactive queries needed for such time-sensitive data. Academic research has focused on optimizing RAG for static content, overlooking complex intents and the need for dynamic sources like databases and real-time APIs. To bridge this gap, we introduce TURA (Tool-Augmented Unified Retrieval Agent for AI Search), a novel three-stage framework that combines RAG with agentic tool-use to access both static content and dynamic, real-time information. TURA has three key components: an Intent-Aware Retrieval module to decompose queries and retrieve information sources encapsulated as Model Context Protocol (MCP) Servers, a DAG-based Task Planner that models task dependencies as a Directed Acyclic Graph (DAG) for optimal parallel execution, and a lightweight Distilled Agent Executor for efficient tool calling. TURA is the first architecture to systematically bridge the gap between static RAG and dynamic information sources for a world-class AI search product. Serving tens of millions of users, it leverages an agentic framework to deliver robust, real-time answers while meeting the low-latency demands of a large-scale industrial system.

[60] Lightweight Transformers for Zero-Shot and Fine-Tuned Text-to-SQL Generation Using Spider

Chirag Seth, Utkarsh Singh

Main category: cs.CL

TL;DR: The study evaluates lightweight transformer models (T5-Small, BART-Small, GPT-2) for text-to-SQL translation on the Spider dataset, with T5-Small achieving the best performance.

DetailsMotivation: To enable non-expert users to query databases using natural language, especially in low-resource settings.

Method: Developed a reusable pipeline for schema formatting, trained models (1000-5000 iterations), and evaluated using LFAcc, BLEU, and EM metrics.

Result: T5-Small outperformed others with 27.8% LFAcc, showing encoder-decoder models’ superiority.

Conclusion: Compact transformers like T5-Small are promising for text-to-SQL in resource-scarce environments, with potential for future enhancements.

Abstract: Text-to-SQL translation enables non-expert users to query relational databases using natural language, with applications in education and business intelligence. This study evaluates three lightweight transformer models - T5-Small, BART-Small, and GPT-2 - on the Spider dataset, focusing on low-resource settings. We developed a reusable, model-agnostic pipeline that tailors schema formatting to each model’s architecture, training them across 1000 to 5000 iterations and evaluating on 1000 test samples using Logical Form Accuracy (LFAcc), BLEU, and Exact Match (EM) metrics. Fine-tuned T5-Small achieves the highest LFAcc (27.8%), outperforming BART-Small (23.98%) and GPT-2 (20.1%), highlighting encoder-decoder models’ superiority in schema-aware SQL generation. Despite resource constraints limiting performance, our pipeline’s modularity supports future enhancements, such as advanced schema linking or alternative base models. This work underscores the potential of compact transformers for accessible text-to-SQL solutions in resource-scarce environments.

[61] P-Aligner: Enabling Pre-Alignment of Language Models via Principled Instruction Synthesis

Feifan Song, Bofei Gao, Yifan Song, Yi Liu, Weimin Xiong, Yuyang Song, Tianyu Liu, Guoyin Wang, Houfeng Wang

Main category: cs.CL

TL;DR: P-Aligner, a lightweight module, improves LLM instruction alignment by generating human-preferred instructions, outperforming baselines with significant win-rate gains.

DetailsMotivation: LLMs often fail to align with safe, helpful, and honest values due to flawed instructions, necessitating cost-effective pre-alignment solutions.

Method: P-Aligner, trained on the UltraPrompt dataset synthesized via Monte-Carlo Tree Search, generates human-preferred instructions while preserving intent.

Result: P-Aligner achieves average win-rate gains of 28.35% on GPT-4-turbo and 8.69% on Gemma-2-SimPO, outperforming baselines.

Conclusion: P-Aligner is an efficient and effective solution for preference alignment, validated by data quality, search strategies, and deployment analyses.

Abstract: Large Language Models (LLMs) are expected to produce safe, helpful, and honest content during interaction with human users, but they frequently fail to align with such values when given flawed instructions, e.g., missing context, ambiguous directives, or inappropriate tone, leaving substantial room for improvement along multiple dimensions. A cost-effective yet high-impact way is to pre-align instructions before the model begins decoding. Existing approaches either rely on prohibitive test-time search costs or end-to-end model rewrite, which is powered by a customized training corpus with unclear objectives. In this work, we demonstrate that the goal of efficient and effective preference alignment can be achieved by P-Aligner, a lightweight module generating instructions that preserve the original intents while being expressed in a more human-preferred form. P-Aligner is trained on UltraPrompt, a new dataset synthesized via a proposed principle-guided pipeline using Monte-Carlo Tree Search, which systematically explores the space of candidate instructions that are closely tied to human preference. Experiments across different methods show that P-Aligner generally outperforms strong baselines across various models and benchmarks, including average win-rate gains of 28.35% and 8.69% on GPT-4-turbo and Gemma-2-SimPO, respectively. Further analyses validate its effectiveness and efficiency through multiple perspectives, including data quality, search strategies, iterative deployment, and time overhead.

[62] IFDECORATOR: Wrapping Instruction Following Reinforcement Learning with Verifiable Rewards

Xu Guo, Tianyi Liang, Tong Jian, Xiaogui Yang, Ling-I Wu, Chenhui Li, Zhihui Lu, Qipeng Guo, Kai Chen

Main category: cs.CL

TL;DR: RLVR improves LLMs’ instruction-following but faces inefficiency and over-optimization. IFDecorator enhances RLVR with a robust pipeline, achieving high accuracy and reducing reward hacking.

DetailsMotivation: Address training inefficiency and over-optimization in RLVR for better instruction-following in LLMs.

Method: Introduces IFDecorator with a cooperative-adversarial data flywheel, IntentCheck for alignment, and trip wires for detecting reward hacking.

Result: Achieves 87.43% accuracy on IFEval, outperforms GPT-4o, and reduces reward hacking.

Conclusion: IFDecorator effectively improves RLVR training, ensuring intent alignment and efficiency.

Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) improves instruction following capabilities of large language models (LLMs), but suffers from training inefficiency due to inadequate difficulty assessment. Moreover, RLVR is prone to over-optimization, where LLMs exploit verification shortcuts without aligning to the actual intent of user instructions. We introduce Instruction Following Decorator (IFDecorator}, a framework that wraps RLVR training into a robust and sample-efficient pipeline. It consists of three components: (1) a cooperative-adversarial data flywheel that co-evolves instructions and hybrid verifications, generating progressively more challenging instruction-verification pairs; (2) IntentCheck, a bypass module enforcing intent alignment; and (3) trip wires, a diagnostic mechanism that detects reward hacking via trap instructions, which trigger and capture shortcut exploitation behaviors. Our Qwen2.5-32B-Instruct-IFDecorator achieves 87.43% accuracy on IFEval, outperforming larger proprietary models such as GPT-4o. Additionally, we demonstrate substantial improvements on FollowBench while preserving general capabilities. Our trip wires show significant reductions in reward hacking rates. We will release models, code, and data for future research.

[63] Can NLP Tackle Hate Speech in the Real World? Stakeholder-Informed Feedback and Survey on Counterspeech

Tanvi Dinkar, Aiqi Jiang, Simona Frenda, Poppy Gerrard-Abbott, Nancie Gunson, Gavin Abercrombie, Ioannis Konstas

Main category: cs.CL

TL;DR: The paper reviews 74 NLP studies on counterspeech, highlighting a shift from stakeholder collaboration to automated methods, and identifies a gap between research and community needs.

DetailsMotivation: To analyze the impact of stakeholder participation in counterspeech research and address the disconnect between NLP methods and affected communities.

Method: Systematic review of 74 NLP studies and a participatory case study with five NGOs specializing in online Gender-Based Violence (oGBV).

Result: Findings show a growing disconnect between NLP research and community needs, with stakeholder input often lacking in dataset creation and model development.

Conclusion: The paper recommends re-centering stakeholder expertise in counterspeech research to better address real-world impacts.

Abstract: Counterspeech, i.e. the practice of responding to online hate speech, has gained traction in NLP as a promising intervention. While early work emphasised collaboration with non-governmental organisation stakeholders, recent research trends have shifted toward automated pipelines that reuse a small set of legacy datasets, often without input from affected communities. This paper presents a systematic review of 74 NLP studies on counterspeech, analysing the extent to which stakeholder participation influences dataset creation, model development, and evaluation. To complement this analysis, we conducted a participatory case study with five NGOs specialising in online Gender-Based Violence (oGBV), identifying stakeholder-informed practices for counterspeech generation. Our findings reveal a growing disconnect between current NLP research and the needs of communities most impacted by toxic online content. We conclude with concrete recommendations for re-centring stakeholder expertise in counterspeech research.

[64] Multi-module GRPO: Composing Policy Gradients and Prompt Optimization for Language Model Programs

Noah Ziems, Dilara Soylu, Lakshya A Agrawal, Isaac Miller, Liheng Lai, Chen Qian, Kaiqiang Song, Meng Jiang, Dan Klein, Matei Zaharia, Karel D’Oosterlinck, Christopher Potts, Omar Khattab

Main category: cs.CL

TL;DR: mmGRPO extends GRPO for multi-module AI systems, improving accuracy by 11% over post-trained LMs and 5% over prompt optimization alone.

DetailsMotivation: AI systems now use modular programs with multiple LM calls, but GRPO's effectiveness in such setups is unclear.

Method: mmGRPO generalizes GRPO by grouping LM calls by module, handling variable-length and interrupted trajectories.

Result: mmGRPO boosts accuracy by 11% on tasks like classification and search, outperforming prompt optimization alone by 5%.

Conclusion: mmGRPO, integrated into DSPy as dspy.GRPO, effectively enhances modular AI systems.

Abstract: Group Relative Policy Optimization (GRPO) has proven to be an effective tool for post-training language models (LMs). However, AI systems are increasingly expressed as modular programs that mix together multiple LM calls with distinct prompt templates and other tools, and it is not clear how best to leverage GRPO to improve these systems. We begin to address this challenge by defining mmGRPO, a simple multi-module generalization of GRPO that groups LM calls by module across rollouts and handles variable-length and interrupted trajectories. We find that mmGRPO, composed with automatic prompt optimization, improves accuracy by 11% on average across classification, many-hop search, and privacy-preserving delegation tasks against the post-trained LM, and by 5% against prompt optimization on its own. We open-source mmGRPO in DSPy as the dspy.GRPO optimizer.

[65] Sculptor: Empowering LLMs with Cognitive Agency via Active Context Management

Mo Li, L. H. Xu, Qitai Tan, Ting Cao, Yunxin Liu

Main category: cs.CL

TL;DR: Sculptor is a framework enabling LLMs to actively manage long contexts using tools like fragmentation, summarization, and search, improving performance without additional training.

DetailsMotivation: LLMs struggle with long contexts due to proactive interference, where irrelevant early information disrupts reasoning. Existing solutions focus on external memory, but internal context management is overlooked.

Method: Introduces Sculptor, a framework with three tools: context fragmentation, summary/hide/restore, and intelligent search, to help LLMs manage attention and working memory.

Result: Sculptor significantly improves performance on benchmarks (PI-LLM and NeedleBench) without specific training, leveraging LLMs’ tool-calling generalization.

Conclusion: Active context management (ACM) via Sculptor mitigates proactive interference and enhances reasoning, showing that explicit context-control strategies are crucial for robustness in long-context tasks.

Abstract: Large Language Models (LLMs) suffer from significant performance degradation when processing long contexts due to proactive interference, where irrelevant information in earlier parts of the context disrupts reasoning and memory recall. While most research focuses on external memory systems to augment LLMs’ capabilities, we propose a complementary approach: empowering LLMs with Active Context Management (ACM) tools to actively sculpt their internal working memory. We introduce Sculptor, a framework that equips LLMs with three categories of tools: (1) context fragmentation, (2) summary, hide, and restore, and (3) intelligent search. Our approach enables LLMs to proactively manage their attention and working memory, analogous to how humans selectively focus on relevant information while filtering out distractions. Experimental evaluation on information-sparse benchmarks-PI-LLM (proactive interference) and NeedleBench Multi-Needle Reasoning-demonstrates that Sculptor significantly improves performance even without specific training, leveraging LLMs’ inherent tool calling generalization capabilities. By enabling Active Context Management, Sculptor not only mitigates proactive interference but also provides a cognitive foundation for more reliable reasoning across diverse long-context tasks-highlighting that explicit context-control strategies, rather than merely larger token windows, are key to robustness at scale.

[66] GeRe: Towards Efficient Anti-Forgetting in Continual Learning of LLM via General Samples Replay

Yunan Zhang, Shuoran Jiang, Mengchen Zhao, Yuefeng Li, Yang Fan, Xiangping Wu, Qingcai Chen

Main category: cs.CL

TL;DR: Proposes General Sample Replay (GeRe) to mitigate catastrophic forgetting in continual learning for LLMs, using pretraining texts and a threshold-based margin loss for activation state consistency.

DetailsMotivation: Address catastrophic forgetting in continual fine-tuning of LLMs, which harms general capabilities and task performance.

Method: Introduces GeRe with a threshold-based margin (TM) loss to maintain activation state consistency during replay learning.

Result: A small set of general replay samples suffices to retain general capabilities and improve task performance; TM outperforms other replay strategies.

Conclusion: GeRe with TM loss offers a stable and efficient solution for continual learning in LLMs, enhancing robustness and performance.

Abstract: The continual learning capability of large language models (LLMs) is crucial for advancing artificial general intelligence. However, continual fine-tuning LLMs across various domains often suffers from catastrophic forgetting, characterized by: 1) significant forgetting of their general capabilities, and 2) sharp performance declines in previously learned tasks. To simultaneously address both issues in a simple yet stable manner, we propose General Sample Replay (GeRe), a framework that use usual pretraining texts for efficient anti-forgetting. Beyond revisiting the most prevalent replay-based practices under GeRe, we further leverage neural states to introduce a enhanced activation states constrained optimization method using threshold-based margin (TM) loss, which maintains activation state consistency during replay learning. We are the first to validate that a small, fixed set of pre-collected general replay samples is sufficient to resolve both concerns–retaining general capabilities while promoting overall performance across sequential tasks. Indeed, the former can inherently facilitate the latter. Through controlled experiments, we systematically compare TM with different replay strategies under the GeRe framework, including vanilla label fitting, logit imitation via KL divergence and feature imitation via L1/L2 losses. Results demonstrate that TM consistently improves performance and exhibits better robustness. Our work paves the way for efficient replay of LLMs for the future. Our code and data are available at https://github.com/Qznan/GeRe.

[67] FaST: Feature-aware Sampling and Tuning for Personalized Preference Alignment with Limited Data

Thibaut Thonet, Germán Kruszewski, Jos Rozen, Pierre Erbacher, Marc Dymetman

Main category: cs.CL

TL;DR: The paper addresses the challenge of personalizing LLM-powered conversational assistants with limited user preference data (PPALLI), introduces two datasets (DnD and ELIP), benchmarks alignment techniques, and proposes FaST, a parameter-efficient method that outperforms others.

DetailsMotivation: Current LLM-powered assistants lack personalization, failing to align with individual user preferences. The work aims to tackle this gap, especially in scenarios with limited user preference annotations.

Method: The study benchmarks various alignment techniques on two new datasets (DnD and ELIP) and introduces FaST, a parameter-efficient approach leveraging high-level features from the data.

Result: FaST achieves the best overall performance among the tested alignment techniques.

Conclusion: Personalization with limited data (PPALLI) is feasible, and FaST offers an effective solution for aligning LLMs to user preferences efficiently.

Abstract: LLM-powered conversational assistants are often deployed in a one-size-fits-all manner, which fails to accommodate individual user preferences. Recently, LLM personalization – tailoring models to align with specific user preferences – has gained increasing attention as a way to bridge this gap. In this work, we specifically focus on a practical yet challenging setting where only a small set of preference annotations can be collected per user – a problem we define as Personalized Preference Alignment with Limited Data (PPALLI). To support research in this area, we introduce two datasets – DnD and ELIP – and benchmark a variety of alignment techniques on them. We further propose FaST, a highly parameter-efficient approach that leverages high-level features automatically discovered from the data, achieving the best overall performance.

[68] Hop, Skip, and Overthink: Diagnosing Why Reasoning Models Fumble during Multi-Hop Analysis

Anushka Yadav, Isha Nalawade, Srujana Pillarichety, Yashwanth Babu, Reshmi Ghosh, Samyadeep Basu, Wenlong Zhao, Ali Nasaeh, Sriram Balasubramanian, Soundararajan Srinivasan

Main category: cs.CL

TL;DR: The paper investigates why reasoning models hallucinate more than general language models, focusing on multi-hop question answering tasks. It introduces a new error categorization framework and uncovers hidden error patterns.

DetailsMotivation: To understand the reasoning failures of language models, particularly in multi-hop question answering, and improve their fidelity and robustness.

Method: Systematic exploration using a novel error categorization framework (hops, coverage, overthinking) and human annotation with automated metrics.

Result: Reveals intricate error patterns not captured by accuracy-centric evaluations, highlighting cognitive inefficiencies.

Conclusion: Provides actionable insights to enhance reasoning fidelity and transparency in future language models.

Abstract: The emergence of reasoning models and their integration into practical AI chat bots has led to breakthroughs in solving advanced math, deep search, and extractive question answering problems that requires a complex and multi-step thought process. Yet, a complete understanding of why these models hallucinate more than general purpose language models is missing. In this investigative study, we systematicallyexplore reasoning failures of contemporary language models on multi-hop question answering tasks. We introduce a novel, nuanced error categorization framework that examines failures across three critical dimensions: the diversity and uniqueness of source documents involved (“hops”), completeness in capturing relevant information (“coverage”), and cognitive inefficiency (“overthinking”). Through rigorous hu-man annotation, supported by complementary automated metrics, our exploration uncovers intricate error patterns often hidden by accuracy-centric evaluations. This investigative approach provides deeper insights into the cognitive limitations of current models and offers actionable guidance toward enhancing reasoning fidelity, transparency, and robustness in future language modeling efforts.

[69] How Well Do LLMs Represent Values Across Cultures? Empirical Analysis of LLM Responses Based on Hofstede Cultural Dimensions

Julia Kharchenko, Tanya Roosta, Aman Chadha, Chirag Shah

Main category: cs.CL

TL;DR: The study examines whether LLMs adapt responses based on cultural values of users’ countries, finding they recognize but don’t always uphold these values, and proposes recommendations for culturally sensitive training.

DetailsMotivation: To understand if LLMs reflect diverse cultural values in responses, given human diversity and the importance of cultural sensitivity.

Method: Prompted LLMs with advice requests using Hofstede Cultural Dimensions, incorporating personas from 36 countries and their languages.

Result: LLMs recognize cultural differences but inconsistently uphold values in advice, failing to adapt responses based on cultural nuances.

Conclusion: Recommendations for training culturally sensitive LLMs are provided, alongside a framework to address culture and language alignment issues.

Abstract: Large Language Models (LLMs) attempt to imitate human behavior by responding to humans in a way that pleases them, including by adhering to their values. However, humans come from diverse cultures with different values. It is critical to understand whether LLMs showcase different values to the user based on the stereotypical values of a user’s known country. We prompt different LLMs with a series of advice requests based on 5 Hofstede Cultural Dimensions – a quantifiable way of representing the values of a country. Throughout each prompt, we incorporate personas representing 36 different countries and, separately, languages predominantly tied to each country to analyze the consistency in the LLMs’ cultural understanding. Through our analysis of the responses, we found that LLMs can differentiate between one side of a value and another, as well as understand that countries have differing values, but will not always uphold the values when giving advice, and fail to understand the need to answer differently based on different cultural values. Rooted in these findings, we present recommendations for training value-aligned and culturally sensitive LLMs. More importantly, the methodology and the framework developed here can help further understand and mitigate culture and language alignment issues with LLMs.

[70] Fairness Definitions in Language Models Explained

Avash Palikhe, Zichong Wang, Zhipeng Yin, Wenbin Zhang

Main category: cs.CL

TL;DR: The paper surveys fairness definitions in Language Models (LMs), proposes a taxonomy, and demonstrates practical implications through experiments.

DetailsMotivation: Address the confusion and lack of agreement on fairness definitions in LMs, which hinder progress and real-world adoption.

Method: Conducts a systematic survey, introduces a taxonomy for fairness notions based on transformer architectures (encoder-only, decoder-only, encoder-decoder), and validates definitions via experiments.

Result: Provides a clear overview of fairness definitions, a novel taxonomy, and experimental insights into their practical implications.

Conclusion: The paper clarifies fairness in LMs, highlights research challenges, and aims to inspire further advancements in the field.

Abstract: Language Models (LMs) have demonstrated exceptional performance across various Natural Language Processing (NLP) tasks. Despite these advancements, LMs can inherit and amplify societal biases related to sensitive attributes such as gender and race, limiting their adoption in real-world applications. Therefore, fairness has been extensively explored in LMs, leading to the proposal of various fairness notions. However, the lack of clear agreement on which fairness definition to apply in specific contexts and the complexity of understanding the distinctions between these definitions can create confusion and impede further progress. To this end, this paper proposes a systematic survey that clarifies the definitions of fairness as they apply to LMs. Specifically, we begin with a brief introduction to LMs and fairness in LMs, followed by a comprehensive, up-to-date overview of existing fairness notions in LMs and the introduction of a novel taxonomy that categorizes these concepts based on their transformer architecture: encoder-only, decoder-only, and encoder-decoder LMs. We further illustrate each definition through experiments, showcasing their practical implications and outcomes. Finally, we discuss current research challenges and open questions, aiming to foster innovative ideas and advance the field. The repository is publicly available online at https://github.com/vanbanTruong/Fairness-in-Large-Language-Models/tree/main/definitions.

[71] Parse Trees Guided LLM Prompt Compression

Wenhao Mao, Chengbin Hou, Tianyu Zhang, Xinyu Lin, Ke Tang, Hairong Lv

Main category: cs.CL

TL;DR: PartPrompt is a novel selective prompt compression method using linguistic rules and global tree structures to optimize prompt length without losing coherence, achieving state-of-the-art performance.

DetailsMotivation: Long prompts for LLMs increase computational costs and may exceed input limits. Existing compression methods either hallucinate or ignore linguistic rules and global structure.

Method: PartPrompt parses sentences into trees, calculates local entropy, organizes them hierarchically, propagates node values, and prunes the tree recursively.

Result: PartPrompt outperforms others across datasets, metrics, compression ratios, and LLMs, with strong coherence and handling of long prompts.

Conclusion: PartPrompt effectively compresses prompts while maintaining quality, validated by ablation studies and outperforming alternatives.

Abstract: Offering rich contexts to Large Language Models (LLMs) has shown to boost the performance in various tasks, but the resulting longer prompt would increase the computational cost and might exceed the input limit of LLMs. Recently, some prompt compression methods have been suggested to shorten the length of prompts by using language models to generate shorter prompts or by developing computational models to select important parts of original prompt. The generative compression methods would suffer from issues like hallucination, while the selective compression methods have not involved linguistic rules and overlook the global structure of prompt. To this end, we propose a novel selective compression method called PartPrompt. It first obtains a parse tree for each sentence based on linguistic rules, and calculates local information entropy for each node in a parse tree. These local parse trees are then organized into a global tree according to the hierarchical structure such as the dependency of sentences, paragraphs, and sections. After that, the root-ward propagation and leaf-ward propagation are proposed to adjust node values over the global tree. Finally, a recursive algorithm is developed to prune the global tree based on the adjusted node values. The experiments show that PartPrompt receives the state-of-the-art performance across various datasets, metrics, compression ratios, and target LLMs for inference. The in-depth ablation studies confirm the effectiveness of designs in PartPrompt, and other additional experiments also demonstrate its superiority in terms of the coherence of compressed prompts and in the extreme long prompt scenario.

[72] Human Bias in the Face of AI: Examining Human Judgment Against Text Labeled as AI Generated

Tiffany Zhu, Iain Weissburg, Kexun Zhang, William Yang Wang

Main category: cs.CL

TL;DR: Humans favor content labeled as human-generated over AI-generated, even when labels are swapped, revealing a bias against AI despite similar quality.

DetailsMotivation: To explore how bias affects perception of AI vs. human-generated content and its societal implications.

Method: Three experiments involving text rephrasing, news summarization, and persuasive writing with human raters evaluating labeled/unlabeled content.

Result: Raters couldn’t distinguish AI/human content blindly but preferred human-labeled content by 30%, even with swapped labels.

Conclusion: Human bias undervalues AI performance, highlighting judgment limitations and the need for better human-AI collaboration.

Abstract: As AI advances in text generation, human trust in AI generated content remains constrained by biases that go beyond concerns of accuracy. This study explores how bias shapes the perception of AI versus human generated content. Through three experiments involving text rephrasing, news article summarization, and persuasive writing, we investigated how human raters respond to labeled and unlabeled content. While the raters could not differentiate the two types of texts in the blind test, they overwhelmingly favored content labeled as “Human Generated,” over those labeled “AI Generated,” by a preference score of over 30%. We observed the same pattern even when the labels were deliberately swapped. This human bias against AI has broader societal and cognitive implications, as it undervalues AI performance. This study highlights the limitations of human judgment in interacting with AI and offers a foundation for improving human-AI collaboration, especially in creative fields.

[73] A Survey of Conversational Search

Fengran Mo, Kelong Mao, Ziliang Zhao, Hongjin Qian, Haonan Chen, Yiruo Cheng, Xiaoxi Li, Yutao Zhu, Zhicheng Dou, Jian-Yun Nie

Main category: cs.CL

TL;DR: The paper surveys conversational search, an AI-driven evolution of traditional search engines, focusing on advancements, key components, and future directions, with emphasis on large language models (LLMs).

DetailsMotivation: To explore how conversational search, leveraging NLP and LLMs, enhances user interactions and information retrieval compared to traditional keyword-based systems.

Method: Examines key components like query reformulation, search clarification, conversational retrieval, and response generation, alongside LLM integration.

Result: Highlights the potential of conversational search for complex queries and multi-turn interactions, with insights into real-world applications and evaluations.

Conclusion: Identifies challenges and opportunities in conversational search, guiding future research and development in the field.

Abstract: As a cornerstone of modern information access, search engines have become indispensable in everyday life. With the rapid advancements in AI and natural language processing (NLP) technologies, particularly large language models (LLMs), search engines have evolved to support more intuitive and intelligent interactions between users and systems. Conversational search, an emerging paradigm for next-generation search engines, leverages natural language dialogue to facilitate complex and precise information retrieval, thus attracting significant attention. Unlike traditional keyword-based search engines, conversational search systems enhance user experience by supporting intricate queries, maintaining context over multi-turn interactions, and providing robust information integration and processing capabilities. Key components such as query reformulation, search clarification, conversational retrieval, and response generation work in unison to enable these sophisticated interactions. In this survey, we explore the recent advancements and potential future directions in conversational search, examining the critical modules that constitute a conversational search system. We highlight the integration of LLMs in enhancing these systems and discuss the challenges and opportunities that lie ahead in this dynamic field. Additionally, we provide insights into real-world applications and robust evaluations of current conversational search systems, aiming to guide future research and development in conversational search.

[74] AUTALIC: A Dataset for Anti-AUTistic Ableist Language In Context

Naba Rizvi, Harper Strickland, Daniel Gitelman, Tristan Cooper, Alexis Morales-Flores, Michael Golden, Aekta Kallepalli, Akshat Alurkar, Haaset Owens, Saleha Ahmedi, Isha Khirwadkar, Imani Munyaka, Nedjma Ousidhoum

Main category: cs.CL

TL;DR: AUTALIC is the first benchmark dataset for detecting anti-autistic ableist language, highlighting the limitations of current NLP tools in this nuanced domain.

DetailsMotivation: The rise in understanding of autism and ableism reveals gaps in detecting subtle, context-dependent ableist language, which existing NLP tools often miss.

Method: AUTALIC comprises 2,400 autism-related sentences from Reddit, annotated by neurodiversity experts, with surrounding context for evaluation.

Result: Current language models, including advanced LLMs, struggle to reliably identify anti-autistic ableism or align with human judgments.

Conclusion: AUTALIC addresses a critical gap, offering a resource for inclusive NLP development and neurodiversity research.

Abstract: As our understanding of autism and ableism continues to increase, so does our understanding of ableist language towards autistic people. Such language poses a significant challenge in NLP research due to its subtle and context-dependent nature. Yet, detecting anti-autistic ableist language remains underexplored, with existing NLP tools often failing to capture its nuanced expressions. We present AUTALIC, the first benchmark dataset dedicated to the detection of anti-autistic ableist language in context, addressing a significant gap in the field. The dataset comprises 2,400 autism-related sentences collected from Reddit, accompanied by surrounding context, and is annotated by trained experts with backgrounds in neurodiversity. Our comprehensive evaluation reveals that current language models, including state-of-the-art LLMs, struggle to reliably identify anti-autistic ableism and align with human judgments, underscoring their limitations in this domain. We publicly release AUTALIC along with the individual annotations which serve as a valuable resource to researchers working on ableism, neurodiversity, and also studying disagreements in annotation tasks. This dataset serves as a crucial step towards developing more inclusive and context-aware NLP systems that better reflect diverse perspectives.

[75] CLaSP: Learning Concepts for Time-Series Signals from Natural Language Supervision

Aoi Ito, Kota Dohi, Yohei Kawaguchi

Main category: cs.CL

TL;DR: CLaSP is a novel model for retrieving time-series signals using natural language queries, addressing scalability and adaptability issues of existing methods.

DetailsMotivation: Existing methods for searching time-series signals rely on sketch-based inputs or predefined dictionaries, limiting their scalability and adaptability.

Method: CLaSP employs contrastive learning to map time-series signals to natural language descriptions, leveraging large language models (LLMs) without predefined dictionaries.

Result: CLaSP achieves high accuracy in retrieving time-series patterns using natural language queries, demonstrated on TRUCE and SUSHI datasets.

Conclusion: CLaSP offers a scalable and adaptable solution for retrieving time-series signals based on natural language descriptions.

Abstract: This paper presents CLaSP, a novel model for retrieving time-series signals using natural language queries that describe signal characteristics. The ability to search time-series signals based on descriptive queries is essential in domains such as industrial diagnostics, where data scientists often need to find signals with specific characteristics. However, existing methods rely on sketch-based inputs, predefined synonym dictionaries, or domain-specific manual designs, limiting their scalability and adaptability. CLaSP addresses these challenges by employing contrastive learning to map time-series signals to natural language descriptions. Unlike prior approaches, it eliminates the need for predefined synonym dictionaries and leverages the rich contextual knowledge of large language models (LLMs). Using the TRUCE and SUSHI datasets, which pair time-series signals with natural language descriptions, we demonstrate that CLaSP achieves high accuracy in retrieving a variety of time series patterns based on natural language queries.

[76] FactEHR: A Dataset for Evaluating Factuality in Clinical Notes Using LLMs

Monica Munnangi, Akshay Swaminathan, Jason Alan Fries, Jenelle Jindal, Sanjana Narayanan, Ivan Lopez, Lucia Tu, Philip Chung, Jesutofunmi A. Omiye, Mehr Kashyap, Nigam Shah

Main category: cs.CL

TL;DR: FactEHR introduces a dataset for fact decomposition in clinical notes, revealing variability in LLM performance for factual verification in healthcare.

DetailsMotivation: Ensuring factual accuracy in LLMs for healthcare requires fine-grained fact decomposition, which is challenging due to clinical documentation's complexity.

Method: FactEHR dataset includes 2,168 clinical notes decomposed into 987,266 entailment pairs, evaluated using LLMs like Gemini-1.5-Flash and Llama-3 8B.

Result: Gemini-1.5-Flash performs better in generating accurate facts, while Llama-3 8B is less consistent, highlighting LLM limitations in clinical text.

Conclusion: Improved LLM capabilities are needed for reliable factual verification in clinical settings.

Abstract: Verifying and attributing factual claims is essential for the safe and effective use of large language models (LLMs) in healthcare. A core component of factuality evaluation is fact decomposition, the process of breaking down complex clinical statements into fine-grained atomic facts for verification. Recent work has proposed fact decomposition, which uses LLMs to rewrite source text into concise sentences conveying a single piece of information, to facilitate fine-grained fact verification. However, clinical documentation poses unique challenges for fact decomposition due to dense terminology and diverse note types and remains understudied. To address this gap and explore these challenges, we present FactEHR, an NLI dataset consisting of document fact decompositions for 2,168 clinical notes spanning four types from three hospital systems, resulting in 987,266 entailment pairs. We assess the generated facts on different axes, from entailment evaluation of LLMs to a qualitative analysis. Our evaluation, including review by the clinicians, reveals substantial variability in LLM performance for fact decomposition. For example, Gemini-1.5-Flash consistently generates relevant and accurate facts, while Llama-3 8B produces fewer and less consistent outputs. The results underscore the need for better LLM capabilities to support factual verification in clinical text.

[77] Improved Unbiased Watermark for Large Language Models

Ruibo Chen, Yihan Wu, Junfeng Guo, Heng Huang

Main category: cs.CL

TL;DR: MCmark, a multi-channel-based watermarking method, improves detectability and robustness in AI-generated text without distorting quality.

DetailsMotivation: The need to authenticate AI-generated content due to AI surpassing human text generation capabilities.

Method: Partitioning the model’s vocabulary into segments and adjusting token probabilities based on a watermark key.

Result: MCmark improves detectability by over 10% compared to existing unbiased watermarks.

Conclusion: MCmark enhances practical watermarking applications in AI-generated texts.

Abstract: As artificial intelligence surpasses human capabilities in text generation, the necessity to authenticate the origins of AI-generated content has become paramount. Unbiased watermarks offer a powerful solution by embedding statistical signals into language model-generated text without distorting the quality. In this paper, we introduce MCmark, a family of unbiased, Multi-Channel-based watermarks. MCmark works by partitioning the model’s vocabulary into segments and promoting token probabilities within a selected segment based on a watermark key. We demonstrate that MCmark not only preserves the original distribution of the language model but also offers significant improvements in detectability and robustness over existing unbiased watermarks. Our experiments with widely-used language models demonstrate an improvement in detectability of over 10% using MCmark, compared to existing state-of-the-art unbiased watermarks. This advancement underscores MCmark’s potential in enhancing the practical application of watermarking in AI-generated texts.

[78] Evaluating the Robustness of Multimodal Agents Against Active Environmental Injection Attacks

Yurun Chen, Xavier Hu, Keting Yin, Juncheng Li, Shengyu Zhang

Main category: cs.CL

TL;DR: The paper introduces Active Environment Injection Attack (AEIA), a novel threat where attackers disguise malicious attacks as environmental elements to manipulate AI agents. It identifies two vulnerabilities in Android OS and proposes AEIA-MN to test MLLM-based agents, showing a 93% success rate.

DetailsMotivation: To address the overlooked security concern of AI agents detecting impostors in their environment, particularly in operating systems like Android.

Method: Analyzes the agents’ operational context, identifies AEIA vulnerabilities, and proposes AEIA-MN to exploit interaction flaws in mobile OS.

Result: Demonstrates a 93% attack success rate on the AndroidWorld benchmark, highlighting MLLM-based agents’ vulnerability.

Conclusion: AEIA poses a significant threat to AI agents, and current MLLMs are highly susceptible, urging the need for improved security measures.

Abstract: As researchers continue to optimize AI agents for more effective task execution within operating systems, they often overlook a critical security concern: the ability of these agents to detect “impostors” within their environment. Through an analysis of the agents’ operational context, we identify a significant threat-attackers can disguise malicious attacks as environmental elements, injecting active disturbances into the agents’ execution processes to manipulate their decision-making. We define this novel threat as the Active Environment Injection Attack (AEIA). Focusing on the interaction mechanisms of the Android OS, we conduct a risk assessment of AEIA and identify two critical security vulnerabilities: (1) Adversarial content injection in multimodal interaction interfaces, where attackers embed adversarial instructions within environmental elements to mislead agent decision-making; and (2) Reasoning gap vulnerabilities in the agent’s task execution process, which increase susceptibility to AEIA attacks during reasoning. To evaluate the impact of these vulnerabilities, we propose AEIA-MN, an attack scheme that exploits interaction vulnerabilities in mobile operating systems to assess the robustness of MLLM-based agents. Experimental results show that even advanced MLLMs are highly vulnerable to this attack, achieving a maximum attack success rate of 93% on the AndroidWorld benchmark by combining two vulnerabilities.

[79] Mixup Model Merge: Enhancing Model Merging Performance through Randomized Linear Interpolation

Yue Zhou, Yi Chang, Yuan Wu

Main category: cs.CL

TL;DR: Mixup Model Merge (M3) improves model merging by using randomized linear interpolation with Beta-distributed coefficients, outperforming equal-ratio merging in performance, robustness, and efficiency.

DetailsMotivation: Existing model merging methods ignore varying contribution ratios of task-specific models, limiting merged model performance.

Method: M3 uses randomized linear interpolation in parameter space with coefficients sampled from a Beta distribution to explore diverse contribution ratios.

Result: M3 enhances merged model performance, robustness, and works well with sparsification methods like DARE.

Conclusion: M3 is a simple, effective method for merging task-specific models, offering better performance and flexibility.

Abstract: Model merging aims to integrate multiple task-specific models into a unified model that inherits the capabilities of the task-specific models, without additional training. Existing model merging methods often lack consideration of the varying contribution ratios of different task-specific models to the final merged model. In this paper, we propose Mixup Model Merge (M3), a simple yet effective method inspired by the randomized linear interpolation strategy from the Mixup data augmentation technique. M3 performs randomized linear interpolation in parameter space between two task-specific LLMs, where interpolation coefficients are sampled from a Beta distribution to explore diverse contribution ratios. This controllable randomness allows M3 to outperform standard equal-ratio merging by discovering better contribution ratio combinations. Extensive experiments show that M3 significantly (1) improves merged LLM performance across tasks, (2) enhances out-of-distribution and adversarial robustness, (3) outperforms the positive effects of the sparsification method DARE on model merging and can be further combined with DARE to achieve superior results, and (4) balances exploration efficiency and diversity in contribution ratios by tuning the Beta distribution’s shape parameters. The code is provided in the supplementary materials.

[80] Evaluating Robustness of LLMs in Question Answering on Multilingual Noisy OCR Data

Bhawna Piryani, Jamshid Mozafari, Abdelrahman Abdallah, Antoine Doucet, Adam Jatowt

Main category: cs.CL

TL;DR: The paper analyzes how OCR errors impact multilingual QA systems, introduces a dataset (MultiOCR-QA), and evaluates LLMs under various OCR noise conditions, revealing poor performance on noisy text.

DetailsMotivation: OCR errors in historical and multilingual documents degrade QA system performance, necessitating a study on their impact.

Method: Introduces MultiOCR-QA dataset (50K QA pairs in English, French, German) with varying OCR noise levels. Evaluates state-of-the-art LLMs under different OCR error conditions.

Result: QA systems perform poorly on noisy OCR text, highlighting their vulnerability to OCR-induced errors.

Conclusion: Current QA systems lack noise resilience, underscoring the need for improved models in historical digitization.

Abstract: Optical Character Recognition (OCR) plays a crucial role in digitizing historical and multilingual documents, yet OCR errors - imperfect extraction of text, including character insertion, deletion, and substitution can significantly impact downstream tasks like question-answering (QA). In this work, we conduct a comprehensive analysis of how OCR-induced noise affects the performance of Multilingual QA Systems. To support this analysis, we introduce a multilingual QA dataset MultiOCR-QA, comprising 50K question-answer pairs across three languages, English, French, and German. The dataset is curated from OCR-ed historical documents, which include different levels and types of OCR noise. We then evaluate how different state-of-the-art Large Language models (LLMs) perform under different error conditions, focusing on three major OCR error types. Our findings show that QA systems are highly prone to OCR-induced errors and perform poorly on noisy OCR text. By comparing model performance on clean versus noisy texts, we provide insights into the limitations of current approaches and emphasize the need for more noise-resilient QA systems in historical digitization contexts.

[81] Assessing Agentic Large Language Models in Multilingual National Bias

Qianying Liu, Katrina Qiyao Wang, Fei Cheng, Sadao Kurohashi

Main category: cs.CL

TL;DR: The study explores cross-language biases in LLMs, focusing on reasoning-based recommendations in university applications, travel, and relocation. It reveals persistent local language bias, with GPT-4 and Sonnet reducing bias for English-speaking countries but failing in multilingual alignment.

DetailsMotivation: To address the unexplored issue of cross-language disparities in reasoning-based recommendations by LLMs and quantify biases in multilingual decision-making tasks.

Method: Testing LLMs (GPT-3.5, GPT-4, Sonnet) on decision-making tasks across languages, analyzing bias in scores, and assessing demographic and reasoning strategy impacts.

Result: Local language bias is prevalent; GPT-4 and Sonnet reduce bias for English-speaking countries but lack robust multilingual alignment.

Conclusion: The findings highlight challenges for multilingual AI agents, emphasizing the need for better alignment in applications like education.

Abstract: Large Language Models have garnered significant attention for their capabilities in multilingual natural language processing, while studies on risks associated with cross biases are limited to immediate context preferences. Cross-language disparities in reasoning-based recommendations remain largely unexplored, with a lack of even descriptive analysis. This study is the first to address this gap. We test LLM’s applicability and capability in providing personalized advice across three key scenarios: university applications, travel, and relocation. We investigate multilingual bias in state-of-the-art LLMs by analyzing their responses to decision-making tasks across multiple languages. We quantify bias in model-generated scores and assess the impact of demographic factors and reasoning strategies (e.g., Chain-of-Thought prompting) on bias patterns. Our findings reveal that local language bias is prevalent across different tasks, with GPT-4 and Sonnet reducing bias for English-speaking countries compared to GPT-3.5 but failing to achieve robust multilingual alignment, highlighting broader implications for multilingual AI agents and applications such as education. \footnote{Code available at: https://github.com/yiyunya/assess_agentic_national_bias

[82] Search-R1: Training LLMs to Reason and Leverage Search Engines with Reinforcement Learning

Bowen Jin, Hansi Zeng, Zhenrui Yue, Jinsung Yoon, Sercan Arik, Dong Wang, Hamed Zamani, Jiawei Han

Main category: cs.CL

TL;DR: Search-R1 uses RL to teach LLMs to autonomously generate search queries during reasoning, improving performance by up to 41% over baselines.

DetailsMotivation: LLMs often struggle to optimally interact with search engines for real-time knowledge retrieval, limiting reasoning effectiveness.

Method: Search-R1 extends RL for reasoning, using multi-turn search interactions, token masking, and outcome-based rewards.

Result: Improves performance by 41% (Qwen2.5-7B) and 20% (Qwen2.5-3B) over RAG baselines.

Conclusion: Search-R1 enhances LLM reasoning with autonomous search, offering insights into RL optimization and retrieval-augmented reasoning.

Abstract: Efficiently acquiring external knowledge and up-to-date information is essential for effective reasoning and text generation in large language models (LLMs). Prompting advanced LLMs with reasoning capabilities to use search engines during inference is often suboptimal, as the LLM might not fully possess the capability on how to interact optimally with the search engine. This paper introduces Search-R1, an extension of reinforcement learning (RL) for reasoning frameworks where the LLM learns to autonomously generate (multiple) search queries during step-by-step reasoning with real-time retrieval. Search-R1 optimizes LLM reasoning trajectories with multi-turn search interactions, leveraging retrieved token masking for stable RL training and a simple outcome-based reward function. Experiments on seven question-answering datasets show that Search-R1 improves performance by 41% (Qwen2.5-7B) and 20% (Qwen2.5-3B) over various RAG baselines under the same setting. This paper further provides empirical insights into RL optimization methods, LLM choices, and response length dynamics in retrieval-augmented reasoning. The code and model checkpoints are available at https://github.com/PeterGriffinJin/Search-R1.

[83] The Impact of Item-Writing Flaws on Difficulty and Discrimination in Item Response Theory

Robin Schmucker, Steven Moore

Main category: cs.CL

TL;DR: The paper explores the predictive validity of Item-Writing Flaw (IWF) rubrics for estimating IRT parameters, revealing significant links between IWFs and item difficulty/discrimination in STEM subjects.

DetailsMotivation: Traditional validation methods for test items are resource-intensive, and the predictive validity of IWF rubrics for IRT parameters is underexplored.

Method: Analyzed 7,126 STEM multiple-choice questions using a 19-criteria IWF rubric and compared results with IRT parameters.

Result: Found statistically significant relationships between IWFs and IRT parameters, with specific flaws impacting item quality differently.

Conclusion: Automated IWF analysis is a valuable supplement to traditional validation, but further research is needed for domain-general rubrics and domain-specific algorithms.

Abstract: High-quality test items are essential for educational assessments, particularly within Item Response Theory (IRT). Traditional validation methods rely on resource-intensive pilot testing to estimate item difficulty and discrimination. More recently, Item-Writing Flaw (IWF) rubrics emerged as a domain-general approach for evaluating test items based on textual features. This method offers a scalable, pre-deployment evaluation without requiring student data, but its predictive validity concerning empirical IRT parameters is underexplored. To address this gap, we conducted a study involving 7,126 multiple-choice questions across various STEM subjects (physical science, mathematics, and life/earth sciences). Using an automated approach, we annotated each question with a 19-criteria IWF rubric and studied relationships to data-driven IRT parameters. Our analysis revealed statistically significant links between the number of IWFs and IRT difficulty and discrimination parameters, particularly in life/earth and physical science domains. We further observed how specific IWF criteria can impact item quality more and less severely (e.g., negative wording vs. implausible distractors) and how they might make a question more or less challenging. Overall, our findings establish automated IWF analysis as a valuable supplement to traditional validation, providing an efficient method for initial item screening, particularly for flagging low-difficulty MCQs. Our findings show the need for further research on domain-general evaluation rubrics and algorithms that understand domain-specific content for robust item validation.

[84] Inside-Out: Hidden Factual Knowledge in LLMs

Zorik Gekhman, Eyal Ben David, Hadas Orgad, Eran Ofek, Yonatan Belinkov, Idan Szpektor, Jonathan Herzig, Roi Reichart

Main category: cs.CL

TL;DR: The paper introduces a framework to measure hidden knowledge in LLMs, showing they encode more facts internally than they express externally, with a 40% gap. Some knowledge is so hidden it never appears in outputs, revealing generation limitations.

DetailsMotivation: To quantify and demonstrate the discrepancy between the factual knowledge LLMs encode internally and what they express in outputs, addressing a gap in prior research.

Method: Proposes a formal definition of knowledge, distinguishing external (observable outputs) and internal (intermediate computations) knowledge. Applies this framework to three LLMs in closed-book QA.

Result: LLMs encode 40% more knowledge internally than externally. Some answers are never generated despite perfect internal knowledge, highlighting generation limitations.

Conclusion: Scaling test-time compute via repeated sampling is constrained by deeply hidden knowledge, limiting performance improvements in closed-book QA.

Abstract: This work presents a framework for assessing whether large language models (LLMs) encode more factual knowledge in their parameters than what they express in their outputs. While a few studies hint at this possibility, none has clearly defined or demonstrated this phenomenon. We first propose a formal definition of knowledge, quantifying it for a given question as the fraction of correct-incorrect answer pairs where the correct one is ranked higher. This gives rise to external and internal knowledge, depending on the information used to score individual answer candidates: either the model’s observable token-level probabilities or its intermediate computations. Hidden knowledge arises when internal knowledge exceeds external knowledge. We then present a case study, applying this framework to three popular open-weights LLMs in a closed-book QA setup. Our results indicate that: (1) LLMs consistently encode more factual knowledge internally than what they express externally, with an average relative gap of 40%. (2) Surprisingly, some knowledge is so deeply hidden that a model can internally know an answer perfectly, yet fail to generate it even once, despite large-scale repeated sampling of 1,000 answers. This reveals fundamental limitations in the generation capabilities of LLMs, which (3) put a practical constraint on scaling test-time compute via repeated answer sampling in closed-book QA: significant performance improvements remain inaccessible because some answers are practically never sampled, yet if they were, we would be guaranteed to rank them first.

[85] I Have Covered All the Bases Here: Interpreting Reasoning Features in Large Language Models via Sparse Autoencoders

Andrey Galichin, Alexey Dontsov, Polina Druzhinina, Anton Razzhigaev, Oleg Y. Rogov, Elena Tutubalina, Ivan Oseledets

Main category: cs.CL

TL;DR: The paper explores the internal reasoning mechanisms of LLMs like DeepSeek-R1, identifying vocabulary linked to human reasoning. Using Sparse Autoencoders (SAEs) and ReasonScore, it detects features matching uncertainty, exploratory thinking, and reflection. Steering these features improves reasoning performance (+2.2%) and trace length (+20.5%).

DetailsMotivation: To uncover the unexplored internal mechanisms behind reasoning processes in LLMs, particularly how vocabulary associated with human reasoning corresponds to specific moments in the models' functioning.

Method: Employ Sparse Autoencoders (SAEs) for sparse decomposition of activations, introduce ReasonScore to identify active features during reasoning, and perform manual/automatic interpretation of these features. Conduct steering experiments and model diffing.

Result: Features matching uncertainty, exploratory thinking, and reflection are identified. Amplifying these features boosts reasoning performance (+2.2%) and reasoning trace length (+20.5%). Model diffing confirms their presence only in reasoning-capable models.

Conclusion: The study advances mechanistic understanding of reasoning in LLMs, demonstrating the role of specific features and their impact on performance. Code is publicly available for further research.

Abstract: Recent LLMs like DeepSeek-R1 have demonstrated state-of-the-art performance by integrating deep thinking and complex reasoning during generation. However, the internal mechanisms behind these reasoning processes remain unexplored. We observe reasoning LLMs consistently use vocabulary associated with human reasoning processes. We hypothesize these words correspond to specific reasoning moments within the models’ internal mechanisms. To test this hypothesis, we employ Sparse Autoencoders (SAEs), a technique for sparse decomposition of neural network activations into human-interpretable features. We introduce ReasonScore, an automatic metric to identify active SAE features during these reasoning moments. We perform manual and automatic interpretation of the features detected by our metric, and find those with activation patterns matching uncertainty, exploratory thinking, and reflection. Through steering experiments, we demonstrate that amplifying these features increases performance on reasoning-intensive benchmarks (+2.2%) while producing longer reasoning traces (+20.5%). Using the model diffing technique, we provide evidence that these features are present only in models with reasoning capabilities. Our work provides the first step towards a mechanistic understanding of reasoning in LLMs. Code available at https://github.com/AIRI-Institute/SAE-Reasoning

[86] Learning Optimal Prompt Ensemble for Multi-source Visual Prompt Transfer

Enming Zhang, Liwen Cao, Yanru Wu, Zijie Zhao, Yang Li

Main category: cs.CL

TL;DR: HGPrompt is a dynamic framework for optimizing ensemble weights of multiple source prompts to enhance generalization in new tasks, leveraging transferability metrics and gradient conflict minimization.

DetailsMotivation: To address the challenge of naive aggregation of source prompts, which overlooks their varying contributions to target tasks, by dynamically learning optimal ensemble weights.

Method: HGPrompt uses a differentiable prompt transferability metric and a regularization strategy to minimize gradient conflicts, optimizing ensemble weights via Hessian and Fisher Information.

Result: Achieves state-of-the-art performance on the VTAB benchmark, demonstrating effective multi-source prompt transfer.

Conclusion: HGPrompt effectively learns optimal ensemble weights, enhancing generalization and stability in multi-source prompt transfer.

Abstract: Prompt tuning has emerged as a lightweight strategy for adapting foundation models to downstream tasks, particularly for resource-constrained systems. As pre-trained prompts become valuable assets, combining multiple source prompts offers a promising approach to enhance generalization for new tasks by leveraging complementary knowledge. However, naive aggregation often overlooks different source prompts have different contribution potential to the target task. To address this, we propose HGPrompt, a dynamic framework that learns optimal ensemble weights. These weights are optimized by jointly maximizing an information-theoretic metric for transferability and minimizing gradient conflicts via a novel regularization strategy. Specifically, we propose a differentiable prompt transferability metric to captures the discriminability of prompt-induced features on the target task. Meanwhile, HGPrompt match the gradient variances with respect to different source prompts based on Hessian and Fisher Information, ensuring stable and coherent knowledge transfer while suppressing gradient conflicts among them. Extensive experiments on the large-scale VTAB benchmark demonstrate the state-of-the-art performance of HGPrompt, validating its effectiveness in learning an optimal ensemble for effective multi-source prompt transfer.

[87] CRAB: A Benchmark for Evaluating Curation of Retrieval-Augmented LLMs in Biomedicine

Hanmeng Zhong, Linqing Chen, Wentao Wu, Weilei Wang

Main category: cs.CL

TL;DR: CRAB is a multilingual benchmark for evaluating biomedical curation in retrieval-augmented LLMs, highlighting performance gaps.

DetailsMotivation: Addressing the lack of reliable evaluation for curation in biomedical retrieval-augmented LLMs.

Method: Introduces CRAB, a multilingual benchmark with a citation-based metric to quantify curation performance.

Result: Reveals significant discrepancies in curation performance among mainstream LLMs.

Conclusion: Urges improvement in biomedical curation for LLMs, with CRAB dataset publicly available.

Abstract: Recent development in Retrieval-Augmented Large Language Models (LLMs) have shown great promise in biomedical applications. How ever, a critical gap persists in reliably evaluating their curation ability the process by which models select and integrate relevant references while filtering out noise. To address this, we introduce the benchmark for Curation of Retrieval-Augmented LLMs in Biomedicine (CRAB), the first multilingual benchmark tailored for evaluating the biomedical curation of retrieval-augmented LLMs, available in English, French, German and Chinese. By incorporating a novel citation-based evaluation metric, CRAB quantifies the curation performance of retrieval-augmented LLMs in biomedicine. Experimental results reveal significant discrepancies in the curation performance of mainstream LLMs, underscoring the urgent need to improve it in the domain of biomedicine. Our dataset is available at https://huggingface.co/datasets/zhm0/CRAB.

[88] Meta-rater: A Multi-dimensional Data Selection Method for Pre-training Language Models

Xinlin Zhuang, Jiahui Peng, Ren Ma, Yinfan Wang, Tianyi Bai, Xingjian Wei, Jiantao Qiu, Chi Zhang, Ying Qian, Conghui He

Main category: cs.CL

TL;DR: The paper proposes Meta-rater, a multi-dimensional data selection method for LLMs, improving convergence speed and downstream task performance.

DetailsMotivation: Current data selection methods for LLMs are limited by single-dimensional evaluation or redundancy-focused strategies, hindering transparency and optimization of data quality.

Method: Meta-rater integrates four quality dimensions (professionalism, readability, reasoning, cleanliness) with existing metrics using learned optimal weightings and proxy models to predict validation loss.

Result: Meta-rater doubles convergence speed for 1.3B parameter models and improves downstream task performance by 3.23, scaling to 7.2B parameters.

Conclusion: Holistic, multi-dimensional quality integration outperforms single-dimension approaches, offering a scalable paradigm for enhancing pre-training efficiency and model capability.

Abstract: The composition of pre-training datasets for large language models (LLMs) remains largely undisclosed, hindering transparency and efforts to optimize data quality, a critical driver of model performance. Current data selection methods, such as natural language quality assessments, diversity-based filters, and classifier-based approaches, are limited by single-dimensional evaluation or redundancy-focused strategies. To address these gaps, we propose four dimensions to evaluate data quality: professionalism, readability, reasoning, and cleanliness. We further introduce Meta-rater,a multi-dimensional data selection method that integrates these dimensions with existing quality metrics through learned optimal weightings. Meta-rater employs proxy models to train a regression model that predicts validation loss, enabling the identification of optimal combinations of quality scores. Experiments demonstrate that Meta-rater doubles convergence speed for 1.3B parameter models and improves downstream task performance by 3.23, with advantages that scale to models as large as 7.2B parameters. Our work establishes that holistic, multi-dimensional quality integration significantly outperforms conventional single-dimension approaches, offering a scalable paradigm for enhancing pre-training efficiency and model capability. To advance future research, we release scripts, data, and models at https://github.com/opendatalab/Meta-rater.

[89] Evaluating Multi-Hop Reasoning in Large Language Models: A Chemistry-Centric Case Study

Mohammad Khodadad, Ali Shiraee Kasmaee, Mahdi Astaraki, Nicholas Sherck, Hamidreza Mahyar, Soheila Samiee

Main category: cs.CL

TL;DR: A new benchmark for evaluating compositional reasoning in LLMs for chemistry, using automated pipelines and knowledge graphs, reveals challenges even with context augmentation.

DetailsMotivation: To assess and improve the compositional reasoning capabilities of large language models (LLMs) in the chemistry domain.

Method: Developed a benchmark with curated data and automated pipelines, integrating OpenAI models and NER systems to build knowledge graphs and generate multi-hop questions.

Result: State-of-the-art LLMs struggle with multi-hop reasoning; document retrieval helps but doesn’t eliminate errors.

Conclusion: The study benchmarks LLM limitations, introduces a novel data pipeline, and advances understanding of reasoning in computational linguistics.

Abstract: In this study, we introduced a new benchmark consisting of a curated dataset and a defined evaluation process to assess the compositional reasoning capabilities of large language models within the chemistry domain. We designed and validated a fully automated pipeline, verified by subject matter experts, to facilitate this task. Our approach integrates OpenAI reasoning models with named entity recognition (NER) systems to extract chemical entities from recent literature, which are then augmented with external knowledge bases to form a comprehensive knowledge graph. By generating multi-hop questions across these graphs, we assess LLM performance in both context-augmented and non-context augmented settings. Our experiments reveal that even state-of-the-art models face significant challenges in multi-hop compositional reasoning. The results reflect the importance of augmenting LLMs with document retrieval, which can have a substantial impact on improving their performance. However, even perfect retrieval accuracy with full context does not eliminate reasoning errors, underscoring the complexity of compositional reasoning. This work not only benchmarks and highlights the limitations of current LLMs but also presents a novel data generation pipeline capable of producing challenging reasoning datasets across various domains. Overall, this research advances our understanding of reasoning in computational linguistics.

[90] Improving the fact-checking performance of language models by relying on their entailment ability

Gaurav Kumar, Debajyoti Mazumder, Ayush Garg, Jasabanta Patro

Main category: cs.CL

TL;DR: The paper proposes using language models’ entailment ability to improve fact-checking, comparing prompting and fine-tuning strategies, and reports significant performance improvements.

DetailsMotivation: Fact verification is complex due to contradictory evidence, and current NLP strategies for automated fact-checking are not robust enough.

Method: Relied on language models’ entailment ability, compared different prompting and fine-tuning strategies, and trained models with raw evidence, claim-evidence understanding, and entailed justifications.

Result: Performance improved by up to 8.20% and 16.39% for TBE-1 and TBE-2, and up to 28.57% and 44.26% for TBE-3 on LIAR-RAW and RAW-FC datasets.

Conclusion: The proposed strategy is effective, and the code repository is shared for reproducibility.

Abstract: Automated fact-checking is a crucial task in this digital age. The NLP community has been trying various strategies to build robust fact-checking systems. However, we have not been very successful yet. One main reason behind this is that fact verification is a complex process. Language models have to parse through multiple pieces of evidence, often contradicting each other, to predict a claim’s veracity. In this paper, we proposed a simple yet effective strategy, where we relied on the entailment ability of language models to improve the fact-checking performance. Apart from that, we did a comparison of different prompting and fine-tuning strategies, as it is currently lacking in the literature. Some of our observations are: (i) training language models with raw evidence sentences (TBE-1) and overall claim-evidence understanding (TBE-2) resulted in an improvement up to 8.20% and 16.39% in macro-F1 for RAW-FC dataset, and (ii) training language models with entailed justifications (TBE-3) outperformed the baselines by a huge margin (up to 28.57% and 44.26% for LIAR-RAW and RAW-FC, respectively). We have shared our code repository to reproduce the results.

[91] Explain Less, Understand More: Jargon Detection via Personalized Parameter-Efficient Fine-tuning

Bohao Wu, Qingyun Wang, Yue Guo

Main category: cs.CL

TL;DR: The paper explores efficient and scalable methods for personalized jargon detection, comparing lightweight fine-tuning (LoRA) and personalized prompting, achieving superior performance with minimal annotated data.

DetailsMotivation: To make technical documents accessible to diverse readers by personalizing jargon detection without requiring extensive annotation or computational resources.

Method: Two strategies: (1) lightweight fine-tuning with LoRA on open-source models, and (2) personalized prompting. Hybrid approaches combining limited annotated data with unsupervised signals are also tested.

Result: The personalized LoRA model outperforms GPT-4 by 21.4% in F1 score and exceeds the best baseline by 8.3%, using only 10% of annotated data.

Conclusion: The study provides a scalable, low-resource solution for personalized jargon detection, advancing user-adaptive NLP systems.

Abstract: Personalizing jargon detection and explanation is essential for making technical documents accessible to readers with diverse disciplinary backgrounds. However, tailoring models to individual users typically requires substantial annotation efforts and computational resources due to user-specific finetuning. To address this, we present a systematic study of personalized jargon detection, focusing on methods that are both efficient and scalable for real-world deployment. We explore two personalization strategies: (1) lightweight fine-tuning using Low-Rank Adaptation (LoRA) on open-source models, and (2) personalized prompting, which tailors model behavior at inference time without retaining. To reflect realistic constraints, we also investigate hybrid approaches that combine limited annotated data with unsupervised user background signals. Our personalized LoRA model outperforms GPT-4 by 21.4% in F1 score and exceeds the best performing oracle baseline by 8.3%. Remarkably, our method achieves comparable performance using only 10% of the annotated training data, demonstrating its practicality for resource-constrained settings. Our study offers the first work to systematically explore efficient, low-resource personalization of jargon detection using open-source language models, offering a practical path toward scalable, user-adaptive NLP system.

[92] Text-Only Reasoning Unleashes Zero-Shot Multimodal Evaluators

Jongwoo Ko, Sungnyun Kim, Sungwoo Cho, Se-Young Yun

Main category: cs.CL

TL;DR: Flex-Judge is a reasoning-guided multimodal judge model that generalizes across modalities using minimal textual reasoning data, outperforming traditional methods.

DetailsMotivation: Human-generated reward signals are costly, and existing LLM evaluators lack generalization across multimodal tasks.

Method: Flex-Judge leverages textual reasoning data to generalize decision-making patterns for multimodal judgments.

Result: Flex-Judge achieves competitive or superior performance with fewer training data compared to commercial APIs and multimodal evaluators.

Conclusion: Reasoning-based text supervision is a cost-effective alternative to annotation-intensive methods, advancing scalable multimodal evaluation.

Abstract: Human-generated reward signals are critical for aligning generative models with human preferences, guiding both training and inference-time evaluations. While large language models (LLMs) employed as proxy evaluators, i.e., LLM-as-a-Judge, significantly reduce the costs associated with manual annotations, they typically require extensive modality-specific training data and fail to generalize well across diverse multimodal tasks. In this paper, we propose Flex-Judge, a reasoning-guided multimodal judge model that leverages minimal textual reasoning data to robustly generalize across multiple modalities and evaluation formats. Our core intuition is that structured textual reasoning explanations inherently encode generalizable decision-making patterns, enabling an effective transfer to multimodal judgments, e.g., with images or videos. Empirical results demonstrate that Flex-Judge, despite being trained on significantly fewer text data, achieves competitive or superior performance compared to state-of-the-art commercial APIs and extensively trained multimodal evaluators. Notably, Flex-Judge presents broad impact in modalities like molecule, where comprehensive evaluation benchmarks are scarce, underscoring its practical value in resource-constrained domains. Our framework highlights reasoning-based text supervision as a powerful, cost-effective alternative to traditional annotation-intensive approaches, substantially advancing scalable multimodal model-as-a-judge.

[93] Emotion-o1: Adaptive Long Reasoning for Emotion Understanding in LLMs

Changhao Song, Yazhou Zhang, Hui Gao, Kaiyun Huang, Peng Zhang

Main category: cs.CL

TL;DR: Emotion-o1 is an adaptive CoT framework that dynamically adjusts reasoning length for emotion tasks, improving performance and efficiency.

DetailsMotivation: Fixed-length CoT methods struggle to balance reasoning depth and efficiency for varying emotion-task complexities.

Method: Emotion-o1 uses adaptive CoT patterns distilled from an LLM, fine-tuned with supervised learning and reinforcement learning targeting accuracy, brevity, structure, and redundancy.

Result: Significant performance improvements (F1 scores: +10% Sentiment, +5% Emotion, +18% Humor, +27% Sarcasm) and outperforms advanced LLMs like Grok-3 and Claude-3.7. Reduces reasoning length by 83% compared to OpenAI-o1.

Conclusion: Emotion-o1 effectively balances reasoning depth and efficiency for emotion understanding in LLMs.

Abstract: Long chain-of-thought (CoT) reasoning has shown great promise in enhancing the emotion understanding performance of large language models (LLMs). However, current fixed-length CoT methods struggle to balance reasoning depth and efficiency. Simple tasks (e.g., sentiment classification) are over-reasoned, while complex tasks (e.g., sarcasm understanding) lack depth. To fill this gap, we present Emotion-o1, an adaptive CoT framework that dynamically adjusts reasoning length based on emotion-task complexity. Emotion-o1 is trained by distilling adaptive CoT patterns from a reasoning-oriented LLM, followed by supervised fine-tuning and reinforcement learning with a four-part reward targeting accuracy, brevity, structure, and redundancy. Experimental results on four emotion tasks highlight: (1) Emotion-o1 demonstrates significant improvements over its backbone, with F1 score increases of 10%(Sentiment), 5%(Emotion), 18%(Humor), and 27%(Sarcasm). (2) In sentiment and sarcasm tasks, our 8B model demonstrates superior performance against advanced LLMs, outperforming Grok-3 by 1.1% and Claude-3.7 by 2%. (3) The framework maintains accuracy while reducing reasoning length by 83% compared to OpenAI-o1, demonstrating effective precision-efficiency optimization. Emotion-o1 effectively balances reasoning depth and efficiency for emotion understanding in LLMs.

[94] Model Internal Sleuthing: Finding Lexical Identity and Inflectional Morphology in Modern Language Models

Michael Li, Nishant Subramani

Main category: cs.CL

TL;DR: The study investigates how 25 transformer-based language models encode lexical and morphological information across diverse languages, revealing consistent patterns in information organization despite model differences.

DetailsMotivation: To understand how modern large language models (LLMs) encode linguistic information, given the dominance of transformer-based models in NLP and the lack of studies on newer architectures.

Method: Linear and nonlinear classifiers predict word lemmas and inflectional features from hidden activations. Additional experiments include attention/residual analyses, steering vector tests, and intrinsic dimensionality studies.

Result: Lexical information is linear in early layers and nonlinear in later layers, while inflectional information remains uniformly accessible and linearly separable. Patterns are consistent across all tested models.

Conclusion: Transformer models organize linguistic information similarly regardless of architecture or training, suggesting these properties are fundamental for next-token prediction and learned early in pretraining.

Abstract: Large transformer-based language models dominate modern NLP, yet our understanding of how they encode linguistic information is rooted in studies of early models like BERT and GPT-2. To better understand today’s language models, we investigate how 25 models - from classical architectures (BERT, DeBERTa, GPT-2) to modern large language models (Pythia, OLMo-2, Gemma-2, Qwen2.5, Llama-3.1) - represent lexical identity and inflectional morphology across six typologically diverse languages. Using linear and nonlinear classifiers trained on hidden activations, we predict word lemmas and inflectional features layer by layer. We find that models concentrate lexical information linearly in early layers and increasingly nonlinearly in later layers, while keeping inflectional information uniformly accessible and linearly separable throughout. Additional experiments probe the nature of these encodings: attention and residual analyses examine where within layers information can be recovered, steering vector experiments test what information can be functionally manipulated, and intrinsic dimensionality analyses explore how the representational structure evolves across layers. Remarkably, these encoding patterns emerge across all models we test, despite differences in architecture, size, and training regime (pretrained and instruction-tuned variants). This suggests that, even with substantial advances in LLM technologies, transformer models organize linguistic information in similar ways, indicating that these properties are important for next token prediction and are learned early during pretraining. Our code is available at https://github.com/ml5885/model_internal_sleuthing

[95] FinanceReasoning: Benchmarking Financial Numerical Reasoning More Credible, Comprehensive and Challenging

Zichen Tang, Haihong E, Ziyan Ma, Haoyang He, Jiacheng Liu, Zhongjun Yang, Zihua Rong, Rongjin Li, Kun Ji, Qing Huang, Xinyang Hu, Yang Liu, Qianhe Zheng

Main category: cs.CL

TL;DR: FinanceReasoning is a new benchmark for evaluating large reasoning models (LRMs) in financial numerical reasoning, offering credibility, comprehensiveness, and challenge through updated questions, broad financial coverage, and hard problems.

DetailsMotivation: To address gaps in existing benchmarks by providing a more credible, comprehensive, and challenging evaluation tool for LRMs in financial reasoning.

Method: Updated 15.6% of questions from public datasets, annotated 908 new questions with Python solutions, and constructed 3,133 Python-formatted functions. Evaluated models on 238 Hard problems.

Result: Improved LRM performance (e.g., GPT-4o accuracy rose from 83.2% to 91.6%). Best model (OpenAI o1 with PoT) achieved 89.1% accuracy. Combining Reasoner and Programmer models further enhanced performance.

Conclusion: FinanceReasoning advances LRM evaluation in financial reasoning, showing potential for domain-specific improvements and future research.

Abstract: We introduce FinanceReasoning, a novel benchmark designed to evaluate the reasoning capabilities of large reasoning models (LRMs) in financial numerical reasoning problems. Compared to existing benchmarks, our work provides three key advancements. (1) Credibility: We update 15.6% of the questions from four public datasets, annotating 908 new questions with detailed Python solutions and rigorously refining evaluation standards. This enables an accurate assessment of the reasoning improvements of LRMs. (2) Comprehensiveness: FinanceReasoning covers 67.8% of financial concepts and formulas, significantly surpassing existing datasets. Additionally, we construct 3,133 Python-formatted functions, which enhances LRMs’ financial reasoning capabilities through refined knowledge (e.g., 83.2% $\rightarrow$ 91.6% for GPT-4o). (3) Challenge: Models are required to apply multiple financial formulas for precise numerical reasoning on 238 Hard problems. The best-performing model (i.e., OpenAI o1 with PoT) achieves 89.1% accuracy, yet LRMs still face challenges in numerical precision. We demonstrate that combining Reasoner and Programmer models can effectively enhance LRMs’ performance (e.g., 83.2% $\rightarrow$ 87.8% for DeepSeek-R1). Our work paves the way for future research on evaluating and improving LRMs in domain-specific complex reasoning tasks.

[96] NameTag 3: A Tool and a Service for Multilingual/Multitagset NER

Jana Straková, Milan Straka

Main category: cs.CL

TL;DR: NameTag 3 is an open-source tool and cloud service for multilingual, multidataset, and multitagset NER, achieving state-of-the-art results on 21 datasets in 15 languages.

DetailsMotivation: To provide a versatile and high-performance NER tool supporting flat and nested entities across multiple languages and datasets.

Method: Uses fine-tuned models (355M and 126M parameters) for flat and nested NER, available as a command-line tool and cloud service.

Result: Achieves state-of-the-art results on 21 test datasets in 15 languages and remains competitive on others.

Conclusion: NameTag 3 is a powerful, accessible NER tool with broad language support and open-source availability.

Abstract: We introduce NameTag 3, an open-source tool and cloud-based web service for multilingual, multidataset, and multitagset named entity recognition (NER), supporting both flat and nested entities. NameTag 3 achieves state-of-the-art results on 21 test datasets in 15 languages and remains competitive on the rest, even against larger models. It is available as a command-line tool and as a cloud-based service, enabling use without local installation. NameTag 3 web service currently provides flat NER for 17 languages, trained on 21 corpora and three NE tagsets, all powered by a single 355M-parameter fine-tuned model; and nested NER for Czech, powered by a 126M fine-tuned model. The source code is licensed under open-source MPL 2.0, while the models are distributed under non-commercial CC BY-NC-SA 4.0. Documentation is available at https://ufal.mff.cuni.cz/nametag, source code at https://github.com/ufal/nametag3, and trained models via https://lindat.cz. The REST service and the web application can be found at https://lindat.mff.cuni.cz/services/nametag/. A demonstration video is available at https://www.youtube.com/watch?v=-gaGnP0IV8A.

[97] UITron-Speech: Towards Automated GUI Agents Based on Speech Instructions

Wenkang Han, Zhixiong Zeng, Jing Huang, Shu Jiang, Liming Zheng, Haibo Qiu, Chang Yao, Jingyuan Chen, Lin Ma

Main category: cs.CL

TL;DR: UITron-Speech is the first end-to-end GUI agent that processes speech instructions and screenshots to predict user actions, addressing limitations of text-based inputs. It uses synthesized speech datasets and a mixed-modality training strategy for robust performance.

DetailsMotivation: Text-based instructions limit accessibility and convenience in GUI agents, especially in hands-free scenarios. Speech input is proposed as a solution.

Method: UITron-Speech processes speech and screenshots directly. It synthesizes speech datasets, employs mixed-modality training, and refines grounding predictions with a two-step method.

Result: UITron-Speech achieves robust performance and superior adaptability on benchmarks, demonstrating feasibility for speech-driven GUI agents.

Conclusion: Speech-driven GUI agents like UITron-Speech offer accessible and intelligent human-computer interaction, with potential for broader adoption.

Abstract: Autonomous agents for Graphical User Interfaces (GUIs) are revolutionizing human-computer interaction, yet their reliance on text-based instructions imposes limitations on accessibility and convenience, particularly in hands-free scenarios. To address this issue, we propose replacing text with speech as the instruction input modality for GUI agents, and introduce UITron-Speech, which is the first end-to-end GUI agent capable of directly processing speech instructions and on-device screenshots to predict user actions. To tackle the problem of data scarcity, we synthesize high-quality speech instruction datasets using a random-speaker text-to-speech model. Additionally, we design a mixed-modality training strategy to mitigate the inherent modality imbalance in pre-trained foundation models. Furthermore, we conduct a statistical analysis of the distribution of GUI grounding prediction errors and propose a training-free two-step grounding refinement method to alleviate minor localization deviations. Extensive experiments on multiple benchmarks demonstrate that UITron-Speech achieves robust performance and superior adaptability, underscoring the feasibility and potential of speech-driven GUI agents for more accessible and intelligent human-computer interaction. Our code and datasets are available at https://github.com/UITron-hub/UITron-Speech.

[98] How Far Can LLMs Improve from Experience? Measuring Test-Time Learning Ability in LLMs with Human Comparison

Jiayin Wang, Zhiquang Guo, Weizhi Ma, Min Zhang

Main category: cs.CL

TL;DR: The paper advocates for evaluating Test-time Learning in LLMs using semantic games, revealing measurable but unstable learning capabilities compared to humans.

DetailsMotivation: Current benchmarks focus on static knowledge, but intelligence requires rapid learning from experience. The paper aims to assess LLMs' ability to improve during test time.

Method: Proposes semantic games as testbeds for Test-time Learning, with an evaluation framework comparing model performance under limited and cumulative experience settings. Includes human baseline for comparison.

Result: LLMs show measurable test-time learning but improve less stably and slowly than humans.

Conclusion: LLMs have potential as general-purpose learners, but a significant gap remains between their learning abilities and humans’, beyond static benchmark performance.

Abstract: As evaluation designs of large language models may shape our trajectory toward artificial general intelligence, comprehensive and forward-looking assessment is essential. Existing benchmarks primarily assess static knowledge, while intelligence also entails the ability to rapidly learn from experience. To this end, we advocate for the evaluation of Test-time Learning, the capacity to improve performance in experience-based, reasoning-intensive tasks during test time. In this work, we propose semantic games as effective testbeds for evaluating test-time learning, due to their resistance to saturation and inherent demand for strategic reasoning. We introduce an objective evaluation framework that compares model performance under both limited and cumulative experience settings, and contains four forms of experience representation. To provide a comparative baseline, we recruit eight human participants to complete the same task. Results show that LLMs exhibit measurable test-time learning capabilities; however, their improvements are less stable under cumulative experience and progress more slowly than those observed in humans. These findings underscore the potential of LLMs as general-purpose learning machines, while also revealing a substantial intellectual gap between models and humans, irrespective of how well LLMs perform on static benchmarks.

[99] R1-RE: Cross-Domain Relation Extraction with RLVR

Runpeng Dai, Tong Zheng, Run Yang, Kaixian Yu, Hongtu Zhu

Main category: cs.CL

TL;DR: R1-RE reframes relation extraction as a reasoning task using reinforcement learning with verifiable reward, improving OOD robustness and matching GPT-4o’s performance.

DetailsMotivation: Traditional supervised learning for relation extraction struggles with out-of-domain generalization, prompting a shift to human-like reasoning.

Method: Introduces R1-RE, a reinforcement learning framework guided by annotation guidelines, leveraging small language models for reasoning.

Result: Achieves ~70% OOD accuracy on Sem-2010 and MDKG datasets, comparable to GPT-4o.

Conclusion: R1-RE enhances OOD robustness and provides insights into RLVR training dynamics for relation extraction.

Abstract: Relation extraction (RE) is a core task in natural language processing. Traditional approaches typically frame RE as a supervised learning problem, directly mapping context to labels-an approach that often suffers from poor out-of-domain (OOD) generalization. Inspired by the workflow of human annotators, we reframe RE as a reasoning task guided by annotation guidelines and introduce R1-RE, the first reinforcement learning with verifiable reward (RLVR) framework for RE tasks. Our method elicits the reasoning abilities of small language models for annotation tasks, resulting in significantly improved OOD robustness. We evaluate our approach on the public Sem-2010 dataset and a private MDKG dataset. The R1-RE-7B model attains an average OOD accuracy of approximately 70%, on par with leading proprietary models such as GPT-4o. Additionally, our comprehensive analysis provides novel insights into the training dynamics and emergent reasoning behaviors of the RLVR paradigm for RE.

[100] From Queries to Criteria: Understanding How Astronomers Evaluate LLMs

Alina Hyk, Kiera McCormick, Mian Zhong, Ioana Ciucă, Sanjib Sharma, John F Wu, J. E. G. Peek, Kartheik G. Iyer, Ziang Xiao, Anjalie Field

Main category: cs.CL

TL;DR: The study explores how users evaluate LLMs in astronomy research, using a Slack-deployed bot. Findings from queries and interviews inform better benchmark recommendations.

DetailsMotivation: To address the gap in LLM evaluation benchmarks for diverse real-world use cases, particularly in scientific research like astronomy.

Method: Inductive coding of 368 Slack bot queries and interviews with 11 astronomers to understand evaluation criteria.

Result: Identified user evaluation patterns and synthesized recommendations for better benchmarks, demonstrated with a sample astronomy benchmark.

Conclusion: Provides actionable insights to improve LLM evaluation and usability in scientific research.

Abstract: There is growing interest in leveraging LLMs to aid in astronomy and other scientific research, but benchmarks for LLM evaluation in general have not kept pace with the increasingly diverse ways that real people evaluate and use these models. In this study, we seek to improve evaluation procedures by building an understanding of how users evaluate LLMs. We focus on a particular use case: an LLM-powered retrieval-augmented generation bot for engaging with astronomical literature, which we deployed via Slack. Our inductive coding of 368 queries to the bot over four weeks and our follow-up interviews with 11 astronomers reveal how humans evaluated this system, including the types of questions asked and the criteria for judging responses. We synthesize our findings into concrete recommendations for building better benchmarks, which we then employ in constructing a sample benchmark for evaluating LLMs for astronomy. Overall, our work offers ways to improve LLM evaluation and ultimately usability, particularly for use in scientific research.

[101] Towards Domain Specification of Embedding Models in Medicine

Mohammad Khodadad, Ali Shiraee Kasmaee, Mahdi Astaraki, Hamidreza Mahyar

Main category: cs.CL

TL;DR: The paper introduces MEDTE, a medical text embedding model fine-tuned on diverse medical corpora, and a benchmark suite of 51 tasks to address shortcomings in existing models and evaluations.

DetailsMotivation: Existing medical text embedding models are limited by narrow training data and inadequate evaluations, failing to capture real-world medical diversity.

Method: The authors leverage MEDTE, a model fine-tuned using self-supervised contrastive learning on diverse medical corpora, and propose a 51-task benchmark suite tailored to medical text.

Result: MEDTE outperforms state-of-the-art alternatives across various tasks, and the benchmark provides a robust evaluation framework.

Conclusion: The combined approach of MEDTE and the benchmark suite addresses gaps in medical text embeddings, improving performance and evaluation reliability.

Abstract: Medical text embedding models are foundational to a wide array of healthcare applications, ranging from clinical decision support and biomedical information retrieval to medical question answering, yet they remain hampered by two critical shortcomings. First, most models are trained on a narrow slice of medical and biological data, beside not being up to date in terms of methodology, making them ill suited to capture the diversity of terminology and semantics encountered in practice. Second, existing evaluations are often inadequate: even widely used benchmarks fail to generalize across the full spectrum of real world medical tasks. To address these gaps, we leverage MEDTE, a GTE model extensively fine-tuned on diverse medical corpora through self-supervised contrastive learning across multiple data sources, to deliver robust medical text embeddings. Alongside this model, we propose a comprehensive benchmark suite of 51 tasks spanning classification, clustering, pair classification, and retrieval modeled on the Massive Text Embedding Benchmark (MTEB) but tailored to the nuances of medical text. Our results demonstrate that this combined approach not only establishes a robust evaluation framework but also yields embeddings that consistently outperform state of the art alternatives in different tasks.

[102] From Sufficiency to Reflection: Reinforcement-Guided Thinking Quality in Retrieval-Augmented Reasoning for LLMs

Jie He, Victor Gutiérrez-Basulto, Jeff Z. Pan

Main category: cs.CL

TL;DR: TIRESRAG-R1 improves reasoning in LLMs by addressing retrieval and reasoning flaws with a multi-reward system and think-retrieve-reflect process.

DetailsMotivation: Existing RAG methods focus on final-answer rewards, ignoring intermediate reasoning quality, leading to failures like insufficient retrieval, faulty reasoning, and answer-reasoning inconsistency.

Method: TIRESRAG-R1 uses a think-retrieve-reflect process with sufficiency, reasoning quality, and reflection rewards, plus difficulty-aware reweighting and sample filtering.

Result: Outperforms prior RAG methods on multi-hop QA datasets and generalizes to single-hop tasks.

Conclusion: TIRESRAG-R1 effectively enhances reasoning and stability in LLMs by addressing intermediate reasoning flaws.

Abstract: Reinforcement learning-based retrieval-augmented generation (RAG) methods enhance the reasoning abilities of large language models (LLMs). However, most rely only on final-answer rewards, overlooking intermediate reasoning quality. This paper analyzes existing RAG reasoning models and identifies three main failure patterns: (1) information insufficiency, meaning the model fails to retrieve adequate support; (2) faulty reasoning, where logical or content-level flaws appear despite sufficient information; and (3) answer-reasoning inconsistency, where a valid reasoning chain leads to a mismatched final answer. We propose TIRESRAG-R1, a novel framework using a think-retrieve-reflect process and a multi-dimensional reward system to improve reasoning and stability. TIRESRAG-R1 introduces: (1) a sufficiency reward to encourage thorough retrieval; (2) a reasoning quality reward to assess the rationality and accuracy of the reasoning chain; and (3) a reflection reward to detect and revise errors. It also employs a difficulty-aware reweighting strategy and training sample filtering to boost performance on complex tasks. Experiments on four multi-hop QA datasets show that TIRESRAG-R1 outperforms prior RAG methods and generalizes well to single-hop tasks. The code and data are available at: https://github.com/probe2/TIRESRAG-R1.

[103] EdgeInfinite-Instruct: Bridging SFT-Based Optimization and NPU-Level Efficiency for Edge Devices

Jiyu Chen, Poh Seng Lim, Shuang Peng, Daxiong Luo, JungHau Foo, Yap Deep, Timothy Lee Jun Jie, Kelvin Teh Kae Wen, Fan Yang, Danyu Feng, Hao-Yun Chen, Peng-Wen Chen, Fangyuan Li, Xiaoxin Chen, Wong Wai Mun

Main category: cs.CL

TL;DR: EdgeInfinite-Instruct optimizes Transformer-based LLMs for edge devices by fine-tuning parameters, employing quantization, and customizing input sizes, improving efficiency without sacrificing performance.

DetailsMotivation: Challenges in deploying LLMs on edge devices due to high computational and memory costs, especially for long-sequence tasks, motivate the need for efficient solutions like EdgeInfinite-Instruct.

Method: Proposes Segmented Supervised Fine-Tuning (S-SFT) for long-sequence tasks, fine-grained PTQ for computational efficiency, and fixed-shape computation graphs for memory optimization.

Result: Demonstrates improved performance on long-context benchmarks and mobile tasks while maintaining efficiency on edge NPUs.

Conclusion: EdgeInfinite-Instruct effectively balances efficiency and performance for LLM deployment on resource-constrained edge devices.

Abstract: Deploying Transformer-based large language models (LLMs) on resource-constrained edge devices for long-sequence tasks remains challenging due to the quadratic time complexity of self-attention and growing Key-Value (KV) cache demands. While existing KV cache optimizations improve memory efficiency, they often fail to reduce time to first token (TTFT) and may degrade performance through token pruning. Alternative sequence modeling architectures address some of these limitations, but typically require full retraining and lack infrastructure support. EdgeInfinite offers an efficient solution by fine-tuning only a small subset of parameters, maintaining quality while reducing both computational and memory costs, including improved TTFT. However, its instruction-following ability is limited, and it lacks mobile-specific optimizations. To address these issues, we propose EdgeInfinite-Instruct, which introduces a Segmented Supervised Fine-Tuning (S-SFT) strategy tailored to long-sequence tasks such as summarization and question answering. We further optimized EdgeInfinite-Instruct for efficient deployment on edge NPUs by employing fine-grained post-training quantization (PTQ) to reduce computational demands while maintaining accuracy, and by implementing a fixed-shape computation graph that balances memory usage and on-device efficiency through scenario-specific customization of input token and cache sizes. Experiments on long-context benchmarks and real-world mobile tasks show that our approach improves domain-specific performance while maintaining efficiency on NPU-accelerated edge devices.

[104] LinkQA: Synthesizing Diverse QA from Multiple Seeds Strongly Linked by Knowledge Points

Xuemiao Zhang, Can Ren, Chengying Tu, Rongxiang Weng, Hongfei Yan, Jingang Wang, Xunliang Cai

Main category: cs.CL

TL;DR: LinkSyn is a KP graph-based framework for synthesizing diverse QA data, improving LLM training. It balances KP coverage and popularity, enhances difficulty, and creates LinkQA, a 50B-token dataset, boosting model performance by 11.51%.

DetailsMotivation: Addressing the scarcity of high-quality, diverse training data for LLMs by synthesizing QA data with controlled discipline and difficulty distributions.

Method: LinkSyn constructs a KP graph from QA seeds, uses a knowledge distribution function for balanced sampling, and employs diffusion-based synthesis and difficulty adjustments.

Result: LinkQA, a 50B-token dataset, improves Llama-3 8B performance by 11.51% on MMLU and CMMLU, setting new SOTA results.

Conclusion: LinkSyn effectively enhances LLM training with diverse, high-quality data, demonstrating significant performance gains.

Abstract: The advancement of large language models (LLMs) struggles with the scarcity of high-quality, diverse training data. To address this limitation, we propose LinkSyn, a novel knowledge point (KP) graph-based synthesis framework that enables flexible control over discipline and difficulty distributions while balancing KP coverage and popularity. LinkSyn extracts KPs from question-answering (QA) seed data and constructs a KP graph to synthesize diverse QA data from multiple seeds strongly linked by KPs and sampled from graph walks. Specifically, LinkSyn incorporates (1) a knowledge distribution value function to guide the adjustment of path sampling probability and balance KP coverage and popularity during graph walks; (2) diffusion-based synthesis via DeepSeek-R1 by leveraging multiple seeds with dense logical associations along each path; and (3) high-difficulty QA enhancement within given disciplines by flexible difficulty adjustments. By executing LinkSyn, we synthesize LinkQA, a diverse multi-disciplinary QA dataset with 50B tokens. Extensive experiments on Llama-3 8B demonstrate that continual pre-training with LinkQA yields an average improvement of $\mathbf{11.51%}$ on MMLU and CMMLU, establishing new SOTA results. LinkQA consistently enhances performance across model size and initial FLOPs scales.

[105] CharBench: Evaluating the Role of Tokenization in Character-Level Tasks

Omri Uzan, Yuval Pinter

Main category: cs.CL

TL;DR: CharBench is a large-scale benchmark for character-level tasks, revealing modern LLMs struggle with such tasks (avg. accuracy 43.6%-32.3%). Tokenization’s impact varies by task type.

DetailsMotivation: Character-level tasks challenge language models, but the role of tokenization is unclear. CharBench aims to clarify this and evaluate model performance.

Method: Introduce CharBench, a large benchmark for character-level tasks, and evaluate diverse LLMs. Analyze word properties and tokenization effects.

Result: LLMs perform poorly on CharBench (avg. 43.6%-32.3%). Tokenization weakly affects counting tasks but hinders positional tasks with longer tokens.

Conclusion: CharBench highlights LLMs’ struggles with character-level tasks. Future work should use it to improve model performance.

Abstract: Tasks that require character-level reasoning, such as counting or locating characters within words, remain challenging for contemporary language models. A common conjecture is that language models’ reliance on subword units, rather than characters, contributes to their struggles with character-level tasks, yet recent studies offer conflicting conclusions about the role of tokenization, leaving its impact unclear. To address this gap, we introduce CharBench, a comprehensive benchmark of character-level tasks that is two orders of magnitude larger than existing alternatives. We evaluate a diverse range of leading open-weight and proprietary models on CharBench and find that it presents a significant challenge to modern LLMs, with an average accuracy of 43.6% and 32.3% on some tasks. We present an in-depth analysis of how intrinsic properties of words and their segmentations into tokens correspond to model performance. For counting tasks, we find that tokenization properties are weakly correlated with correctness, while the length of the queried word and the actual character count play a more significant part. In contrast, for tasks requiring intra-word positional understanding, performance is negatively correlated with the length of the token containing the queried character, suggesting that longer tokens obscure character position information for LLMs. We encourage future work to build on the benchmark and evaluation methodology introduced here as tools for improving model performance on such tasks.

[106] CoCoTen: Detecting Adversarial Inputs to Large Language Models through Latent Space Features of Contextual Co-occurrence Tensors

Sri Durga Sai Sowmya Kadali, Evangelos E. Papalexakis

Main category: cs.CL

TL;DR: The paper proposes a novel method using Contextual Co-occurrence Matrices and Tensors to detect adversarial and jailbreak prompts in LLMs, achieving high accuracy and speed with minimal labeled data.

DetailsMotivation: LLMs are vulnerable to jailbreak attacks, necessitating robust detection methods for safe deployment.

Method: Leverages latent space characteristics of Contextual Co-occurrence Matrices and Tensors for prompt detection.

Result: Achieves an F1 score of 0.83 with only 0.5% labeled prompts, 96.6% better than baselines, and is significantly faster.

Conclusion: The method is effective and efficient for detecting adversarial prompts, especially in data-scarce scenarios.

Abstract: The widespread use of Large Language Models (LLMs) in many applications marks a significant advance in research and practice. However, their complexity and hard-to-understand nature make them vulnerable to attacks, especially jailbreaks designed to produce harmful responses. To counter these threats, developing strong detection methods is essential for the safe and reliable use of LLMs. This paper studies this detection problem using the Contextual Co-occurrence Matrix, a structure recognized for its efficacy in data-scarce environments. We propose a novel method leveraging the latent space characteristics of Contextual Co-occurrence Matrices and Tensors for the effective identification of adversarial and jailbreak prompts. Our evaluations show that this approach achieves a notable F1 score of 0.83 using only 0.5% of labeled prompts, which is a 96.6% improvement over baselines. This result highlights the strength of our learned patterns, especially when labeled data is scarce. Our method is also significantly faster, speedup ranging from 2.3 to 128.4 times compared to the baseline models. To support future research and reproducibility, we have made our implementation publicly available.

[107] Thinking with Nothinking Calibration: A New In-Context Learning Paradigm in Reasoning Large Language Models

Haotian Wu, Bo Xu, Yao Shu, Menglin Yang, Chengwei Qin

Main category: cs.CL

TL;DR: The paper introduces JointThinking, a new in-context learning paradigm for reasoning large language models (RLLMs) that leverages the difference between Thinking and Nothinking modes to improve accuracy with minimal latency overhead.

DetailsMotivation: To explore the underexplored potential of in-context learning (ICL) in RLLMs, focusing on improving reasoning accuracy by leveraging structured differences between reasoning modes.

Method: JointThinking prompts the model to generate two answers (Thinking and Nothinking modes) in parallel, triggering a second round of Thinking only if the answers disagree. This minimizes latency while improving robustness.

Result: JointThinking outperforms few-shot chain-of-thought and majority voting, achieves comparable in-distribution performance to training-based SOTA, and excels in out-of-distribution tasks. It also shows scalability with model size.

Conclusion: The method highlights the value of structural thinking diversity and scalability, with promising directions for future ICL research in RLLMs.

Abstract: Reasoning large language models (RLLMs) have recently demonstrated remarkable capabilities through structured and multi-step reasoning. While prior research has primarily focused on improving their training and inference strategies, their potential for in-context learning (ICL) remains largely underexplored. To fill this gap, we propose Thinking with Nothinking Calibration (JointThinking), a new ICL paradigm that leverages the structured difference between two reasoning modes, i.e., Thinking and Nothinking, to improve reasoning accuracy. Specifically, our method prompts the model to generate two answers in parallel: one in Thinking mode and the other in Nothinking mode. A second round of Thinking is triggered only when the two initial responses are inconsistent, using a single prompt that incorporates the original question and both candidate answers. Since such disagreement occurs infrequently (e.g., only 6% in GSM8K), our method performs just one round of reasoning in most cases, resulting in minimal latency overhead. Extensive experiments across multiple reasoning benchmarks demonstrate that JointThinking significantly outperforms few-shot chain-of-thought (CoT) and majority voting with improved answer robustness. Moreover, It achieves comparable in-distribution performance to training-based SOTA method, while substantially outperforming on out-of-distribution tasks. We further conduct a systematic analysis of the calibration mechanism, showing that leveraging different reasoning modes consistently lowers the error rate and highlights the value of structural thinking diversity. Additionally, we observe that the performance gap between actual and ideal reasoning narrows as model size increases in the second round of thinking, indicating the strong scalability of our approach. Finally, we discuss current limitations and outline promising directions for future ICL research in RLLMs.

[108] LLMs Have a Heart of Stone: Demystifying the Soft Thinking Ability of Large Reasoning Models

Chünhung Wu, Jinliang Lu, Zixuan Ren, Gangqiang Hu, Zhi Wu, Dai Dai, Hua Wu

Main category: cs.CL

TL;DR: The paper investigates ‘Soft Thinking’ in LLMs, revealing their tendency to rely on dominant soft inputs, limiting reasoning path exploration. Introducing randomness via sampling strategies like Dirichlet resampling and Gumbel-Softmax improves performance.

DetailsMotivation: Human cognition handles abstract concepts fluidly, but LLMs often rely on discrete tokens, limiting expressive power. This work aims to enhance LLMs' reasoning in continuous concept spaces.

Method: Probing techniques analyze LLMs’ internal behavior with soft tokens. Sampling strategies (Dirichlet resampling, Gumbel-Softmax) introduce randomness to improve reasoning path exploration.

Result: LLMs default to greedy decoding with soft tokens, hindering diverse reasoning. Randomness via Gumbel-Softmax outperforms vanilla methods across eight benchmarks.

Conclusion: Incorporating randomness in Soft Thinking mitigates limitations, with Gumbel-Softmax proving most effective for enhancing reasoning performance.

Abstract: Human cognition naturally engages with abstract and fluid concepts, whereas existing reasoning models often rely on generating discrete tokens, potentially constraining their expressive capabilities. Recent advancements aim to address this limitation by enabling large language models (LLMs) to generate soft, abstract tokens, thus facilitating reasoning within a continuous concept space. This paper explores the `Soft Thinking’ capabilities of various LLMs by examining the models’ internal behavior using a suite of probing techniques. Contrary to the common belief that Soft Thinking enables the simultaneous exploration of diverse reasoning paths, our findings reveal that LLMs predominantly rely on the most influential component of the soft inputs during subsequent decoding steps. This reliance hinders the exploration of different reasoning paths and reduces vanilla Soft Thinking to a form of greedy decoding, obscuring the advantage of transmitting more information through Soft Tokens. To tackle this issue, we explore sampling strategies to introduce \emph{randomness}, employing methods such as Dirichlet resampling and the Gumbel-Softmax trick. Our experiments demonstrate that incorporating randomness can alleviate the limitations of vanilla approaches and unleash the potential of Soft Thinking. Notably, the Gumbel-Softmax trick provides adequate randomness with controlled smoothness, resulting in superior performance across eight reasoning benchmarks.

cs.CV

[109] Text2VR: Automated instruction Generation in Virtual Reality using Large language Models for Assembly Task

Subin Raj Peter

Main category: cs.CV

TL;DR: The paper proposes using Large Language Models (LLMs) to automate VR training content generation, reducing development time and enhancing scalability.

DetailsMotivation: VR training is effective but resource-intensive to develop; automating content creation can address this challenge.

Method: Combines LLMs to extract task-relevant text and an intelligent module to generate VR animations and visual cues.

Result: Automated system creates engaging VR training content efficiently, improving scalability and adaptability.

Conclusion: LLM-driven automation makes VR training more accessible and adaptable for industrial needs.

Abstract: Virtual Reality (VR) has emerged as a powerful tool for workforce training, offering immersive, interactive, and risk-free environments that enhance skill acquisition, decision-making, and confidence. Despite its advantages, developing VR applications for training remains a significant challenge due to the time, expertise, and resources required to create accurate and engaging instructional content. To address these limitations, this paper proposes a novel approach that leverages Large Language Models (LLMs) to automate the generation of virtual instructions from textual input. The system comprises two core components: an LLM module that extracts task-relevant information from the text, and an intelligent module that transforms this information into animated demonstrations and visual cues within a VR environment. The intelligent module receives input from the LLM module and interprets the extracted information. Based on this, an instruction generator creates training content using relevant data from a database. The instruction generator generates the instruction by changing the color of virtual objects and creating animations to illustrate tasks. This approach enhances training effectiveness and reduces development overhead, making VR-based training more scalable and adaptable to evolving industrial needs.

[110] Audio-Assisted Face Video Restoration with Temporal and Identity Complementary Learning

Yuqin Cao, Yixuan Gao, Wei Sun, Xiaohong Liu, Yulun Zhang, Xiongkuo Min

Main category: cs.CV

TL;DR: GAVN is a novel audio-assisted face video restoration network that leverages identity and temporal learning to address various distortions, outperforming existing methods.

DetailsMotivation: Face videos often suffer from degradations, and existing methods ignore visual-audio correlations or focus narrowly on compression artifacts.

Method: GAVN uses temporal features in low-resolution for coarse restoration and audio-assisted identity features in high-resolution for detail enhancement, integrating both for final output.

Result: GAVN excels in compression artifact removal, deblurring, and super-resolution, surpassing state-of-the-art methods.

Conclusion: GAVN effectively restores face videos by combining temporal and audio-assisted identity features, offering superior performance across multiple distortion types.

Abstract: Face videos accompanied by audio have become integral to our daily lives, while they often suffer from complex degradations. Most face video restoration methods neglect the intrinsic correlations between the visual and audio features, especially in mouth regions. A few audio-aided face video restoration methods have been proposed, but they only focus on compression artifact removal. In this paper, we propose a General Audio-assisted face Video restoration Network (GAVN) to address various types of streaming video distortions via identity and temporal complementary learning. Specifically, GAVN first captures inter-frame temporal features in the low-resolution space to restore frames coarsely and save computational cost. Then, GAVN extracts intra-frame identity features in the high-resolution space with the assistance of audio signals and face landmarks to restore more facial details. Finally, the reconstruction module integrates temporal features and identity features to generate high-quality face videos. Experimental results demonstrate that GAVN outperforms the existing state-of-the-art methods on face video compression artifact removal, deblurring, and super-resolution. Codes will be released upon publication.

[111] Outlier Detection Algorithm for Circle Fitting

Ahmet Gökhan Poyraz

Main category: cs.CV

TL;DR: The paper introduces the Polar Coordinate-Based Outlier Detection (PCOD) algorithm to improve circle fitting accuracy by detecting and removing noisy points.

DetailsMotivation: Noisy point sets compromise circle fitting effectiveness, necessitating robust outlier detection methods.

Method: Transforms points into polar coordinates, calculates local/global standard deviations, and identifies outliers by comparing local means with global deviation.

Result: PCOD outperforms other methods in accuracy for high-precision diameter measurement of industrial washer parts.

Conclusion: PCOD enhances circle fitting in industrial applications by effectively handling noisy data.

Abstract: Circle fitting methods are extensively utilized in various industries, particularly in quality control processes and design applications. The effectiveness of these algorithms can be significantly compromised when the point sets to be predicted are noisy. To mitigate this issue, outlier detection and removal algorithms are often applied before the circle fitting procedure. This study introduces the Polar Coordinate-Based Outlier Detection (PCOD) algorithm, which can be effectively employed in circle fitting applications. In the proposed approach, the point set is first transformed into polar coordinates, followed by the calculation of both local and global standard deviations. Outliers are then identified by comparing local mean values with the global standard deviation. The practicality and efficiency of the proposed method are demonstrated by focusing on the high-precision diameter measurement of industrial washer parts. Images from a machine vision system are processed through preprocessing steps, including sub-pixel edge detection. The resulting sub-pixel edge points are then cleaned using the proposed outlier detection and removal algorithm, after which circle fitting is performed. A comparison is made using ten different circle fitting algorithms and five distinct outlier detection methods. The results indicate that the proposed method outperforms the other approaches, delivering the best performance in terms of accuracy within the dataset, thereby demonstrating its potential for enhancing circle fitting applications in industrial environments.

[112] LayerT2V: Interactive Multi-Object Trajectory Layering for Video Generation

Kangrui Cen, Baixuan Zhao, Yi Xin, Siqi Luo, Guangtao Zhai, Xiaohong Liu

Main category: cs.CV

TL;DR: LayerT2V introduces a layered approach for Text-to-Video (T2V) generation, improving multi-object motion control by compositing background and foreground objects separately.

DetailsMotivation: Current T2V models struggle with multi-object motion, especially when trajectories intersect, due to semantic conflicts. Existing methods lack support or degrade in performance for such scenarios.

Method: LayerT2V generates videos by compositing background and foreground objects layer by layer, allowing independent control of each element.

Result: LayerT2V outperforms SOTA methods with 1.4x and 4.5x improvements in mIoU and AP50 metrics for multi-object scenarios.

Conclusion: LayerT2V effectively addresses multi-object motion challenges in T2V generation, offering superior performance and control.

Abstract: Controlling object motion trajectories in Text-to-Video (T2V) generation is a challenging and relatively under-explored area, particularly in scenarios involving multiple moving objects. Most community models and datasets in the T2V domain are designed for single-object motion, limiting the performance of current generative models in multi-object tasks. Additionally, existing motion control methods in T2V either lack support for multi-object motion scenes or experience severe performance degradation when object trajectories intersect, primarily due to the semantic conflicts in colliding regions. To address these limitations, we introduce LayerT2V, the first approach for generating video by compositing background and foreground objects layer by layer. This layered generation enables flexible integration of multiple independent elements within a video, positioning each element on a distinct “layer” and thus facilitating coherent multi-object synthesis while enhancing control over the generation process. Extensive experiments demonstrate the superiority of LayerT2V in generating complex multi-object scenarios, showcasing 1.4x and 4.5x improvements in mIoU and AP50 metrics over state-of-the-art (SOTA) methods. Project page and code are available at https://kr-panghu.github.io/LayerT2V/ .

[113] Enhancing Diameter Measurement Accuracy in Machine Vision Applications

Ahmet Gokhan Poyraz, Ahmet Emir Dirik, Hakan Gurkan, Mehmet Kacmaz

Main category: cs.CV

TL;DR: The paper proposes two methods (conversion factor-based and pixel-based) to improve measurement accuracy in camera systems, reducing errors from 13-114 micrometers to 1-2 micrometers.

DetailsMotivation: Measurement errors in camera systems persist despite specialized equipment, especially when measuring parts of varying diameters.

Method: Two approaches: (1) estimating conversion factors from known references, and (2) directly using pixel-based diameter information from references.

Result: Errors reduced from 13-114 micrometers to 1-2 micrometers in tests on glass and metal samples.

Conclusion: The methods significantly enhance accuracy and reliability in diameter measurements using minimal reference parts.

Abstract: In camera measurement systems, specialized equipment such as telecentric lenses is often employed to measure parts with narrow tolerances. However, despite the use of such equipment, measurement errors can occur due to mechanical and software-related factors within the system. These errors are particularly evident in applications where parts of different diameters are measured using the same setup. This study proposes two innovative approaches to enhance measurement accuracy using multiple known reference parts: a conversion factor-based method and a pixel-based method. In the first approach, the conversion factor is estimated from known references to calculate the diameter (mm) of the unknown part. In the second approach, the diameter (mm) is directly estimated using pixel-based diameter information from the references. The experimental setup includes an industrial-grade camera and telecentric lenses. Tests conducted on glass samples (1-12 mm) and metal workpieces (3-24 mm) show that measurement errors, which originally ranged from 13-114 micrometers, were reduced to 1-2 micrometers using the proposed methods. By utilizing only a few known reference parts, the proposed approach enables high-accuracy measurement of all parts within the camera’s field of view. Additionally, this method enhances the existing diameter measurement literature by significantly reducing error rates and improving measurement reliability.

[114] Multimodal Video Emotion Recognition with Reliable Reasoning Priors

Zhepeng Wang, Yingjian Zhu, Guanghao Dong, Hongzhu Yi, Feng Chen, Xinming Wang, Jun Xie

Main category: cs.CV

TL;DR: The paper integrates MLLM-derived reasoning into multimodal emotion recognition, using Gemini for fine-grained traces and Balanced Dual-Contrastive Learning to address class imbalance, achieving significant performance gains on MER2024.

DetailsMotivation: To enhance multimodal emotion recognition by leveraging trustworthy reasoning knowledge from MLLMs and addressing class imbalance issues.

Method: Uses Gemini for modality-separable reasoning traces and introduces Balanced Dual-Contrastive Learning for balanced inter- and intra-class distributions.

Result: Substantial performance improvements on the MER2024 benchmark.

Conclusion: MLLM-derived reasoning synergizes well with lightweight fusion networks for robust and scalable emotion recognition.

Abstract: This study investigates the integration of trustworthy prior reasoning knowledge from MLLMs into multimodal emotion recognition. We employ Gemini to generate fine-grained, modality-separable reasoning traces, which are injected as priors during the fusion stage to enrich cross-modal interactions. To mitigate the pronounced class-imbalance in multimodal emotion recognition, we introduce Balanced Dual-Contrastive Learning, a loss formulation that jointly balances inter-class and intra-class distributions. Applied to the MER2024 benchmark, our prior-enhanced framework yields substantial performance gains, demonstrating that the reliability of MLLM-derived reasoning can be synergistically combined with the domain adaptability of lightweight fusion networks for robust, scalable emotion recognition.

[115] From Waveforms to Pixels: A Survey on Audio-Visual Segmentation

Jia Li, Yapeng Tian

Main category: cs.CV

TL;DR: A survey on Audio-Visual Segmentation (AVS) covering problem formulation, datasets, metrics, methodologies, and future directions.

DetailsMotivation: To provide a comprehensive overview of AVS, enabling fine-grained object-level understanding in videos by leveraging visual and audio modalities.

Method: Analyzes unimodal/multimodal encoding, fusion strategies, decoder designs, and training paradigms (supervised to training-free).

Result: Extensive comparison of AVS methods on benchmarks, highlighting impacts of architecture, fusion, and training.

Conclusion: Identifies challenges (e.g., temporal modeling, modality bias) and proposes future directions (e.g., better fusion, foundation models, reduced labeled data reliance).

Abstract: Audio-Visual Segmentation (AVS) aims to identify and segment sound-producing objects in videos by leveraging both visual and audio modalities. It has emerged as a significant research area in multimodal perception, enabling fine-grained object-level understanding. In this survey, we present a comprehensive overview of the AVS field, covering its problem formulation, benchmark datasets, evaluation metrics, and the progression of methodologies. We analyze a wide range of approaches, including architectures for unimodal and multimodal encoding, key strategies for audio-visual fusion, and various decoder designs. Furthermore, we examine major training paradigms, from fully supervised learning to weakly supervised and training-free methods. Notably, we provide an extensive comparison of AVS methods across standard benchmarks, highlighting the impact of different architectural choices, fusion strategies, and training paradigms on performance. Finally, we outline the current challenges, such as limited temporal modeling, modality bias toward vision, lack of robustness in complex environments, and high computational demands, and propose promising future directions, including improving temporal reasoning and multimodal fusion, leveraging foundation models for better generalization and few-shot learning, reducing reliance on labeled data through selfand weakly supervised learning, and incorporating higher-level reasoning for more intelligent AVS systems.

[116] A Large Language Model Powered Integrated Circuit Footprint Geometry Understanding

Yida Wang, Taiting Lu, Runze Liu, Lanqing Yang, Yifan Yang, Zhe Chen, Yuehai Wang, Yixin Liu, Kaiyuan Lin, Xiaomeng Chen, Dian Ding, Yijie Li, Yi-Chao Chen, Yincheng Jin, Mahanth Gowda

Main category: cs.CV

TL;DR: The paper proposes LLM4-IC8K, a framework using LLMs for automated IC footprint geometry labeling from mechanical drawings, addressing challenges in visual perception and outperforming existing methods.

DetailsMotivation: Automated parsing of IC footprint geometry is challenging due to unstructured drawings and lack of existing methods, despite its importance in PCB design.

Method: LLM4-IC8K treats IC drawings as images, using LLMs for structured interpretation via three sub-tasks: pin counting, center coordinate computation, and pin dimension estimation. It employs synthetic data training followed by real-world fine-tuning.

Result: The model outperforms state-of-the-art LMMs, validated on the ICGeo8K dataset (8,608 samples), demonstrating improved accuracy and robustness.

Conclusion: LLM4-IC8K effectively addresses IC footprint labeling challenges, offering a scalable solution for PCB design automation.

Abstract: Printed-Circuit-board (PCB) footprint geometry labeling of integrated circuits (IC) is essential in defining the physical interface between components and the PCB layout, requiring exceptional visual perception proficiency. However, due to the unstructured footprint drawing and abstract diagram annotations, automated parsing and accurate footprint geometry modeling remain highly challenging. Despite its importance, no methods currently exist for automated package geometry labeling directly from IC mechanical drawings. In this paper, we first investigate the visual perception performance of Large Multimodal Models (LMMs) when solving IC footprint geometry understanding. Our findings reveal that current LMMs severely suffer from inaccurate geometric perception, which hinders their performance in solving the footprint geometry labeling problem. To address these limitations, we propose LLM4-IC8K, a novel framework that treats IC mechanical drawings as images and leverages LLMs for structured geometric interpretation. To mimic the step-by-step reasoning approach used by human engineers, LLM4-IC8K addresses three sub-tasks: perceiving the number of pins, computing the center coordinates of each pin, and estimating the dimensions of individual pins. We present a two-stage framework that first trains LMMs on synthetically generated IC footprint diagrams to learn fundamental geometric reasoning and then fine-tunes them on real-world datasheet drawings to enhance robustness and accuracy in practical scenarios. To support this, we introduce ICGeo8K, a multi-modal dataset with 8,608 labeled samples, including 4138 hand-crafted IC footprint samples and 4470 synthetically generated samples. Extensive experiments demonstrate that our model outperforms state-of-the-art LMMs on the proposed benchmark.

[117] TIR-Diffusion: Diffusion-based Thermal Infrared Image Denoising via Latent and Wavelet Domain Optimization

Tai Hyoung Rhee, Dong-guw Lee, Ayoung Kim

Main category: cs.CV

TL;DR: A diffusion-based framework for denoising thermal infrared (TIR) images using latent-space and wavelet-domain optimization, outperforming state-of-the-art methods.

DetailsMotivation: TIR images suffer from non-uniform noise, hindering robotic perception tasks like object detection and mapping.

Method: Leverages a pretrained stable diffusion model fine-tuned with a novel loss function combining latent-space and wavelet-domain losses, plus a cascaded refinement stage.

Result: Superior performance on benchmarks and robust zero-shot generalization to real-world TIR datasets.

Conclusion: The method is effective for practical robotic deployment, addressing TIR image noise challenges.

Abstract: Thermal infrared imaging exhibits considerable potentials for robotic perception tasks, especially in environments with poor visibility or challenging lighting conditions. However, TIR images typically suffer from heavy non-uniform fixed-pattern noise, complicating tasks such as object detection, localization, and mapping. To address this, we propose a diffusion-based TIR image denoising framework leveraging latent-space representations and wavelet-domain optimization. Utilizing a pretrained stable diffusion model, our method fine-tunes the model via a novel loss function combining latent-space and discrete wavelet transform (DWT) / dual-tree complex wavelet transform (DTCWT) losses. Additionally, we implement a cascaded refinement stage to enhance fine details, ensuring high-fidelity denoising results. Experiments on benchmark datasets demonstrate superior performance of our approach compared to state-of-the-art denoising methods. Furthermore, our method exhibits robust zero-shot generalization to diverse and challenging real-world TIR datasets, underscoring its effectiveness for practical robotic deployment.

[118] MSC: A Marine Wildlife Video Dataset with Grounded Segmentation and Clip-Level Captioning

Quang-Trung Truong, Yuk-Kwan Wong, Vo Hoang Kim Tuyen Dang, Rinaldi Gotama, Duc Thanh Nguyen, Sai-Kit Yeung

Main category: cs.CV

TL;DR: A two-stage pipeline for marine video captioning addresses challenges like dynamic objects and underwater complexity, using video-text-mask triplets for better understanding and generation.

DetailsMotivation: Existing datasets fail to handle marine video complexities, limiting insights into marine life.

Method: Proposes a two-stage pipeline with video-text-mask triplets and video splitting for salient object transitions.

Result: Improved marine video understanding, captioning, and generation, with released dataset and code.

Conclusion: The approach enhances marine video analysis and captioning, addressing domain-specific challenges.

Abstract: Marine videos present significant challenges for video understanding due to the dynamics of marine objects and the surrounding environment, camera motion, and the complexity of underwater scenes. Existing video captioning datasets, typically focused on generic or human-centric domains, often fail to generalize to the complexities of the marine environment and gain insights about marine life. To address these limitations, we propose a two-stage marine object-oriented video captioning pipeline. We introduce a comprehensive video understanding benchmark that leverages the triplets of video, text, and segmentation masks to facilitate visual grounding and captioning, leading to improved marine video understanding and analysis, and marine video generation. Additionally, we highlight the effectiveness of video splitting in order to detect salient object transitions in scene changes, which significantly enrich the semantics of captioning content. Our dataset and code have been released at https://msc.hkustvgd.com.

[119] What is Beneath Misogyny: Misogynous Memes Classification and Explanation

Kushal Kanwar, Dushyant Singh Chauhan, Gopendra Vikram Singh, Asif Ekbal

Main category: cs.CV

TL;DR: Unable to fetch the paper summary due to HTTP 503 error.

DetailsMotivation: N/A

Method: N/A

Result: N/A

Conclusion: N/A

Abstract: Failed to fetch summary for 2508.03732: Page request resulted in HTTP 503 (https://export.arxiv.org/api/query?search_query=&id_list=2508.03732&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[120] CLASP: Cross-modal Salient Anchor-based Semantic Propagation for Weakly-supervised Dense Audio-Visual Event Localization

Jinxing Zhou, Ziheng Zhou, Yanghao Zhou, Yuxin Mao, Zhangling Duan, Dan Guo

Main category: cs.CV

TL;DR: The paper introduces W-DAVEL, a weakly-supervised task for dense audio-visual event localization, using cross-modal salient anchors and achieves state-of-the-art results.

DetailsMotivation: To address the challenge of localizing events in untrimmed videos with only video-level labels, where temporal boundaries are unknown.

Method: Proposes a Mutual Event Agreement Evaluation module and Cross-modal Salient Anchor Identification to identify reliable timestamps, followed by Anchor-based Temporal Propagation for enhanced localization.

Result: Achieves state-of-the-art performance on UnAV-100 and ActivityNet1.3 datasets.

Conclusion: The method effectively localizes events under weak supervision by leveraging cross-modal consistency.

Abstract: The Dense Audio-Visual Event Localization (DAVEL) task aims to temporally localize events in untrimmed videos that occur simultaneously in both the audio and visual modalities. This paper explores DAVEL under a new and more challenging weakly-supervised setting (W-DAVEL task), where only video-level event labels are provided and the temporal boundaries of each event are unknown. We address W-DAVEL by exploiting \textit{cross-modal salient anchors}, which are defined as reliable timestamps that are well predicted under weak supervision and exhibit highly consistent event semantics across audio and visual modalities. Specifically, we propose a \textit{Mutual Event Agreement Evaluation} module, which generates an agreement score by measuring the discrepancy between the predicted audio and visual event classes. Then, the agreement score is utilized in a \textit{Cross-modal Salient Anchor Identification} module, which identifies the audio and visual anchor features through global-video and local temporal window identification mechanisms. The anchor features after multimodal integration are fed into an \textit{Anchor-based Temporal Propagation} module to enhance event semantic encoding in the original temporal audio and visual features, facilitating better temporal localization under weak supervision. We establish benchmarks for W-DAVEL on both the UnAV-100 and ActivityNet1.3 datasets. Extensive experiments demonstrate that our method achieves state-of-the-art performance.

[121] StorySync: Training-Free Subject Consistency in Text-to-Image Generation via Region Harmonization

Gopalji Gaur, Mohammadreza Zolfaghari, Thomas Brox

Main category: cs.CV

TL;DR: A training-free method for generating visually consistent subjects in sequential images using text-to-image diffusion models, avoiding computational costs of fine-tuning.

DetailsMotivation: Existing methods for maintaining subject consistency in visual storytelling are computationally expensive and disrupt pre-trained model capabilities.

Method: Introduces masked cross-image attention sharing and Regional Feature Harmonization to align and refine subject features dynamically.

Result: Successfully generates consistent subjects across scenarios without compromising the model’s creative abilities.

Conclusion: The proposed approach efficiently ensures subject consistency in visual storytelling without retraining, preserving model performance.

Abstract: Generating a coherent sequence of images that tells a visual story, using text-to-image diffusion models, often faces the critical challenge of maintaining subject consistency across all story scenes. Existing approaches, which typically rely on fine-tuning or retraining models, are computationally expensive, time-consuming, and often interfere with the model’s pre-existing capabilities. In this paper, we follow a training-free approach and propose an efficient consistent-subject-generation method. This approach works seamlessly with pre-trained diffusion models by introducing masked cross-image attention sharing to dynamically align subject features across a batch of images, and Regional Feature Harmonization to refine visually similar details for improved subject consistency. Experimental results demonstrate that our approach successfully generates visually consistent subjects across a variety of scenarios while maintaining the creative abilities of the diffusion model.

[122] Fusion of Pervasive RF Data with Spatial Images via Vision Transformers for Enhanced Mapping in Smart Cities

Rafayel Mkrtchyan, Armen Manukyan, Hrant Khachatrian, Theofanis P. Raptis

Main category: cs.CV

TL;DR: A deep learning approach using DINOv2 and RF data improves building mapping accuracy, outperforming baselines with a 65.3% IoU.

DetailsMotivation: Address limitations of conventional mapping techniques (cost, accessibility, accuracy) and biases in open-source maps.

Method: Integrates DINOv2 with RF data using a vision transformer to process both modalities, capturing spatial dependencies.

Result: Achieves 65.3% IoU, surpassing erroneous maps (40.1%), RF-only (37.3%), and non-AI fusion (42.2%) baselines.

Conclusion: The proposed method enhances mapping accuracy by combining RF data with open-source maps, validated by synthetic dataset metrics.

Abstract: Environment mapping is an important computing task for a wide range of smart city applications, including autonomous navigation, wireless network operations and extended reality environments. Conventional smart city mapping techniques, such as satellite imagery, LiDAR scans, and manual annotations, often suffer from limitations related to cost, accessibility and accuracy. Open-source mapping platforms have been widely utilized in artificial intelligence applications for environment mapping, serving as a source of ground truth. However, human errors and the evolving nature of real-world environments introduce biases that can negatively impact the performance of neural networks trained on such data. In this paper, we present a deep learning-based approach that integrates the DINOv2 architecture to improve building mapping by combining maps from open-source platforms with radio frequency (RF) data collected from multiple wireless user equipments and base stations. Our approach leverages a vision transformer-based architecture to jointly process both RF and map modalities within a unified framework, effectively capturing spatial dependencies and structural priors for enhanced mapping accuracy. For the evaluation purposes, we employ a synthetic dataset co-produced by Huawei. We develop and train a model that leverages only aggregated path loss information to tackle the mapping problem. We measure the results according to three performance metrics which capture different qualities: (i) The Jaccard index, also known as intersection over union (IoU), (ii) the Hausdorff distance, and (iii) the Chamfer distance. Our design achieves a macro IoU of 65.3%, significantly surpassing (i) the erroneous maps baseline, which yields 40.1%, (ii) an RF-only method from the literature, which yields 37.3%, and (iii) a non-AI fusion baseline that we designed which yields 42.2%.

[123] VQ-DeepISC: Vector Quantized-Enabled Digital Semantic Communication with Channel Adaptive Image Transmission

Jianqiao Chen, Tingting Zhu, Huishi Song, Nan Ma, Xiaodong Xu

Main category: cs.CV

TL;DR: VQ-DeepISC is a digital semantic communication system using vector quantization and channel adaptation for efficient, robust image transmission, outperforming benchmarks.

DetailsMotivation: The challenge lies in digitizing semantic features while preserving continuity and context, ensuring robustness to channel degradation.

Method: Proposes VQ-DeepISC with Swin Transformer for semantic feature extraction, VQ modules for discrete latent spaces, and attention-driven channel adaptation. Uses KLD regularization and EMA for stable training, and QPSK-OFDM for digital communication.

Result: Superior reconstruction fidelity compared to benchmark methods.

Conclusion: VQ-DeepISC effectively addresses semantic feature digitization and transmission challenges, demonstrating practical potential.

Abstract: Discretization of semantic features enables interoperability between semantic and digital communication systems, showing significant potential for practical applications. The fundamental difficulty in digitizing semantic features stems from the need to preserve continuity and context in inherently analog representations during their compression into discrete symbols while ensuring robustness to channel degradation. In this paper, we propose a vector quantized (VQ)-enabled digital semantic communication system with channel adaptive image transmission, named VQ-DeepISC. Guided by deep joint source-channel coding (DJSCC), we first design a Swin Transformer backbone for hierarchical semantic feature extraction, followed by VQ modules projecting features into discrete latent spaces. Consequently, it enables efficient index-based transmission instead of raw feature transmission. To further optimize this process, we develop an attention mechanism-driven channel adaptation module to dynamically optimize index transmission. Secondly, to counteract codebook collapse during training process, we impose a distributional regularization by minimizing the Kullback-Leibler divergence (KLD) between codeword usage frequencies and a uniform prior. Meanwhile, exponential moving average (EMA) is employed to stabilize training and ensure balanced feature coverage during codebook updates. Finally, digital communication is implemented using quadrature phase shift keying (QPSK) modulation alongside orthogonal frequency division multiplexing (OFDM), adhering to the IEEE 802.11a standard. Experimental results demonstrate superior reconstruction fidelity of the proposed system over benchmark methods.

[124] Tobler’s First Law in GeoAI: A Spatially Explicit Deep Learning Model for Terrain Feature Detection Under Weak Supervision

Wenwen Li, Chia-Yu Hsu, Maosheng Hu

Main category: cs.CV

TL;DR: The paper presents a weakly supervised deep learning model for geospatial object detection, addressing challenges like lack of training data and spatial neglect in AI models. It uses Tobler’s first law, attention maps, and multistage training, applied to detect Mars craters.

DetailsMotivation: Challenges in GeoAI include limited training data and ignoring spatial principles, hindering AI-geospatial integration. The paper aims to address these gaps.

Method: Develops a spatially explicit model using Tobler’s first law, integrates attention maps, and employs multistage training for weakly supervised object detection.

Result: Successfully detects natural features like Mars craters and generalizes to Earth and planetary surfaces.

Conclusion: Advances GeoAI by improving theoretical and methodological foundations for weakly supervised geospatial object detection.

Abstract: Recent interest in geospatial artificial intelligence (GeoAI) has fostered a wide range of applications using artificial intelligence (AI), especially deep learning, for geospatial problem solving. However, major challenges such as a lack of training data and the neglect of spatial principles and spatial effects in AI model design remain, significantly hindering the in-depth integration of AI with geospatial research. This paper reports our work in developing a deep learning model that enables object detection, particularly of natural features, in a weakly supervised manner. Our work makes three contributions: First, we present a method of object detection using only weak labels. This is achieved by developing a spatially explicit model based on Tobler’s first law of geography. Second, we incorporate attention maps into the object detection pipeline and develop a multistage training strategy to improve performance. Third, we apply this model to detect impact craters on Mars, a task that previously required extensive manual effort. The model generalizes to both natural and human-made features on the surfaces of Earth and other planets. This research advances the theoretical and methodological foundations of GeoAI.

[125] RoboTron-Drive: All-in-One Large Multimodal Model for Autonomous Driving

Zhijian Huang, Chengjian Feng, Feng Yan, Baihui Xiao, Zequn Jie, Yujie Zhong, Xiaodan Liang, Lin Ma

Main category: cs.CV

TL;DR: RoboTron-Drive is a general large multimodal model for autonomous driving, excelling in diverse tasks and datasets, achieving state-of-the-art performance.

DetailsMotivation: Current AD models focus narrowly on single datasets/tasks, lacking generalization. RoboTron-Drive aims to bridge this gap.

Method: Curriculum pre-training on varied visual inputs, followed by dataset augmentation and fine-tuning for diverse AD tasks.

Result: Achieves SOTA on six benchmarks and excels in zero-shot transfer on three unseen datasets.

Conclusion: RoboTron-Drive is a promising, versatile solution for real-world autonomous driving.

Abstract: Large Multimodal Models (LMMs) have demonstrated exceptional comprehension and interpretation capabilities in Autonomous Driving (AD) by incorporating large language models. Despite the advancements, current data-driven AD approaches tend to concentrate on a single dataset and specific tasks, neglecting their overall capabilities and ability to generalize. To bridge these gaps, we propose RoboTron-Drive, a general large multimodal model designed to process diverse data inputs, such as images and multi-view videos, while performing a broad spectrum of AD tasks, including perception, prediction, and planning. Initially, the model undergoes curriculum pre-training to process varied visual signals and perform basic visual comprehension and perception tasks. Subsequently, we augment and standardize various AD datasets to finetune the model, resulting in an all-in-one LMM for autonomous driving. To assess the general capabilities and generalization ability, we conduct evaluations on six public benchmarks and undertake zero-shot transfer on three unseen datasets, where RoboTron-Drive achieves state-of-the-art performance across all tasks. We hope RoboTron-Drive as a promising solution for AD in the real world. Project page with code: https://github.com/zhijian11/RoboTron-Drive.

[126] Closed-Circuit Television Data as an Emergent Data Source for Urban Rail Platform Crowding Estimation

Riccardo Fiorista, Awad Abdelhalim, Anson F. Stewart, Gabriel L. Pincus, Ian Thistle, Jinhua Zhao

Main category: cs.CV

TL;DR: The paper explores using CCTV footage and computer vision to estimate urban rail platform occupancy, comparing three methods and introducing a novel linear-optimization approach. Results show promise for real-time crowding estimation.

DetailsMotivation: Improving safety, efficiency, and customer experience in urban rail transit by accurately estimating platform occupancy in real-time, addressing the limitations of indirect proxies.

Method: Three computer vision approaches: object detection/counting (YOLOv11, RT-DETRv2, APGCC), crowd-level classification (Crowd-ViT), and semantic segmentation (DeepLabV3), plus a novel linear-optimization method for count extraction.

Result: Computer vision methods, tested on a large WMATA dataset, prove valuable for crowd estimation, enabling precise real-time occupancy tracking without relying on other data sources.

Conclusion: CCTV-based computer vision can enhance real-time crowding estimation, supporting timely operational decisions to mitigate platform crowding.

Abstract: Accurately estimating urban rail platform occupancy can enhance transit agencies’ ability to make informed operational decisions, thereby improving safety, operational efficiency, and customer experience, particularly in the context of crowding. However, sensing real-time crowding remains challenging and often depends on indirect proxies such as automatic fare collection data or staff observations. Recently, Closed-Circuit Television (CCTV) footage has emerged as a promising data source with the potential to yield accurate, real-time occupancy estimates. The presented study investigates this potential by comparing three state-of-the-art computer vision approaches for extracting crowd-related features from platform CCTV imagery: (a) object detection and counting using YOLOv11, RT-DETRv2, and APGCC; (b) crowd-level classification via a custom-trained Vision Transformer, Crowd-ViT; and (c) semantic segmentation using DeepLabV3. Additionally, we present a novel, highly efficient linear-optimization-based approach to extract counts from the generated segmentation maps while accounting for image object depth and, thus, for passenger dispersion along a platform. Tested on a privacy-preserving dataset created in collaboration with the Washington Metropolitan Area Transit Authority (WMATA) that encompasses more than 600 hours of video material, our results demonstrate that computer vision approaches can provide substantive value for crowd estimation. This work demonstrates that CCTV image data, independent of other data sources available to a transit agency, can enable more precise real-time crowding estimation and, eventually, timely operational responses for platform crowding mitigation.

[127] Adaptive Audio-Visual Speech Recognition via Matryoshka-Based Multimodal LLMs

Umberto Cappellazzo, Minsu Kim, Stavros Petridis

Main category: cs.CV

TL;DR: Llama-MTSK is a Matryoshka-based Multimodal LLM for AVSR that adapts audio-visual token allocation efficiently under compute constraints, outperforming fixed-compression models.

DetailsMotivation: To address the trade-off between computational cost and accuracy in AVSR due to long speech representations in LLMs.

Method: Proposes Llama-MTSK, encoding representations at multiple granularities with a single architecture, and introduces three LoRA-based fine-tuning strategies.

Result: Matches or outperforms fixed-compression models on major AVSR datasets.

Conclusion: Llama-MTSK offers a flexible and efficient solution for AVSR, balancing computational constraints and accuracy.

Abstract: Audio-Visual Speech Recognition (AVSR) leverages audio and visual modalities to improve robustness in noisy environments. Recent advances in Large Language Models (LLMs) show strong performance in speech recognition, including AVSR. However, the long speech representations lead to high computational costs for LLMs. Prior methods compress inputs before feeding them to LLMs, but high compression often harms accuracy. To address this, we propose Llama-MTSK, the first Matryoshka-based Multimodal LLM for AVSR, which flexibly adapts audio-visual token allocation under varying compute constraints. Inspired by Matryoshka Representation Learning, our model encodes representations at multiple granularities with a single architecture, avoiding the need for separate models. For efficient fine-tuning, we introduce three LoRA-based strategies using global and scale-specific modules. Evaluations on major AVSR datasets show Llama-MTSK matches or outperforms models trained at fixed compression levels.

[128] Modular Transformer Architecture for Precision Agriculture Imaging

Brian Gopalan, Nathalia Nascimento, Vishal Monga

Main category: cs.CV

TL;DR: A modular deep-learning framework for weed segmentation in drone videos dynamically routes degraded images through specialized pre-processing and transformer models, outperforming CNN-based methods.

DetailsMotivation: Addressing the need for efficient and accurate weed segmentation in precision agriculture, especially under common image degradation like blur and noise.

Method: The system analyzes image quality (blur/noise) using Mean Absolute Deviation and Laplacian, then routes data to one of three transformer models: baseline, noise-reducing (Fisher Vector), or blur-correcting (Lucy-Robinson decoder).

Result: Outperforms CNN-based methods in segmentation quality and computational efficiency.

Conclusion: The framework represents a significant advancement in deep-learning applications for agriculture.

Abstract: This paper addresses the critical need for efficient and accurate weed segmentation from drone video in precision agriculture. A quality-aware modular deep-learning framework is proposed that addresses common image degradation by analyzing quality conditions-such as blur and noise-and routing inputs through specialized pre-processing and transformer models optimized for each degradation type. The system first analyzes drone images for noise and blur using Mean Absolute Deviation and the Laplacian. Data is then dynamically routed to one of three vision transformer models: a baseline for clean images, a modified transformer with Fisher Vector encoding for noise reduction, or another with an unrolled Lucy-Robinson decoder to correct blur. This novel routing strategy allows the system to outperform existing CNN-based methods in both segmentation quality and computational efficiency, demonstrating a significant advancement in deep-learning applications for agriculture.

[129] Generating Synthetic Invoices via Layout-Preserving Content Replacement

Bevin V, Ananthakrishnan P V, Ragesh KR, Sanjay M, Vineeth S, Bibin Wilson

Main category: cs.CV

TL;DR: A novel pipeline generates synthetic invoice documents to overcome dataset constraints in machine learning for invoice processing.

DetailsMotivation: Addressing privacy and cost barriers in acquiring diverse datasets for training invoice processing models.

Method: Uses OCR for layout extraction, LLM for synthetic content, and inpainting to replace text while preserving layout.

Result: Produces realistic synthetic invoices and aligned structured data, scalable for dataset expansion.

Conclusion: Enables creation of large, diverse training datasets to improve document intelligence models.

Abstract: The performance of machine learning models for automated invoice processing is critically dependent on large-scale, diverse datasets. However, the acquisition of such datasets is often constrained by privacy regulations and the high cost of manual annotation. To address this, we present a novel pipeline for generating high-fidelity, synthetic invoice documents and their corresponding structured data. Our method first utilizes Optical Character Recognition (OCR) to extract the text content and precise spatial layout from a source invoice. Select data fields are then replaced with contextually realistic, synthetic content generated by a large language model (LLM). Finally, we employ an inpainting technique to erase the original text from the image and render the new, synthetic text in its place, preserving the exact layout and font characteristics. This process yields a pair of outputs: a visually realistic new invoice image and a perfectly aligned structured data file (JSON) reflecting the synthetic content. Our approach provides a scalable and automated solution to amplify small, private datasets, enabling the creation of large, varied corpora for training more robust and accurate document intelligence models.

[130] Consistent and Invariant Generalization Learning for Short-video Misinformation Detection

Hanghui Guo, Weijie Shi, Mengze Li, Juncheng Li, Hao Chen, Yue Cui, Jiajie Xu, Jia Zhu, Jiawei Shen, Zhangze Chen, Sirui Han

Main category: cs.CV

TL;DR: The paper proposes DOCTOR, a domain generalization model for short-video misinformation detection, addressing domain gaps by leveraging cross-modal consistency and invariance learning.

DetailsMotivation: Current models for short-video misinformation detection perform poorly on unseen domains due to domain gaps. The paper aims to improve domain generalization by addressing modality reliance and cross-modal biases.

Method: DOCTOR uses cross-modal feature interpolation and interpolation distillation for shared space mapping and multi-modal synchronization. It also employs a diffusion model to retain core features and enhance domain invariance.

Result: Extensive experiments show DOCTOR’s effectiveness in improving domain generalization for misinformation detection.

Conclusion: DOCTOR successfully addresses domain gaps in short-video misinformation detection through consistency and invariance learning, outperforming existing models.

Abstract: Short-video misinformation detection has attracted wide attention in the multi-modal domain, aiming to accurately identify the misinformation in the video format accompanied by the corresponding audio. Despite significant advancements, current models in this field, trained on particular domains (source domains), often exhibit unsatisfactory performance on unseen domains (target domains) due to domain gaps. To effectively realize such domain generalization on the short-video misinformation detection task, we propose deep insights into the characteristics of different domains: (1) The detection on various domains may mainly rely on different modalities (i.e., mainly focusing on videos or audios). To enhance domain generalization, it is crucial to achieve optimal model performance on all modalities simultaneously. (2) For some domains focusing on cross-modal joint fraud, a comprehensive analysis relying on cross-modal fusion is necessary. However, domain biases located in each modality (especially in each frame of videos) will be accumulated in this fusion process, which may seriously damage the final identification of misinformation. To address these issues, we propose a new DOmain generalization model via ConsisTency and invariance learning for shORt-video misinformation detection (named DOCTOR), which contains two characteristic modules: (1) We involve the cross-modal feature interpolation to map multiple modalities into a shared space and the interpolation distillation to synchronize multi-modal learning; (2) We design the diffusion model to add noise to retain core features of multi modal and enhance domain invariant features through cross-modal guided denoising. Extensive experiments demonstrate the effectiveness of our proposed DOCTOR model. Our code is public available at https://github.com/ghh1125/DOCTOR.

[131] Refine-IQA: Multi-Stage Reinforcement Finetuning for Perceptual Image Quality Assessment

Ziheng Jia, Jiaying Qian, Zicheng Zhang, Zijian Chen, Xiongkuo Min

Main category: cs.CV

TL;DR: Refine-IQA introduces a multi-stage RFT framework for IQA, enhancing visual perception and supervising the ’think’ process, achieving top performance.

DetailsMotivation: Existing RFT-based IQA methods lack supervision for the 'think' process and direct fine-tuning limits performance.

Method: Stage-1: Builds a dataset (Refine-Perception-20K) with multi-task rewards. Stage-2: Uses probability difference rewards for ’think’ supervision.

Result: Refine-IQA excels in perception and scoring tasks, with strong ’think’ capability.

Conclusion: The framework addresses gaps in RFT for IQA, improving performance and interpretability.

Abstract: Reinforcement fine-tuning (RFT) is a proliferating paradigm for LMM training. Analogous to high-level reasoning tasks, RFT is similarly applicable to low-level vision domains, including image quality assessment (IQA). Existing RFT-based IQA methods typically use rule-based output rewards to verify the model’s rollouts but provide no reward supervision for the “think” process, leaving its correctness and efficacy uncontrolled. Furthermore, these methods typically fine-tune directly on downstream IQA tasks without explicitly enhancing the model’s native low-level visual quality perception, which may constrain its performance upper bound. In response to these gaps, we propose the multi-stage RFT IQA framework (Refine-IQA). In Stage-1, we build the Refine-Perception-20K dataset (with 12 main distortions, 20,907 locally-distorted images, and over 55K RFT samples) and design multi-task reward functions to strengthen the model’s visual quality perception. In Stage-2, targeting the quality scoring task, we introduce a probability difference reward involved strategy for “think” process supervision. The resulting Refine-IQA Series Models achieve outstanding performance on both perception and scoring tasks-and, notably, our paradigm activates a robust “think” (quality interpreting) capability that also attains exceptional results on the corresponding quality interpreting benchmark.

[132] 4D-PreNet: A Unified Preprocessing Framework for 4D-STEM Data Analysis

Mingyu Liu, Zian Mao, Zhu Liu, Haoran Zhang, Jintao Guo, Xiaoya He, Xi Huang, Shufen Chu, Chun Cheng, Jun Ding, Yujun Xie

Main category: cs.CV

TL;DR: 4D-PreNet, a deep-learning pipeline, addresses data preprocessing bottlenecks in 4D-STEM by integrating denoising, center correction, and distortion calibration, improving accuracy and reliability for real-time analysis.

DetailsMotivation: High-throughput 4D-STEM faces challenges like noise, beam drift, and distortions, which bias measurements. Existing methods lack robustness and generalizability.

Method: 4D-PreNet combines attention-enhanced U-Net and ResNet architectures, trained on simulated datasets with varied noise, drift, and distortion conditions.

Result: The pipeline reduces mean squared error by 50% in denoising, achieves sub-pixel center localization (errors <0.04 pixels), and outperforms traditional algorithms.

Conclusion: 4D-PreNet enables reliable, high-throughput 4D-STEM analysis, enhancing automated characterization.

Abstract: Automated experimentation with real time data analysis in scanning transmission electron microscopy (STEM) often require end-to-end framework. The four-dimensional scanning transmission electron microscopy (4D-STEM) with high-throughput data acquisition has been constrained by the critical bottleneck results from data preprocessing. Pervasive noise, beam center drift, and elliptical distortions during high-throughput acquisition inevitably corrupt diffraction patterns, systematically biasing quantitative measurements. Yet, conventional correction algorithms are often material-specific and fail to provide a robust, generalizable solution. In this work, we present 4D-PreNet, an end-to-end deep-learning pipeline that integrates attention-enhanced U-Net and ResNet architectures to simultaneously perform denoising, center correction, and elliptical distortion calibration. The network is trained on large, simulated datasets encompassing a wide range of noise levels, drift magnitudes, and distortion types, enabling it to generalize effectively to experimental data acquired under varying conditions. Quantitative evaluations demonstrate that our pipeline reduces mean squared error by up to 50% during denoising and achieves sub-pixel center localization in the center detection task, with average errors below 0.04 pixels. The outputs are bench-marked against traditional algorithms, highlighting improvements in both noise suppression and restoration of diffraction patterns, thereby facilitating high-throughput, reliable 4D-STEM real-time analysis for automated characterization.

[133] HPSv3: Towards Wide-Spectrum Human Preference Score

Yuhang Ma, Xiaoshi Wu, Keqiang Sun, Hongsheng Li

Main category: cs.CV

TL;DR: The paper introduces HPSv3, a human-centric metric for evaluating text-to-image models, addressing limitations in existing metrics. It includes a dataset (HPDv3) and a VLM-based preference model with an uncertainty-aware ranking loss. The method CoHP iteratively refines images using HPSv3, improving quality without extra data.

DetailsMotivation: Existing human-centric metrics for text-to-image models are limited by data coverage, feature extraction, and loss functions, necessitating a more robust solution.

Method: The authors propose HPSv3, which includes a dataset (HPDv3) with 1.08M text-image pairs and 1.17M comparisons, a VLM-based preference model with uncertainty-aware ranking loss, and CoHP for iterative image refinement.

Result: HPSv3 proves effective for wide-spectrum image evaluation, and CoHP enhances image quality efficiently without additional data.

Conclusion: HPSv3 and CoHP provide a robust, human-aligned approach to evaluating and improving text-to-image generation, with publicly available code and dataset.

Abstract: Evaluating text-to-image generation models requires alignment with human perception, yet existing human-centric metrics are constrained by limited data coverage, suboptimal feature extraction, and inefficient loss functions. To address these challenges, we introduce Human Preference Score v3 (HPSv3). (1) We release HPDv3, the first wide-spectrum human preference dataset integrating 1.08M text-image pairs and 1.17M annotated pairwise comparisons from state-of-the-art generative models and low to high-quality real-world images. (2) We introduce a VLM-based preference model trained using an uncertainty-aware ranking loss for fine-grained ranking. Besides, we propose Chain-of-Human-Preference (CoHP), an iterative image refinement method that enhances quality without extra data, using HPSv3 to select the best image at each step. Extensive experiments demonstrate that HPSv3 serves as a robust metric for wide-spectrum image evaluation, and CoHP offers an efficient and human-aligned approach to improve image generation quality. The code and dataset are available at the HPSv3 Homepage.

[134] Deep learning framework for crater detection and identification on the Moon and Mars

Yihan Ma, Zeyang Yu, Rohitash Chandra

Main category: cs.CV

TL;DR: The paper proposes a deep learning framework for automated crater detection using CNNs, YOLO, and ResNet-50, showing YOLO’s balanced performance and ResNet-50’s precision for large craters.

DetailsMotivation: Impact craters are key for planetary science, and automated detection using deep learning can enhance research efficiency.

Method: A two-stage approach: first, crater identification with CNNs, ResNet-50, and YOLO; second, YOLO-based localization. Applied to Mars and Moon remote sensing data.

Result: YOLO performs best overall, while ResNet-50 is precise for large craters.

Conclusion: The framework effectively detects and identifies craters, with YOLO and ResNet-50 excelling in different aspects.

Abstract: Impact craters are among the most prominent geomorphological features on planetary surfaces and are of substantial significance in planetary science research. Their spatial distribution and morphological characteristics provide critical information on planetary surface composition, geological history, and impact processes. In recent years, the rapid advancement of deep learning models has fostered significant interest in automated crater detection. In this paper, we apply advancements in deep learning models for impact crater detection and identification. We use novel models, including Convolutional Neural Networks (CNNs) and variants such as YOLO and ResNet. We present a framework that features a two-stage approach where the first stage features crater identification using simple classic CNN, ResNet-50 and YOLO. In the second stage, our framework employs YOLO-based detection for crater localisation. Therefore, we detect and identify different types of craters and present a summary report with remote sensing data for a selected region. We consider selected regions for craters and identification from Mars and the Moon based on remote sensing data. Our results indicate that YOLO demonstrates the most balanced crater detection performance, while ResNet-50 excels in identifying large craters with high precision.

[135] Point-Based Shape Representation Generation with a Correspondence-Preserving Diffusion Model

Shen Zhu, Yinzhu Jin, Ifrah Zawar, P. Thomas Fletcher

Main category: cs.CV

TL;DR: A diffusion model for generating point-based shapes with preserved correspondences, outperforming existing methods and enabling applications like conditional and counterfactual generation.

DetailsMotivation: Current deep learning methods ignore point correspondences in shape generation, limiting their utility for tasks requiring such correspondences.

Method: Proposes a diffusion model trained on OASIS-3 data to generate realistic point-based shapes while preserving correspondences.

Result: The model generates highly realistic hippocampal shapes with preserved correspondences, outperforming existing methods.

Conclusion: The model enables applications like conditional generation and disease progression prediction, advancing shape representation in deep learning.

Abstract: We propose a diffusion model designed to generate point-based shape representations with correspondences. Traditional statistical shape models have considered point correspondences extensively, but current deep learning methods do not take them into account, focusing on unordered point clouds instead. Current deep generative models for point clouds do not address generating shapes with point correspondences between generated shapes. This work aims to formulate a diffusion model that is capable of generating realistic point-based shape representations, which preserve point correspondences that are present in the training data. Using shape representation data with correspondences derived from Open Access Series of Imaging Studies 3 (OASIS-3), we demonstrate that our correspondence-preserving model effectively generates point-based hippocampal shape representations that are highly realistic compared to existing methods. We further demonstrate the applications of our generative model by downstream tasks, such as conditional generation of healthy and AD subjects and predicting morphological changes of disease progression by counterfactual generation.

[136] Policy to Assist Iteratively Local Segmentation: Optimising Modality and Location Selection for Prostate Cancer Localisation

Xiangcen Wu, Shaheer U. Saeed, Yipei Wang, Ester Bonmati Coll, Yipeng Hu

Main category: cs.CV

TL;DR: A recommendation system for prostate cancer segmentation suggests optimal imaging modalities and regions, improving accuracy and efficiency.

DetailsMotivation: Radiologists use varied strategies for medical image analysis, but current machine learning models lack dynamic decision-making for optimal modality and region selection.

Method: A policy network recommends imaging modalities and regions, iteratively refining segmentation using a pre-trained model.

Result: The method outperforms standard segmentation networks and develops unique strategies, validated on 1325 MRI images.

Conclusion: The approach enhances segmentation and annotation, with potential for interactive radiologist assistance.

Abstract: Radiologists often mix medical image reading strategies, including inspection of individual modalities and local image regions, using information at different locations from different images independently as well as concurrently. In this paper, we propose a recommend system to assist machine learning-based segmentation models, by suggesting appropriate image portions along with the best modality, such that prostate cancer segmentation performance can be maximised. Our approach trains a policy network that assists tumor localisation, by recommending both the optimal imaging modality and the specific sections of interest for review. During training, a pre-trained segmentation network mimics radiologist inspection on individual or variable combinations of these imaging modalities and their sections - selected by the policy network. Taking the locally segmented regions as an input for the next step, this dynamic decision making process iterates until all cancers are best localised. We validate our method using a data set of 1325 labelled multiparametric MRI images from prostate cancer patients, demonstrating its potential to improve annotation efficiency and segmentation accuracy, especially when challenging pathology is present. Experimental results show that our approach can surpass standard segmentation networks. Perhaps more interestingly, our trained agent independently developed its own optimal strategy, which may or may not be consistent with current radiologist guidelines such as PI-RADS. This observation also suggests a promising interactive application, in which the proposed policy networks assist human radiologists.

[137] Scaling Up Audio-Synchronized Visual Animation: An Efficient Training Paradigm

Lin Zhang, Zefan Cai, Yufan Zhou, Shentong Mo, Jinhong Lin, Cheng-En Wu, Yibing Wei, Yijing Zhang, Ruiyi Zhang, Wen Xiao, Tong Sun, Junjie Hu, Pedro Morgado

Main category: cs.CV

TL;DR: A two-stage training paradigm reduces reliance on manual curation for audio-synchronized visual animation, using noisy pretraining and small-scale fine-tuning, achieving scalability and generalization.

DetailsMotivation: Existing methods require expensive manual curation of high-quality training videos, limiting scalability to diverse audio-video classes.

Method: A two-stage approach: pretraining on noisy, large-scale videos, followed by fine-tuning on small, high-quality datasets. Multi-feature conditioning and window attention enhance synchronization.

Result: The method reduces manual effort by over 10x, generalizes to open classes, and outperforms benchmarks with 3x more diversity (AVSync48).

Conclusion: The proposed paradigm efficiently scales audio-synchronized animation while minimizing human effort, leveraging pretrained models and innovative training techniques.

Abstract: Recent advances in audio-synchronized visual animation enable control of video content using audios from specific classes. However, existing methods rely heavily on expensive manual curation of high-quality, class-specific training videos, posing challenges to scaling up to diverse audio-video classes in the open world. In this work, we propose an efficient two-stage training paradigm to scale up audio-synchronized visual animation using abundant but noisy videos. In stage one, we automatically curate large-scale videos for pretraining, allowing the model to learn diverse but imperfect audio-video alignments. In stage two, we finetune the model on manually curated high-quality examples, but only at a small scale, significantly reducing the required human effort. We further enhance synchronization by allowing each frame to access rich audio context via multi-feature conditioning and window attention. To efficiently train the model, we leverage pretrained text-to-video generator and audio encoders, introducing only 1.9% additional trainable parameters to learn audio-conditioning capability without compromising the generator’s prior knowledge. For evaluation, we introduce AVSync48, a benchmark with videos from 48 classes, which is 3$\times$ more diverse than previous benchmarks. Extensive experiments show that our method significantly reduces reliance on manual curation by over 10$\times$, while generalizing to many open classes.

[138] RAVID: Retrieval-Augmented Visual Detection: A Knowledge-Driven Approach for AI-Generated Image Identification

Mamadou Keita, Wassim Hamidouche, Hessen Bougueffa Eutamene, Abdelmalik Taleb-Ahmed, Abdenour Hadid

Main category: cs.CV

TL;DR: RAVID is a novel framework for detecting AI-generated images using visual retrieval-augmented generation (RAG), outperforming existing methods in accuracy and robustness.

DetailsMotivation: Existing AI-generated image detection methods lack generalization and robustness, relying on low-level artifacts. RAVID aims to enhance detection by dynamically retrieving relevant images.

Method: RAVID uses a fine-tuned CLIP image encoder (RAVID CLIP) with category-related prompts and integrates a vision-language model (VLM) to fuse retrieved images with the query.

Result: RAVID achieves 93.85% accuracy on the UniversalFakeDetect benchmark and maintains 80.27% accuracy under image degradations, outperforming C2P-CLIP (63.44%).

Conclusion: RAVID sets a new standard for AI-generated image detection, offering superior accuracy and robustness, with potential for public use upon acceptance.

Abstract: In this paper, we introduce RAVID, the first framework for AI-generated image detection that leverages visual retrieval-augmented generation (RAG). While RAG methods have shown promise in mitigating factual inaccuracies in foundation models, they have primarily focused on text, leaving visual knowledge underexplored. Meanwhile, existing detection methods, which struggle with generalization and robustness, often rely on low-level artifacts and model-specific features, limiting their adaptability. To address this, RAVID dynamically retrieves relevant images to enhance detection. Our approach utilizes a fine-tuned CLIP image encoder, RAVID CLIP, enhanced with category-related prompts to improve representation learning. We further integrate a vision-language model (VLM) to fuse retrieved images with the query, enriching the input and improving accuracy. Given a query image, RAVID generates an embedding using RAVID CLIP, retrieves the most relevant images from a database, and combines these with the query image to form an enriched input for a VLM (e.g., Qwen-VL or Openflamingo). Experiments on the UniversalFakeDetect benchmark, which covers 19 generative models, show that RAVID achieves state-of-the-art performance with an average accuracy of 93.85%. RAVID also outperforms traditional methods in terms of robustness, maintaining high accuracy even under image degradations such as Gaussian blur and JPEG compression. Specifically, RAVID achieves an average accuracy of 80.27% under degradation conditions, compared to 63.44% for the state-of-the-art model C2P-CLIP, demonstrating consistent improvements in both Gaussian blur and JPEG compression scenarios. The code will be publicly available upon acceptance.

[139] Investigating the Impact of Large-Scale Pre-training on Nutritional Content Estimation from 2D Images

Michele Andrade, Guilherme A. L. Silva, Valéria Santos, Gladston Moreira, Eduardo Luz

Main category: cs.CV

TL;DR: The paper explores the impact of large-scale pre-training datasets on deep learning models for estimating food nutrition from 2D images, finding proprietary datasets like JFT-300M outperform public ones like ImageNet and COYO.

DetailsMotivation: Nutritional estimation from images is vital for health monitoring but challenging due to variability in food presentation and reliance on proprietary datasets. This study aims to evaluate the role of pre-training datasets in model performance.

Method: Fine-tuned Vision Transformer (ViT) models pre-trained on ImageNet and COYO, compared against CNN baselines (InceptionV2, ResNet-50) and a JFT-300M pre-trained model. Evaluated on Nutrition5k dataset using MAE and MAE%.

Result: JFT-300M pre-trained models outperformed public dataset models. Surprisingly, COYO pre-trained models performed worse than ImageNet, contradicting initial expectations.

Conclusion: Pre-training dataset characteristics (scale, domain relevance, curation) are crucial for effective transfer learning in nutritional estimation.

Abstract: Estimating the nutritional content of food from images is a critical task with significant implications for health and dietary monitoring. This is challenging, especially when relying solely on 2D images, due to the variability in food presentation, lighting, and the inherent difficulty in inferring volume and mass without depth information. Furthermore, reproducibility in this domain is hampered by the reliance of state-of-the-art methods on proprietary datasets for large-scale pre-training. In this paper, we investigate the impact of large-scale pre-training datasets on the performance of deep learning models for nutritional estimation using only 2D images. We fine-tune and evaluate Vision Transformer (ViT) models pre-trained on two large public datasets, ImageNet and COYO, comparing their performance against baseline CNN models (InceptionV2 and ResNet-50) and a state-of-the-art method pre-trained on the proprietary JFT-300M dataset. We conduct extensive experiments on the Nutrition5k dataset, a large-scale collection of real-world food plates with high-precision nutritional annotations. Our evaluation using Mean Absolute Error (MAE) and Mean Absolute Percentage Error (MAE%) reveals that models pre-trained on JFT-300M significantly outperform those pre-trained on public datasets. Unexpectedly, the model pre-trained on the massive COYO dataset performs worse than the model pre-trained on ImageNet for this specific regression task, refuting our initial hypothesis. Our analysis provides quantitative evidence highlighting the critical role of pre-training dataset characteristics, including scale, domain relevance, and curation quality, for effective transfer learning in 2D nutritional estimation.

Zheng Cheng, Wenri Wang, Guangyong Chen, Yakun Ju, Yihua Cheng, Zhisong Liu, Yanda Meng, Jintao Song

Main category: cs.CV

TL;DR: The paper challenges the necessity of multi-scale feature fusion in underwater image enhancement, proposing a single-scale approach (SSD-Net) that matches or outperforms multi-scale methods with reduced complexity.

DetailsMotivation: Current UIE methods rely on multi-scale feature extraction, but experiments show single-scale methods can achieve comparable or better results, prompting the need for simpler, efficient solutions.

Method: Introduces SSD-Net, a single-scale network with asymmetrical decomposition to separate clean and degradation layers. Combines CNN and Transformer strengths via PFDB and BFCB modules for feature decoupling and fusion.

Result: SSD-Net demonstrates that single-scale feature extraction can match or surpass multi-scale methods, reducing complexity while maintaining performance.

Conclusion: Single-scale feature extraction is viable for UIE, offering a simpler, efficient alternative to multi-scale methods without compromising quality.

Abstract: Underwater image enhancement (UIE) techniques aim to improve visual quality of images captured in aquatic environments by addressing degradation issues caused by light absorption and scattering effects, including color distortion, blurring, and low contrast. Current mainstream solutions predominantly employ multi-scale feature extraction (MSFE) mechanisms to enhance reconstruction quality through multi-resolution feature fusion. However, our extensive experiments demonstrate that high-quality image reconstruction does not necessarily rely on multi-scale feature fusion. Contrary to popular belief, our experiments show that single-scale feature extraction alone can match or surpass the performance of multi-scale methods, significantly reducing complexity. To comprehensively explore single-scale feature potential in underwater enhancement, we propose an innovative Single-Scale Decomposition Network (SSD-Net). This architecture introduces an asymmetrical decomposition mechanism that disentangles input image into clean layer along with degradation layer. The former contains scene-intrinsic information and the latter encodes medium-induced interference. It uniquely combines CNN’s local feature extraction capabilities with Transformer’s global modeling strengths through two core modules: 1) Parallel Feature Decomposition Block (PFDB), implementing dual-branch feature space decoupling via efficient attention operations and adaptive sparse transformer; 2) Bidirectional Feature Communication Block (BFCB), enabling cross-layer residual interactions for complementary feature mining and fusion. This synergistic design preserves feature decomposition independence while establishing dynamic cross-layer information pathways, effectively enhancing degradation decoupling capacity.

[141] JanusNet: Hierarchical Slice-Block Shuffle and Displacement for Semi-Supervised 3D Multi-Organ Segmentation

Zheng Zhang, Tianzhuzi Tan, Guanchun Yin, Bo Zhang, Xiuzhuang Zhou

Main category: cs.CV

TL;DR: JanusNet is a data augmentation framework for 3D medical image segmentation that preserves anatomical continuity while enhancing training in hard-to-segment regions, outperforming state-of-the-art methods.

DetailsMotivation: Weakly supervised medical image segmentation suffers from limited training data and disrupted anatomical continuity in existing augmentation methods, especially for small organs.

Method: JanusNet uses Slice-Block Shuffle to preserve anatomical context and Confidence-Guided Displacement to focus on challenging regions, integrating seamlessly with teacher-student schemes.

Result: JanusNet achieves a 4% DSC gain on the Synapse dataset with only 20% labeled data, surpassing existing methods.

Conclusion: JanusNet effectively addresses anatomical inconsistency and improves segmentation performance in weakly supervised settings.

Abstract: Limited by the scarcity of training samples and annotations, weakly supervised medical image segmentation often employs data augmentation to increase data diversity, while randomly mixing volumetric blocks has demonstrated strong performance. However, this approach disrupts the inherent anatomical continuity of 3D medical images along orthogonal axes, leading to severe structural inconsistencies and insufficient training in challenging regions, such as small-sized organs, etc. To better comply with and utilize human anatomical information, we propose JanusNet}, a data augmentation framework for 3D medical data that globally models anatomical continuity while locally focusing on hard-to-segment regions. Specifically, our Slice-Block Shuffle step performs aligned shuffling of same-index slice blocks across volumes along a random axis, while preserving the anatomical context on planes perpendicular to the perturbation axis. Concurrently, the Confidence-Guided Displacement step uses prediction reliability to replace blocks within each slice, amplifying signals from difficult areas. This dual-stage, axis-aligned framework is plug-and-play, requiring minimal code changes for most teacher-student schemes. Extensive experiments on the Synapse and AMOS datasets demonstrate that JanusNet significantly surpasses state-of-the-art methods, achieving, for instance, a 4% DSC gain on the Synapse dataset with only 20% labeled data.

[142] CAD-Judge: Toward Efficient Morphological Grading and Verification for Text-to-CAD Generation

Zheyuan Zhou, Jiayi Han, Liang Du, Naiyu Fang, Lemiao Qiu, Shuyou Zhang

Main category: cs.CV

TL;DR: CAD-Judge is a verifiable reward system for efficient CAD preference grading and grammatical validation, improving Text-to-CAD workflows.

DetailsMotivation: To address slow CAD rendering and costly VLM deployment, which can degrade systems through reward hacking.

Method: Uses Compiler-as-a-Judge Module (CJM) for fast rewards and Compiler-as-a-Review Module (CRM) for verification, alongside an agentic CAD generation approach.

Result: Achieves state-of-the-art performance on CAD datasets while maintaining efficiency.

Conclusion: CAD-Judge effectively optimizes Text-to-CAD workflows with robust verification and alignment.

Abstract: Computer-Aided Design (CAD) models are widely used across industrial design, simulation, and manufacturing processes. Text-to-CAD systems aim to generate editable, general-purpose CAD models from textual descriptions, significantly reducing the complexity and entry barrier associated with traditional CAD workflows. However, rendering CAD models can be slow, and deploying VLMs to review CAD models can be expensive and may introduce reward hacking that degrades the systems. To address these challenges, we propose CAD-Judge, a novel, verifiable reward system for efficient and effective CAD preference grading and grammatical validation. We adopt the Compiler-as-a-Judge Module (CJM) as a fast, direct reward signal, optimizing model alignment by maximizing generative utility through prospect theory. To further improve the robustness of Text-to-CAD in the testing phase, we introduce a simple yet effective agentic CAD generation approach and adopt the Compiler-as-a-Review Module (CRM), which efficiently verifies the generated CAD models, enabling the system to refine them accordingly. Extensive experiments on challenging CAD datasets demonstrate that our method achieves state-of-the-art performance while maintaining superior efficiency.

[143] $\text{S}^2$Q-VDiT: Accurate Quantized Video Diffusion Transformer with Salient Data and Sparse Token Distillation

Weilun Feng, Haotong Qin, Chuanguang Yang, Xiangqi Li, Han Yang, Yuqi Li, Zhulin An, Libo Huang, Michele Magno, Yongjun Xu

Main category: cs.CV

TL;DR: The paper introduces S²Q-VDiT, a post-training quantization framework for video diffusion models (V-DMs) to reduce computational costs while maintaining performance. It uses salient data selection and sparse token distillation to address calibration variance and learning challenges.

DetailsMotivation: Video diffusion models (V-DMs) with billions of parameters incur high computational costs. Quantization can help but faces challenges due to long token sequences in joint spatial-temporal modeling.

Method: Proposes S²Q-VDiT, leveraging Hessian-aware salient data selection for calibration and attention-guided sparse token distillation for learning efficiency.

Result: Achieves lossless performance under W4A6 quantization, with 3.9× model compression and 1.3× inference acceleration.

Conclusion: S²Q-VDiT effectively addresses quantization challenges in V-DMs, offering significant computational savings without performance loss.

Abstract: Diffusion transformers have emerged as the mainstream paradigm for video generation models. However, the use of up to billions of parameters incurs significant computational costs. Quantization offers a promising solution by reducing memory usage and accelerating inference. Nonetheless, we observe that the joint modeling of spatial and temporal information in video diffusion models (V-DMs) leads to extremely long token sequences, which introduces high calibration variance and learning challenges. To address these issues, we propose \textbf{$\text{S}^2$Q-VDiT}, a post-training quantization framework for V-DMs that leverages \textbf{S}alient data and \textbf{S}parse token distillation. During the calibration phase, we identify that quantization performance is highly sensitive to the choice of calibration data. To mitigate this, we introduce \textit{Hessian-aware Salient Data Selection}, which constructs high-quality calibration datasets by considering both diffusion and quantization characteristics unique to V-DMs. To tackle the learning challenges, we further analyze the sparse attention patterns inherent in V-DMs. Based on this observation, we propose \textit{Attention-guided Sparse Token Distillation}, which exploits token-wise attention distributions to emphasize tokens that are more influential to the model’s output. Under W4A6 quantization, $\text{S}^2$Q-VDiT achieves lossless performance while delivering $3.9\times$ model compression and $1.3\times$ inference acceleration. Code will be available at \href{https://github.com/wlfeng0509/s2q-vdit}{https://github.com/wlfeng0509/s2q-vdit}.

[144] Can Large Multimodal Models Actively Recognize Faulty Inputs? A Systematic Evaluation Framework of Their Input Scrutiny Ability

Haiqi Yang, Jinzhe Li, Gengxu Li, Yi Chang, Yuan Wu

Main category: cs.CV

TL;DR: The paper introduces ISEval, a framework to evaluate LMMs’ ability to detect flawed inputs, revealing their struggles and biases.

DetailsMotivation: To explore whether LMMs can actively detect and scrutinize erroneous inputs, a gap in current research.

Method: Developed ISEval, a framework with seven flawed premise categories and three metrics, tested on ten advanced LMMs.

Result: Most LMMs struggle with detecting flawed inputs without guidance, with performance varying by error type and modality trust.

Conclusion: Highlights the need to improve LMMs’ proactive input verification and offers insights for mitigating the issue.

Abstract: Large Multimodal Models (LMMs) have witnessed remarkable growth, showcasing formidable capabilities in handling intricate multimodal tasks with exceptional performance. Recent research has underscored the inclination of large language models to passively accept defective inputs, often resulting in futile reasoning on invalid prompts. However, the same critical question of whether LMMs can actively detect and scrutinize erroneous inputs still remains unexplored. To address this gap, we introduce the Input Scrutiny Ability Evaluation Framework (ISEval), which encompasses seven categories of flawed premises and three evaluation metrics. Our extensive evaluation of ten advanced LMMs has identified key findings. Most models struggle to actively detect flawed textual premises without guidance, which reflects a strong reliance on explicit prompts for premise error identification. Error type affects performance: models excel at identifying logical fallacies but struggle with surface-level linguistic errors and certain conditional flaws. Modality trust varies-Gemini 2.5 pro and Claude Sonnet 4 balance visual and textual info, while aya-vision-8b over-rely on text in conflicts. These insights underscore the urgent need to enhance LMMs’ proactive verification of input validity and shed novel insights into mitigating the problem. The code is available at https://github.com/MLGroupJLU/LMM_ISEval.

[145] Prototype-Driven Structure Synergy Network for Remote Sensing Images Segmentation

Junyi Wang, Jinjiang Li, Guodong Fan, Yakun Ju, Xiang Fang, Alex C. Kot

Main category: cs.CV

TL;DR: PDSSNet improves semantic segmentation in remote sensing by addressing intra-class variance and inter-class similarity through adaptive prototypes and semantic-structure coordination.

DetailsMotivation: The challenges of high intra-class variance and inter-class similarity in remote sensing images hinder accurate semantic segmentation, necessitating a method that unifies class representations and preserves structural information.

Method: PDSSNet uses three modules: APEM for unbiased class prototypes, SSCM for hierarchical semantic-structure coordination, and CSAM for dynamic feature adjustment.

Result: PDSSNet outperforms state-of-the-art methods in experiments.

Conclusion: PDSSNet effectively integrates class semantics and spatial structure, achieving superior segmentation results.

Abstract: In the semantic segmentation of remote sensing images, acquiring complete ground objects is critical for achieving precise analysis. However, this task is severely hindered by two major challenges: high intra-class variance and high inter-class similarity. Traditional methods often yield incomplete segmentation results due to their inability to effectively unify class representations and distinguish between similar features. Even emerging class-guided approaches are limited by coarse class prototype representations and a neglect of target structural information. Therefore, this paper proposes a Prototype-Driven Structure Synergy Network (PDSSNet). The design of this network is based on a core concept, a complete ground object is jointly defined by its invariant class semantics and its variant spatial structure. To implement this, we have designed three key modules. First, the Adaptive Prototype Extraction Module (APEM) ensures semantic accuracy from the source by encoding the ground truth to extract unbiased class prototypes. Subsequently, the designed Semantic-Structure Coordination Module (SSCM) follows a hierarchical semantics-first, structure-second principle. This involves first establishing a global semantic cognition, then leveraging structural information to constrain and refine the semantic representation, thereby ensuring the integrity of class information. Finally, the Channel Similarity Adjustment Module (CSAM) employs a dynamic step-size adjustment mechanism to focus on discriminative features between classes. Extensive experiments demonstrate that PDSSNet outperforms state-of-the-art methods. The source code is available at https://github.com/wangjunyi-1/PDSSNet.

[146] Dual Prompt Learning for Adapting Vision-Language Models to Downstream Image-Text Retrieval

Yifan Wang, Tao Wang, Chenwei Tang, Caiyang Yu, Zhengqing Zang, Mengmi Zhang, Shudong Huang, Jiancheng Lv

Main category: cs.CV

TL;DR: DCAR is a dual-prompt learning framework for Image-Text Retrieval (ITR), addressing fine-grained attribute and subcategory discrimination. It dynamically adjusts prompts and optimizes attribute-class features, achieving state-of-the-art results on the FDRD benchmark.

DetailsMotivation: The challenge in applying prompt learning to ITR lies in discriminating fine-grained attributes and similar subcategories. Existing methods struggle with this, prompting the need for a novel approach.

Method: DCAR dynamically adjusts prompt vectors from semantic and visual dimensions. It jointly optimizes attribute and class features: (1) updates attribute weights via text-image mutual information, (2) introduces negative samples with category-matching weighting.

Result: DCAR achieves state-of-the-art performance on the FDRD benchmark, which includes 1,500 fine categories and 230,000 image-caption pairs.

Conclusion: DCAR effectively addresses ITR challenges by enhancing fine-grained representation learning, demonstrating superior performance over existing baselines.

Abstract: Recently, prompt learning has demonstrated remarkable success in adapting pre-trained Vision-Language Models (VLMs) to various downstream tasks such as image classification. However, its application to the downstream Image-Text Retrieval (ITR) task is more challenging. We find that the challenge lies in discriminating both fine-grained attributes and similar subcategories of the downstream data. To address this challenge, we propose Dual prompt Learning with Joint Category-Attribute Reweighting (DCAR), a novel dual-prompt learning framework to achieve precise image-text matching. The framework dynamically adjusts prompt vectors from both semantic and visual dimensions to improve the performance of CLIP on the downstream ITR task. Based on the prompt paradigm, DCAR jointly optimizes attribute and class features to enhance fine-grained representation learning. Specifically, (1) at the attribute level, it dynamically updates the weights of attribute descriptions based on text-image mutual information correlation; (2) at the category level, it introduces negative samples from multiple perspectives with category-matching weighting to learn subcategory distinctions. To validate our method, we construct the Fine-class Described Retrieval Dataset (FDRD), which serves as a challenging benchmark for ITR in downstream data domains. It covers over 1,500 downstream fine categories and 230,000 image-caption pairs with detailed attribute annotations. Extensive experiments on FDRD demonstrate that DCAR achieves state-of-the-art performance over existing baselines.

[147] Radar-Based NLoS Pedestrian Localization for Darting-Out Scenarios Near Parked Vehicles with Camera-Assisted Point Cloud Interpretation

Hee-Yeun Kim, Byeonggyu Park, Byonghyok Choi, Hansang Cho, Byungkwan Kim, Soomok Lee, Mingu Jeon, Seung-Woo Seo, Seong-Woo Kim

Main category: cs.CV

TL;DR: Proposes an NLoS pedestrian localization framework combining monocular camera and 2D radar PCD to address challenges of sudden pedestrian emergence in urban NLoS blind spots.

DetailsMotivation: Addresses the challenge of NLoS blind spots caused by parked vehicles, which hinder pedestrian detection and road safety. Existing methods lack generalizability due to reliance on predefined spatial data.

Method: Integrates monocular camera image segmentation for parked vehicle detection and depth estimation, then refines spatial inference using 2D radar PCD.

Result: Enhances early pedestrian detection in real-world urban environments, improving road safety.

Conclusion: The proposed framework effectively addresses NLoS pedestrian localization challenges, offering practical applicability and improved safety.

Abstract: The presence of Non-Line-of-Sight (NLoS) blind spots resulting from roadside parking in urban environments poses a significant challenge to road safety, particularly due to the sudden emergence of pedestrians. mmWave technology leverages diffraction and reflection to observe NLoS regions, and recent studies have demonstrated its potential for detecting obscured objects. However, existing approaches predominantly rely on predefined spatial information or assume simple wall reflections, thereby limiting their generalizability and practical applicability. A particular challenge arises in scenarios where pedestrians suddenly appear from between parked vehicles, as these parked vehicles act as temporary spatial obstructions. Furthermore, since parked vehicles are dynamic and may relocate over time, spatial information obtained from satellite maps or other predefined sources may not accurately reflect real-time road conditions, leading to erroneous sensor interpretations. To address this limitation, we propose an NLoS pedestrian localization framework that integrates monocular camera image with 2D radar point cloud (PCD) data. The proposed method initially detects parked vehicles through image segmentation, estimates depth to infer approximate spatial characteristics, and subsequently refines this information using 2D radar PCD to achieve precise spatial inference. Experimental evaluations conducted in real-world urban road environments demonstrate that the proposed approach enhances early pedestrian detection and contributes to improved road safety. Supplementary materials are available at https://hiyeun.github.io/NLoS/.

[148] CORE-ReID V2: Advancing the Domain Adaptation for Object Re-Identification with Optimized Training and Ensemble Fusion

Trinh Quoc Nguyen, Oky Dicky Ardiansyah Prima, Syahid Al Irfan, Hindriyanto Dwi Purnomo, Radius Tanone

Main category: cs.CV

TL;DR: CORE-ReID V2 enhances its predecessor by addressing UDA challenges in ReID tasks, using CycleGAN for data synthesis and an advanced ensemble fusion mechanism (ECAB/SECAB) for feature representation. It outperforms state-of-the-art methods in mAP and Rank-k Accuracy.

DetailsMotivation: To improve Unsupervised Domain Adaptation (UDA) in Person, Vehicle, and Object ReID by bridging domain gaps and enhancing feature representation.

Method: Uses CycleGAN for synthetic data generation and an ensemble fusion mechanism (ECAB/SECAB) for fine-tuning, supporting lightweight backbones like ResNet18/34.

Result: Achieves top performance in mAP and Rank-k Accuracy on UDA ReID datasets, demonstrating scalability and efficiency.

Conclusion: CORE-ReID V2 advances UDA-based Object ReID, providing a foundation for future research. Code and models are publicly available.

Abstract: This study presents CORE-ReID V2, an enhanced framework building upon CORE-ReID. The new framework extends its predecessor by addressing Unsupervised Domain Adaptation (UDA) challenges in Person ReID and Vehicle ReID, with further applicability to Object ReID. During pre-training, CycleGAN is employed to synthesize diverse data, bridging image characteristic gaps across different domains. In the fine-tuning, an advanced ensemble fusion mechanism, consisting of the Efficient Channel Attention Block (ECAB) and the Simplified Efficient Channel Attention Block (SECAB), enhances both local and global feature representations while reducing ambiguity in pseudo-labels for target samples. Experimental results on widely used UDA Person ReID and Vehicle ReID datasets demonstrate that the proposed framework outperforms state-of-the-art methods, achieving top performance in Mean Average Precision (mAP) and Rank-k Accuracy (Top-1, Top-5, Top-10). Moreover, the framework supports lightweight backbones such as ResNet18 and ResNet34, ensuring both scalability and efficiency. Our work not only pushes the boundaries of UDA-based Object ReID but also provides a solid foundation for further research and advancements in this domain. Our codes and models are available at https://github.com/TrinhQuocNguyen/CORE-ReID-V2.

[149] ForestFormer3D: A Unified Framework for End-to-End Segmentation of Forest LiDAR 3D Point Clouds

Binbin Xiang, Maciej Wielgosz, Stefano Puliti, Kamil Král, Martin Krůček, Azim Missarov, Rasmus Astrup

Main category: cs.CV

TL;DR: ForestFormer3D is a new framework for precise individual tree and semantic segmentation in forest LiDAR 3D point clouds, achieving state-of-the-art performance and robustness across diverse forest conditions.

DetailsMotivation: Current methods struggle with the complexity and variability of natural forest environments, necessitating a more effective solution for forest management and ecological research.

Method: ForestFormer3D uses ISA-guided query point selection, score-based block merging during inference, and a one-to-many association mechanism for training.

Result: The model achieves top performance on the FOR-instanceV2 dataset and generalizes well to unseen test sets (Wytham woods and LAUTx).

Conclusion: ForestFormer3D is a robust and unified solution for forest LiDAR segmentation, with publicly available dataset and code.

Abstract: The segmentation of forest LiDAR 3D point clouds, including both individual tree and semantic segmentation, is fundamental for advancing forest management and ecological research. However, current approaches often struggle with the complexity and variability of natural forest environments. We present ForestFormer3D, a new unified and end-to-end framework designed for precise individual tree and semantic segmentation. ForestFormer3D incorporates ISA-guided query point selection, a score-based block merging strategy during inference, and a one-to-many association mechanism for effective training. By combining these new components, our model achieves state-of-the-art performance for individual tree segmentation on the newly introduced FOR-instanceV2 dataset, which spans diverse forest types and regions. Additionally, ForestFormer3D generalizes well to unseen test sets (Wytham woods and LAUTx), showcasing its robustness across different forest conditions and sensor modalities. The FOR-instanceV2 dataset and the ForestFormer3D code are publicly available at https://bxiang233.github.io/FF3D/.

[150] Upsampling DINOv2 features for unsupervised vision tasks and weakly supervised materials segmentation

Ronan Docherty, Antonis Vamvakeros, Samuel J. Cooper

Main category: cs.CV

TL;DR: Self-supervised ViT features enhance object localization and segmentation, outperforming traditional methods without additional training.

DetailsMotivation: To leverage ViT features for improved performance in object localization and segmentation, especially in weakly supervised settings.

Method: Upsampled ViT features are used in clustering-based workflows for localization/segmentation and paired with classifiers for weakly supervised segmentation.

Result: Strong performance on benchmarks, particularly in weakly supervised segmentation, due to ViT features capturing complex relationships.

Conclusion: ViT features offer flexibility and generalizability, promising faster and stronger materials characterization.

Abstract: The features of self-supervised vision transformers (ViTs) contain strong semantic and positional information relevant to downstream tasks like object localization and segmentation. Recent works combine these features with traditional methods like clustering, graph partitioning or region correlations to achieve impressive baselines without finetuning or training additional networks. We leverage upsampled features from ViT networks (e.g DINOv2) in two workflows: in a clustering based approach for object localization and segmentation, and paired with standard classifiers in weakly supervised materials segmentation. Both show strong performance on benchmarks, especially in weakly supervised segmentation where the ViT features capture complex relationships inaccessible to classical approaches. We expect the flexibility and generalizability of these features will both speed up and strengthen materials characterization, from segmentation to property-prediction.

[151] SPJFNet: Self-Mining Prior-Guided Joint Frequency Enhancement for Ultra-Efficient Dark Image Restoration

Tongshun Zhang, Pingling Liu, Zijian Zhang, Qiuzhan Zhou

Main category: cs.CV

TL;DR: SPJFNet is a novel dark image restoration method that improves efficiency by eliminating external priors, compressing operations, and decoupling frequency processing.

DetailsMotivation: Current methods face efficiency issues due to reliance on external priors, redundant operations, and indiscriminate frequency processing.

Method: SPJFNet uses a Self-Mining Guidance Module (SMGM) for lightweight endogenous guidance, lossless wavelet decomposition, and joint Fourier-based enhancement. It also employs a Dual-Frequency Guidance Framework (DFGF) for specialized high/low frequency processing.

Result: SPJFNet outperforms state-of-the-art methods while reducing computational complexity and overhead.

Conclusion: SPJFNet offers an efficient and effective solution for dark image restoration, with significant performance and efficiency gains.

Abstract: Current dark image restoration methods suffer from severe efficiency bottlenecks, primarily stemming from: (1) computational burden and error correction costs associated with reliance on external priors (manual or cross-modal); (2) redundant operations in complex multi-stage enhancement pipelines; and (3) indiscriminate processing across frequency components in frequency-domain methods, leading to excessive global computational demands. To address these challenges, we propose an Efficient Self-Mining Prior-Guided Joint Frequency Enhancement Network (SPJFNet). Specifically, we first introduce a Self-Mining Guidance Module (SMGM) that generates lightweight endogenous guidance directly from the network, eliminating dependence on external priors and thereby bypassing error correction overhead while improving inference speed. Second, through meticulous analysis of different frequency domain characteristics, we reconstruct and compress multi-level operation chains into a single efficient operation via lossless wavelet decomposition and joint Fourier-based advantageous frequency enhancement, significantly reducing parameters. Building upon this foundation, we propose a Dual-Frequency Guidance Framework (DFGF) that strategically deploys specialized high/low frequency branches (wavelet-domain high-frequency enhancement and Fourier-domain low-frequency restoration), decoupling frequency processing to substantially reduce computational complexity. Rigorous evaluation across multiple benchmarks demonstrates that SPJFNet not only surpasses state-of-the-art performance but also achieves significant efficiency improvements, substantially reducing model complexity and computational overhead. Code is available at https://github.com/bywlzts/SPJFNet.

[152] VisualTrans: A Benchmark for Real-World Visual Transformation Reasoning

Yuheng Ji, Yipu Wang, Yuyang Liu, Xiaoshuai Hao, Yue Liu, Yuting Zhao, Huaihai Lyu, Xiaolong Zheng

Main category: cs.CV

TL;DR: The paper introduces VisualTrans, a benchmark for Visual Transformation Reasoning (VTR) in real-world human-object interactions, addressing gaps in existing benchmarks. It evaluates spatial, procedural, and quantitative reasoning across 12 tasks, revealing limitations in current models’ dynamic reasoning capabilities.

DetailsMotivation: Existing VTR benchmarks have a sim-to-real gap, limited complexity, and incomplete reasoning coverage, hindering practical use. The authors aim to bridge this gap with a real-world benchmark.

Method: VisualTrans includes 12 tasks and 6 subtask types, evaluated through 472 question-answer pairs. A scalable pipeline uses first-person videos, automated annotation, and human verification for data construction.

Result: State-of-the-art models perform well in static spatial tasks but struggle with dynamic, multi-step reasoning, exposing weaknesses in temporal modeling and causal reasoning.

Conclusion: VisualTrans highlights the need for improved temporal and causal reasoning in VTR systems, providing a foundation for future research. The dataset and code are publicly available.

Abstract: Visual transformation reasoning (VTR) is a vital cognitive capability that empowers intelligent agents to understand dynamic scenes, model causal relationships, and predict future states, and thereby guiding actions and laying the foundation for advanced intelligent systems. However, existing benchmarks suffer from a sim-to-real gap, limited task complexity, and incomplete reasoning coverage, limiting their practical use in real-world scenarios. To address these limitations, we introduce VisualTrans, the first comprehensive benchmark specifically designed for VTR in real-world human-object interaction scenarios. VisualTrans encompasses 12 semantically diverse manipulation tasks and systematically evaluates three essential reasoning dimensions - spatial, procedural, and quantitative - through 6 well-defined subtask types. The benchmark features 472 high-quality question-answer pairs in various formats, including multiple-choice, open-ended counting, and target enumeration. We introduce a scalable data construction pipeline built upon first-person manipulation videos, which integrates task selection, image pair extraction, automated metadata annotation with large multimodal models, and structured question generation. Human verification ensures the final benchmark is both high-quality and interpretable. Evaluations of various state-of-the-art vision-language models show strong performance in static spatial tasks. However, they reveal notable shortcomings in dynamic, multi-step reasoning scenarios, particularly in areas like intermediate state recognition and transformation sequence planning. These findings highlight fundamental weaknesses in temporal modeling and causal reasoning, providing clear directions for future research aimed at developing more capable and generalizable VTR systems. The dataset and code are available at https://github.com/WangYipu2002/VisualTrans.

[153] Iterative pseudo-labeling based adaptive copy-paste supervision for semi-supervised tumor segmentation

Qiangguo Jin, Hui Cui, Junbo Wang, Changming Sun, Yimiao He, Ping Xuan, Linlin Wang, Cong Cong, Leyi Wei, Ran Su

Main category: cs.CV

TL;DR: The paper introduces IPA-CP, a semi-supervised learning method for tumor segmentation in CT scans, addressing challenges like small tumors and underutilized data augmentation. It outperforms existing SSL methods.

DetailsMotivation: Existing SSL methods neglect small or numerous tumors and underutilize data augmentation. The paper aims to address these gaps.

Method: IPA-CP combines iterative pseudo-labeling and adaptive copy-paste supervision with a two-way uncertainty-based augmentation mechanism.

Result: IPA-CP outperforms state-of-the-art SSL methods in tumor segmentation, validated on in-house and public datasets.

Conclusion: IPA-CP is effective for tumor segmentation, with its adaptive augmentation and pseudo-labeling strategies proving superior.

Abstract: Semi-supervised learning (SSL) has attracted considerable attention in medical image processing. The latest SSL methods use a combination of consistency regularization and pseudo-labeling to achieve remarkable success. However, most existing SSL studies focus on segmenting large organs, neglecting the challenging scenarios where there are numerous tumors or tumors of small volume. Furthermore, the extensive capabilities of data augmentation strategies, particularly in the context of both labeled and unlabeled data, have yet to be thoroughly investigated. To tackle these challenges, we introduce a straightforward yet effective approach, termed iterative pseudo-labeling based adaptive copy-paste supervision (IPA-CP), for tumor segmentation in CT scans. IPA-CP incorporates a two-way uncertainty based adaptive augmentation mechanism, aiming to inject tumor uncertainties present in the mean teacher architecture into adaptive augmentation. Additionally, IPA-CP employs an iterative pseudo-label transition strategy to generate more robust and informative pseudo labels for the unlabeled samples. Extensive experiments on both in-house and public datasets show that our framework outperforms state-of-the-art SSL methods in medical image segmentation. Ablation study results demonstrate the effectiveness of our technical contributions.

[154] Motion is the Choreographer: Learning Latent Pose Dynamics for Seamless Sign Language Generation

Jiayi He, Xu Wang, Shengeng Tang, Yaxiong Wang, Lechao Cheng, Dan Guo

Main category: cs.CV

TL;DR: A two-phase framework decouples sign language motion semantics from signer identity, enabling high-quality, flexible video generation with minimal data.

DetailsMotivation: Addressing challenges of excessive data requirements and poor generalization in sign language video generation.

Method: 1. Construct a signer-independent motion lexicon. 2. Transform gloss sequences into motion trajectories and render photorealistic videos.

Result: Disentangling motion from identity improves synthesis quality and allows signer personalization.

Conclusion: The approach offers a viable and advantageous solution for sign language video generation.

Abstract: Sign language video generation requires producing natural signing motions with realistic appearances under precise semantic control, yet faces two critical challenges: excessive signer-specific data requirements and poor generalization. We propose a new paradigm for sign language video generation that decouples motion semantics from signer identity through a two-phase synthesis framework. First, we construct a signer-independent multimodal motion lexicon, where each gloss is stored as identity-agnostic pose, gesture, and 3D mesh sequences, requiring only one recording per sign. This compact representation enables our second key innovation: a discrete-to-continuous motion synthesis stage that transforms retrieved gloss sequences into temporally coherent motion trajectories, followed by identity-aware neural rendering to produce photorealistic videos of arbitrary signers. Unlike prior work constrained by signer-specific datasets, our method treats motion as a first-class citizen: the learned latent pose dynamics serve as a portable “choreography layer” that can be visually realized through different human appearances. Extensive experiments demonstrate that disentangling motion from identity is not just viable but advantageous - enabling both high-quality synthesis and unprecedented flexibility in signer personalization.

[155] DOMR: Establishing Cross-View Segmentation via Dense Object Matching

Jitong Liao, Yulu Gao, Shaofei Huang, Jialin Gao, Jie Lei, Ronghua Liang, Si Liu

Main category: cs.CV

TL;DR: The paper introduces the DOMR framework for dense object matching between egocentric and exocentric views, achieving state-of-the-art performance.

DetailsMotivation: Cross-view object correspondence is crucial for visual understanding but challenging due to differences in perspectives.

Method: The DOMR framework uses a Dense Object Matcher (DOM) module to model multiple objects jointly, leveraging positional and semantic relationships, and includes a mask refinement head for improved accuracy.

Result: DOMR achieves mean IoU of 49.7% (Ego→Exo) and 55.2% (Exo→Ego), outperforming previous methods by 5.8% and 4.3%.

Conclusion: The integrated approach of DOMR effectively addresses cross-view object matching, validated by superior benchmark results.

Abstract: Cross-view object correspondence involves matching objects between egocentric (first-person) and exocentric (third-person) views. It is a critical yet challenging task for visual understanding. In this work, we propose the Dense Object Matching and Refinement (DOMR) framework to establish dense object correspondences across views. The framework centers around the Dense Object Matcher (DOM) module, which jointly models multiple objects. Unlike methods that directly match individual object masks to image features, DOM leverages both positional and semantic relationships among objects to find correspondences. DOM integrates a proposal generation module with a dense matching module that jointly encodes visual, spatial, and semantic cues, explicitly constructing inter-object relationships to achieve dense matching among objects. Furthermore, we combine DOM with a mask refinement head designed to improve the completeness and accuracy of the predicted masks, forming the complete DOMR framework. Extensive evaluations on the Ego-Exo4D benchmark demonstrate that our approach achieves state-of-the-art performance with a mean IoU of 49.7% on Ego$\to$Exo and 55.2% on Exo$\to$Ego. These results outperform those of previous methods by 5.8% and 4.3%, respectively, validating the effectiveness of our integrated approach for cross-view understanding.

[156] Towards Globally Predictable k-Space Interpolation: A White-box Transformer Approach

Chen Luo, Qiyu Jin, Taofeng Xie, Xuemei Wang, Huayu Wang, Congcong Liu, Liming Tang, Guoqing Chen, Zhuo-Xu Cui, Dong Liang

Main category: cs.CV

TL;DR: The paper introduces GPI-WT, a white-box Transformer framework for k-space interpolation in MRI, combining global predictability with interpretability.

DetailsMotivation: Existing methods for k-space interpolation focus on local predictability and lack global dependency exploitation. Transformers, known for capturing long-range dependencies, are promising but lack interpretability.

Method: Proposes GPI-WT, a white-box Transformer based on Globally Predictable Interpolation (GPI), formulated as a structured low-rank (SLR) model with learnable global annihilation filters and a subgradient-induced attention mechanism.

Result: GPI-WT outperforms state-of-the-art methods in k-space interpolation accuracy and offers superior interpretability.

Conclusion: The GPI-WT framework effectively addresses the limitations of existing methods by leveraging global dependencies and ensuring interpretability, making it a reliable solution for accelerated MRI.

Abstract: Interpolating missing data in k-space is essential for accelerating imaging. However, existing methods, including convolutional neural network-based deep learning, primarily exploit local predictability while overlooking the inherent global dependencies in k-space. Recently, Transformers have demonstrated remarkable success in natural language processing and image analysis due to their ability to capture long-range dependencies. This inspires the use of Transformers for k-space interpolation to better exploit its global structure. However, their lack of interpretability raises concerns regarding the reliability of interpolated data. To address this limitation, we propose GPI-WT, a white-box Transformer framework based on Globally Predictable Interpolation (GPI) for k-space. Specifically, we formulate GPI from the perspective of annihilation as a novel k-space structured low-rank (SLR) model. The global annihilation filters in the SLR model are treated as learnable parameters, and the subgradients of the SLR model naturally induce a learnable attention mechanism. By unfolding the subgradient-based optimization algorithm of SLR into a cascaded network, we construct the first white-box Transformer specifically designed for accelerated MRI. Experimental results demonstrate that the proposed method significantly outperforms state-of-the-art approaches in k-space interpolation accuracy while providing superior interpretability.

[157] Uni-DocDiff: A Unified Document Restoration Model Based on Diffusion

Fangmin Zhao, Weichao Zeng, Zhenhang Li, Dongbao Yang, Binbin Li, Xiaojun Bi, Yu Zhou

Main category: cs.CV

TL;DR: Uni-DocDiff is a unified, scalable document restoration model using diffusion, outperforming task-specific models by leveraging learnable prompts and a novel Prior Pool mechanism.

DetailsMotivation: Existing methods for document restoration are cumbersome and lack scalability due to independent task models and limited inter-task synergy.

Method: Proposes Uni-DocDiff, featuring learnable task prompts, a Prior Pool for combining local and global features, and a Prior Fusion Module for adaptive task-specific prior selection.

Result: Achieves comparable or superior performance to task-specific models while maintaining scalability for new tasks.

Conclusion: Uni-DocDiff offers a scalable, unified solution for document restoration, effectively addressing task interference and leveraging inter-task synergy.

Abstract: Removing various degradations from damaged documents greatly benefits digitization, downstream document analysis, and readability. Previous methods often treat each restoration task independently with dedicated models, leading to a cumbersome and highly complex document processing system. Although recent studies attempt to unify multiple tasks, they often suffer from limited scalability due to handcrafted prompts and heavy preprocessing, and fail to fully exploit inter-task synergy within a shared architecture. To address the aforementioned challenges, we propose Uni-DocDiff, a Unified and highly scalable Document restoration model based on Diffusion. Uni-DocDiff develops a learnable task prompt design, ensuring exceptional scalability across diverse tasks. To further enhance its multi-task capabilities and address potential task interference, we devise a novel \textbf{Prior \textbf{P}ool}, a simple yet comprehensive mechanism that combines both local high-frequency features and global low-frequency features. Additionally, we design the \textbf{Prior \textbf{F}usion \textbf{M}odule (PFM)}, which enables the model to adaptively select the most relevant prior information for each specific task. Extensive experiments show that the versatile Uni-DocDiff achieves performance comparable or even superior performance compared with task-specific expert models, and simultaneously holds the task scalability for seamless adaptation to new tasks.

[158] TCSAFormer: Efficient Vision Transformer with Token Compression and Sparse Attention for Medical Image Segmentation

Zunhui Xia, Hongxing Li, Libin Lan

Main category: cs.CV

TL;DR: TCSAFormer is an efficient transformer-based medical image segmentation network that reduces computational complexity and enhances feature representation by using Compressed Attention and Dual-Branch Feed-Forward Network modules.

DetailsMotivation: Transformer-based methods for medical image segmentation face high computational costs and limited local feature capture. TCSAFormer aims to address these issues.

Method: TCSAFormer uses a Compressed Attention module for efficient token processing and a Dual-Branch Feed-Forward Network to capture local and multiscale features.

Result: TCSAFormer outperforms state-of-the-art methods on ISIC-2018, CVC-ClinicDB, and Synapse datasets with lower computational overhead.

Conclusion: TCSAFormer achieves a balance between efficiency and accuracy, making it a promising solution for medical image segmentation.

Abstract: In recent years, transformer-based methods have achieved remarkable progress in medical image segmentation due to their superior ability to capture long-range dependencies. However, these methods typically suffer from two major limitations. First, their computational complexity scales quadratically with the input sequences. Second, the feed-forward network (FFN) modules in vanilla Transformers typically rely on fully connected layers, which limits models’ ability to capture local contextual information and multiscale features critical for precise semantic segmentation. To address these issues, we propose an efficient medical image segmentation network, named TCSAFormer. The proposed TCSAFormer adopts two key ideas. First, it incorporates a Compressed Attention (CA) module, which combines token compression and pixel-level sparse attention to dynamically focus on the most relevant key-value pairs for each query. This is achieved by pruning globally irrelevant tokens and merging redundant ones, significantly reducing computational complexity while enhancing the model’s ability to capture relationships between tokens. Second, it introduces a Dual-Branch Feed-Forward Network (DBFFN) module as a replacement for the standard FFN to capture local contextual features and multiscale information, thereby strengthening the model’s feature representation capability. We conduct extensive experiments on three publicly available medical image segmentation datasets: ISIC-2018, CVC-ClinicDB, and Synapse, to evaluate the segmentation performance of TCSAFormer. Experimental results demonstrate that TCSAFormer achieves superior performance compared to existing state-of-the-art (SOTA) methods, while maintaining lower computational overhead, thus achieving an optimal trade-off between efficiency and accuracy.

[159] Beyond the Visible: Benchmarking Occlusion Perception in Multimodal Large Language Models

Zhaochen Liu, Kaiwen Gao, Shuyi Liang, Bin Xiao, Limeng Qiao, Lin Ma, Tingting Jiang

Main category: cs.CV

TL;DR: The paper introduces O-Bench, a VQA benchmark for occlusion perception, revealing a performance gap between MLLMs and humans, with identified failure patterns.

DetailsMotivation: To explore the under-examined performance of MLLMs on occlusion perception, a key aspect of spatial understanding.

Method: Developed O-Bench using a layered synthesis approach on SA-1B, annotating 4,588 QA pairs across five tasks. Evaluated 22 MLLMs against human baselines.

Result: Significant performance gap between MLLMs and humans, not bridged by scaling or reasoning. Identified three failure patterns.

Conclusion: O-Bench serves as a critical tool for evaluating occlusion perception and advancing MLLM visual intelligence.

Abstract: Occlusion perception, a critical foundation for human-level spatial understanding, embodies the challenge of integrating visual recognition and reasoning. Though multimodal large language models (MLLMs) have demonstrated remarkable capabilities, their performance on occlusion perception remains under-explored. To address this gap, we introduce O-Bench, the first visual question answering (VQA) benchmark specifically designed for occlusion perception. Based on SA-1B, we construct 1,365 images featuring semantically coherent occlusion scenarios through a novel layered synthesis approach. Upon this foundation, we annotate 4,588 question-answer pairs in total across five tailored tasks, employing a reliable, semi-automatic workflow. Our extensive evaluation of 22 representative MLLMs against the human baseline reveals a significant performance gap between current MLLMs and humans, which, we find, cannot be sufficiently bridged by model scaling or thinking process. We further identify three typical failure patterns, including an overly conservative bias, a fragile gestalt prediction, and a struggle with quantitative tasks. We believe O-Bench can not only provide a vital evaluation tool for occlusion perception, but also inspire the development of MLLMs for better visual intelligence. Our benchmark will be made publicly available upon paper publication.

[160] TNet: Terrace Convolutional Decoder Network for Remote Sensing Image Semantic Segmentation

Chengqian Dai, Yonghong Guo, Hongzhao Xiang, Yigui Luo

Main category: cs.CV

TL;DR: TNet introduces a Terrace Convolutional Decoder Network to enhance global-local feature fusion in remote sensing segmentation, outperforming benchmarks with efficient computation.

DetailsMotivation: Existing segmentation networks focus on intra-scale relationships, neglecting global contextual dependencies across resolutions.

Method: TNet uses convolution and addition to progressively integrate low-resolution (global) features into high-resolution (local) features during decoding.

Result: TNet-R achieves mIoU scores of 85.35% (Vaihingen), 87.05% (Potsdam), and 52.19% (LoveDA).

Conclusion: TNet effectively blends global and local information, offering competitive performance and computational efficiency.

Abstract: In remote sensing, most segmentation networks adopt the UNet architecture, often incorporating modules such as Transformers or Mamba to enhance global-local feature interactions within decoder stages. However, these enhancements typically focus on intra-scale relationships and neglect the global contextual dependencies across multiple resolutions. To address this limitation, we introduce the Terrace Convolutional Decoder Network (TNet), a simple yet effective architecture that leverages only convolution and addition operations to progressively integrate low-resolution features (rich in global context) into higher-resolution features (rich in local details) across decoding stages. This progressive fusion enables the model to learn spatially-aware convolutional kernels that naturally blend global and local information in a stage-wise manner. We implement TNet with a ResNet-18 encoder (TNet-R) and evaluate it on three benchmark datasets. TNet-R achieves competitive performance with a mean Intersection-over-Union (mIoU) of 85.35% on ISPRS Vaihingen, 87.05% on ISPRS Potsdam, and 52.19% on LoveDA, while maintaining high computational efficiency. Code is publicly available.

[161] Bridging Diffusion Models and 3D Representations: A 3D Consistent Super-Resolution Framework

Yi-Ting Chen, Ting-Hsuan Liao, Pengsheng Guo, Alexander Schwing, Jia-Bin Huang

Main category: cs.CV

TL;DR: 3DSR is a 3D Gaussian-splatting-based super-resolution framework using 2D diffusion models, ensuring 3D consistency without fine-tuning.

DetailsMotivation: To achieve high-resolution 3D reconstructions with visual quality and spatial coherence, addressing limitations of prior 2D or implicitly 3D methods.

Method: Leverages 2D diffusion-based super-resolution models within a 3D Gaussian-splatting framework for explicit 3D consistency.

Result: Produces visually compelling high-resolution results on MipNeRF360 and LLFF data, maintaining structural 3D consistency.

Conclusion: 3DSR effectively enhances 3D super-resolution with explicit consistency, outperforming prior methods without additional fine-tuning.

Abstract: We propose 3D Super Resolution (3DSR), a novel 3D Gaussian-splatting-based super-resolution framework that leverages off-the-shelf diffusion-based 2D super-resolution models. 3DSR encourages 3D consistency across views via the use of an explicit 3D Gaussian-splatting-based scene representation. This makes the proposed 3DSR different from prior work, such as image upsampling or the use of video super-resolution, which either don’t consider 3D consistency or aim to incorporate 3D consistency implicitly. Notably, our method enhances visual quality without additional fine-tuning, ensuring spatial coherence within the reconstructed scene. We evaluate 3DSR on MipNeRF360 and LLFF data, demonstrating that it produces high-resolution results that are visually compelling, while maintaining structural consistency in 3D reconstructions. Code will be released.

[162] DET-GS: Depth- and Edge-Aware Regularization for High-Fidelity 3D Gaussian Splatting

Zexu Huang, Min Xu, Stuart Perry

Main category: cs.CV

TL;DR: DET-GS improves 3D Gaussian Splatting by introducing depth and edge-aware regularization, enhancing geometric accuracy and visual fidelity in sparse-view conditions.

DetailsMotivation: Existing methods struggle with accurate geometric reconstruction in sparse-view scenarios due to noise sensitivity and poor preservation of fine structures and semantic boundaries.

Method: DET-GS uses hierarchical geometric depth supervision, edge-aware regularization with semantic masks, and an RGB-guided edge-preserving Total Variation loss.

Result: DET-GS outperforms SOTA methods in geometric accuracy and visual fidelity on sparse-view benchmarks.

Conclusion: DET-GS effectively addresses challenges in sparse-view reconstruction, offering superior performance and robustness.

Abstract: 3D Gaussian Splatting (3DGS) represents a significant advancement in the field of efficient and high-fidelity novel view synthesis. Despite recent progress, achieving accurate geometric reconstruction under sparse-view conditions remains a fundamental challenge. Existing methods often rely on non-local depth regularization, which fails to capture fine-grained structures and is highly sensitive to depth estimation noise. Furthermore, traditional smoothing methods neglect semantic boundaries and indiscriminately degrade essential edges and textures, consequently limiting the overall quality of reconstruction. In this work, we propose DET-GS, a unified depth and edge-aware regularization framework for 3D Gaussian Splatting. DET-GS introduces a hierarchical geometric depth supervision framework that adaptively enforces multi-level geometric consistency, significantly enhancing structural fidelity and robustness against depth estimation noise. To preserve scene boundaries, we design an edge-aware depth regularization guided by semantic masks derived from Canny edge detection. Furthermore, we introduce an RGB-guided edge-preserving Total Variation loss that selectively smooths homogeneous regions while rigorously retaining high-frequency details and textures. Extensive experiments demonstrate that DET-GS achieves substantial improvements in both geometric accuracy and visual fidelity, outperforming state-of-the-art (SOTA) methods on sparse-view novel view synthesis benchmarks.

[163] NEARL-CLIP: Interacted Query Adaptation with Orthogonal Regularization for Medical Vision-Language Understanding

Zelin Peng, Yichen Zhao, Yu Huang, Piao Yang, Feilong Tang, Zhengqin Xu, Xiaokang Yang, Wei Shen

Main category: cs.CV

TL;DR: NEARL-CLIP bridges the domain gap in medical image analysis using a cross-modality interaction framework with USEformer and OCA, achieving efficient parameter usage.

DetailsMotivation: Limited annotated medical datasets and the domain gap hinder the direct application of vision-language models (VLMs) like CLIP in medical imaging.

Method: Proposes NEARL-CLIP with USEformer for dynamic cross-modality queries and OCA for orthogonality-based knowledge decoupling.

Result: Achieves efficient modality interaction with only 1.46M learnable parameters.

Conclusion: NEARL-CLIP effectively enhances medical-specific VLM performance by fostering cross-modality knowledge enrichment.

Abstract: Computer-aided medical image analysis is crucial for disease diagnosis and treatment planning, yet limited annotated datasets restrict medical-specific model development. While vision-language models (VLMs) like CLIP offer strong generalization capabilities, their direct application to medical imaging analysis is impeded by a significant domain gap. Existing approaches to bridge this gap, including prompt learning and one-way modality interaction techniques, typically focus on introducing domain knowledge to a single modality. Although this may offer performance gains, it often causes modality misalignment, thereby failing to unlock the full potential of VLMs. In this paper, we propose \textbf{NEARL-CLIP} (i\underline{N}teracted qu\underline{E}ry \underline{A}daptation with o\underline{R}thogona\underline{L} Regularization), a novel cross-modality interaction VLM-based framework that contains two contributions: (1) Unified Synergy Embedding Transformer (USEformer), which dynamically generates cross-modality queries to promote interaction between modalities, thus fostering the mutual enrichment and enhancement of multi-modal medical domain knowledge; (2) Orthogonal Cross-Attention Adapter (OCA). OCA introduces an orthogonality technique to decouple the new knowledge from USEformer into two distinct components: the truly novel information and the incremental knowledge. By isolating the learning process from the interference of incremental knowledge, OCA enables a more focused acquisition of new information, thereby further facilitating modality interaction and unleashing the capability of VLMs. Notably, NEARL-CLIP achieves these two contributions in a parameter-efficient style, which only introduces \textbf{1.46M} learnable parameters.

[164] AR as an Evaluation Playground: Bridging Metrics and Visual Perception of Computer Vision Models

Ashkan Ganj, Yiqin Zhao, Tian Guo

Main category: cs.CV

TL;DR: ARCADE is an AR-based platform designed to simplify human perception studies for CV model evaluation, offering cross-platform data collection, custom protocols, and AR streaming.

DetailsMotivation: Human perception studies are valuable but complex and hard to scale; AR presents a unique opportunity to streamline this process.

Method: Developed ARCADE, a platform supporting AR data collection, pluggable model inference, and AR streaming for user studies. Demonstrated with depth and lighting estimation models.

Result: ARCADE effectively elicits human perceptual judgments of CV model quality and performs well across various deployment settings.

Conclusion: ARCADE is a flexible and effective human-centered evaluation platform for CV research.

Abstract: Human perception studies can provide complementary insights to qualitative evaluation for understanding computer vision (CV) model performance. However, conducting human perception studies remains a non-trivial task, it often requires complex, end-to-end system setups that are time-consuming and difficult to scale. In this paper, we explore the unique opportunity presented by augmented reality (AR) for helping CV researchers to conduct perceptual studies. We design ARCADE, an evaluation platform that allows researchers to easily leverage AR’s rich context and interactivity for human-centered CV evaluation. Specifically, ARCADE supports cross-platform AR data collection, custom experiment protocols via pluggable model inference, and AR streaming for user studies. We demonstrate ARCADE using two types of CV models, depth and lighting estimation and show that AR tasks can be effectively used to elicit human perceptual judgments of model quality. We also evaluate the systems usability and performance across different deployment and study settings, highlighting its flexibility and effectiveness as a human-centered evaluation platform.

[165] Unlocking the Potential of MLLMs in Referring Expression Segmentation via a Light-weight Mask Decode

Jingchao Wang, Zhijian Wu, Dingjiang Huang, Yefeng Zheng, Hong Wang

Main category: cs.CV

TL;DR: MLLMSeg is a lightweight RES framework that leverages MLLMs’ visual and semantic features without extra encoders, outperforming SAM-based and SAM-free methods.

DetailsMotivation: Address the trade-off between performance and cost in RES by avoiding heavy models like SAM while maintaining accuracy.

Method: Uses MLLM’s visual encoder for detail features, integrates them with LLM’s semantic features via DSFF, and employs a lightweight mask decoder (34M parameters).

Result: Outperforms SAM-based and SAM-free methods, balancing performance and cost.

Conclusion: MLLMSeg efficiently combines visual and semantic features for precise RES without heavy computational costs.

Abstract: Reference Expression Segmentation (RES) aims to segment image regions specified by referring expressions and has become popular with the rise of multimodal large models (MLLMs). While MLLMs excel in semantic understanding, their token-generation paradigm struggles with pixel-level dense prediction. Existing RES methods either couple MLLMs with the parameter-heavy Segment Anything Model (SAM) with 632M network parameters or adopt SAM-free lightweight pipelines that sacrifice accuracy. To address the trade-off between performance and cost, we specifically propose MLLMSeg, a novel framework that fully exploits the inherent visual detail features encoded in the MLLM vision encoder without introducing an extra visual encoder. Besides, we propose a detail-enhanced and semantic-consistent feature fusion module (DSFF) that fully integrates the detail-related visual feature with the semantic-related feature output by the large language model (LLM) of MLLM. Finally, we establish a light-weight mask decoder with only 34M network parameters that optimally leverages detailed spatial features from the visual encoder and semantic features from the LLM to achieve precise mask prediction. Extensive experiments demonstrate that our method generally surpasses both SAM-based and SAM-free competitors, striking a better balance between performance and cost. Code is available at https://github.com/jcwang0602/MLLMSeg.

[166] CLIPVehicle: A Unified Framework for Vision-based Vehicle Search

Likai Wang, Ruize Han, Xiangqun Zhang, Wei Feng

Main category: cs.CV

TL;DR: CLIPVehicle proposes a unified framework for joint vehicle detection and re-identification, addressing conflicting objectives with a dual-granularity alignment module and multi-level learning strategy, outperforming state-of-the-art methods.

DetailsMotivation: Existing methods for vehicle search are resource-intensive and impractical, requiring separate detection and re-identification steps. This work aims to unify these tasks efficiently.

Method: Introduces CLIPVehicle with a dual-granularity semantic-region alignment module and multi-level learning strategy, leveraging VLMs for vehicle discrimination. Benchmarks include CityFlowVS and synthetic datasets.

Result: Outperforms state-of-the-art methods in vehicle Re-ID and person search tasks.

Conclusion: CLIPVehicle effectively unifies detection and re-identification, offering a practical and efficient solution for vehicle search.

Abstract: Vehicles, as one of the most common and significant objects in the real world, the researches on which using computer vision technologies have made remarkable progress, such as vehicle detection, vehicle re-identification, etc. To search an interested vehicle from the surveillance videos, existing methods first pre-detect and store all vehicle patches, and then apply vehicle re-identification models, which is resource-intensive and not very practical. In this work, we aim to achieve the joint detection and re-identification for vehicle search. However, the conflicting objectives between detection that focuses on shared vehicle commonness and re-identification that focuses on individual vehicle uniqueness make it challenging for a model to learn in an end-to-end system. For this problem, we propose a new unified framework, namely CLIPVehicle, which contains a dual-granularity semantic-region alignment module to leverage the VLMs (Vision-Language Models) for vehicle discrimination modeling, and a multi-level vehicle identification learning strategy to learn the identity representation from global, instance and feature levels. We also construct a new benchmark, including a real-world dataset CityFlowVS, and two synthetic datasets SynVS-Day and SynVS-All, for vehicle search. Extensive experimental results demonstrate that our method outperforms the state-of-the-art methods of both vehicle Re-ID and person search tasks.

[167] Conditional Latent Diffusion Models for Zero-Shot Instance Segmentation

Maximilian Ulmer, Wout Boerdijk, Rudolph Triebel, Maximilian Durner

Main category: cs.CV

TL;DR: OC-DiT is a diffusion model for object-centric prediction, achieving zero-shot instance segmentation by conditioning on object templates and image features. It introduces coarse and refinement models, trained on synthetic data, and outperforms benchmarks without retraining.

DetailsMotivation: To address the challenge of zero-shot instance segmentation by leveraging diffusion models to disentangle object instances using visual descriptors and localized cues.

Method: A conditional latent diffusion framework with two variants: a coarse model for initial proposals and a refinement model for parallel refinement. Trained on synthetic data.

Result: State-of-the-art performance on real-world benchmarks without retraining, demonstrating diffusion models’ potential for instance segmentation.

Conclusion: OC-DiT effectively applies diffusion models to instance segmentation, achieving strong zero-shot performance and showcasing the framework’s versatility.

Abstract: This paper presents OC-DiT, a novel class of diffusion models designed for object-centric prediction, and applies it to zero-shot instance segmentation. We propose a conditional latent diffusion framework that generates instance masks by conditioning the generative process on object templates and image features within the diffusion model’s latent space. This allows our model to effectively disentangle object instances through the diffusion process, which is guided by visual object descriptors and localized image cues. Specifically, we introduce two model variants: a coarse model for generating initial object instance proposals, and a refinement model that refines all proposals in parallel. We train these models on a newly created, large-scale synthetic dataset comprising thousands of high-quality object meshes. Remarkably, our model achieves state-of-the-art performance on multiple challenging real-world benchmarks, without requiring any retraining on target data. Through comprehensive ablation studies, we demonstrate the potential of diffusion models for instance segmentation tasks.

[168] Learning Using Privileged Information for Litter Detection

Matthias Bartolo, Konstantinos Makantasis, Dylan Seychell

Main category: cs.CV

TL;DR: A novel deep learning approach combines privileged information with object detection to improve litter detection efficiently, showing consistent performance gains across datasets without added complexity.

DetailsMotivation: Addressing the global rise in litter pollution by enhancing automated detection tools, particularly for small or obscured litter objects.

Method: Combines privileged information with deep learning object detection, encoding bounding box information as binary masks for refined detection guidance. Evaluated across five object detection models.

Result: Demonstrated consistent performance improvements on SODA, BDW, and UAVVaste datasets, with better generalization and no added model complexity.

Conclusion: The methodology offers a practical, efficient solution for litter detection, balancing accuracy and scalability in real-world applications.

Abstract: As litter pollution continues to rise globally, developing automated tools capable of detecting litter effectively remains a significant challenge. This study presents a novel approach that combines, for the first time, privileged information with deep learning object detection to improve litter detection while maintaining model efficiency. We evaluate our method across five widely used object detection models, addressing challenges such as detecting small litter and objects partially obscured by grass or stones. In addition to this, a key contribution of our work can also be attributed to formulating a means of encoding bounding box information as a binary mask, which can be fed to the detection model to refine detection guidance. Through experiments on both within-dataset evaluation on the renowned SODA dataset and cross-dataset evaluation on the BDW and UAVVaste litter detection datasets, we demonstrate consistent performance improvements across all models. Our approach not only bolsters detection accuracy within the training sets but also generalises well to other litter detection contexts. Crucially, these improvements are achieved without increasing model complexity or adding extra layers, ensuring computational efficiency and scalability. Our results suggest that this methodology offers a practical solution for litter detection, balancing accuracy and efficiency in real-world applications.

[169] SVC 2025: the First Multimodal Deception Detection Challenge

Xun Lin, Xiaobao Guo, Taorui Wang, Yingjie Ma, Jiajian Huang, Jiayu Zhang, Junzhe Cao, Zitong Yu

Main category: cs.CV

TL;DR: The paper introduces the SVC 2025 Multimodal Deception Detection Challenge to address cross-domain generalization in deception detection, leveraging multimodal data for adaptable systems.

DetailsMotivation: Existing deception detection methods struggle with domain shifts, limiting their real-world applicability. The challenge aims to improve cross-domain performance.

Method: The challenge involves developing models using multimodal data (audio, video, text) to detect deception across diverse datasets.

Result: 21 teams participated, submitting final results, indicating strong engagement with the challenge.

Conclusion: The benchmark promotes adaptable, explainable deception detection systems, advancing multimodal learning.

Abstract: Deception detection is a critical task in real-world applications such as security screening, fraud prevention, and credibility assessment. While deep learning methods have shown promise in surpassing human-level performance, their effectiveness often depends on the availability of high-quality and diverse deception samples. Existing research predominantly focuses on single-domain scenarios, overlooking the significant performance degradation caused by domain shifts. To address this gap, we present the SVC 2025 Multimodal Deception Detection Challenge, a new benchmark designed to evaluate cross-domain generalization in audio-visual deception detection. Participants are required to develop models that not only perform well within individual domains but also generalize across multiple heterogeneous datasets. By leveraging multimodal data, including audio, video, and text, this challenge encourages the design of models capable of capturing subtle and implicit deceptive cues. Through this benchmark, we aim to foster the development of more adaptable, explainable, and practically deployable deception detection systems, advancing the broader field of multimodal learning. By the conclusion of the workshop competition, a total of 21 teams had submitted their final results. https://sites.google.com/view/svc-mm25 for more information.

[170] DS$^2$Net: Detail-Semantic Deep Supervision Network for Medical Image Segmentation

Zhaohong Huang, Yuxin Zhang, Mingbao Lin, Taojian Zhou, Guorong Cai, Rongrong Ji

Main category: cs.CV

TL;DR: DS²Net introduces multi-view deep supervision for medical image segmentation, combining detail and semantic feature supervision with adaptive uncertainty-based loss, outperforming existing methods.

DetailsMotivation: Existing methods supervise either coarse-grained semantic or fine-grained detailed features in isolation, ignoring their vital relationships in medical image analysis.

Method: Proposes DS²Net with Detail Enhance Module (DEM) and Semantic Enhance Module (SEM) for multi-view supervision, and an uncertainty-based loss for adaptive feature supervision.

Result: DS²Net consistently outperforms state-of-the-art methods across six benchmarks in colonoscopy, ultrasound, and microscope imaging.

Conclusion: DS²Net’s multi-view deep supervision and adaptive loss design significantly improve medical image segmentation, addressing limitations of prior work.

Abstract: Deep Supervision Networks exhibit significant efficacy for the medical imaging community. Nevertheless, existing work merely supervises either the coarse-grained semantic features or fine-grained detailed features in isolation, which compromises the fact that these two types of features hold vital relationships in medical image analysis. We advocate the powers of complementary feature supervision for medical image segmentation, by proposing a Detail-Semantic Deep Supervision Network (DS$^2$Net). DS$^2$Net navigates both low-level detailed and high-level semantic feature supervision through Detail Enhance Module (DEM) and Semantic Enhance Module (SEM). DEM and SEM respectively harness low-level and high-level feature maps to create detail and semantic masks for enhancing feature supervision. This is a novel shift from single-view deep supervision to multi-view deep supervision. DS$^2$Net is also equipped with a novel uncertainty-based supervision loss that adaptively assigns the supervision strength of features within distinct scales based on their uncertainty, thus circumventing the sub-optimal heuristic design that typifies previous works. Through extensive experiments on six benchmarks captured under either colonoscopy, ultrasound and microscope, we demonstrate that DS$^2$Net consistently outperforms state-of-the-art methods for medical image analysis.

[171] UniFGVC: Universal Training-Free Few-Shot Fine-Grained Vision Classification via Attribute-Aware Multimodal Retrieval

Hongyu Guo, Kuan Zhu, Xiangzhao Hao, Haiyun Guo, Ming Tang, Jinqiao Wang

Main category: cs.CV

TL;DR: UniFGVC is a training-free framework for few-shot fine-grained visual classification (FGVC) that reformulates the task as multimodal retrieval, leveraging structured text descriptions and multimodal templates for improved performance.

DetailsMotivation: Existing methods suffer from overfitting and weak generalization when finetuning pre-trained models for few-shot FGVC. UniFGVC addresses this by avoiding training and instead using multimodal retrieval.

Method: UniFGVC uses a Category-Discriminative Visual Captioner (CDV-Captioner) to generate structured text descriptions from images, constructs multimodal templates, and performs retrieval using off-the-shelf encoders.

Result: UniFGVC outperforms prior few-shot CLIP-based methods and some fully-supervised approaches on 12 FGVC benchmarks.

Conclusion: UniFGVC offers reliable generalization and adaptability for few-shot FGVC, demonstrating consistent superiority over existing methods.

Abstract: Few-shot fine-grained visual classification (FGVC) aims to leverage limited data to enable models to discriminate subtly distinct categories. Recent works mostly finetuned the pre-trained visual language models to achieve performance gain, yet suffering from overfitting and weak generalization. To deal with this, we introduce UniFGVC, a universal training-free framework that reformulates few-shot FGVC as multimodal retrieval. First, we propose the Category-Discriminative Visual Captioner (CDV-Captioner) to exploit the open-world knowledge of multimodal large language models (MLLMs) to generate a structured text description that captures the fine-grained attribute features distinguishing closely related classes. CDV-Captioner uses chain-of-thought prompting and visually similar reference images to reduce hallucination and enhance discrimination of generated captions. Using it we can convert each image into an image-description pair, enabling more comprehensive feature representation, and construct the multimodal category templates using few-shot samples for the subsequent retrieval pipeline. Then, off-the-shelf vision and text encoders embed query and template pairs, and FGVC is accomplished by retrieving the nearest template in the joint space. UniFGVC ensures broad compatibility with diverse MLLMs and encoders, offering reliable generalization and adaptability across few-shot FGVC scenarios. Extensive experiments on 12 FGVC benchmarks demonstrate its consistent superiority over prior few-shot CLIP-based methods and even several fully-supervised MLLMs-based approaches.

[172] IDCNet: Guided Video Diffusion for Metric-Consistent RGBD Scene Generation with Precise Camera Control

Lijuan Liu, Wenfa Li, Dongbo Zhang, Shuo Wang, Shaohui Jiao

Main category: cs.CV

TL;DR: IDC-Net is a unified framework for generating RGB-D video sequences with controlled camera trajectories, combining RGB and depth synthesis in a geometry-aware diffusion model for better alignment and fidelity.

DetailsMotivation: Existing methods treat RGB and depth generation separately, leading to misalignment. IDC-Net aims to unify these processes for precise camera control and geometric consistency.

Method: IDC-Net uses a joint learning framework with a geometry-aware diffusion model and a transformer block for fine-grained camera control, trained on a dataset with aligned RGB videos, depth maps, and camera poses.

Result: IDC-Net outperforms state-of-the-art methods in visual quality and geometric consistency, enabling direct use in 3D scene reconstruction without post-processing.

Conclusion: IDC-Net’s joint learning approach improves RGB-D sequence generation, offering practical benefits for downstream tasks like 3D reconstruction.

Abstract: We present IDC-Net (Image-Depth Consistency Network), a novel framework designed to generate RGB-D video sequences under explicit camera trajectory control. Unlike approaches that treat RGB and depth generation separately, IDC-Net jointly synthesizes both RGB images and corresponding depth maps within a unified geometry-aware diffusion model. The joint learning framework strengthens spatial and geometric alignment across frames, enabling more precise camera control in the generated sequences. To support the training of this camera-conditioned model and ensure high geometric fidelity, we construct a camera-image-depth consistent dataset with metric-aligned RGB videos, depth maps, and accurate camera poses, which provides precise geometric supervision with notably improved inter-frame geometric consistency. Moreover, we introduce a geometry-aware transformer block that enables fine-grained camera control, enhancing control over the generated sequences. Extensive experiments show that IDC-Net achieves improvements over state-of-the-art approaches in both visual quality and geometric consistency of generated scene sequences. Notably, the generated RGB-D sequences can be directly feed for downstream 3D Scene reconstruction tasks without extra post-processing steps, showcasing the practical benefits of our joint learning framework. See more at https://idcnet-scene.github.io.

[173] ICM-Fusion: In-Context Meta-Optimized LoRA Fusion for Multi-Task Adaptation

Yihua Shao, Xiaofeng Lin, Xinwei Long, Siyu Chen, Minxi Yan, Yang Liu, Ziyang Yan, Ao Ma, Hao Tang, Jingcai Guo

Main category: cs.CV

TL;DR: ICM-Fusion combines meta-learning and in-context adaptation to enhance multi-task generalization in LoRA models, reducing conflicts and forgetting.

DetailsMotivation: Address inter-weight conflicts and catastrophic domain forgetting in pre-trained LoRA fusion methods.

Method: Uses task vector arithmetic and Fusion VAE (F-VAE) to dynamically balance optimization directions and reconstruct fused LoRA.

Result: ICM-Fusion reduces multi-tasking loss and enhances performance in few-shot scenarios across visual and linguistic tasks.

Conclusion: ICM-Fusion is adaptable to various models and tasks, outperforming existing LoRA fusion methods.

Abstract: Enabling multi-task adaptation in pre-trained Low-Rank Adaptation (LoRA) models is crucial for enhancing their generalization capabilities. Most existing pre-trained LoRA fusion methods decompose weight matrices, sharing similar parameters while merging divergent ones. However, this paradigm inevitably induces inter-weight conflicts and leads to catastrophic domain forgetting. While incremental learning enables adaptation to multiple tasks, it struggles to achieve generalization in few-shot scenarios. Consequently, when the weight data follows a long-tailed distribution, it can lead to forgetting in the fused weights. To address this issue, we propose In-Context Meta LoRA Fusion (ICM-Fusion), a novel framework that synergizes meta-learning with in-context adaptation. The key innovation lies in our task vector arithmetic, which dynamically balances conflicting optimization directions across domains through learned manifold projections. ICM-Fusion obtains the optimal task vector orientation for the fused model in the latent space by adjusting the orientation of the task vectors. Subsequently, the fused LoRA is reconstructed by a self-designed Fusion VAE (F-VAE) to realize multi-task LoRA generation. We have conducted extensive experiments on visual and linguistic tasks, and the experimental results demonstrate that ICM-Fusion can be adapted to a wide range of architectural models and applied to various tasks. Compared to the current pre-trained LoRA fusion method, ICM-Fusion fused LoRA can significantly reduce the multi-tasking loss and can even achieve task enhancement in few-shot scenarios.

[174] ToxicTAGS: Decoding Toxic Memes with Rich Tag Annotations

Subhankar Swain, Naquee Rizwan, Nayandeep Deb, Vishwajeet Singh Solanki, Vishwa Gangadhar S, Animesh Mukherjee

Main category: cs.CV

TL;DR: The paper introduces a dataset of 6,300 real-world memes annotated for toxicity and proposes a tag generation module to enhance meme moderation systems.

DetailsMotivation: Addressing the lack of accessible data for meme moderation, given the role of memes in spreading harmful content online.

Method: Created a dataset with binary toxicity classification and fine-grained labels, enriched with metadata. Proposed a tag generation module for context.

Result: Incorporating socially grounded tags improved performance of state-of-the-art VLMs in detection tasks.

Conclusion: The work provides a scalable foundation for better content moderation in multimodal online spaces.

Abstract: The 2025 Global Risks Report identifies state-based armed conflict and societal polarisation among the most pressing global threats, with social media playing a central role in amplifying toxic discourse. Memes, as a widely used mode of online communication, often serve as vehicles for spreading harmful content. However, limitations in data accessibility and the high cost of dataset curation hinder the development of robust meme moderation systems. To address this challenge, in this work, we introduce a first-of-its-kind dataset of 6,300 real-world meme-based posts annotated in two stages: (i) binary classification into toxic and normal, and (ii) fine-grained labelling of toxic memes as hateful, dangerous, or offensive. A key feature of this dataset is that it is enriched with auxiliary metadata of socially relevant tags, enhancing the context of each meme. In addition, we propose a tag generation module that produces socially grounded tags, because most in-the-wild memes often do not come with tags. Experimental results show that incorporating these tags substantially enhances the performance of state-of-the-art VLMs detection tasks. Our contributions offer a novel and scalable foundation for improved content moderation in multimodal online environments.

[175] AD-FM: Multimodal LLMs for Anomaly Detection via Multi-Stage Reasoning and Fine-Grained Reward Optimization

Jingyi Liao, Yongyi Su, Rong-Cheng Tu, Zhao Jin, Wenhao Sun, Yiting Li, Dacheng Tao, Xun Xu, Xulei Yang

Main category: cs.CV

TL;DR: The paper proposes a framework to improve Multimodal Large Language Models (MLLMs) for anomaly detection by introducing a multi-stage reasoning process and a fine-grained reward mechanism, achieving better performance on specialized tasks.

DetailsMotivation: Existing MLLM approaches for anomaly detection face challenges like inadequate data utilization and lack of supervision over reasoning processes, limiting their effectiveness in specialized domains.

Method: The framework includes a multi-stage deliberative reasoning process and a fine-grained reward mechanism to enhance model performance and supervision.

Result: The method outperforms existing approaches, achieving superior accuracy in adapting general vision-language models to specialized anomaly detection tasks.

Conclusion: The proposed framework effectively bridges the gap between general-purpose MLLM capabilities and the fine-grained requirements of anomaly detection in specialized domains.

Abstract: While Multimodal Large Language Models (MLLMs) demonstrate remarkable capabilities across diverse domains, their application to specialized anomaly detection (AD) remains constrained by domain adaptation challenges. Existing Group Relative Policy Optimization (GRPO) based approaches suffer from two critical limitations: inadequate training data utilization when models produce uniform responses, and insufficient supervision over reasoning processes that encourage immediate binary decisions without deliberative analysis. We propose a comprehensive framework addressing these limitations through two synergistic innovations. First, we introduce a multi-stage deliberative reasoning process that guides models from region identification to focused examination, generating diverse response patterns essential for GRPO optimization while enabling structured supervision over analytical workflows. Second, we develop a fine-grained reward mechanism incorporating classification accuracy and localization supervision, transforming binary feedback into continuous signals that distinguish genuine analytical insight from spurious correctness. Comprehensive evaluation across multiple industrial datasets demonstrates substantial performance improvements in adapting general vision-language models to specialized anomaly detection. Our method achieves superior accuracy with efficient adaptation of existing annotations, effectively bridging the gap between general-purpose MLLM capabilities and the fine-grained visual discrimination required for detecting subtle manufacturing defects and structural irregularities.

[176] Uncertainty-Aware Spatial Color Correlation for Low-Light Image Enhancement

Jin Kuang, Dong Liu, Yukuang Zhang, Shengsheng Wang

Main category: cs.CV

TL;DR: U2CLLIE is a novel framework for low-light image enhancement that integrates uncertainty-aware enhancement and spatial-color causal correlation modeling to address feature uncertainty and noise dominance in dark conditions.

DetailsMotivation: Existing methods overlook intrinsic uncertainty in feature representations under dark conditions, leading to degraded gradients and noise dominance, impairing model reliability and causal reasoning.

Method: The framework includes an Uncertainty-Aware Dual-domain Denoise (UaD) Module for noise suppression and a hierarchical causality-aware framework with Luminance Enhancement Network (LEN) and causal correlation modules (NeCo and AsC) for structure and color consistency.

Result: U2CLLIE achieves state-of-the-art performance on multiple benchmarks, showing robustness and generalization across various scenes.

Conclusion: The proposed framework effectively mitigates uncertainty and noise issues in low-light image enhancement, outperforming existing methods.

Abstract: Most existing low-light image enhancement approaches primarily focus on architectural innovations, while often overlooking the intrinsic uncertainty within feature representations particularly under extremely dark conditions where degraded gradient and noise dominance severely impair model reliability and causal reasoning. To address these issues, we propose U2CLLIE, a novel framework that integrates uncertainty-aware enhancement and spatial-color causal correlation modeling. From the perspective of entropy-based uncertainty, our framework introduces two key components: (1) An Uncertainty-Aware Dual-domain Denoise (UaD) Module, which leverages Gaussian-Guided Adaptive Frequency Domain Feature Enhancement (G2AF) to suppress frequency-domain noise and optimize entropy-driven representations. This module enhances spatial texture extraction and frequency-domain noise suppression/structure refinement, effectively mitigating gradient vanishing and noise dominance. (2) A hierarchical causality-aware framework, where a Luminance Enhancement Network (LEN) first performs coarse brightness enhancement on dark regions. Then, during the encoder-decoder phase, two asymmetric causal correlation modeling modules Neighborhood Correlation State Space (NeCo) and Adaptive Spatial-Color Calibration (AsC) collaboratively construct hierarchical causal constraints. These modules reconstruct and reinforce neighborhood structure and color consistency in the feature space. Extensive experiments demonstrate that U2CLLIE achieves state-of-the-art performance across multiple benchmark datasets, exhibiting robust performance and strong generalization across various scenes.

[177] Deeper Inside Deep ViT

Sungrae Hong

Main category: cs.CV

TL;DR: The paper explores the practical utility of large-scale vision models like ViT-22B, focusing on training stability, performance, and image generation suitability.

DetailsMotivation: To understand the practical application and training dynamics of large-scale vision models, particularly ViT-22B, and to explore its potential in image generation.

Method: Examined ViT-22B’s training in a local environment, addressed instability issues with modifications, and proposed an image generation architecture using ViT and ViT-22B.

Result: ViT-22B outperformed ViT in performance under the same parameter size and was evaluated for image generation suitability.

Conclusion: ViT-22B shows promise for practical use and image generation, though training stability remains a challenge.

Abstract: There have been attempts to create large-scale structures in vision models similar to LLM, such as ViT-22B. While this research has provided numerous analyses and insights, our understanding of its practical utility remains incomplete. Therefore, we examine how this model structure reacts and train in a local environment. We also highlight the instability in training and make some model modifications to stabilize it. The ViT-22B model, trained from scratch, overall outperformed ViT in terms of performance under the same parameter size. Additionally, we venture into the task of image generation, which has not been attempted in ViT-22B. We propose an image generation architecture using ViT and investigate which between ViT and ViT-22B is a more suitable structure for image generation.

[178] RPCANet++: Deep Interpretable Robust PCA for Sparse Object Segmentation

Fengyi Wu, Yimian Dai, Tianfang Zhang, Yixuan Ding, Jian Yang, Ming-Ming Cheng, Zhenming Peng

Main category: cs.CV

TL;DR: RPCANet++ integrates RPCA with deep learning for efficient and interpretable sparse object segmentation, outperforming traditional methods.

DetailsMotivation: Traditional RPCA models face computational inefficiency, hyperparameter sensitivity, and inflexibility in dynamic scenarios.

Method: RPCANet++ unfolds a relaxed RPCA model into a network with Background Approximation, Object Extraction, and Image Restoration Modules, enhanced by Memory-Augmented and Deep Contrast Prior Modules.

Result: Achieves state-of-the-art performance across diverse datasets, with improved interpretability through visual and numerical metrics.

Conclusion: Combining RPCA’s theoretical strengths with deep learning efficiency, RPCANet++ sets a new benchmark for reliable and interpretable sparse object segmentation.

Abstract: Robust principal component analysis (RPCA) decomposes an observation matrix into low-rank background and sparse object components. This capability has enabled its application in tasks ranging from image restoration to segmentation. However, traditional RPCA models suffer from computational burdens caused by matrix operations, reliance on finely tuned hyperparameters, and rigid priors that limit adaptability in dynamic scenarios. To solve these limitations, we propose RPCANet++, a sparse object segmentation framework that fuses the interpretability of RPCA with efficient deep architectures. Our approach unfolds a relaxed RPCA model into a structured network comprising a Background Approximation Module (BAM), an Object Extraction Module (OEM), and an Image Restoration Module (IRM). To mitigate inter-stage transmission loss in the BAM, we introduce a Memory-Augmented Module (MAM) to enhance background feature preservation, while a Deep Contrast Prior Module (DCPM) leverages saliency cues to expedite object extraction. Extensive experiments on diverse datasets demonstrate that RPCANet++ achieves state-of-the-art performance under various imaging scenarios. We further improve interpretability via visual and numerical low-rankness and sparsity measurements. By combining the theoretical strengths of RPCA with the efficiency of deep networks, our approach sets a new baseline for reliable and interpretable sparse object segmentation. Codes are available at our Project Webpage https://fengyiwu98.github.io/rpcanetx.

[179] From Learning to Unlearning: Biomedical Security Protection in Multimodal Large Language Models

Dunyuan Xu, Xikai Yang, Yaoqian Li, Jinpeng Li, Pheng-Ann Heng

Main category: cs.CV

TL;DR: The paper introduces MLLMU-Med, a benchmark for evaluating machine unlearning in biomedical MLLMs, addressing privacy and incorrect knowledge issues. It highlights the limitations of current unlearning methods.

DetailsMotivation: Biomedical MLLMs risk privacy leaks and incorrect outputs due to harmful training data. Retraining is impractical, so machine unlearning is proposed as a solution, but lacks evaluation benchmarks.

Method: The authors create MLLMU-Med, a benchmark with synthetic private data and factual errors, targeting privacy protection and incorrectness removal. They propose an Unlearning Efficiency Score to evaluate performance.

Result: Five unlearning methods were tested on MLLMU-Med, showing limited effectiveness in removing harmful knowledge, indicating room for improvement.

Conclusion: MLLMU-Med provides a foundation for future research in machine unlearning for biomedical MLLMs, addressing critical security challenges.

Abstract: The security of biomedical Multimodal Large Language Models (MLLMs) has attracted increasing attention. However, training samples easily contain private information and incorrect knowledge that are difficult to detect, potentially leading to privacy leakage or erroneous outputs after deployment. An intuitive idea is to reprocess the training set to remove unwanted content and retrain the model from scratch. Yet, this is impractical due to significant computational costs, especially for large language models. Machine unlearning has emerged as a solution to this problem, which avoids complete retraining by selectively removing undesired knowledge derived from harmful samples while preserving required capabilities on normal cases. However, there exist no available datasets to evaluate the unlearning quality for security protection in biomedical MLLMs. To bridge this gap, we propose the first benchmark Multimodal Large Language Model Unlearning for BioMedicine (MLLMU-Med) built upon our novel data generation pipeline that effectively integrates synthetic private data and factual errors into the training set. Our benchmark targets two key scenarios: 1) Privacy protection, where patient private information is mistakenly included in the training set, causing models to unintentionally respond with private data during inference; and 2) Incorrectness removal, where wrong knowledge derived from unreliable sources is embedded into the dataset, leading to unsafe model responses. Moreover, we propose a novel Unlearning Efficiency Score that directly reflects the overall unlearning performance across different subsets. We evaluate five unlearning approaches on MLLMU-Med and find that these methods show limited effectiveness in removing harmful knowledge from biomedical MLLMs, indicating significant room for improvement. This work establishes a new pathway for further research in this promising field.

[180] Gather and Trace: Rethinking Video TextVQA from an Instance-oriented Perspective

Yan Zhang, Gangyan Zeng, Daiqing Wu, Huawen Shen, Binbin Li, Yu Zhou, Can Ma, Xiaojun Bi

Main category: cs.CV

TL;DR: GAT (Gather and Trace) improves Video TextVQA by focusing on text instances, enhancing accuracy and efficiency with context aggregation and trajectory tracing.

DetailsMotivation: Existing frame-level frameworks suffer from redundant text entities and poor relation modeling, limiting accuracy and efficiency.

Method: GAT uses a context-aggregated instance gathering module for unified text representation and an instance-focused trajectory tracing module for spatio-temporal relationships.

Result: GAT outperforms existing methods by 3.86% in accuracy and is ten times faster than video large language models.

Conclusion: GAT provides a robust solution for Video TextVQA, balancing accuracy and speed.

Abstract: Video text-based visual question answering (Video TextVQA) aims to answer questions by explicitly reading and reasoning about the text involved in a video. Most works in this field follow a frame-level framework which suffers from redundant text entities and implicit relation modeling, resulting in limitations in both accuracy and efficiency. In this paper, we rethink the Video TextVQA task from an instance-oriented perspective and propose a novel model termed GAT (Gather and Trace). First, to obtain accurate reading result for each video text instance, a context-aggregated instance gathering module is designed to integrate the visual appearance, layout characteristics, and textual contents of the related entities into a unified textual representation. Then, to capture dynamic evolution of text in the video flow, an instance-focused trajectory tracing module is utilized to establish spatio-temporal relationships between instances and infer the final answer. Extensive experiments on several public Video TextVQA datasets validate the effectiveness and generalization of our framework. GAT outperforms existing Video TextVQA methods, video-language pretraining methods, and video large language models in both accuracy and inference speed. Notably, GAT surpasses the previous state-of-the-art Video TextVQA methods by 3.86% in accuracy and achieves ten times of faster inference speed than video large language models. The source code is available at https://github.com/zhangyan-ucas/GAT.

[181] Bootstrap Deep Spectral Clustering with Optimal Transport

Wengang Guo, Wei Ye, Chunchun Chen, Xin Sun, Christian Böhm, Claudia Plant, Susanto Rahardja

Main category: cs.CV

TL;DR: BootSC is a deep spectral clustering model that jointly optimizes all stages of spectral clustering (affinity matrix, spectral embedding, and k-means) in an end-to-end manner, achieving state-of-the-art performance.

DetailsMotivation: Addresses the disjoint optimization and limited representation capacity of traditional spectral clustering methods.

Method: Uses a single network for end-to-end learning, leverages optimal-transport-derived supervision, and introduces orthogonal re-parameterization for spectral embeddings.

Result: Achieves a 16% NMI improvement over the runner-up method on ImageNet-Dogs.

Conclusion: BootSC significantly enhances clustering performance and representation capability.

Abstract: Spectral clustering is a leading clustering method. Two of its major shortcomings are the disjoint optimization process and the limited representation capacity. To address these issues, we propose a deep spectral clustering model (named BootSC), which jointly learns all stages of spectral clustering – affinity matrix construction, spectral embedding, and $k$-means clustering – using a single network in an end-to-end manner. BootSC leverages effective and efficient optimal-transport-derived supervision to bootstrap the affinity matrix and the cluster assignment matrix. Moreover, a semantically-consistent orthogonal re-parameterization technique is introduced to orthogonalize spectral embeddings, significantly enhancing the discrimination capability. Experimental results indicate that BootSC achieves state-of-the-art clustering performance. For example, it accomplishes a notable 16% NMI improvement over the runner-up method on the challenging ImageNet-Dogs dataset. Our code is available at https://github.com/spdj2271/BootSC.

[182] ViFP: A Framework for Visual False Positive Detection to Enhance Reasoning Reliability in VLMs

Ben Zhang, LuLu Yu, Lei Gao, Jing Liu, QuanJiang Guo, Hui Gao

Main category: cs.CV

TL;DR: ViFP is a framework to improve visual-language model reasoning by detecting false positives (FPs) and enhancing reasoning reliability through sub-question templates, multi-turn QA, and adaptive CoT mechanisms.

DetailsMotivation: Existing methods for FP reasoning in VLMs are dataset-dependent and costly, limiting generalization. ViFP aims to overcome these issues.

Method: ViFP constructs sub-question templates, uses multi-turn QA for reasoning paths, dynamically analyzes consistency, and employs adaptive CoT to guide FP and non-FP samples.

Result: ViFP improves accuracy by up to 5.4% on A-OKVQA, reduces FPs, and outperforms prior methods.

Conclusion: ViFP enhances reasoning reliability and accuracy, validated by the new VoC metric and superior performance on benchmark datasets.

Abstract: In visual-language model (VLM) reasoning, false positive(FP) reasoning occurs when a model generates a correct answer but follows an incorrect reasoning path. Existing methods based on specific multi-step reasoning datasets and reinforcement learning strategies, leading to high training costs and limited generalization. In this work, we propose ViFP, a general framework for enhancing visual reasoning reliability. It improves both answer accuracy and reasoning soundness by detecting FPs. ViFP tackles the limitations of dataset dependency and poor generalization by constructing sub-question templates grounded in the core dimensions of visual reasoning, such as object localization, characteristic description, and object discovery. ViFP then builds effective reasoning paths via multi-turn QA to improve reasoning accuracy. Meanwhile, ViFP dynamically analyzes the consistency of reasoning path to identify potential FPs, and introduces a targeted chain-of-thought (CoT) mechanism that adaptively guides both FP and non-FP samples. Thereby reducing logical errors in the reasoning path while preserving accuracy. Finally, we introduce a reliability evaluation metric-VoC, which integrates answer accuracy and the FP rate, providing a quantitative tool to assess whether a VLM not only answers correctly, but also reasons reliably. Our experiments on closed-source VLMs show that ViFP consistently improves performance across three datasets: A-OKVQA, OKVQA, and FVQA. On A-OKVQA, ViFP improves accuracy by up to 5.4%, surpassing the previous state-of-the-art by 4.3%, and significantly reduces the number of FPs, validating its benefits in enhancing reasoning reliability.

[183] Small Lesions-aware Bidirectional Multimodal Multiscale Fusion Network for Lung Disease Classification

Jianxun Yu, Ruiquan Ge, Zhipeng Wang, Cheng Yang, Chenyu Lin, Xianjun Fu, Jikui Liu, Ahmed Elazab, Changmiao Wang

Main category: cs.CV

TL;DR: MMCAF-Net improves medical disease diagnosis by addressing dimensionality differences in multimodal data through a feature pyramid and cross-attention fusion.

DetailsMotivation: Challenges in diagnosing small lesions and integrating medical imaging with EHR data due to dimensionality differences.

Method: Proposes MMCAF-Net with a feature pyramid, 3D multi-scale convolutional attention, and cross-attention for multimodal fusion.

Result: Outperforms state-of-the-art methods on the Lung-PET-CT-Dx dataset, improving diagnostic accuracy.

Conclusion: MMCAF-Net effectively integrates multimodal data for better medical diagnosis, with code publicly available.

Abstract: The diagnosis of medical diseases faces challenges such as the misdiagnosis of small lesions. Deep learning, particularly multimodal approaches, has shown great potential in the field of medical disease diagnosis. However, the differences in dimensionality between medical imaging and electronic health record data present challenges for effective alignment and fusion. To address these issues, we propose the Multimodal Multiscale Cross-Attention Fusion Network (MMCAF-Net). This model employs a feature pyramid structure combined with an efficient 3D multi-scale convolutional attention module to extract lesion-specific features from 3D medical images. To further enhance multimodal data integration, MMCAF-Net incorporates a multi-scale cross-attention module, which resolves dimensional inconsistencies, enabling more effective feature fusion. We evaluated MMCAF-Net on the Lung-PET-CT-Dx dataset, and the results showed a significant improvement in diagnostic accuracy, surpassing current state-of-the-art methods. The code is available at https://github.com/yjx1234/MMCAF-Net

[184] What Holds Back Open-Vocabulary Segmentation?

Josip Šarić, Ivan Martinović, Matej Kristan, Siniša Šegvić

Main category: cs.CV

TL;DR: The paper identifies bottlenecks in open-vocabulary segmentation models and proposes oracle components to address them, offering insights for future research.

DetailsMotivation: Current open-vocabulary models fail to recognize concepts outside training taxonomy, with performance stagnating for years.

Method: Proposes novel oracle components leveraging groundtruth information to identify and decouple bottlenecks.

Result: Validation experiments reveal empirical findings on model failures and suggest future research directions.

Conclusion: The study provides insights to unlock potential in open-vocabulary segmentation research.

Abstract: Standard segmentation setups are unable to deliver models that can recognize concepts outside the training taxonomy. Open-vocabulary approaches promise to close this gap through language-image pretraining on billions of image-caption pairs. Unfortunately, we observe that the promise is not delivered due to several bottlenecks that have caused the performance to plateau for almost two years. This paper proposes novel oracle components that identify and decouple these bottlenecks by taking advantage of the groundtruth information. The presented validation experiments deliver important empirical findings that provide a deeper insight into the failures of open-vocabulary models and suggest prominent approaches to unlock the future research.

[185] Segment Any Vehicle: Semantic and Visual Context Driven SAM and A Benchmark

Xiao Wang, Ziwen Wang, Wentao Wu, Anjie Wang, Jiashu Wu, Yantao Pan, Chenglong Li

Main category: cs.CV

TL;DR: The paper proposes SAV, a framework for vehicle part segmentation, addressing limitations of SAM by integrating a knowledge graph and context retrieval. It also introduces a new dataset, VehicleSeg10K.

DetailsMotivation: Current segmentation models like SAM lack fine-grained capabilities for vehicle part segmentation due to missing semantic labels and inaccessible text-prompted functionality.

Method: SAV combines a SAM-based encoder-decoder, a vehicle part knowledge graph, and a context retrieval module to improve segmentation.

Result: Comprehensive experiments on VehicleSeg10K and other datasets benchmark performance, showing SAV’s effectiveness.

Conclusion: SAV advances vehicle part segmentation with structured knowledge and contextual priors, supported by a new benchmark dataset.

Abstract: With the rapid advancement of autonomous driving, vehicle perception, particularly detection and segmentation, has placed increasingly higher demands on algorithmic performance. Pre-trained large segmentation models, especially Segment Anything Model (SAM), have sparked significant interest and inspired new research directions in artificial intelligence. However, SAM cannot be directly applied to the fine-grained task of vehicle part segmentation, as its text-prompted segmentation functionality is not publicly accessible, and the mask regions generated by its default mode lack semantic labels, limiting its utility in structured, category-specific segmentation tasks. To address these limitations, we propose SAV, a novel framework comprising three core components: a SAM-based encoder-decoder, a vehicle part knowledge graph, and a context sample retrieval encoding module. The knowledge graph explicitly models the spatial and geometric relationships among vehicle parts through a structured ontology, effectively encoding prior structural knowledge. Meanwhile, the context retrieval module enhances segmentation by identifying and leveraging visually similar vehicle instances from training data, providing rich contextual priors for improved generalization. Furthermore, we introduce a new large-scale benchmark dataset for vehicle part segmentation, named VehicleSeg10K, which contains 11,665 high-quality pixel-level annotations across diverse scenes and viewpoints. We conduct comprehensive experiments on this dataset and two other datasets, benchmarking multiple representative baselines to establish a solid foundation for future research and comparison. % Both the dataset and source code of this paper will be released upon acceptance. Both the dataset and source code of this paper will be released on https://github.com/Event-AHU/SAV

[186] SplitGaussian: Reconstructing Dynamic Scenes via Visual Geometry Decomposition

Jiahui Li, Shengeng Tang, Jingxuan He, Gang Huang, Zhangye Wang, Yantao Pan, Lechao Cheng

Main category: cs.CV

TL;DR: SplitGaussian decouples static and dynamic scene components in 3D reconstruction from monocular video, improving quality and stability.

DetailsMotivation: Existing methods entangle static and dynamic elements, causing artifacts like motion leakage and flickering.

Method: Proposes SplitGaussian, explicitly separating static and dynamic representations to prevent artifacts.

Result: Outperforms state-of-the-art in rendering quality, geometric stability, and motion separation.

Conclusion: Disentangled design enhances fidelity, consistency, and convergence speed.

Abstract: Reconstructing dynamic 3D scenes from monocular video remains fundamentally challenging due to the need to jointly infer motion, structure, and appearance from limited observations. Existing dynamic scene reconstruction methods based on Gaussian Splatting often entangle static and dynamic elements in a shared representation, leading to motion leakage, geometric distortions, and temporal flickering. We identify that the root cause lies in the coupled modeling of geometry and appearance across time, which hampers both stability and interpretability. To address this, we propose \textbf{SplitGaussian}, a novel framework that explicitly decomposes scene representations into static and dynamic components. By decoupling motion modeling from background geometry and allowing only the dynamic branch to deform over time, our method prevents motion artifacts in static regions while supporting view- and time-dependent appearance refinement. This disentangled design not only enhances temporal consistency and reconstruction fidelity but also accelerates convergence. Extensive experiments demonstrate that SplitGaussian outperforms prior state-of-the-art methods in rendering quality, geometric stability, and motion separation.

[187] Continual Learning for VLMs: A Survey and Taxonomy Beyond Forgetting

Yuyang Liu, Qiuhe Hong, Linlan Huang, Alexandra Gomez-Villa, Dipam Goswami, Xialei Liu, Joost van de Weijer, Yonghong Tian

Main category: cs.CV

TL;DR: A survey on continual learning for vision-language models (VLM-CL), addressing challenges like cross-modal drift and parameter interference, with proposed solutions and future directions.

DetailsMotivation: VLMs struggle with continual learning due to catastrophic forgetting and unique challenges like cross-modal feature drift and zero-shot capability erosion.

Method: Identifies three core failure modes and proposes a taxonomy of solutions: Multi-Modal Replay Strategies, Cross-Modal Regularization, and Parameter-Efficient Adaptation.

Result: Highlights gaps in current evaluation protocols and datasets, advocating for better benchmarks for VLM-specific issues.

Conclusion: The survey serves as a reference for lifelong VLM development, with open problems like continual pre-training and compositional zero-shot learning.

Abstract: Vision-language models (VLMs) have achieved impressive performance across diverse multimodal tasks by leveraging large-scale pre-training. However, enabling them to learn continually from non-stationary data remains a major challenge, as their cross-modal alignment and generalization capabilities are particularly vulnerable to catastrophic forgetting. Unlike traditional unimodal continual learning (CL), VLMs face unique challenges such as cross-modal feature drift, parameter interference due to shared architectures, and zero-shot capability erosion. This survey offers the first focused and systematic review of continual learning for VLMs (VLM-CL). We begin by identifying the three core failure modes that degrade performance in VLM-CL. Based on these, we propose a challenge-driven taxonomy that maps solutions to their target problems: (1) \textit{Multi-Modal Replay Strategies} address cross-modal drift through explicit or implicit memory mechanisms; (2) \textit{Cross-Modal Regularization} preserves modality alignment during updates; and (3) \textit{Parameter-Efficient Adaptation} mitigates parameter interference with modular or low-rank updates. We further analyze current evaluation protocols, datasets, and metrics, highlighting the need for better benchmarks that capture VLM-specific forgetting and compositional generalization. Finally, we outline open problems and future directions, including continual pre-training and compositional zero-shot learning. This survey aims to serve as a comprehensive and diagnostic reference for researchers developing lifelong vision-language systems. All resources are available at: https://github.com/YuyangSunshine/Awesome-Continual-learning-of-Vision-Language-Models.

[188] FrEVL: Leveraging Frozen Pretrained Embeddings for Efficient Vision-Language Understanding

Emmanuelle Bourigault, Pauline Bourigault

Main category: cs.CV

TL;DR: FrEVL explores frozen pretrained embeddings for vision-language tasks, achieving near state-of-the-art performance with fewer parameters, faster speed, and lower energy use.

DetailsMotivation: To reduce computational demands of vision-language models by leveraging frozen embeddings.

Method: Uses frozen pretrained embeddings and evaluates their effectiveness on discriminative tasks.

Result: Achieves 85-95% of SOTA performance with 68.4M parameters, 2.3x speedup, and 52% lower energy use.

Conclusion: Frozen embeddings are viable when pretraining aligns with task needs, offering efficiency gains.

Abstract: The deployment of vision-language models remains constrained by substantial computational requirements. We present \textbf{FrEVL}, a framework exploring whether frozen pretrained embeddings can support effective vision-language understanding. Our analysis reveals that frozen embeddings contain rich information for discriminative tasks, achieving 85% to 95% of state-of-the-art performance on standard benchmarks with only 68.4M trainable parameters. This performance dichotomy reveals a critical insight: frozen embedding effectiveness depends on alignment between pretraining objectives and downstream task requirements. When accounting for end-to-end computation including embedding extraction, FrEVL provides $2.3\times$ speedup with 52% lower energy consumption, making it suitable for scenarios with pre-computable inputs or when deployment constraints outweigh marginal performance gains. Our evaluation provides practitioners with guidance on when frozen embedding approaches represent viable alternatives to full model deployment. We will release our complete implementation and evaluation framework to facilitate further research into efficient multi-modal understanding.

[189] Intention Enhanced Diffusion Model for Multimodal Pedestrian Trajectory Prediction

Yu Liu, Zhijie Liu, Xiao Ren, You-Fu Li, He Kong

Main category: cs.CV

TL;DR: A diffusion-based model for pedestrian trajectory prediction incorporates motion intentions, improving interpretability and precision.

DetailsMotivation: Accurate pedestrian trajectory prediction is crucial for autonomous vehicles, but existing models lack explicit motion intention integration.

Method: The model decomposes motion intentions into lateral/longitudinal components, uses an intention recognition module, and employs an efficient guidance mechanism.

Result: Competitive performance on ETH and UCY benchmarks compared to state-of-the-art methods.

Conclusion: Incorporating motion intentions enhances the interpretability and accuracy of trajectory predictions.

Abstract: Predicting pedestrian motion trajectories is critical for path planning and motion control of autonomous vehicles. However, accurately forecasting crowd trajectories remains a challenging task due to the inherently multimodal and uncertain nature of human motion. Recent diffusion-based models have shown promising results in capturing the stochasticity of pedestrian behavior for trajectory prediction. However, few diffusion-based approaches explicitly incorporate the underlying motion intentions of pedestrians, which can limit the interpretability and precision of prediction models. In this work, we propose a diffusion-based multimodal trajectory prediction model that incorporates pedestrians’ motion intentions into the prediction framework. The motion intentions are decomposed into lateral and longitudinal components, and a pedestrian intention recognition module is introduced to enable the model to effectively capture these intentions. Furthermore, we adopt an efficient guidance mechanism that facilitates the generation of interpretable trajectories. The proposed framework is evaluated on two widely used human trajectory prediction benchmarks, ETH and UCY, on which it is compared against state-of-the-art methods. The experimental results demonstrate that our method achieves competitive performance.

[190] Analyzing and Mitigating Object Hallucination: A Training Bias Perspective

Yifan Li, Kun Zhou, Wayne Xin Zhao, Lei Fang, Ji-Rong Wen

Main category: cs.CV

TL;DR: The paper investigates training data’s role in hallucination in Large Vision-Language Models (LVLMs), introduces POPEv2 benchmark, identifies training bias in LM heads, and proposes Obliviate, an unlearning method to reduce hallucination efficiently.

DetailsMotivation: LVLMs suffer from hallucination, generating text inconsistent with visual input, prompting a study on training data's impact.

Method: Introduces POPEv2 benchmark with counterfactual images, probes model components, and proposes Obliviate, a lightweight unlearning method targeting the LM head.

Result: Obliviate reduces hallucination significantly, updates only 2% of parameters, and scales well with model size and data volume.

Conclusion: Obliviate effectively mitigates hallucination, demonstrating strong scalability and generalization, with code and data to be released.

Abstract: As scaling up training data has significantly improved the general multimodal capabilities of Large Vision-Language Models (LVLMs), they still suffer from the hallucination issue, generating text that is inconsistent with the visual input. This phenomenon motivates us to systematically investigate the role of training data in hallucination. We introduce a new benchmark, POPEv2, which consists of counterfactual images collected from the training data of LVLMs with certain objects masked. Through comprehensive evaluation on POPEv2, we find that current LVLMs suffer from training bias: they fail to fully leverage their training data and hallucinate more frequently on images seen during training. Specifically, they perform poorly on counterfactual images, often incorrectly answering ``Yes’’ to questions about masked objects. To understand this issue, we conduct probing experiments on the models’ internal components, revealing that this training bias is primarily located in the language modeling (LM) head. Based on these findings, we propose Obliviate, an efficient and lightweight unlearning method designed to mitigate object hallucination via training bias unlearning. Obliviate identifies the discrepancy between ground-truth labels and model outputs on the training data as a proxy for bias and adopts a parameter- and data-efficient fine-tuning strategy that only updates the LM head. Extensive experiments demonstrate the effectiveness of our approach. While only reusing the training data and updating approximately 2% of the parameters, Obliviate significantly reduces hallucination across both discriminative and generative tasks. Furthermore, it demonstrates strong scalability with respect to both model size (2B to 72B) and training data volume, and exhibits promising generalization to hallucination types beyond object-level hallucination. Our code and data will be publicly released.

[191] DocVCE: Diffusion-based Visual Counterfactual Explanations for Document Image Classification

Saifullah Saifullah, Stefan Agne, Andreas Dengel, Sheraz Ahmed

Main category: cs.CV

TL;DR: The paper introduces DocVCE, a novel method using generative counterfactuals to explain document image classification models, addressing interpretability gaps in existing feature-importance maps.

DetailsMotivation: Improving transparency and reliability of AI-driven document processing systems, especially in high-stakes applications where biases or spurious correlations can have serious consequences.

Method: Proposes DocVCE, leveraging latent diffusion models and classifier guidance to generate visual counterfactual explanations, followed by hierarchical patch-wise refinement.

Result: Demonstrated effectiveness on three datasets (RVL-CDIP, Tobacco3482, DocLayNet) and three models (ResNet, ConvNeXt, DiT) using validity, closeness, and realism metrics.

Conclusion: First work to explore generative counterfactual explanations in document image analysis, providing actionable insights into model decision-making.

Abstract: As black-box AI-driven decision-making systems become increasingly widespread in modern document processing workflows, improving their transparency and reliability has become critical, especially in high-stakes applications where biases or spurious correlations in decision-making could lead to serious consequences. One vital component often found in such document processing workflows is document image classification, which, despite its widespread use, remains difficult to explain. While some recent works have attempted to explain the decisions of document image classification models through feature-importance maps, these maps are often difficult to interpret and fail to provide insights into the global features learned by the model. In this paper, we aim to bridge this research gap by introducing generative document counterfactuals that provide meaningful insights into the model’s decision-making through actionable explanations. In particular, we propose DocVCE, a novel approach that leverages latent diffusion models in combination with classifier guidance to first generate plausible in-distribution visual counterfactual explanations, and then performs hierarchical patch-wise refinement to search for a refined counterfactual that is closest to the target factual image. We demonstrate the effectiveness of our approach through a rigorous qualitative and quantitative assessment on 3 different document classification datasets – RVL-CDIP, Tobacco3482, and DocLayNet – and 3 different models – ResNet, ConvNeXt, and DiT – using well-established evaluation criteria such as validity, closeness, and realism. To the best of the authors’ knowledge, this is the first work to explore generative counterfactual explanations in document image analysis.

[192] A machine learning approach for image classification in synthetic aperture RADAR

Romina Gaburro, Patrick Healy, Shraddha Naidu, Clifford Nolan

Main category: cs.CV

TL;DR: CNNs effectively classify objects in SAR data, achieving ≥75% accuracy for shape and ice type classification, while also examining the impact of antenna height.

DetailsMotivation: To address the challenge of identifying and classifying objects in SAR imagery using CNNs, focusing on geometric and environmental tasks.

Method: Uses a single scattering approximation with CNNs on simulated and reconstructed SAR data, and applies it to real Sentinel-1 imagery for ice type classification.

Result: High classification accuracy (≥75%) for both object shapes and ice types, demonstrating CNN effectiveness in SAR tasks.

Conclusion: CNNs are effective for SAR-based classification, with antenna height impacting success, suggesting further exploration of acquisition parameters.

Abstract: We consider the problem in Synthetic Aperture RADAR (SAR) of identifying and classifying objects located on the ground by means of Convolutional Neural Networks (CNNs). Specifically, we adopt a single scattering approximation to classify the shape of the object using both simulated SAR data and reconstructed images from this data, and we compare the success of these approaches. We then identify ice types in real SAR imagery from the satellite Sentinel-1. In both experiments we achieve a promising high classification accuracy ($\geq$75%). Our results demonstrate the effectiveness of CNNs in using SAR data for both geometric and environmental classification tasks. Our investigation also explores the effect of SAR data acquisition at different antenna heights on our ability to classify objects successfully.

[193] ProtoN: Prototype Node Graph Neural Network for Unconstrained Multi-Impression Ear Recognition

Santhoshkumar Peddi, Sadhvik Bathini, Arun Balasubramanian, Monalisa Sarma, Debasis Samanta

Main category: cs.CV

TL;DR: ProtoN, a few-shot learning framework, uses a graph-based approach with a Prototype Graph Neural Network (PGNN) to improve ear biometric recognition by jointly processing multiple impressions and enhancing discriminative power.

DetailsMotivation: Ear biometrics face challenges like limited annotated data and intra-class variability, restricting consistent and discriminative feature extraction.

Method: ProtoN employs a graph-based approach with PGNN, using class-specific graphs and a dual-path message-passing mechanism. It includes cross-graph prototype alignment and a hybrid loss function.

Result: ProtoN achieves state-of-the-art performance with 99.60% Rank-1 accuracy and 0.025 EER on benchmark datasets.

Conclusion: ProtoN effectively addresses few-shot ear recognition challenges, demonstrating superior performance under limited data conditions.

Abstract: Ear biometrics offer a stable and contactless modality for identity recognition, yet their effectiveness remains limited by the scarcity of annotated data and significant intra-class variability. Existing methods typically extract identity features from individual impressions in isolation, restricting their ability to capture consistent and discriminative representations. To overcome these limitations, a few-shot learning framework, ProtoN, is proposed to jointly process multiple impressions of an identity using a graph-based approach. Each impression is represented as a node in a class-specific graph, alongside a learnable prototype node that encodes identity-level information. This graph is processed by a Prototype Graph Neural Network (PGNN) layer, specifically designed to refine both impression and prototype representations through a dual-path message-passing mechanism. To further enhance discriminative power, the PGNN incorporates a cross-graph prototype alignment strategy that improves class separability by enforcing intra-class compactness while maintaining inter-class distinction. Additionally, a hybrid loss function is employed to balance episodic and global classification objectives, thereby improving the overall structure of the embedding space. Extensive experiments on five benchmark ear datasets demonstrate that ProtoN achieves state-of-the-art performance, with Rank-1 identification accuracy of up to 99.60% and an Equal Error Rate (EER) as low as 0.025, showing the effectiveness for few-shot ear recognition under limited data conditions.

[194] PIS3R: Very Large Parallax Image Stitching via Deep 3D Reconstruction

Muhua Zhu, Xinhao Jin, Chengbo Wang, Yongcong Zhang, Yifei Xue, Tie Ji, Yizhen Lao

Main category: cs.CV

TL;DR: Proposes PIS3R, a robust image stitching method for large parallax using deep 3D reconstruction and diffusion-based refinement.

DetailsMotivation: Existing methods struggle with large parallax in 3D scenes, limiting seamless stitching.

Method: Uses visual geometry grounded transformers for 3D reconstruction, reprojects point clouds, and refines with a diffusion module.

Result: Outperforms existing methods in accuracy and quality for large parallax images.

Conclusion: PIS3R is effective for large parallax, preserving geometric integrity for downstream 3D tasks.

Abstract: Image stitching aim to align two images taken from different viewpoints into one seamless, wider image. However, when the 3D scene contains depth variations and the camera baseline is significant, noticeable parallax occurs-meaning the relative positions of scene elements differ substantially between views. Most existing stitching methods struggle to handle such images with large parallax effectively. To address this challenge, in this paper, we propose an image stitching solution called PIS3R that is robust to very large parallax based on the novel concept of deep 3D reconstruction. First, we apply visual geometry grounded transformer to two input images with very large parallax to obtain both intrinsic and extrinsic parameters, as well as the dense 3D scene reconstruction. Subsequently, we reproject reconstructed dense point cloud onto a designated reference view using the recovered camera parameters, achieving pixel-wise alignment and generating an initial stitched image. Finally, to further address potential artifacts such as holes or noise in the initial stitching, we propose a point-conditioned image diffusion module to obtain the refined result.Compared with existing methods, our solution is very large parallax tolerant and also provides results that fully preserve the geometric integrity of all pixels in the 3D photogrammetric context, enabling direct applicability to downstream 3D vision tasks such as SfM. Experimental results demonstrate that the proposed algorithm provides accurate stitching results for images with very large parallax, and outperforms the existing methods qualitatively and quantitatively.

[195] Deep Learning-based Scalable Image-to-3D Facade Parser for Generating Thermal 3D Building Models

Yinan Yu, Alex Gonzalez-Caceres, Samuel Scheidegger, Sanjay Somanath, Alexander Hollberg

Main category: cs.CV

TL;DR: SI3FP is a pipeline for generating LoD3 thermal models from images, combining computer vision and deep learning to accurately identify features like windows, with a 5% error rate in window-to-wall ratio estimates.

DetailsMotivation: Early-phase renovation planning requires accurate thermal 3D models (LoD3), but scalable feature identification is challenging.

Method: SI3FP extracts geometries from images using computer vision and deep learning, directly modeling geometric primitives in the orthographic image plane to reduce distortions.

Result: Tested on Swedish buildings, SI3FP achieved ~5% error in window-to-wall ratio estimates, suitable for renovation analysis.

Conclusion: SI3FP enables scalable energy renovation planning and has broader urban development applications.

Abstract: Renovating existing buildings is essential for climate impact. Early-phase renovation planning requires simulations based on thermal 3D models at Level of Detail (LoD) 3, which include features like windows. However, scalable and accurate identification of such features remains a challenge. This paper presents the Scalable Image-to-3D Facade Parser (SI3FP), a pipeline that generates LoD3 thermal models by extracting geometries from images using both computer vision and deep learning. Unlike existing methods relying on segmentation and projection, SI3FP directly models geometric primitives in the orthographic image plane, providing a unified interface while reducing perspective distortions. SI3FP supports both sparse (e.g., Google Street View) and dense (e.g., hand-held camera) data sources. Tested on typical Swedish residential buildings, SI3FP achieved approximately 5% error in window-to-wall ratio estimates, demonstrating sufficient accuracy for early-stage renovation analysis. The pipeline facilitates large-scale energy renovation planning and has broader applications in urban development and planning.

[196] From eye to AI: studying rodent social behavior in the era of machine Learning

Giuseppe Chindemi, Camilla Bellone, Benoit Girard

Main category: cs.CV

TL;DR: The paper discusses the shift from human observation to AI and machine learning in rodent social behavior research, highlighting benefits and challenges.

DetailsMotivation: To address biases and limitations in traditional methods and explore the potential of AI for deeper insights into rodent social interactions.

Method: Integration of computer vision, ethology, and neuroscience, with a focus on tools and practical solutions for analysis.

Result: Modern approaches offer multifaceted insights but pose challenges; the paper provides guidance for researchers.

Conclusion: The paper aims to facilitate adoption of AI methods and encourage expert discussion on evolving tool requirements.

Abstract: The study of rodent social behavior has shifted in the last years from relying on direct human observation to more nuanced approaches integrating computational methods in artificial intelligence (AI) and machine learning. While conventional approaches introduce bias and can fail to capture the complexity of rodent social interactions, modern approaches bridging computer vision, ethology and neuroscience provide more multifaceted insights into behavior which are particularly relevant to social neuroscience. Despite these benefits, the integration of AI into social behavior research also poses several challenges. Here we discuss the main steps involved and the tools available for analyzing rodent social behavior, examining their advantages and limitations. Additionally, we suggest practical solutions to address common hurdles, aiming to guide young researchers in adopting these methods and to stimulate further discussion among experts regarding the evolving requirements of these tools in scientific applications.

[197] Zero-Residual Concept Erasure via Progressive Alignment in Text-to-Image Model

Hongxu Chen, Zhen Wang, Taoran Mei, Lin Li, Bowei Zhu, Runshi Li, Long Chen

Main category: cs.CV

TL;DR: ErasePro improves concept erasure in text-to-image models by enforcing zero-residual alignment and progressive layer-wise updates, addressing incomplete erasure and quality degradation.

DetailsMotivation: Existing methods for concept erasure in text-to-image models suffer from incomplete erasure and generation quality degradation due to non-zero alignment residuals and concentrated parameter updates in deep layers.

Method: ErasePro introduces a zero-residual constraint for perfect alignment and a progressive, layer-wise update strategy to transfer features from shallow to deep layers, minimizing parameter changes in sensitive layers.

Result: Empirical results show ErasePro’s effectiveness in various concept erasure tasks, including instance, art style, and nudity erasure.

Conclusion: ErasePro achieves more complete concept erasure while better preserving generative quality compared to existing methods.

Abstract: Concept Erasure, which aims to prevent pretrained text-to-image models from generating content associated with semantic-harmful concepts (i.e., target concepts), is getting increased attention. State-of-the-art methods formulate this task as an optimization problem: they align all target concepts with semantic-harmless anchor concepts, and apply closed-form solutions to update the model accordingly. While these closed-form methods are efficient, we argue that existing methods have two overlooked limitations: 1) They often result in incomplete erasure due to “non-zero alignment residual”, especially when text prompts are relatively complex. 2) They may suffer from generation quality degradation as they always concentrate parameter updates in a few deep layers. To address these issues, we propose a novel closed-form method ErasePro: it is designed for more complete concept erasure and better preserving overall generative quality. Specifically, ErasePro first introduces a strict zero-residual constraint into the optimization objective, ensuring perfect alignment between target and anchor concept features and enabling more complete erasure. Secondly, it employs a progressive, layer-wise update strategy that gradually transfers target concept features to those of the anchor concept from shallow to deep layers. As the depth increases, the required parameter changes diminish, thereby reducing deviations in sensitive deep layers and preserving generative quality. Empirical results across different concept erasure tasks (including instance, art style, and nudity erasure) have demonstrated the effectiveness of our ErasePro.

[198] Revisiting Continual Semantic Segmentation with Pre-trained Vision Models

Duzhen Zhang, Yong Ren, Wei Cong, Junhao Zheng, Qiaoyi Su, Shuncheng Jia, Zhong-Zhi Li, Xuanle Zhao, Ye Bai, Feilong Chen, Qi Tian, Tielin Zhang

Main category: cs.CV

TL;DR: The paper challenges the assumption that Direct Fine-Tuning (DFT) in Continual Semantic Segmentation (CSS) suffers from severe forgetting, showing PVMs retain knowledge well. It proposes DFT*, a simple enhancement to DFT, outperforming complex methods.

DetailsMotivation: To reassess the anti-forgetting capabilities of Pre-trained Vision Models (PVMs) in CSS and improve DFT's performance.

Method: Systematically evaluates DFT on Pascal VOC 2012 and ADE20K benchmarks using ResNet101 and Swin-B backbones, then proposes DFT* with frozen backbones and classifiers.

Result: PVMs retain knowledge with minimal forgetting; DFT* outperforms 16 state-of-the-art CSS methods with fewer parameters and training time.

Conclusion: DFT* is a simple, effective solution for CSS, leveraging PVMs’ inherent anti-forgetting capabilities.

Abstract: Continual Semantic Segmentation (CSS) seeks to incrementally learn to segment novel classes while preserving knowledge of previously encountered ones. Recent advancements in CSS have been largely driven by the adoption of Pre-trained Vision Models (PVMs) as backbones. Among existing strategies, Direct Fine-Tuning (DFT), which sequentially fine-tunes the model across classes, remains the most straightforward approach. Prior work often regards DFT as a performance lower bound due to its presumed vulnerability to severe catastrophic forgetting, leading to the development of numerous complex mitigation techniques. However, we contend that this prevailing assumption is flawed. In this paper, we systematically revisit forgetting in DFT across two standard benchmarks, Pascal VOC 2012 and ADE20K, under eight CSS settings using two representative PVM backbones: ResNet101 and Swin-B. Through a detailed probing analysis, our findings reveal that existing methods significantly underestimate the inherent anti-forgetting capabilities of PVMs. Even under DFT, PVMs retain previously learned knowledge with minimal forgetting. Further investigation of the feature space indicates that the observed forgetting primarily arises from the classifier’s drift away from the PVM, rather than from degradation of the backbone representations. Based on this insight, we propose DFT*, a simple yet effective enhancement to DFT that incorporates strategies such as freezing the PVM backbone and previously learned classifiers, as well as pre-allocating future classifiers. Extensive experiments show that DFT* consistently achieves competitive or superior performance compared to sixteen state-of-the-art CSS methods, while requiring substantially fewer trainable parameters and less training time.

[199] PKSS-Align: Robust Point Cloud Registration on Pre-Kendall Shape Space

Chenlei Lv, Hui Huang

Main category: cs.CV

TL;DR: A robust point cloud registration method, PKSS-Align, is proposed to handle similarity transformations, noise, and defects without requiring point-to-point metrics or training.

DetailsMotivation: Point cloud registration is sensitive to transformations, noise, and incomplete structures, often leading to local optima. A robust solution is needed.

Method: PKSS-Align measures shape feature-based similarity on Pre-Kendall shape space (PKSS), using a manifold metric robust to Euclidean variations. No training or complex encoding is required.

Result: The method outperforms state-of-the-art techniques, efficiently handling transformations, noise, and defects with parallel acceleration.

Conclusion: PKSS-Align is a practical, efficient, and robust solution for point cloud registration, addressing key challenges without training.

Abstract: Point cloud registration is a classical topic in the field of 3D Vision and Computer Graphics. Generally, the implementation of registration is typically sensitive to similarity transformations (translation, scaling, and rotation), noisy points, and incomplete geometric structures. Especially, the non-uniform scales and defective parts of point clouds increase probability of struck local optima in registration task. In this paper, we propose a robust point cloud registration PKSS-Align that can handle various influences, including similarity transformations, non-uniform densities, random noisy points, and defective parts. The proposed method measures shape feature-based similarity between point clouds on the Pre-Kendall shape space (PKSS), \textcolor{black}{which is a shape measurement-based scheme and doesn’t require point-to-point or point-to-plane metric.} The employed measurement can be regarded as the manifold metric that is robust to various representations in the Euclidean coordinate system. Benefited from the measurement, the transformation matrix can be directly generated for point clouds with mentioned influences at the same time. The proposed method does not require data training and complex feature encoding. Based on a simple parallel acceleration, it can achieve significant improvement for efficiency and feasibility in practice. Experiments demonstrate that our method outperforms the relevant state-of-the-art methods.

[200] MuGS: Multi-Baseline Generalizable Gaussian Splatting Reconstruction

Yaopeng Lou, Liao Shen, Tianqi Liu, Jiaqi Li, Zihao Huang, Huiqiang Sun, Zhiguo Cao

Main category: cs.CV

TL;DR: MuRF is a novel view synthesis method integrating MVS and MDE features, using Gaussian splatting for efficient training and high-quality rendering, achieving top performance across diverse scenarios.

DetailsMotivation: To address the challenge of novel view synthesis under varying baseline settings, including sparse views with small or large baselines.

Method: Combines MVS and MDE features, introduces a projection-and-sampling mechanism for depth fusion, and uses 3D Gaussian representations for efficiency.

Result: State-of-the-art performance on DTU, RealEstate10K, and promising zero-shot results on LLFF and Mip-NeRF 360.

Conclusion: MuRF effectively generalizes across diverse baselines and scenarios, offering efficient and high-quality novel view synthesis.

Abstract: We present Multi-Baseline Gaussian Splatting (MuRF), a generalized feed-forward approach for novel view synthesis that effectively handles diverse baseline settings, including sparse input views with both small and large baselines. Specifically, we integrate features from Multi-View Stereo (MVS) and Monocular Depth Estimation (MDE) to enhance feature representations for generalizable reconstruction. Next, We propose a projection-and-sampling mechanism for deep depth fusion, which constructs a fine probability volume to guide the regression of the feature map. Furthermore, We introduce a reference-view loss to improve geometry and optimization efficiency. We leverage 3D Gaussian representations to accelerate training and inference time while enhancing rendering quality. MuRF achieves state-of-the-art performance across multiple baseline settings and diverse scenarios ranging from simple objects (DTU) to complex indoor and outdoor scenes (RealEstate10K). We also demonstrate promising zero-shot performance on the LLFF and Mip-NeRF 360 datasets.

[201] Learning Robust Intervention Representations with Delta Embeddings

Panagiotis Alimisis, Christos Diou

Main category: cs.CV

TL;DR: The paper proposes Causal Delta Embeddings for representing interventions in causal representation learning, improving OOD robustness without extra supervision.

DetailsMotivation: To enhance model generalization and robustness by focusing on intervention representations in latent space, addressing a gap in current research.

Method: Proposes Causal Delta Embeddings, which are invariant to visual scenes and sparse in affecting causal variables, learned from image pairs without supervision.

Result: Demonstrates superior OOD robustness in synthetic and real-world benchmarks, outperforming baselines.

Conclusion: Causal Delta Embeddings effectively improve OOD robustness, offering a promising direction for causal representation learning.

Abstract: Causal representation learning has attracted significant research interest during the past few years, as a means for improving model generalization and robustness. Causal representations of interventional image pairs, have the property that only variables corresponding to scene elements affected by the intervention / action are changed between the start state and the end state. While most work in this area has focused on identifying and representing the variables of the scene under a causal model, fewer efforts have focused on representations of the interventions themselves. In this work, we show that an effective strategy for improving out of distribution (OOD) robustness is to focus on the representation of interventions in the latent space. Specifically, we propose that an intervention can be represented by a Causal Delta Embedding that is invariant to the visual scene and sparse in terms of the causal variables it affects. Leveraging this insight, we propose a framework that is capable of learning causal representations from image pairs, without any additional supervision. Experiments in the Causal Triplet challenge demonstrate that Causal Delta Embeddings are highly effective in OOD settings, significantly exceeding baseline performance in both synthetic and real-world benchmarks.

[202] Length Matters: Length-Aware Transformer for Temporal Sentence Grounding

Yifan Wang, Ziyi Liu, Xiaolong Sun, Jiawei Wang, Hongmin Liu

Main category: cs.CV

TL;DR: The paper introduces Length-Aware Transformer (LATR) for Temporal Sentence Grounding (TSG), improving query specialization by leveraging length priors to avoid redundant predictions.

DetailsMotivation: Current DETR-based TSG models suffer from overlapping query roles due to lack of explicit supervision, leading to redundancy.

Method: LATR divides queries into three groups (short, middle, long durations) and introduces a length classification task to suppress mismatched predictions.

Result: LATR achieves state-of-the-art performance on three public benchmarks, with ablation studies confirming its effectiveness.

Conclusion: Incorporating length priors into TSG via LATR enhances query specialization and overall performance.

Abstract: Temporal sentence grounding (TSG) is a highly challenging task aiming to localize the temporal segment within an untrimmed video corresponding to a given natural language description. Benefiting from the design of learnable queries, the DETR-based models have achieved substantial advancements in the TSG task. However, the absence of explicit supervision often causes the learned queries to overlap in roles, leading to redundant predictions. Therefore, we propose to improve TSG by making each query fulfill its designated role, leveraging the length priors of the video-description pairs. In this paper, we introduce the Length-Aware Transformer (LATR) for TSG, which assigns different queries to handle predictions based on varying temporal lengths. Specifically, we divide all queries into three groups, responsible for segments with short, middle, and long temporal durations, respectively. During training, an additional length classification task is introduced. Predictions from queries with mismatched lengths are suppressed, guiding each query to specialize in its designated function. Extensive experiments demonstrate the effectiveness of our LATR, achieving state-of-the-art performance on three public benchmarks. Furthermore, the ablation studies validate the contribution of each component of our method and the critical role of incorporating length priors into the TSG task.

[203] RAIDX: A Retrieval-Augmented Generation and GRPO Reinforcement Learning Framework for Explainable Deepfake Detection

Tianxiao Li, Zhenglin Huang, Haiquan Wen, Yiwei He, Shuchang Lyu, Baoyuan Wu, Guangliang Cheng

Main category: cs.CV

TL;DR: RAIDX is a new deepfake detection framework combining RAG and GRPO for improved accuracy and explainability, outperforming existing methods.

DetailsMotivation: Address ethical risks of AI-generated imagery by enhancing transparency and accuracy in deepfake detection.

Method: Integrates Retrieval-Augmented Generation (RAG) for external knowledge and Group Relative Policy Optimization (GRPO) for fine-grained explanations.

Result: Achieves state-of-the-art performance in detection and provides interpretable rationales via text and saliency maps.

Conclusion: RAIDX successfully bridges gaps in accuracy and explainability, setting a new standard for deepfake detection frameworks.

Abstract: The rapid advancement of AI-generation models has enabled the creation of hyperrealistic imagery, posing ethical risks through widespread misinformation. Current deepfake detection methods, categorized as face specific detectors or general AI-generated detectors, lack transparency by framing detection as a classification task without explaining decisions. While several LLM-based approaches offer explainability, they suffer from coarse-grained analyses and dependency on labor-intensive annotations. This paper introduces RAIDX (Retrieval-Augmented Image Deepfake Detection and Explainability), a novel deepfake detection framework integrating Retrieval-Augmented Generation (RAG) and Group Relative Policy Optimization (GRPO) to enhance detection accuracy and decision explainability. Specifically, RAIDX leverages RAG to incorporate external knowledge for improved detection accuracy and employs GRPO to autonomously generate fine-grained textual explanations and saliency maps, eliminating the need for extensive manual annotations. Experiments on multiple benchmarks demonstrate RAIDX’s effectiveness in identifying real or fake, and providing interpretable rationales in both textual descriptions and saliency maps, achieving state-of-the-art detection performance while advancing transparency in deepfake identification. RAIDX represents the first unified framework to synergize RAG and GRPO, addressing critical gaps in accuracy and explainability. Our code and models will be publicly available.

[204] A Foundation Model for DAS Signal Recognition and Visual Prompt Tuning of the Pre-trained Model for Downstream Tasks

Kun Gui, Hongliang Ren, Shang Shi, Jin Lu, Changqiu Yu, Quanjun Cao, Guomin Gu, Qi Xuan

Main category: cs.CV

TL;DR: The paper proposes MAEPD, a foundational model for DAS signal recognition using a Masked Autoencoder, pretrained on diverse datasets. It employs Visual Prompt Tuning (VPT) for efficient downstream tasks, achieving high accuracy with minimal fine-tuning.

DetailsMotivation: Address data distribution disparities and limited labeled training data in DAS applications, improving cross-domain generalization for AI models.

Method: Pretrain MAEPD on 635,860 diverse DAS signal samples using self-supervised mask reconstruction. Use VPT for downstream tasks, freezing backbone and fine-tuning only visual prompt vectors.

Result: Achieves 96.94% accuracy in gait recognition with 0.322% fine-tuned parameters, outperforming Full Fine Tuning by 0.61% and reducing training time by 45%. Robust in pipeline leakage detection.

Conclusion: MAEPD offers a scalable, efficient solution for DAS signal recognition, enhancing generalization and reducing reliance on labeled data.

Abstract: Distributed Acoustic Sensing (DAS) technology finds growing applications across various domains. However, data distribution disparities due to heterogeneous sensing environments pose challenges for data-driven artificial intelligence (AI) models, limiting cross-domain generalization and facing a shortage of labeled training data. To address these issues, this study proposes a foundational model for DAS signal recognition based on a Masked Autoencoder, named MAEPD. The MAEPD model is pretrained on a dataset of 635,860 samples, encompassing DAS gait spatiotemporal signals, 2D GASF images for perimeter security, 2D time-frequency images for pipeline leakage, and open-dataset signals including whale vocalizations and seismic activities, using a self-supervised mask reconstruction task to capture deep semantic features of DAS signals. Visual Prompt Tuning (VPT) is employed for downstream recognition tasks. This method freezes the pretrained backbone parameters and fine-tunes only a small set of learnable visual prompt vectors inserted into the Transformer encoder layers. Experiments on the NVIDIA GeForce RTX 4080 Super platform validate MAEPD using indoor gait recognition as a downstream task. The VPT-Deep approach achieves a classification accuracy of 96.94% with just 0.322% of parameters fine-tuned, surpassing the traditional Full Fine Tuning (FFT) method by 0.61% and reducing training time by 45%. The model also exhibits robust performance in pipeline leakage detection, confirming the generality, efficiency, and scalability of MAEPD as a foundational model. This approach offers a novel paradigm for addressing the limited generalization of signal recognition models in the DAS domain.

[205] TempFlow-GRPO: When Timing Matters for GRPO in Flow Models

Xiaoxuan He, Siming Fu, Yuke Zhao, Wanli Li, Jian Yang, Dacheng Yin, Fengyun Rao, Bo Zhang

Main category: cs.CV

TL;DR: TempFlow-GRPO improves reinforcement learning for flow-based text-to-image generation by addressing temporal uniformity, introducing trajectory branching and noise-aware weighting for better performance.

DetailsMotivation: Existing flow matching models lack effective reinforcement learning integration for human preference alignment due to uniform credit assignment and sparse rewards.

Method: TempFlow-GRPO uses trajectory branching for process rewards and noise-aware weighting to optimize policy based on timestep impact.

Result: Achieves state-of-the-art performance in human preference alignment and text-to-image benchmarks.

Conclusion: TempFlow-GRPO’s temporally-aware optimization enhances flow-based generation by aligning rewards with generative dynamics.

Abstract: Recent flow matching models for text-to-image generation have achieved remarkable quality, yet their integration with reinforcement learning for human preference alignment remains suboptimal, hindering fine-grained reward-based optimization. We observe that the key impediment to effective GRPO training of flow models is the temporal uniformity assumption in existing approaches: sparse terminal rewards with uniform credit assignment fail to capture the varying criticality of decisions across generation timesteps, resulting in inefficient exploration and suboptimal convergence. To remedy this shortcoming, we introduce \textbf{TempFlow-GRPO} (Temporal Flow GRPO), a principled GRPO framework that captures and exploits the temporal structure inherent in flow-based generation. TempFlow-GRPO introduces two key innovations: (i) a trajectory branching mechanism that provides process rewards by concentrating stochasticity at designated branching points, enabling precise credit assignment without requiring specialized intermediate reward models; and (ii) a noise-aware weighting scheme that modulates policy optimization according to the intrinsic exploration potential of each timestep, prioritizing learning during high-impact early stages while ensuring stable refinement in later phases. These innovations endow the model with temporally-aware optimization that respects the underlying generative dynamics, leading to state-of-the-art performance in human preference alignment and standard text-to-image benchmarks.

[206] RiemanLine: Riemannian Manifold Representation of 3D Lines for Factor Graph Optimization

Yanyan Li, Ze Yang, Keisuke Tateno, Federico Tombari Liang Zhao, Gim Hee Lee

Main category: cs.CV

TL;DR: RiemanLine is a minimal 3D line representation on Riemannian manifolds, accommodating both individual lines and parallel-line groups, improving accuracy and reducing parameter dimensionality.

DetailsMotivation: Existing representations ignore structural regularities like parallel lines common in man-made environments, limiting efficiency and accuracy in camera localization and mapping.

Method: Decouples lines into global (shared vanishing direction on S²) and local (scaled normal vectors) components, reducing parameter space for parallel lines. Integrated into a factor graph for unified optimization.

Result: Achieves more accurate pose estimation and line reconstruction, reduces parameter dimensionality, and improves convergence stability on benchmarks.

Conclusion: RiemanLine provides a compact, efficient representation for 3D lines, enhancing performance in structured environments.

Abstract: Minimal parametrization of 3D lines plays a critical role in camera localization and structural mapping. Existing representations in robotics and computer vision predominantly handle independent lines, overlooking structural regularities such as sets of parallel lines that are pervasive in man-made environments. This paper introduces \textbf{RiemanLine}, a unified minimal representation for 3D lines formulated on Riemannian manifolds that jointly accommodates both individual lines and parallel-line groups. Our key idea is to decouple each line landmark into global and local components: a shared vanishing direction optimized on the unit sphere $\mathcal{S}^2$, and scaled normal vectors constrained on orthogonal subspaces, enabling compact encoding of structural regularities. For $n$ parallel lines, the proposed representation reduces the parameter space from $4n$ (orthonormal form) to $2n+2$, naturally embedding parallelism without explicit constraints. We further integrate this parameterization into a factor graph framework, allowing global direction alignment and local reprojection optimization within a unified manifold-based bundle adjustment. Extensive experiments on ICL-NUIM, TartanAir, and synthetic benchmarks demonstrate that our method achieves significantly more accurate pose estimation and line reconstruction, while reducing parameter dimensionality and improving convergence stability.

[207] X-SAM: From Segment Anything to Any Segmentation

Hao Wang, Limeng Qiao, Zequn Jie, Zhijian Huang, Chengjian Feng, Qingfang Zheng, Lin Ma, Xiangyuan Lan, Xiaodan Liang

Main category: cs.CV

TL;DR: X-SAM is a Multimodal Large Language Model (MLLM) framework that enhances pixel-level perceptual understanding, unifying segmentation tasks and introducing Visual GrounDed (VGD) segmentation.

DetailsMotivation: Addressing the limitations of LLMs in pixel-level perception and SAM's shortcomings in multi-mask prediction and category-specific segmentation.

Method: Proposes X-SAM, a unified MLLM framework with VGD segmentation and a co-training strategy for diverse datasets.

Result: Achieves state-of-the-art performance on various segmentation benchmarks.

Conclusion: X-SAM advances multimodal, pixel-level visual understanding and segmentation tasks.

Abstract: Large Language Models (LLMs) demonstrate strong capabilities in broad knowledge representation, yet they are inherently deficient in pixel-level perceptual understanding. Although the Segment Anything Model (SAM) represents a significant advancement in visual-prompt-driven image segmentation, it exhibits notable limitations in multi-mask prediction and category-specific segmentation tasks, and it cannot integrate all segmentation tasks within a unified model architecture. To address these limitations, we present X-SAM, a streamlined Multimodal Large Language Model (MLLM) framework that extends the segmentation paradigm from \textit{segment anything} to \textit{any segmentation}. Specifically, we introduce a novel unified framework that enables more advanced pixel-level perceptual comprehension for MLLMs. Furthermore, we propose a new segmentation task, termed Visual GrounDed (VGD) segmentation, which segments all instance objects with interactive visual prompts and empowers MLLMs with visual grounded, pixel-wise interpretative capabilities. To enable effective training on diverse data sources, we present a unified training strategy that supports co-training across multiple datasets. Experimental results demonstrate that X-SAM achieves state-of-the-art performance on a wide range of image segmentation benchmarks, highlighting its efficiency for multimodal, pixel-level visual understanding. Code is available at https://github.com/wanghao9610/X-SAM.

[208] RotatedMVPS: Multi-view Photometric Stereo with Rotated Natural Light

Songyun Yang, Yufei Han, Jilong Zhang, Kongming Liang, Peng Yu, Zhaowei Qu, Heng Guo

Main category: cs.CV

TL;DR: RotatedMVPS improves shape and reflectance recovery under natural light by leveraging light consistency and integrating single-view priors.

DetailsMotivation: Existing MVPS methods are limited by controlled settings or neglect reflectance recovery, restricting their use in natural illumination and inverse rendering.

Method: Proposes RotatedMVPS, ensuring light consistency across poses and integrating single-view priors for enhanced accuracy.

Result: Demonstrates effectiveness on synthetic and real-world datasets.

Conclusion: RotatedMVPS advances MVPS by addressing natural light challenges and improving recovery accuracy.

Abstract: Multiview photometric stereo (MVPS) seeks to recover high-fidelity surface shapes and reflectances from images captured under varying views and illuminations. However, existing MVPS methods often require controlled darkroom settings for varying illuminations or overlook the recovery of reflectances and illuminations properties, limiting their applicability in natural illumination scenarios and downstream inverse rendering tasks. In this paper, we propose RotatedMVPS to solve shape and reflectance recovery under rotated natural light, achievable with a practical rotation stage. By ensuring light consistency across different camera and object poses, our method reduces the unknowns associated with complex environment light. Furthermore, we integrate data priors from off-the-shelf learning-based single-view photometric stereo methods into our MVPS framework, significantly enhancing the accuracy of shape and reflectance recovery. Experimental results on both synthetic and real-world datasets demonstrate the effectiveness of our approach.

[209] YOLOv8-Based Deep Learning Model for Automated Poultry Disease Detection and Health Monitoring paper

Akhil Saketh Reddy Sabbella, Ch. Lakshmi Prachothan, Eswar Kumar Panta

Main category: cs.CV

TL;DR: The paper proposes an AI-based system using YOLO v8 for real-time detection of chicken illnesses through behavior and appearance analysis, improving farm management and biosecurity.

DetailsMotivation: Manual observation in poultry is labor-intensive and error-prone, leading to financial losses. Automating illness detection can enhance efficiency and accuracy.

Method: Utilizes YOLO v8, a deep learning model, trained on a large annotated dataset to analyze high-resolution chicken photos for illness signs.

Result: Achieves accurate real-time identification of infected chickens, enabling prompt warnings and early intervention.

Conclusion: The AI system enhances chicken health management by automating detection, reducing human inspection, and improving biosecurity in large-scale farms.

Abstract: In the poultry industry, detecting chicken illnesses is essential to avoid financial losses. Conventional techniques depend on manual observation, which is laborious and prone to mistakes. Using YOLO v8 a deep learning model for real-time object recognition. This study suggests an AI based approach, by developing a system that analyzes high resolution chicken photos, YOLO v8 detects signs of illness, such as abnormalities in behavior and appearance. A sizable, annotated dataset has been used to train the algorithm, which provides accurate real-time identification of infected chicken and prompt warnings to farm operators for prompt action. By facilitating early infection identification, eliminating the need for human inspection, and enhancing biosecurity in large-scale farms, this AI technology improves chicken health management. The real-time features of YOLO v8 provide a scalable and effective method for improving farm management techniques.

[210] TSPO: Temporal Sampling Policy Optimization for Long-form Video Language Understanding

Canhui Tang, Zifan Han, Hongbo Sun, Sanping Zhou, Xuchong Zhang, Xin Wei, Ye Yuan, Jinglin Xu, Hao Sun

Main category: cs.CV

TL;DR: TSPO uses reinforcement learning to optimize frame sampling in MLLMs for long videos, improving event capture and understanding.

DetailsMotivation: MLLMs struggle with long videos due to context limits and inefficient frame sampling, missing critical events.

Method: Proposes TSPO: a trainable event-aware agent and RL paradigm for joint keyframe selection and language generation, with rule-based rewards.

Result: TSPO achieves state-of-the-art performance on long video benchmarks and transfers well across Video-MLLMs.

Conclusion: TSPO effectively addresses long video challenges in MLLMs, enhancing event understanding and performance.

Abstract: Multimodal Large Language Models (MLLMs) have demonstrated significant progress in vision-language tasks, yet they still face challenges when processing long-duration video inputs. The limitation arises from MLLMs’ context limit and training costs, necessitating sparse frame sampling before feeding videos into MLLMs. Existing video MLLMs adopt training-free uniform sampling or keyframe search, which may miss critical events or be constrained by the pre-trained models’ event understanding capabilities. Meanwhile, building a training-based method remains challenging due to the unsupervised and non-differentiable nature of sparse frame sampling. To address these problems, we propose Temporal Sampling Policy Optimization (TSPO), advancing MLLMs’ long-form video-language understanding via reinforcement learning. Specifically, we first propose a trainable event-aware temporal agent, which captures event-query correlation for performing probabilistic keyframe selection. Then, we propose the TSPO reinforcement learning paradigm, which models keyframe selection and language generation as a joint decision-making process, enabling end-to-end group relative optimization with efficient rule-based rewards. Furthermore, for the TSPO’s training, we propose a long video training data construction pipeline with comprehensive temporal data and video Needle-in-a-Haystack data. Finally, we incorporate rule-based answering accuracy and temporal locating reward mechanisms to optimize the temporal sampling policy. Comprehensive experiments show that our TSPO achieves state-of-the-art performance across multiple long video understanding benchmarks, and shows transferable ability across different cutting-edge Video-MLLMs.

[211] HierarchicalPrune: Position-Aware Compression for Large-Scale Diffusion Models

Young D. Kwon, Rui Li, Sijia Li, Da Li, Sourav Bhattacharya, Stylianos I. Venieris

Main category: cs.CV

TL;DR: HierarchicalPrune compresses billion-scale text-to-image diffusion models for on-device inference by pruning less essential blocks, preserving semantic integrity, and adjusting knowledge transfer, achieving significant memory and latency reductions with minimal quality loss.

DetailsMotivation: Massive parameter scales (8-11B) in diffusion models hinder on-device inference due to resource constraints.

Method: Combines Hierarchical Position Pruning, Positional Weight Preservation, and Sensitivity-Guided Distillation to compress models.

Result: 77.5-80.4% memory reduction, 27.9-38.0% latency reduction, and minimal quality drop (2.6% GenEval, 7% HPSv2).

Conclusion: HierarchicalPrune enables efficient on-device inference while maintaining perceptual quality, outperforming prior methods.

Abstract: State-of-the-art text-to-image diffusion models (DMs) achieve remarkable quality, yet their massive parameter scale (8-11B) poses significant challenges for inferences on resource-constrained devices. In this paper, we present HierarchicalPrune, a novel compression framework grounded in a key observation: DM blocks exhibit distinct functional hierarchies, where early blocks establish semantic structures while later blocks handle texture refinements. HierarchicalPrune synergistically combines three techniques: (1) Hierarchical Position Pruning, which identifies and removes less essential later blocks based on position hierarchy; (2) Positional Weight Preservation, which systematically protects early model portions that are essential for semantic structural integrity; and (3) Sensitivity-Guided Distillation, which adjusts knowledge-transfer intensity based on our discovery of block-wise sensitivity variations. As a result, our framework brings billion-scale diffusion models into a range more suitable for on-device inference, while preserving the quality of the output images. Specifically, when combined with INT4 weight quantisation, HierarchicalPrune achieves 77.5-80.4% memory footprint reduction (e.g., from 15.8 GB to 3.2 GB) and 27.9-38.0% latency reduction, measured on server and consumer grade GPUs, with the minimum drop of 2.6% in GenEval score and 7% in HPSv2 score compared to the original model. Last but not least, our comprehensive user study with 85 participants demonstrates that HierarchicalPrune maintains perceptual quality comparable to the original model while significantly outperforming prior works.

[212] VisionTS++: Cross-Modal Time Series Foundation Model with Continual Pre-trained Visual Backbones

Lefei Shen, Mouxiang Chen, Xu Liu, Han Fu, Xiaoxue Ren, Jianling Sun, Zhuo Li, Chenghao Liu

Main category: cs.CV

TL;DR: VisionTS++ bridges gaps in cross-modal transfer from vision to time series forecasting with innovations in data filtering, multivariate conversion, and probabilistic forecasting, achieving state-of-the-art results.

DetailsMotivation: To address challenges in transferring vision models to time series forecasting due to data-modality, multivariate-forecasting, and probabilistic-forecasting gaps.

Method: Proposes VisionTS++ with three innovations: vision-model-based filtering, colorized multivariate conversion, and multi-quantile forecasting.

Result: Achieves SOTA results, outperforming specialized models by 6%-44% in MSE reduction and ranking first in 9 out of 12 probabilistic forecasting settings.

Conclusion: Establishes a new paradigm for cross-modal knowledge transfer, advancing universal time series foundation models.

Abstract: Recent studies have revealed that vision models pre-trained on images can perform well in time series forecasting by reformulating forecasting as an image reconstruction task, suggesting their potential as universal time series foundation models. However, effective cross-modal transfer from vision to time series remains challenging due to three key discrepancies: (1) data-modality gap between structured, bounded image data and unbounded, heterogeneous time series; (2) multivariate-forecasting gap between standard RGB three-channel-based vision models and the need to model time series with arbitrary numbers of variates; and (3) probabilistic-forecasting gap between the deterministic output formats of most vision models and the requirement for uncertainty-aware probabilistic predictions. To bridge these gaps, we propose VisionTS++, a vision-model-based TSFM that performs continual pre-training on large-scale time series datasets, including 3 innovations: (1) a vision-model-based filtering mechanism to identify high-quality time series data, thereby mitigating modality gap and improving pre-training stability, (2) a colorized multivariate conversion method that transforms multivariate time series into multi-subfigure RGB images, capturing complex inter-variate dependencies; and (3) a multi-quantile forecasting approach using parallel reconstruction heads to generate forecasts of different quantile levels, thus more flexibly approximating arbitrary output distributions without restrictive prior distributional assumptions. Evaluated on both in-distribution and out-of-distribution TSF benchmarks, \model achieves SOTA results, outperforming specialized TSFMs by 6%-44% in MSE reduction and ranking first in 9 out of 12 probabilistic forecasting settings. Our work establishes a new paradigm for cross-modal knowledge transfer, advancing the development of universal TSFMs.

[213] Thinking With Videos: Multimodal Tool-Augmented Reinforcement Learning for Long Video Reasoning

Haoji Zhang, Xin Gu, Jiawen Li, Chixiang Ma, Sule Bai, Chubin Zhang, Bowen Zhang, Zhichao Zhou, Dongliang He, Yansong Tang

Main category: cs.CV

TL;DR: VITAL is a tool-augmented learning framework for video reasoning, improving cross-modal interaction and reducing hallucination in MLLMs. It uses multimodal CoT and a visual toolbox for precise reasoning, validated on 11 benchmarks.

DetailsMotivation: Addressing limited cross-modal interaction and hallucination in text-based CoT reasoning for MLLMs, especially in long videos.

Method: Proposes VITAL with a visual toolbox for dense frame sampling and multimodal CoT. Introduces datasets MTVR-CoT-72k and MTVR-RL-110k, and DGRPO for multi-task RL.

Result: Outperforms existing methods in video QA and temporal grounding, especially for long videos.

Conclusion: VITAL advances video reasoning in MLLMs, with public release of code, data, and models.

Abstract: The video reasoning ability of multimodal large language models (MLLMs) is crucial for downstream tasks like video question answering and temporal grounding. While recent approaches have explored text-based chain-of-thought (CoT) reasoning for MLLMs, these methods often suffer from limited cross-modal interaction and increased hallucination, especially with longer videos or reasoning chains. To address these challenges, we propose Video Intelligence via Tool-Augmented Learning (VITAL), a novel end-to-end agentic video reasoning framework. With a visual toolbox, the model can densely sample new video frames on demand and generate multimodal CoT for precise long video reasoning. We observe that temporal grounding and question answering are mutually beneficial for video understanding tasks. Therefore, we construct two high-quality multi-task video reasoning datasets MTVR-CoT-72k for supervised fine-tuning and MTVR-RL-110k for reinforcement learning. Moreover, we propose a Difficulty-aware Group Relative Policy Optimization algorithm (DGRPO) to mitigate difficulty imbalance in multi-task reinforcement learning. Extensive experiments on 11 challenging video understanding benchmarks demonstrate the advanced reasoning ability of VITAL, outperforming existing methods in video question answering and temporal grounding tasks, especially in long video scenarios. All code, data and model weight will be made publicly available.

[214] Efficient Inter-Task Attention for Multitask Transformer Models

Christian Bohn, Thomas Kurbiel, Klaus Friedrichs, Hasan Tercan, Tobias Meisen

Main category: cs.CV

TL;DR: Proposes Deformable Inter-Task Self-Attention for efficient multitask learning, reducing FLOPs and latency while improving task performance.

DetailsMotivation: Transformers' computational inefficiency in multitask learning due to quadratic scaling of attention matrices with task numbers.

Method: Novel Deformable Inter-Task Self-Attention mechanism for efficient cross-task information aggregation.

Result: Order-of-magnitude reduction in FLOPs and latency; up to 7.4% improvement in task metrics.

Conclusion: The proposed method effectively addresses computational limitations while enhancing multitask performance.

Abstract: In both Computer Vision and the wider Deep Learning field, the Transformer architecture is well-established as state-of-the-art for many applications. For Multitask Learning, however, where there may be many more queries necessary compared to single-task models, its Multi-Head-Attention often approaches the limits of what is computationally feasible considering practical hardware limitations. This is due to the fact that the size of the attention matrix scales quadratically with the number of tasks (assuming roughly equal numbers of queries for all tasks). As a solution, we propose our novel Deformable Inter-Task Self-Attention for Multitask models that enables the much more efficient aggregation of information across the feature maps from different tasks. In our experiments on the NYUD-v2 and PASCAL-Context datasets, we demonstrate an order-of-magnitude reduction in both FLOPs count and inference latency. At the same time, we also achieve substantial improvements by up to 7.4% in the individual tasks’ prediction quality metrics.

[215] Composed Object Retrieval: Object-level Retrieval via Composed Expressions

Tong Wang, Guanyu Yang, Nian Liu, Zongyan Han, Jinxing Zhou, Salman Khan, Fahad Shahbaz Khan

Main category: cs.CV

TL;DR: The paper introduces Composed Object Retrieval (COR), a new task for object-level precision in multi-modal retrieval, addressing limitations of current Composed Image Retrieval (CIR) methods. It proposes CORE, a unified model, and COR127K, a large-scale benchmark, showing superior performance.

DetailsMotivation: Current CIR methods lack object-level precision, limiting their ability to localize specific objects. COR aims to overcome this by enabling retrieval and segmentation of target objects based on composed expressions.

Method: The paper proposes CORE, an end-to-end model integrating reference region encoding, adaptive visual-textual interaction, and region-level contrastive learning. It also introduces COR127K, a benchmark with 127,166 retrieval triplets.

Result: CORE outperforms existing models in base and novel categories, demonstrating effectiveness for fine-grained multi-modal retrieval.

Conclusion: COR and CORE establish a new direction for fine-grained retrieval, offering a simple yet effective baseline for future research.

Abstract: Retrieving fine-grained visual content based on user intent remains a challenge in multi-modal systems. Although current Composed Image Retrieval (CIR) methods combine reference images with retrieval texts, they are constrained to image-level matching and cannot localize specific objects. To this end, we propose Composed Object Retrieval (COR), a brand-new task that goes beyond image-level retrieval to achieve object-level precision, allowing the retrieval and segmentation of target objects based on composed expressions combining reference objects and retrieval texts. COR presents significant challenges in retrieval flexibility, which requires systems to identify arbitrary objects satisfying composed expressions while avoiding semantically similar but irrelevant negative objects within the same scene. We construct COR127K, the first large-scale COR benchmark that contains 127,166 retrieval triplets with various semantic transformations in 408 categories. We also present CORE, a unified end-to-end model that integrates reference region encoding, adaptive visual-textual interaction, and region-level contrastive learning. Extensive experiments demonstrate that CORE significantly outperforms existing models in both base and novel categories, establishing a simple and effective baseline for this challenging task while opening new directions for fine-grained multi-modal retrieval research.

[216] Benchmarking Foundation Models for Mitotic Figure Classification

Jonas Ammeling, Jonathan Ganz, Emely Rosbach, Ludwig Lausser, Christof A. Bertram, Katharina Breininger, Marc Aubreville

Main category: cs.CV

TL;DR: Foundation models, adapted with LoRA, outperform linear probing and nearly match full-data performance with only 10% training data, while also improving robustness to unseen tumor domains.

DetailsMotivation: Mitotic figure classification is crucial for tumor grading, but labeled data is limited. Self-supervised foundation models can leverage unlabeled data to address this.

Method: Evaluate data scaling laws and robustness of foundation models for mitotic classification, comparing LoRA-adapted models, linear probing, and traditional CNNs/Vision Transformers.

Result: LoRA-adapted models achieve near-full-data performance with 10% training data and reduce the out-of-domain performance gap. Traditional fine-tuning remains competitive.

Conclusion: LoRA-adapted foundation models offer efficient, high-performance solutions for mitotic classification, especially in low-data scenarios, though traditional methods still hold value.

Abstract: The performance of deep learning models is known to scale with data quantity and diversity. In pathology, as in many other medical imaging domains, the availability of labeled images for a specific task is often limited. Self-supervised learning techniques have enabled the use of vast amounts of unlabeled data to train large-scale neural networks, i.e., foundation models, that can address the limited data problem by providing semantically rich feature vectors that can generalize well to new tasks with minimal training effort increasing model performance and robustness. In this work, we investigate the use of foundation models for mitotic figure classification. The mitotic count, which can be derived from this classification task, is an independent prognostic marker for specific tumors and part of certain tumor grading systems. In particular, we investigate the data scaling laws on multiple current foundation models and evaluate their robustness to unseen tumor domains. Next to the commonly used linear probing paradigm, we also adapt the models using low-rank adaptation (LoRA) of their attention mechanisms. We compare all models against end-to-end-trained baselines, both CNNs and Vision Transformers. Our results demonstrate that LoRA-adapted foundation models provide superior performance to those adapted with standard linear probing, reaching performance levels close to 100% data availability with only 10% of training data. Furthermore, LoRA-adaptation of the most recent foundation models almost closes the out-of-domain performance gap when evaluated on unseen tumor domains. However, full fine-tuning of traditional architectures still yields competitive performance.

[217] Boosting Visual Knowledge-Intensive Training for LVLMs Through Causality-Driven Visual Object Completion

Qingguo Hu, Ante Wang, Jia Song, Delai Qiu, Qingsong Liu, Jinsong Su

Main category: cs.CV

TL;DR: The paper introduces a self-improvement framework for Large Vision-Language Models (LVLMs) using a visual knowledge-intensive task called CVC to enhance visual perception and reasoning.

DetailsMotivation: LVLMs lack deep visual perception due to scarce visual knowledge in instruction-tuning corpora, limiting their ability to identify subtle image differences.

Method: The framework uses an automated pipeline to generate CVC tasks, where LVLMs infer masked objects based on causal relationships, enabling self-improvement through trial and error.

Result: Experiments show significant improvements, with average gains of 5.4% and 4.0% on specialized tasks using LLaVA-1.5-7B and LLaVA-1.5-13B, respectively.

Conclusion: The CVC framework effectively enhances LVLMs’ visual perception and reasoning, demonstrating broad applicability across benchmarks.

Abstract: Large Vision-Language Models (LVLMs) have experienced significant advancements in recent years. However, their performance still falls short in tasks requiring deep visual perception, such as identifying subtle differences between images. A potential cause is the scarcity of visual knowledge in popular instruction-tuning corpora, resulting in inadequate visual perception and reasoning capabilities. To address this challenge, we introduce a self-improvement framework grounded in a novel visual knowledge-intensive task, \underline{C}ausality-driven \underline{V}isual object \underline{C}ompletion (CVC). This task requires LVLMs to infer the masked object in an image based on its \textit{causal} relationships with the other visible information. We first obtain rich examples cheaply through our automated instance construction pipeline, without relying on sophisticated LVLMs (\textit{e.g.}, GPT-4V) or human assistance. Then, LVLMs effectively self-improve through trial and error learning using these created instances. Our experiments demonstrate substantial gains across four challenging specialized tasks and four widely-used comprehensive benchmarks. Especially on specialized tasks, our method achieves an average improvement of 5.4% and 4.0% compared to the corresponding baselines when utilizing LLaVA-1.5-7B and LLaVA-1.5-13B, respectively. The code is available at https://github.com/XMUDeepLIT/CVC.

[218] 4DVD: Cascaded Dense-view Video Diffusion Model for High-quality 4D Content Generation

Shuzhou Yang, Xiaodong Cun, Xiaoyu Li, Yaowei Li, Jian Zhang

Main category: cs.CV

TL;DR: 4DVD is a cascaded video diffusion model that decouples 4D content generation into layout prediction and conditional generation, achieving high-quality results.

DetailsMotivation: Directly generating high-dimensional data like 4D is complex; 4DVD simplifies this by decoupling tasks.

Method: 4DVD uses a two-step process: coarse multi-view layout generation and structure-aware conditional generation, unified for consistency.

Result: The model achieves state-of-the-art performance in novel view synthesis and 4D generation, validated on the D-Objaverse dataset.

Conclusion: 4DVD enables accurate 4D representation and practical applications, outperforming previous methods.

Abstract: Given the high complexity of directly generating high-dimensional data such as 4D, we present 4DVD, a cascaded video diffusion model that generates 4D content in a decoupled manner. Unlike previous multi-view video methods that directly model 3D space and temporal features simultaneously with stacked cross view/temporal attention modules, 4DVD decouples this into two subtasks: coarse multi-view layout generation and structure-aware conditional generation, and effectively unifies them. Specifically, given a monocular video, 4DVD first predicts the dense view content of its layout with superior cross-view and temporal consistency. Based on the produced layout priors, a structure-aware spatio-temporal generation branch is developed, combining these coarse structural priors with the exquisite appearance content of input monocular video to generate final high-quality dense-view videos. Benefit from this, explicit 4D representation~(such as 4D Gaussian) can be optimized accurately, enabling wider practical application. To train 4DVD, we collect a dynamic 3D object dataset, called D-Objaverse, from the Objaverse benchmark and render 16 videos with 21 frames for each object. Extensive experiments demonstrate our state-of-the-art performance on both novel view synthesis and 4D generation. Our project page is https://4dvd.github.io/

[219] QuantVSR: Low-Bit Post-Training Quantization for Real-World Video Super-Resolution

Bowen Chai, Zheng Chen, Libo Zhu, Wenbo Li, Yong Guo, Yulun Zhang

Main category: cs.CV

TL;DR: QuantVSR introduces a low-bit quantization model for video super-resolution (VSR) using spatio-temporal complexity awareness and learnable bias alignment to maintain performance while reducing resource usage.

DetailsMotivation: Diffusion models for VSR are resource-intensive and slow, limiting practical deployment. Quantization can help but is challenging due to temporal dynamics and fidelity needs.

Method: Proposes STCA mechanism to measure spatial/temporal complexities and allocate layer-specific ranks. Uses LBA module to reduce quantization errors. Jointly refines FP and low-bit branches.

Result: Outperforms leading low-bit quantization methods and matches full-precision model performance on synthetic and real-world datasets.

Conclusion: QuantVSR effectively balances performance and efficiency, making VSR models more practical for deployment.

Abstract: Diffusion models have shown superior performance in real-world video super-resolution (VSR). However, the slow processing speeds and heavy resource consumption of diffusion models hinder their practical application and deployment. Quantization offers a potential solution for compressing the VSR model. Nevertheless, quantizing VSR models is challenging due to their temporal characteristics and high fidelity requirements. To address these issues, we propose QuantVSR, a low-bit quantization model for real-world VSR. We propose a spatio-temporal complexity aware (STCA) mechanism, where we first utilize the calibration dataset to measure both spatial and temporal complexities for each layer. Based on these statistics, we allocate layer-specific ranks to the low-rank full-precision (FP) auxiliary branch. Subsequently, we jointly refine the FP and low-bit branches to achieve simultaneous optimization. In addition, we propose a learnable bias alignment (LBA) module to reduce the biased quantization errors. Extensive experiments on synthetic and real-world datasets demonstrate that our method obtains comparable performance with the FP model and significantly outperforms recent leading low-bit quantization methods. Code is available at: https://github.com/bowenchai/QuantVSR.

[220] MonoCloth: Reconstruction and Animation of Cloth-Decoupled Human Avatars from Monocular Videos

Daisheng Jin, Ying He

Main category: cs.CV

TL;DR: MonoCloth reconstructs and animates clothed human avatars from monocular videos using a part-based decomposition strategy and a dedicated cloth simulation module.

DetailsMotivation: The challenge lies in limited geometric information and complex non-rigid motion in monocular videos, requiring a method to overcome these limitations.

Method: The approach decomposes the avatar into body, face, hands, and clothing, with detailed geometry recovery for face/hands and a cloth simulation module for garments.

Result: MonoCloth improves visual reconstruction quality and animation realism, and supports tasks like clothing transfer.

Conclusion: MonoCloth’s part-based design enhances versatility and practical utility for 3D human avatar reconstruction.

Abstract: Reconstructing realistic 3D human avatars from monocular videos is a challenging task due to the limited geometric information and complex non-rigid motion involved. We present MonoCloth, a new method for reconstructing and animating clothed human avatars from monocular videos. To overcome the limitations of monocular input, we introduce a part-based decomposition strategy that separates the avatar into body, face, hands, and clothing. This design reflects the varying levels of reconstruction difficulty and deformation complexity across these components. Specifically, we focus on detailed geometry recovery for the face and hands. For clothing, we propose a dedicated cloth simulation module that captures garment deformation using temporal motion cues and geometric constraints. Experimental results demonstrate that MonoCloth improves both visual reconstruction quality and animation realism compared to existing methods. Furthermore, thanks to its part-based design, MonoCloth also supports additional tasks such as clothing transfer, underscoring its versatility and practical utility.

[221] Skeleton Motion Words for Unsupervised Skeleton-Based Temporal Action Segmentation

Uzay Gökay, Federico Spurio, Dominik R. Bach, Juergen Gall

Main category: cs.CV

TL;DR: Proposes an unsupervised method for skeleton-based temporal action segmentation using a sequence-to-sequence autoencoder and motion word quantization, outperforming current state-of-the-art methods.

DetailsMotivation: Existing unsupervised methods focus on video data, neglecting skeleton sequences, which are robust and privacy-preserving. Annotated data for supervised methods is costly.

Method: Uses a temporal autoencoder to disentangle joint information, divides latent sequences into patches, and quantizes them into motion words for action clustering.

Result: Outperforms state-of-the-art unsupervised methods on HuGaDB, LARa, and BABEL datasets.

Conclusion: The approach effectively segments actions in skeleton data without annotations, demonstrating superior performance.

Abstract: Current state-of-the-art methods for skeleton-based temporal action segmentation are predominantly supervised and require annotated data, which is expensive to collect. In contrast, existing unsupervised temporal action segmentation methods have focused primarily on video data, while skeleton sequences remain underexplored, despite their relevance to real-world applications, robustness, and privacy-preserving nature. In this paper, we propose a novel approach for unsupervised skeleton-based temporal action segmentation. Our method utilizes a sequence-to-sequence temporal autoencoder that keeps the information of the different joints disentangled in the embedding space. Latent skeleton sequences are then divided into non-overlapping patches and quantized to obtain distinctive skeleton motion words, driving the discovery of semantically meaningful action clusters. We thoroughly evaluate the proposed approach on three widely used skeleton-based datasets, namely HuGaDB, LARa, and BABEL. The results demonstrate that our model outperforms the current state-of-the-art unsupervised temporal action segmentation methods. Code is available at https://github.com/bachlab/SMQ .

[222] No Masks Needed: Explainable AI for Deriving Segmentation from Classification

Mosong Ma, Tania Stathaki, Michalis Lazarou

Main category: cs.CV

TL;DR: A novel method fine-tunes pre-trained models for medical image segmentation, integrating Explainable AI for better accuracy and relevance scores.

DetailsMotivation: Unsupervised segmentation methods from computer vision don't perform well in medical imaging, prompting the need for a specialized approach.

Method: Fine-tunes pre-trained models for medical images and uses Explainable AI to generate relevance scores for enhanced segmentation.

Result: Achieves improved segmentation accuracy on medical datasets like CBIS-DDSM, NuInsSeg, and Kvasir-SEG.

Conclusion: The proposed method outperforms traditional approaches in medical image segmentation, demonstrating its effectiveness for healthcare applications.

Abstract: Medical image segmentation is vital for modern healthcare and is a key element of computer-aided diagnosis. While recent advancements in computer vision have explored unsupervised segmentation using pre-trained models, these methods have not been translated well to the medical imaging domain. In this work, we introduce a novel approach that fine-tunes pre-trained models specifically for medical images, achieving accurate segmentation with extensive processing. Our method integrates Explainable AI to generate relevance scores, enhancing the segmentation process. Unlike traditional methods that excel in standard benchmarks but falter in medical applications, our approach achieves improved results on datasets like CBIS-DDSM, NuInsSeg and Kvasir-SEG.

[223] TopKD: Top-scaled Knowledge Distillation

Qi Wang, Jinjia Zhou

Main category: cs.CV

TL;DR: The paper revisits logit-based knowledge distillation, introducing TopKD, a framework that enhances distillation by focusing on Top-K knowledge through adaptive scaling and decoupled loss.

DetailsMotivation: Existing knowledge distillation methods often overlook critical information in teacher logit distributions, particularly Top-K knowledge.

Method: Proposes TopKD with two components: Top-K Scaling Module (TSM) for amplifying informative logits and Top-K Decoupled Loss (TDL) for targeted supervision.

Result: TopKD outperforms state-of-the-art methods on multiple datasets (CIFAR-100, ImageNet, etc.) and works well with Vision Transformers.

Conclusion: Logits hold significant potential for advancing knowledge distillation, as demonstrated by TopKD’s effectiveness and versatility.

Abstract: Recent advances in knowledge distillation (KD) predominantly emphasize feature-level knowledge transfer, frequently overlooking critical information embedded within the teacher’s logit distributions. In this paper, we revisit logit-based distillation and reveal an underexplored yet critical element: Top-K knowledge. Motivated by this insight, we propose Top-scaled Knowledge Distillation (TopKD), a simple, efficient, and architecture-agnostic framework that significantly enhances logit-based distillation. TopKD consists of two main components: (1) a Top-K Scaling Module (TSM), which adaptively amplifies the most informative logits, and (2) a Top-K Decoupled Loss (TDL), which offers targeted and effective supervision. Notably, TopKD integrates seamlessly into existing KD methods without introducing extra modules or requiring architectural changes. Extensive experiments on CIFAR-100, ImageNet, STL-10, and Tiny-ImageNet demonstrate that TopKD consistently surpasses state-of-the-art distillation methods. Moreover, our method demonstrates substantial effectiveness when distilling Vision Transformers, underscoring its versatility across diverse network architectures. These findings highlight the significant potential of logits to advance knowledge distillation.

[224] InceptoFormer: A Multi-Signal Neural Framework for Parkinson’s Disease Severity Evaluation from Gait

Safwen Naimi, Arij Said, Wassim Bouachir, Guillaume-Alexandre Bilodeau

Main category: cs.CV

TL;DR: InceptoFormer is a neural framework combining Inception1D and Transformer for Parkinson’s Disease severity evaluation via gait analysis, achieving 96.6% accuracy.

DetailsMotivation: To improve PD severity assessment by capturing multi-scale temporal features and addressing class imbalance.

Method: Uses Inception1D for multi-scale feature extraction and Transformer for long-range dependencies, with oversampling for class imbalance.

Result: Achieves 96.6% accuracy, outperforming existing methods.

Conclusion: InceptoFormer effectively evaluates PD severity by analyzing gait dynamics, offering a robust solution.

Abstract: We present InceptoFormer, a multi-signal neural framework designed for Parkinson’s Disease (PD) severity evaluation via gait dynamics analysis. Our architecture introduces a 1D adaptation of the Inception model, which we refer to as Inception1D, along with a Transformer-based framework to stage PD severity according to the Hoehn and Yahr (H&Y) scale. The Inception1D component captures multi-scale temporal features by employing parallel 1D convolutional filters with varying kernel sizes, thereby extracting features across multiple temporal scales. The transformer component efficiently models long-range dependencies within gait sequences, providing a comprehensive understanding of both local and global patterns. To address the issue of class imbalance in PD severity staging, we propose a data structuring and preprocessing strategy based on oversampling to enhance the representation of underrepresented severity levels. The overall design enables to capture fine-grained temporal variations and global dynamics in gait signal, significantly improving classification performance for PD severity evaluation. Through extensive experimentation, InceptoFormer achieves an accuracy of 96.6%, outperforming existing state-of-the-art methods in PD severity assessment. The source code for our implementation is publicly available at https://github.com/SafwenNaimi/InceptoFormer

[225] Hierarchical Event Memory for Accurate and Low-latency Online Video Temporal Grounding

Minghang Zheng, Yuxin Peng, Benyuan Sun, Yi Yang, Yang Liu

Main category: cs.CV

TL;DR: The paper introduces a hierarchical event memory framework for online video temporal grounding (OnVTG) to improve event modeling and retain long-term historical information, achieving state-of-the-art results.

DetailsMotivation: Existing OnVTG models lack effective event modeling and long-term memory, leading to low performance in locating events within streaming videos.

Method: Proposes a hierarchical event memory framework with event-based predictions and a future prediction branch for real-time forecasting of event start times.

Result: Achieves state-of-the-art performance on TACoS, ActivityNet Captions, and MAD datasets.

Conclusion: The proposed framework effectively addresses the limitations of current OnVTG models by enhancing event modeling and memory retention.

Abstract: In this paper, we tackle the task of online video temporal grounding (OnVTG), which requires the model to locate events related to a given text query within a video stream. Unlike regular video temporal grounding, OnVTG requires the model to make predictions without observing future frames. As online videos are streaming inputs and can go on indefinitely, it is impractical and inefficient to store all historical inputs. The existing OnVTG models employ memory to store recent historical video frame features and predict scores indicating whether the current frame corresponds to the start or end time of the target event. However, these methods lack effective event modeling and cannot retain long-term historical information, leading to low performance. To tackle these challenges, we propose a hierarchical event memory for OnVTG. We propose an event-based OnVTG framework that makes predictions based on event proposals that model event-level information with various durations. To preserve historically valuable event information, we introduce a hierarchical event memory that retains historical events, allowing the model to access both recent and long-term information. To enable the real-time prediction, we further propose a future prediction branch that predicts whether the target event will occur shortly and further regresses the start time of the event. We achieve state-of-the-art performance on the TACoS, ActivityNet Captions, and MAD datasets. Code is available at https://github.com/minghangz/OnVTG.

[226] Two-Way Garment Transfer: Unified Diffusion Framework for Dressing and Undressing Synthesis

Angang Zhang, Fang Deng, Hao Chen, Zhongjian Chen, Junyan Li

Main category: cs.CV

TL;DR: The paper introduces TWGTM, a unified framework for joint virtual try-on (VTON) and try-off (VTOFF), addressing their complementary symmetry through bidirectional feature disentanglement and phased training.

DetailsMotivation: The inverse task of VTON, VTOFF, is underexplored, and existing works treat them as isolated tasks, neglecting their complementary symmetry.

Method: Proposes TWGTM, a unified framework using dual-conditioned guidance from latent and pixel spaces, with phased training to resolve mask dependency asymmetry.

Result: Validated on DressCode and VITON-HD datasets, TWGTM shows efficacy and competitive performance.

Conclusion: TWGTM successfully bridges the gap between VTON and VTOFF, offering a joint solution for garment-centric image synthesis.

Abstract: While recent advances in virtual try-on (VTON) have achieved realistic garment transfer to human subjects, its inverse task, virtual try-off (VTOFF), which aims to reconstruct canonical garment templates from dressed humans, remains critically underexplored and lacks systematic investigation. Existing works predominantly treat them as isolated tasks: VTON focuses on garment dressing while VTOFF addresses garment extraction, thereby neglecting their complementary symmetry. To bridge this fundamental gap, we propose the Two-Way Garment Transfer Model (TWGTM), to the best of our knowledge, the first unified framework for joint clothing-centric image synthesis that simultaneously resolves both mask-guided VTON and mask-free VTOFF through bidirectional feature disentanglement. Specifically, our framework employs dual-conditioned guidance from both latent and pixel spaces of reference images to seamlessly bridge the dual tasks. On the other hand, to resolve the inherent mask dependency asymmetry between mask-guided VTON and mask-free VTOFF, we devise a phased training paradigm that progressively bridges this modality gap. Extensive qualitative and quantitative experiments conducted across the DressCode and VITON-HD datasets validate the efficacy and competitive edge of our proposed approach.

[227] Augmentation-based Domain Generalization and Joint Training from Multiple Source Domains for Whole Heart Segmentation

Franz Thaler, Darko Stern, Gernot Plank, Martin Urschler

Main category: cs.CV

TL;DR: The paper proposes a method for whole heart segmentation using CT and MR data, addressing domain shift challenges with balanced joint training and strong augmentation, achieving high accuracy for patient-specific cardiac models.

DetailsMotivation: Cardiovascular diseases are a leading cause of death, necessitating advanced methods for analyzing cardiac structures from medical images to improve diagnosis and personalized therapy.

Method: A 5-fold ensemble approach with balanced joint training (using CT and MR data equally) and strong intensity/spatial augmentation to handle domain shift.

Result: Achieved 93.33% DSC and 0.8388 mm ASSD for CT, and 89.30% DSC and 1.2411 mm ASSD for MR data, outperforming or matching domain-specific models.

Conclusion: The method shows promise for generating accurate cardiac digital twin models, aiding in personalized therapy and electrophysiological simulations.

Abstract: As the leading cause of death worldwide, cardiovascular diseases motivate the development of more sophisticated methods to analyze the heart and its substructures from medical images like Computed Tomography (CT) and Magnetic Resonance (MR). Semantic segmentations of important cardiac structures that represent the whole heart are useful to assess patient-specific cardiac morphology and pathology. Furthermore, accurate semantic segmentations can be used to generate cardiac digital twin models which allows e.g. electrophysiological simulation and personalized therapy planning. Even though deep learning-based methods for medical image segmentation achieved great advancements over the last decade, retaining good performance under domain shift – i.e. when training and test data are sampled from different data distributions – remains challenging. In order to perform well on domains known at training-time, we employ a (1) balanced joint training approach that utilizes CT and MR data in equal amounts from different source domains. Further, aiming to alleviate domain shift towards domains only encountered at test-time, we rely on (2) strong intensity and spatial augmentation techniques to greatly diversify the available training data. Our proposed whole heart segmentation method, a 5-fold ensemble with our contributions, achieves the best performance for MR data overall and a performance similar to the best performance for CT data when compared to a model trained solely on CT. With 93.33% DSC and 0.8388 mm ASSD for CT and 89.30% DSC and 1.2411 mm ASSD for MR data, our method demonstrates great potential to efficiently obtain accurate semantic segmentations from which patient-specific cardiac twin models can be generated.

[228] One Model For All: Partial Diffusion for Unified Try-On and Try-Off in Any Pose

Jinxi Liu, Zijian He, Guangrun Wang, Guanbin Li, Liang Lin

Main category: cs.CV

TL;DR: OMFA is a unified diffusion framework for virtual try-on and try-off, eliminating the need for exhibition garments and supporting arbitrary poses.

DetailsMotivation: Existing methods rely on exhibition garments and segmentation masks, limiting flexibility and practicality in real-world scenarios.

Method: OMFA uses a partial diffusion strategy for selective noise and denoising, enabling dynamic subtask control and bidirectional garment-person transformation without masks.

Result: OMFA achieves state-of-the-art results in try-on and try-off tasks, supporting multi-view and arbitrary-pose synthesis from a single image.

Conclusion: OMFA provides a practical, mask-free solution for virtual garment synthesis, enhancing realism and flexibility.

Abstract: Recent diffusion-based approaches have made significant advances in image-based virtual try-on, enabling more realistic and end-to-end garment synthesis. However, most existing methods remain constrained by their reliance on exhibition garments and segmentation masks, as well as their limited ability to handle flexible pose variations. These limitations reduce their practicality in real-world scenarios-for instance, users cannot easily transfer garments worn by one person onto another, and the generated try-on results are typically restricted to the same pose as the reference image. In this paper, we introduce \textbf{OMFA} (\emph{One Model For All}), a unified diffusion framework for both virtual try-on and try-off that operates without the need for exhibition garments and supports arbitrary poses. For example, OMFA enables removing garments from a source person (try-off) and transferring them onto a target person (try-on), while also allowing the generated target to appear in novel poses-even without access to multi-pose images of that person. OMFA is built upon a novel \emph{partial diffusion} strategy that selectively applies noise and denoising to individual components of the joint input-such as the garment, the person image, or the face-enabling dynamic subtask control and efficient bidirectional garment-person transformation. The framework is entirely mask-free and requires only a single portrait and a target pose as input, making it well-suited for real-world applications. Additionally, by leveraging SMPL-X-based pose conditioning, OMFA supports multi-view and arbitrary-pose try-on from just one image. Extensive experiments demonstrate that OMFA achieves state-of-the-art results on both try-on and try-off tasks, providing a practical and generalizable solution for virtual garment synthesis. The project page is here: https://onemodelforall.github.io/.

[229] Drone Detection with Event Cameras

Gabriele Magrini, Lorenzo Berlincioni, Luca Cultrera, Federico Becattini, Pietro Pala

Main category: cs.CV

TL;DR: Event-based vision offers a robust solution for drone detection, overcoming limitations of traditional cameras by eliminating motion blur and excelling in extreme lighting.

DetailsMotivation: Address the challenges of detecting small, agile drones with traditional cameras due to motion blur and poor lighting performance.

Method: Survey event-based vision, including data representation, spiking neural networks, and advanced tasks like tracking and identification.

Result: Event cameras provide reliable, low-latency detection and enable sophisticated drone monitoring tasks.

Conclusion: Event-based vision is a powerful foundation for next-gen counter-UAV systems.

Abstract: The diffusion of drones presents significant security and safety challenges. Traditional surveillance systems, particularly conventional frame-based cameras, struggle to reliably detect these targets due to their small size, high agility, and the resulting motion blur and poor performance in challenging lighting conditions. This paper surveys the emerging field of event-based vision as a robust solution to these problems. Event cameras virtually eliminate motion blur and enable consistent detection in extreme lighting. Their sparse, asynchronous output suppresses static backgrounds, enabling low-latency focus on motion cues. We review the state-of-the-art in event-based drone detection, from data representation methods to advanced processing pipelines using spiking neural networks. The discussion extends beyond simple detection to cover more sophisticated tasks such as real-time tracking, trajectory forecasting, and unique identification through propeller signature analysis. By examining current methodologies, available datasets, and the distinct advantages of the technology, this work demonstrates that event-based vision provides a powerful foundation for the next generation of reliable, low-latency, and efficient counter-UAV systems.

[230] TAlignDiff: Automatic Tooth Alignment assisted by Diffusion-based Transformation Learning

Yunbi Liu, Enqi Tang, Shiyu Li, Lei Ma, Juncheng Li, Shu Lou, Yongchu Pan, Qingshan Liu

Main category: cs.CV

TL;DR: TAlignDiff introduces a diffusion-based method for automatic tooth alignment, combining point cloud regression and diffusion modeling to improve orthodontic treatment.

DetailsMotivation: Current deep learning methods for tooth alignment rely on deterministic geometric constraints, missing the anatomical structure's distribution characteristics.

Method: TAlignDiff uses a point cloud-based regression network (PRN) and a diffusion-based transformation matrix denoising module (DTMD) for bidirectional feedback between geometric constraints and diffusion refinement.

Result: The method outperforms prior approaches, demonstrating effectiveness in tooth alignment for orthodontic treatment.

Conclusion: TAlignDiff offers a superior, unified framework for tooth alignment, leveraging diffusion-based learning to capture anatomical distribution characteristics.

Abstract: Orthodontic treatment hinges on tooth alignment, which significantly affects occlusal function, facial aesthetics, and patients’ quality of life. Current deep learning approaches predominantly concentrate on predicting transformation matrices through imposing point-to-point geometric constraints for tooth alignment. Nevertheless, these matrices are likely associated with the anatomical structure of the human oral cavity and possess particular distribution characteristics that the deterministic point-to-point geometric constraints in prior work fail to capture. To address this, we introduce a new automatic tooth alignment method named TAlignDiff, which is supported by diffusion-based transformation learning. TAlignDiff comprises two main components: a primary point cloud-based regression network (PRN) and a diffusion-based transformation matrix denoising module (DTMD). Geometry-constrained losses supervise PRN learning for point cloud-level alignment. DTMD, as an auxiliary module, learns the latent distribution of transformation matrices from clinical data. We integrate point cloud-based transformation regression and diffusion-based transformation modeling into a unified framework, allowing bidirectional feedback between geometric constraints and diffusion refinement. Extensive ablation and comparative experiments demonstrate the effectiveness and superiority of our method, highlighting its potential in orthodontic treatment.

[231] DDTracking: A Deep Generative Framework for Diffusion MRI Tractography with Streamline Local-Global Spatiotemporal Modeling

Yijie Li, Wei Zhang, Xi Zhu, Ye Wu, Yogesh Rathi, Lauren J. O’Donnell, Fan Zhang

Main category: cs.CV

TL;DR: DDTracking is a deep generative framework for diffusion MRI tractography, using a conditional denoising diffusion process and dual-pathway encoding to outperform state-of-the-art methods.

DetailsMotivation: To improve tractography by capturing fine-scale structural details and ensuring long-range consistency in streamline propagation.

Method: Uses a dual-pathway encoding network for local and global modeling, combined with a conditional diffusion model for end-to-end trainable tractography.

Result: Outperforms current methods on benchmarks (ISMRM Challenge, TractoInferno) and shows strong generalizability across diverse datasets.

Conclusion: DDTracking provides anatomically plausible, robust, and scalable tractography for broad dMRI applications.

Abstract: This paper presents DDTracking, a novel deep generative framework for diffusion MRI tractography that formulates streamline propagation as a conditional denoising diffusion process. In DDTracking, we introduce a dual-pathway encoding network that jointly models local spatial encoding (capturing fine-scale structural details at each streamline point) and global temporal dependencies (ensuring long-range consistency across the entire streamline). Furthermore, we design a conditional diffusion model module, which leverages the learned local and global embeddings to predict streamline propagation orientations for tractography in an end-to-end trainable manner. We conduct a comprehensive evaluation across diverse, independently acquired dMRI datasets, including both synthetic and clinical data. Experiments on two well-established benchmarks with ground truth (ISMRM Challenge and TractoInferno) demonstrate that DDTracking largely outperforms current state-of-the-art tractography methods. Furthermore, our results highlight DDTracking’s strong generalizability across heterogeneous datasets, spanning varying health conditions, age groups, imaging protocols, and scanner types. Collectively, DDTracking offers anatomically plausible and robust tractography, presenting a scalable, adaptable, and end-to-end learnable solution for broad dMRI applications. Code is available at: https://github.com/yishengpoxiao/DDtracking.git

[232] Hulk: A Universal Knowledge Translator for Human-Centric Tasks

Yizhou Wang, Yixuan Wu, Weizhen He, Xun Guo, Feng Zhu, Lei Bai, Rui Zhao, Jian Wu, Tong He, Wanli Ouyang, Shixiang Tang

Main category: cs.CV

TL;DR: Hulk is a multimodal human-centric generalist model addressing 2D/3D vision, skeleton-based, and vision-language tasks without task-specific finetuning, achieving state-of-the-art performance.

DetailsMotivation: Existing human-centric foundation models lack 3D and vision-language capabilities and require task-specific finetuning, limiting their versatility.

Method: Condenses task-specific heads into two general heads for discrete and continuous representations, enabling uniform modality translation.

Result: Achieves state-of-the-art performance in 11 out of 12 benchmarks across 8 human-centric tasks.

Conclusion: Hulk demonstrates superior versatility and performance, integrating diverse human-centric tasks effectively.

Abstract: Human-centric perception tasks, e.g., pedestrian detection, skeleton-based action recognition, and pose estimation, have wide industrial applications, such as metaverse and sports analysis. There is a recent surge to develop human-centric foundation models that can benefit a broad range of human-centric perception tasks. While many human-centric foundation models have achieved success, they did not explore 3D and vision-language tasks for human-centric and required task-specific finetuning. These limitations restrict their application to more downstream tasks and situations. To tackle these problems, we present Hulk, the first multimodal human-centric generalist model, capable of addressing 2D vision, 3D vision, skeleton-based, and vision-language tasks without task-specific finetuning. The key to achieving this is condensing various task-specific heads into two general heads, one for discrete representations, \emph{e.g.,} languages, and the other for continuous representations, \emph{e.g.,} location coordinates. The outputs of two heads can be further stacked into four distinct input and output modalities. This uniform representation enables Hulk to treat diverse human-centric tasks as modality translation, integrating knowledge across a wide range of tasks. Comprehensive evaluations of Hulk on 12 benchmarks covering 8 human-centric tasks demonstrate the superiority of our proposed method, achieving state-of-the-art performance in 11 benchmarks. The code will be available on https://github.com/OpenGVLab/Hulk.

[233] Knowledge to Sight: Reasoning over Visual Attributes via Knowledge Decomposition for Abnormality Grounding

Jun Li, Che Liu, Wenjia Bai, Mingxuan Liu, Rossella Arcucci, Cosmin I. Bercea, Julia A. Schnabel

Main category: cs.CV

TL;DR: K2Sight improves medical image grounding by using structured semantic supervision and concise prompts, achieving competitive performance with smaller models and less data.

DetailsMotivation: Generalist VLMs struggle in the medical domain due to rare and domain-specific terms, while specialized VLMs require heavy resources.

Method: Decomposes clinical concepts into visual attributes (shape, density, location) from ontologies, using instruction-style prompts for training.

Result: Compact models (0.23B/2B params) trained with 1.5% of data match or outperform 7B+ VLMs, with up to 9.82% mAP50 improvement.

Conclusion: K2Sight efficiently bridges domain knowledge and spatial structure, enabling high performance with minimal resources.

Abstract: In this work, we address the problem of grounding abnormalities in medical images, where the goal is to localize clinical findings based on textual descriptions. While generalist Vision-Language Models (VLMs) excel in natural grounding tasks, they often struggle in the medical domain due to rare, compositional, and domain-specific terms that are poorly aligned with visual patterns. Specialized medical VLMs address this challenge via large-scale domain pretraining, but at the cost of substantial annotation and computational resources. To overcome these limitations, we propose \textbf{Knowledge to Sight (K2Sight)}, a framework that introduces structured semantic supervision by decomposing clinical concepts into interpretable visual attributes, such as shape, density, and anatomical location. These attributes are distilled from domain ontologies and encoded into concise instruction-style prompts, which guide region-text alignment during training. Unlike conventional report-level supervision, our approach explicitly bridges domain knowledge and spatial structure, enabling data-efficient training of compact models. We train compact models with 0.23B and 2B parameters using only 1.5% of the data required by state-of-the-art medical VLMs. Despite their small size and limited training data, these models achieve performance on par with or better than 7B+ medical VLMs, with up to 9.82% improvement in $mAP_{50}$. Code and models: \href{https://lijunrio.github.io/K2Sight/}{\textcolor{SOTAPink}{https://lijunrio.github.io/K2Sight/}}.

[234] Long-Term Visual Object Tracking with Event Cameras: An Associative Memory Augmented Tracker and A Benchmark Dataset

Xiao Wang, Xufeng Lou, Shiao Wang, Ju Huang, Lan Chen, Bo Jiang

Main category: cs.CV

TL;DR: The paper introduces FELT, a long-term, large-scale frame-event visual object tracking dataset, and proposes AMTTrack, a novel tracker using an Associative Memory Transformer for RGB-Event tracking.

DetailsMotivation: Existing trackers are evaluated on short-term datasets, but real-world scenarios require long-term tracking, which lacks benchmarks.

Method: The authors propose FELT dataset with 1,044 videos and 21 baseline trackers. AMTTrack uses a one-stream framework with Hopfield retrieval and dynamic template updates.

Result: Experiments on FELT, FE108, VisEvent, and COESOT show AMTTrack’s effectiveness.

Conclusion: The dataset and tracker address long-term tracking challenges, with code and dataset publicly available.

Abstract: Existing event stream based trackers undergo evaluation on short-term tracking datasets, however, the tracking of real-world scenarios involves long-term tracking, and the performance of existing tracking algorithms in these scenarios remains unclear. In this paper, we first propose a new long-term, large-scale frame-event visual object tracking dataset, termed FELT. It contains 1,044 long-term videos that involve 1.9 million RGB frames and event stream pairs, 60 different target objects, and 14 challenging attributes. To build a solid benchmark, we retrain and evaluate 21 baseline trackers on our dataset for future work to compare. In addition, we propose a novel Associative Memory Transformer based RGB-Event long-term visual tracker, termed AMTTrack. It follows a one-stream tracking framework and aggregates the multi-scale RGB/event template and search tokens effectively via the Hopfield retrieval layer. The framework also embodies another aspect of associative memory by maintaining dynamic template representations through an associative memory update scheme, which addresses the appearance variation in long-term tracking. Extensive experiments on FELT, FE108, VisEvent, and COESOT datasets fully validated the effectiveness of our proposed tracker. Both the dataset and source code will be released on https://github.com/Event-AHU/FELT_SOT_Benchmark

[235] Visual Bias and Interpretability in Deep Learning for Dermatological Image Analysis

Enam Ahmed Taufik, Abdullah Khondoker, Antara Firoz Parsa, Seraj Al Mahmud Mostafa

Main category: cs.CV

TL;DR: The paper proposes a deep learning framework for skin disease classification, evaluating pre-processing techniques and model architectures, with DinoV2 and RGB pre-processing achieving the best performance (93% accuracy).

DetailsMotivation: Accurate skin disease classification is challenging due to inter-class similarity and intra-class variability. The study aims to improve CAD systems by optimizing pre-processing and model choice.

Method: The study evaluates three pre-processing techniques (RGB, CMY, CLAHE) and benchmarks pre-trained CNNs (DenseNet201, Efficient-NetB5) and transformer models (ViT, Swin, DinoV2) using accuracy and F1-score.

Result: DinoV2 with RGB pre-processing achieves the highest accuracy (93%) and F1-scores. Grad-CAM visualizations confirm precise lesion localization.

Conclusion: Effective pre-processing and model selection are crucial for robust and explainable CAD systems in dermatology.

Abstract: Accurate skin disease classification is a critical yet challenging task due to high inter-class similarity, intra-class variability, and complex lesion textures. While deep learning-based computer-aided diagnosis (CAD) systems have shown promise in automating dermatological assessments, their performance is highly dependent on image pre-processing and model architecture. This study proposes a deep learning framework for multi-class skin disease classification, systematically evaluating three image pre-processing techniques: standard RGB, CMY color space transformation, and Contrast Limited Adaptive Histogram Equalization (CLAHE). We benchmark the performance of pre-trained convolutional neural networks (DenseNet201, Efficient-NetB5) and transformer-based models (ViT, Swin Transformer, DinoV2 Large) using accuracy and F1-score as evaluation metrics. Results show that DinoV2 with RGB pre-processing achieves the highest accuracy (up to 93%) and F1-scores across all variants. Grad-CAM visualizations applied to RGB inputs further reveal precise lesion localization, enhancing interpretability. These findings underscore the importance of effective pre-processing and model choice in building robust and explainable CAD systems for dermatology.

[236] Face-voice Association in Multilingual Environments (FAME) 2026 Challenge Evaluation Plan

Marta Moscati, Ahmed Abdullah, Muhammad Saad Saeed, Shah Nawaz, Rohan Kumar Das, Muhammad Zaigham Zaheer, Junaid Mir, Muhammad Haroon Yousaf, Khalid Malik, Markus Schedl

Main category: cs.CV

TL;DR: The FAME 2026 Challenge explores face-voice association in multilingual environments using the MAV-Celeb dataset, addressing the unique correlation between face and voice in bilingual populations.

DetailsMotivation: Half of the world's population is bilingual, and multilingual communication is common, making face-voice association in such scenarios important.

Method: The challenge uses the MAV-Celeb dataset and baseline models to study face-voice association in multilingual settings.

Result: Details of the challenge, dataset, baseline models, and tasks are provided.

Conclusion: The FAME Challenge aims to advance understanding of face-voice correlation in multilingual contexts.

Abstract: The advancements of technology have led to the use of multimodal systems in various real-world applications. Among them, audio-visual systems are among the most widely used multimodal systems. In the recent years, associating face and voice of a person has gained attention due to the presence of unique correlation between them. The Face-voice Association in Multilingual Environments (FAME) 2026 Challenge focuses on exploring face-voice association under the unique condition of a multilingual scenario. This condition is inspired from the fact that half of the world’s population is bilingual and most often people communicate under multilingual scenarios. The challenge uses a dataset named Multilingual Audio-Visual (MAV-Celeb) for exploring face-voice association in multilingual environments. This report provides the details of the challenge, dataset, baseline models, and task details for the FAME Challenge.

[237] Pseudo Depth Meets Gaussian: A Feed-forward RGB SLAM Baseline

Linqing Zhao, Xiuwei Xu, Yirui Wang, Hao Wang, Wenzhao Zheng, Yansong Tang, Haibin Yan, Jiwen Lu

Main category: cs.CV

TL;DR: Proposes a fast online 3D reconstruction method using 3D Gaussian-based SLAM and feed-forward pose prediction, reducing tracking time by over 90% while matching state-of-the-art performance.

DetailsMotivation: Addressing the challenges of real-sized 3D geometry recovery from pose-free RGB streams, particularly the limitations of existing methods in handling long sequences or relying on slow optimization.

Method: Integrates 3D Gaussian mapping into SLAM, uses a feed-forward recurrent prediction module for camera pose inference, and employs local graph rendering for robustness.

Result: Achieves performance comparable to SplaTAM on Replica and TUM-RGBD datasets, with a 90% reduction in tracking time.

Conclusion: The method offers a faster, efficient alternative for 3D reconstruction without compromising accuracy.

Abstract: Incrementally recovering real-sized 3D geometry from a pose-free RGB stream is a challenging task in 3D reconstruction, requiring minimal assumptions on input data. Existing methods can be broadly categorized into end-to-end and visual SLAM-based approaches, both of which either struggle with long sequences or depend on slow test-time optimization and depth sensors. To address this, we first integrate a depth estimator into an RGB-D SLAM system, but this approach is hindered by inaccurate geometric details in predicted depth. Through further investigation, we find that 3D Gaussian mapping can effectively solve this problem. Building on this, we propose an online 3D reconstruction method using 3D Gaussian-based SLAM, combined with a feed-forward recurrent prediction module to directly infer camera pose from optical flow. This approach replaces slow test-time optimization with fast network inference, significantly improving tracking speed. Additionally, we introduce a local graph rendering technique to enhance robustness in feed-forward pose prediction. Experimental results on the Replica and TUM-RGBD datasets, along with a real-world deployment demonstration, show that our method achieves performance on par with the state-of-the-art SplaTAM, while reducing tracking time by more than 90%.

[238] AVG-LLaVA: An Efficient Large Multimodal Model with Adaptive Visual Granularity

Zhibin Lan, Liqiang Niu, Fandong Meng, Wenbo Li, Jie Zhou, Jinsong Su

Main category: cs.CV

TL;DR: AVG-LLaVA is an LMM that adaptively selects visual granularity for high-resolution images, reducing tokens and speeding up inference.

DetailsMotivation: Existing LMMs divide high-resolution images into many tokens, which is inefficient. AVG-LLaVA aims to optimize this by selecting granularity adaptively.

Method: Uses multiple pooling layers for different granularities and a visual granularity router (Transformer, MLP, voter) to select the best granularity. Introduces RGLF for training alignment without extra data.

Result: Achieves superior performance on 11 benchmarks, reduces visual tokens by 85.3%, and speeds up inference by 2.53x.

Conclusion: AVG-LLaVA efficiently handles high-resolution images by adaptive granularity selection, improving performance and speed.

Abstract: Recently, large multimodal models (LMMs) have achieved significant advancements. When dealing with high-resolution images, dominant LMMs typically divide them into multiple local images and a global image, leading to a large number of visual tokens. In this work, we introduce AVG-LLaVA, an LMM that can adaptively select the appropriate visual granularity based on the input image and instruction. Specifically, we first apply the multiple pooling layers to obtain visual tokens at different granularities. Then we propose a visual granularity router, which includes a Transformer layer, an MLP layer, and a voter layer, used to select the appropriate visual granularity based on the image and instruction. Furthermore, we put forward RGLF, a novel training paradigm that aims at aligning the granularity predicted by the router with the preferences of the LMM, without the need for additional manually annotated data. Extensive experiments and analysis show that AVG-LLaVA achieves superior performance across 11 benchmarks, as well as significantly reduces the number of visual tokens and speeds up inference (e.g., an 85.3% reduction in visual tokens and a 2.53$\times$ increase in inference speed on the AI2D benchmark).

[239] OmniDepth: Bridging Monocular and Stereo Reasoning with Latent Alignment

Tongfan Guan, Jiaxin Guo, Chen Wang, Yun-Hui Liu

Main category: cs.CV

TL;DR: OmniDepth unifies monocular and stereo depth estimation by iteratively aligning their latent representations, improving accuracy and robustness.

DetailsMotivation: Monocular and stereo depth estimation have complementary strengths and weaknesses, but current methods remain disjoint. OmniDepth aims to bridge this gap.

Method: Uses a cross-attentive alignment mechanism to dynamically synchronize monocular contextual cues with stereo hypothesis representations.

Result: State-of-the-art performance, reducing zero-shot generalization error by >40% on benchmarks like Middlebury and ETH3D.

Conclusion: OmniDepth harmonizes multi-view geometry with monocular context, overcoming modality-specific limitations for robust 3D perception.

Abstract: Monocular and stereo depth estimation offer complementary strengths: monocular methods capture rich contextual priors but lack geometric precision, while stereo approaches leverage epipolar geometry yet struggle with ambiguities such as reflective or textureless surfaces. Despite post-hoc synergies, these paradigms remain largely disjoint in practice. We introduce OmniDepth, a unified framework that bridges both through iterative bidirectional alignment of their latent representations. At its core, a novel cross-attentive alignment mechanism dynamically synchronizes monocular contextual cues with stereo hypothesis representations during stereo reasoning. This mutual alignment resolves stereo ambiguities (e.g., specular surfaces) by injecting monocular structure priors while refining monocular depth with stereo geometry within a single network. Extensive experiments demonstrate state-of-the-art results: \textbf{OmniDepth reduces zero-shot generalization error by $!>!40%$ on Middlebury and ETH3D}, while addressing longstanding failures on transparent and reflective surfaces. By harmonizing multi-view geometry with monocular context, OmniDepth enables robust 3D perception that transcends modality-specific limitations. Codes available at https://github.com/aeolusguan/OmniDepth.

[240] How Does Bilateral Ear Symmetry Affect Deep Ear Features?

Kagan Ozturk, Deeksha Arun, Kevin W. Bowyer, Patrick Flynn

Main category: cs.CV

TL;DR: The paper explores the impact of bilateral ear symmetry on CNN-based ear recognition, showing that treating left and right ears separately improves performance.

DetailsMotivation: Despite the use of CNNs in ear recognition, the influence of bilateral ear symmetry on learned features is understudied.

Method: Developed an ear side classifier to categorize images as left or right, then evaluated the impact of side information during training and testing across five datasets.

Result: Separate treatment of left and right ears during training and testing leads to notable performance improvements.

Conclusion: Incorporating bilateral symmetry information enhances CNN-based ear recognition, with practical insights for training on large-scale datasets.

Abstract: Ear recognition has gained attention as a reliable biometric technique due to the distinctive characteristics of human ears. With the increasing availability of large-scale datasets, convolutional neural networks (CNNs) have been widely adopted to learn features directly from raw ear images, outperforming traditional hand-crafted methods. However, the effect of bilateral ear symmetry on the features learned by CNNs has received little attention in recent studies. In this paper, we investigate how bilateral ear symmetry influences the effectiveness of CNN-based ear recognition. To this end, we first develop an ear side classifier to automatically categorize ear images as either left or right. We then explore the impact of incorporating this side information during both training and test. Cross-dataset evaluations are conducted on five datasets. Our results suggest that treating left and right ears separately during training and testing can lead to notable performance improvements. Furthermore, our ablation studies on alignment strategies, input sizes, and various hyperparameter settings provide practical insights into training CNN-based ear recognition systems on large-scale datasets to achieve higher verification rates.

[241] FinMMR: Make Financial Numerical Reasoning More Multimodal, Comprehensive, and Challenging

Zichen Tang, Haihong E, Jiacheng Liu, Zhongjun Yang, Rongjin Li, Zihua Rong, Haoyang He, Zhuodi Hao, Xinyang Hu, Kun Ji, Ziyan Ma, Mengyuan Ji, Jun Zhang, Chenghao Ma, Qianhe Zheng, Yang Liu, Yiling Huang, Xinyi Hu, Qing Huang, Zijian Xie, Shiyao Peng

Main category: cs.CV

TL;DR: FinMMR is a bilingual multimodal benchmark for evaluating MLLMs in financial numerical reasoning, featuring 4.3K questions, 8.7K images, and 14 financial subdomains.

DetailsMotivation: To address the lack of comprehensive benchmarks for evaluating MLLMs' reasoning in financial tasks, FinMMR introduces multimodal, diverse, and challenging questions.

Method: Transforms existing benchmarks and constructs new questions from Chinese financial reports, covering 14 categories and subdomains.

Result: The best MLLM achieves only 53.0% accuracy on Hard problems, highlighting the challenge.

Conclusion: FinMMR aims to advance MLLMs’ reasoning capabilities in real-world financial scenarios.

Abstract: We present FinMMR, a novel bilingual multimodal benchmark tailored to evaluate the reasoning capabilities of multimodal large language models (MLLMs) in financial numerical reasoning tasks. Compared to existing benchmarks, our work introduces three significant advancements. (1) Multimodality: We meticulously transform existing financial reasoning benchmarks, and construct novel questions from the latest Chinese financial research reports. FinMMR comprises 4.3K questions and 8.7K images spanning 14 categories, including tables, bar charts, and ownership structure charts. (2) Comprehensiveness: FinMMR encompasses 14 financial subdomains, including corporate finance, banking, and industry analysis, significantly exceeding existing benchmarks in financial domain knowledge breadth. (3) Challenge: Models are required to perform multi-step precise numerical reasoning by integrating financial knowledge with the understanding of complex financial images and text. The best-performing MLLM achieves only 53.0% accuracy on Hard problems. We believe that FinMMR will drive advancements in enhancing the reasoning capabilities of MLLMs in real-world scenarios.

[242] Causality-Driven Audits of Model Robustness

Nathan Drenkow, William Paul, Chris Ribaudo, Mathias Unberath

Main category: cs.CV

TL;DR: A new robustness auditing method for DNNs uses causal inference to measure sensitivities to complex imaging distortions, outperforming traditional isolated-effect audits.

DetailsMotivation: Traditional robustness audits fail to account for complex, interacting real-world imaging conditions, limiting their practical applicability.

Method: The approach employs causal models to encode domain-relevant factors and their interactions, estimating causal effects on DNN performance using observational data.

Result: Experiments on natural and rendered images show reliable estimation of causal effects, linking DNN sensitivities to imaging pipeline properties.

Conclusion: This method reduces the risk of unexpected DNN failures by directly addressing complex real-world imaging conditions.

Abstract: Robustness audits of deep neural networks (DNN) provide a means to uncover model sensitivities to the challenging real-world imaging conditions that significantly degrade DNN performance in-the-wild. Such conditions are often the result of multiple interacting factors inherent to the environment, sensor, or processing pipeline and may lead to complex image distortions that are not easily categorized. When robustness audits are limited to a set of isolated imaging effects or distortions, the results cannot be (easily) transferred to real-world conditions where image corruptions may be more complex or nuanced. To address this challenge, we present a new alternative robustness auditing method that uses causal inference to measure DNN sensitivities to the factors of the imaging process that cause complex distortions. Our approach uses causal models to explicitly encode assumptions about the domain-relevant factors and their interactions. Then, through extensive experiments on natural and rendered images across multiple vision tasks, we show that our approach reliably estimates causal effects of each factor on DNN performance using only observational domain data. These causal effects directly tie DNN sensitivities to observable properties of the imaging pipeline in the domain of interest towards reducing the risk of unexpected DNN failures when deployed in that domain.

[243] EncQA: Benchmarking Vision-Language Models on Visual Encodings for Charts

Kushin Mukherjee, Donghao Ren, Dominik Moritz, Yannick Assogba

Main category: cs.CV

TL;DR: EncQA is a new benchmark for evaluating chart understanding in VLMs, covering diverse visual encodings and tasks. It reveals performance gaps and challenges the assumption that scaling models improves results uniformly.

DetailsMotivation: Current benchmarks for VLMs don't fully assess visual reasoning needed for chart interpretation, prompting the creation of EncQA.

Method: EncQA includes 2,076 synthetic question-answer pairs, balancing six visual encodings and eight analytic tasks. Nine state-of-the-art VLMs were evaluated.

Result: Performance varies across encodings and tasks, with no consistent improvement from model scaling.

Conclusion: Targeted strategies, not just scaling, are needed to advance chart understanding in VLMs.

Abstract: Multimodal vision-language models (VLMs) continue to achieve ever-improving scores on chart understanding benchmarks. Yet, we find that this progress does not fully capture the breadth of visual reasoning capabilities essential for interpreting charts. We introduce EncQA, a novel benchmark informed by the visualization literature, designed to provide systematic coverage of visual encodings and analytic tasks that are crucial for chart understanding. EncQA provides 2,076 synthetic question-answer pairs, enabling balanced coverage of six visual encoding channels (position, length, area, color quantitative, color nominal, and shape) and eight tasks (find extrema, retrieve value, find anomaly, filter values, compute derived value exact, compute derived value relative, correlate values, and correlate values relative). Our evaluation of 9 state-of-the-art VLMs reveals that performance varies significantly across encodings within the same task, as well as across tasks. Contrary to expectations, we observe that performance does not improve with model size for many task-encoding pairs. Our results suggest that advancing chart understanding requires targeted strategies addressing specific visual reasoning gaps, rather than solely scaling up model or dataset size.

[244] PixCuboid: Room Layout Estimation from Multi-view Featuremetric Alignment

Gustav Hanning, Kalle Åström, Viktor Larsson

Main category: cs.CV

TL;DR: PixCuboid is an optimization-based method for cuboid-shaped room layout estimation using multi-view alignment of dense deep features, outperforming state-of-the-art methods.

DetailsMotivation: Coarse room layout estimation is crucial for downstream tasks, but current methods rely on single views and panoramic images, limiting their effectiveness.

Method: PixCuboid uses multi-view alignment of dense deep features, trained end-to-end to ensure large convergence basins and smooth loss landscapes. Simple heuristics initialize the layout.

Result: PixCuboid outperforms competitors on new benchmarks (ScanNet++ and 2D-3D-Semantics) with verified ground truth. It also extends to multi-room estimation.

Conclusion: PixCuboid offers a flexible, optimization-based solution for room layout estimation, with superior performance and adaptability to multi-room scenarios.

Abstract: Coarse room layout estimation provides important geometric cues for many downstream tasks. Current state-of-the-art methods are predominantly based on single views and often assume panoramic images. We introduce PixCuboid, an optimization-based approach for cuboid-shaped room layout estimation, which is based on multi-view alignment of dense deep features. By training with the optimization end-to-end, we learn feature maps that yield large convergence basins and smooth loss landscapes in the alignment. This allows us to initialize the room layout using simple heuristics. For the evaluation we propose two new benchmarks based on ScanNet++ and 2D-3D-Semantics, with manually verified ground truth 3D cuboids. In thorough experiments we validate our approach and significantly outperform the competition. Finally, while our network is trained with single cuboids, the flexibility of the optimization-based approach allow us to easily extend to multi-room estimation, e.g. larger apartments or offices. Code and model weights are available at https://github.com/ghanning/PixCuboid.

[245] DOGR: Towards Versatile Visual Document Grounding and Referring

Yinan Zhou, Yuxin Chen, Haokun Lin, Yichen Wu, Shuyu Yang, Zhongang Qi, Chen Ma, Li Zhu, Ying Shan

Main category: cs.CV

TL;DR: The paper introduces DOGR-Engine and DOGR-Bench to address the lack of fine-grained datasets and benchmarks for grounding and referring in visual document understanding, proposing a baseline model, DOGR, for improved performance.

DetailsMotivation: Current MLLMs lack fine-grained grounding and referring capabilities in visual document understanding due to insufficient datasets and benchmarks.

Method: Proposes DOGR-Engine to generate multi-granular parsing and instruction-tuning data, and constructs DOGR-Bench for evaluation. Develops the DOGR model for enhanced text localization and recognition.

Result: DOGR-Bench covers seven tasks across three document types, and DOGR model demonstrates superior grounding and referring capabilities.

Conclusion: The work advances fine-grained document understanding and enables flexible interaction paradigms through the proposed data engine, benchmark, and model.

Abstract: With recent advances in Multimodal Large Language Models (MLLMs), grounding and referring capabilities have gained increasing attention for achieving detailed understanding and flexible user interaction. However, these capabilities still remain underdeveloped in visual document understanding due to the scarcity of fine-grained datasets and comprehensive benchmarks. To fill this gap, we propose the DOcument Grounding and Referring data engine (DOGR-Engine), which generates two types of high-quality fine-grained document data: (1) multi-granular parsing data to improve text localization and recognition, and (2) instruction-tuning data to activate MLLMs’ grounding and referring capabilities in dialogue and reasoning. Using the DOGR-Engine, we construct DOGR-Bench, a benchmark covering seven grounding and referring tasks across three document types (chart, poster, and PDF document), offering a comprehensive evaluation of fine-grained document understanding. Leveraging the generated data, we further develop DOGR, a strong baseline model that excels in text localization and recognition, while precisely grounds and refers to key textual information during conversation and reasoning, thereby advancing document understanding to a finer granularity and enable flexible interaction paradigms.

[246] ANPrompt: Anti-noise Prompt Tuning for Vision-Language Models

Yansheng Gao, Yufei Zheng, Jinghan Qu, Zixi Zhu, Yukuan Zhang, Shengsheng Wang

Main category: cs.CV

TL;DR: ANPrompt is a prompt tuning framework for vision-language models (VLMs) that enhances robustness against weak semantic perturbations by integrating noise prompts and noise-resistant visual prototypes, outperforming existing methods.

DetailsMotivation: Existing prompt-tuned VLMs are vulnerable to subtle semantic noise, degrading their generalization to unseen classes.

Method: ANPrompt constructs noise prompts from perturbed text embeddings, integrates them with learnable tokens, and computes noise-resistant visual prototypes. It uses alignment, robustness, and anti-noise objectives.

Result: ANPrompt outperforms existing methods on 11 benchmarks, showing superior robustness and generalization.

Conclusion: ANPrompt effectively addresses the vulnerability of prompt-tuned VLMs to semantic noise, improving their robustness and generalization.

Abstract: Prompt tuning has emerged as an efficient and effective technique for adapting vision-language models (VLMs) with low computational overhead. However, existing methods often overlook the vulnerability of prompt-tuned VLMs to weak semantic perturbations-such as subtle image or text noise-that degrade their generalization to unseen classes. To address this limitation, we propose ANPrompt, a novel prompt tuning framework designed to enhance robustness under such perturbations. ANPrompt first constructs weak noise text features by fusing original and noise-perturbed text embeddings, which are then clustered to form noise prompts. These noise prompts are integrated with learnable prompt tokens to generate anti-noise prompts, which are injected into the deeper layers of both image and text encoders. To further capture the noise-aware visual semantics, ANPrompt computes the Noise-Resistant Visual Prompt Prototype (NRVPP) by averaging the output prompt tokens from the vision encoder. Finally, ANPrompt introduces alignment, robustness, and anti-noise objectives by computing a Weak semantic noise Alignment Loss (WALoss) alongside the standard cross-entropy and sim loss. Experiments across 11 benchmarks demonstrate that ANPrompt consistently outperforms existing prompt tuning approaches, achieving superior robustness to semantic noise and improved generalization to novel categories.

[247] Perceiving and Acting in First-Person: A Dataset and Benchmark for Egocentric Human-Object-Human Interactions

Liang Xu, Chengqun Yang, Zili Lin, Fei Xu, Yifan Liu, Congsheng Xu, Yiyi Zhang, Jie Qin, Xingdong Sheng, Yunhui Liu, Xin Jin, Yichao Yan, Wenjun Zeng, Xiaokang Yang

Main category: cs.CV

TL;DR: The paper introduces InterVLA, a large-scale dataset for human-object-human interaction, combining generalist knowledge and egocentric vision to improve AI assistants.

DetailsMotivation: Existing datasets lack generalist interaction knowledge and egocentric perspective, limiting AI assistants' real-world applicability.

Method: A hybrid RGB-MoCap system captures multimodal data (egocentric/exocentric videos, motions, commands) using GPT-generated scripts.

Result: InterVLA dataset includes 11.4 hours of data, 1.2M frames, and benchmarks for motion estimation, interaction synthesis, and prediction.

Conclusion: InterVLA and its benchmarks advance research on AI agents in physical environments.

Abstract: Learning action models from real-world human-centric interaction datasets is important towards building general-purpose intelligent assistants with efficiency. However, most existing datasets only offer specialist interaction category and ignore that AI assistants perceive and act based on first-person acquisition. We urge that both the generalist interaction knowledge and egocentric modality are indispensable. In this paper, we embed the manual-assisted task into a vision-language-action framework, where the assistant provides services to the instructor following egocentric vision and commands. With our hybrid RGB-MoCap system, pairs of assistants and instructors engage with multiple objects and the scene following GPT-generated scripts. Under this setting, we accomplish InterVLA, the first large-scale human-object-human interaction dataset with 11.4 hours and 1.2M frames of multimodal data, spanning 2 egocentric and 5 exocentric videos, accurate human/object motions and verbal commands. Furthermore, we establish novel benchmarks on egocentric human motion estimation, interaction synthesis, and interaction prediction with comprehensive analysis. We believe that our InterVLA testbed and the benchmarks will foster future works on building AI agents in the physical world.

[248] TurboTrain: Towards Efficient and Balanced Multi-Task Learning for Multi-Agent Perception and Prediction

Zewei Zhou, Seth Z. Zhao, Tianhui Cai, Zhiyu Huang, Bolei Zhou, Jiaqi Ma

Main category: cs.CV

TL;DR: TurboTrain is a novel framework for efficient multi-agent training, combining spatiotemporal pretraining and balanced multi-task learning to improve performance and reduce manual effort.

DetailsMotivation: Training multi-agent systems is challenging and requires extensive manual design. TurboTrain aims to simplify this process and enhance performance.

Method: TurboTrain uses masked reconstruction learning for spatiotemporal pretraining and gradient conflict suppression for balanced multi-task learning.

Result: Evaluated on V2XPnP-Seq, TurboTrain improves state-of-the-art multi-agent perception and prediction models.

Conclusion: TurboTrain effectively captures multi-agent features and enhances downstream tasks, reducing training time and manual effort.

Abstract: End-to-end training of multi-agent systems offers significant advantages in improving multi-task performance. However, training such models remains challenging and requires extensive manual design and monitoring. In this work, we introduce TurboTrain, a novel and efficient training framework for multi-agent perception and prediction. TurboTrain comprises two key components: a multi-agent spatiotemporal pretraining scheme based on masked reconstruction learning and a balanced multi-task learning strategy based on gradient conflict suppression. By streamlining the training process, our framework eliminates the need for manually designing and tuning complex multi-stage training pipelines, substantially reducing training time and improving performance. We evaluate TurboTrain on a real-world cooperative driving dataset, V2XPnP-Seq, and demonstrate that it further improves the performance of state-of-the-art multi-agent perception and prediction models. Our results highlight that pretraining effectively captures spatiotemporal multi-agent features and significantly benefits downstream tasks. Moreover, the proposed balanced multi-task learning strategy enhances detection and prediction.

[249] Vision without Images: End-to-End Computer Vision from Single Compressive Measurements

Fengpu Pan, Heting Gao, Jiangtao Wen, Yuxing Han

Main category: cs.CV

TL;DR: A novel SCI-based framework using 8x8 pseudo-random binary masks and CompDAE, a Compressive Denoising Autoencoder, achieves state-of-the-art performance in low-light, low-SNR conditions without image reconstruction.

DetailsMotivation: Addressing challenges of low-light, low-SNR conditions and hardware constraints in high-resolution SCI systems.

Method: Uses pseudo-random binary masks (8x8) and CompDAE (STFormer-based) for direct task execution from noisy measurements, with rate-constrained training and shared encoder for multi-task efficiency.

Result: State-of-the-art performance with lower complexity, especially in ultra-low-light conditions where traditional methods fail.

Conclusion: The framework enables efficient, hardware-friendly SCI implementations with superior performance in challenging conditions.

Abstract: Snapshot Compressed Imaging (SCI) offers high-speed, low-bandwidth, and energy-efficient image acquisition, but remains challenged by low-light and low signal-to-noise ratio (SNR) conditions. Moreover, practical hardware constraints in high-resolution sensors limit the use of large frame-sized masks, necessitating smaller, hardware-friendly designs. In this work, we present a novel SCI-based computer vision framework using pseudo-random binary masks of only 8$\times$8 in size for physically feasible implementations. At its core is CompDAE, a Compressive Denoising Autoencoder built on the STFormer architecture, designed to perform downstream tasks–such as edge detection and depth estimation–directly from noisy compressive raw pixel measurements without image reconstruction. CompDAE incorporates a rate-constrained training strategy inspired by BackSlash to promote compact, compressible models. A shared encoder paired with lightweight task-specific decoders enables a unified multi-task platform. Extensive experiments across multiple datasets demonstrate that CompDAE achieves state-of-the-art performance with significantly lower complexity, especially under ultra-low-light conditions where traditional CMOS and SCI pipelines fail.

[250] BEVCon: Advancing Bird’s Eye View Perception with Contrastive Learning

Ziyang Leng, Jiawei Yang, Zhicheng Ren, Bolei Zhou

Main category: cs.CV

TL;DR: BEVCon is a contrastive learning framework for improving Bird’s Eye View (BEV) perception in autonomous driving, enhancing feature representations and achieving +2.4% mAP improvement.

DetailsMotivation: Prior work focused on BEV encoders and task-specific heads, leaving representation learning underexplored. BEVCon addresses this gap.

Method: Introduces two contrastive learning modules: instance feature contrast for BEV features and perspective view contrast for the image backbone.

Result: Achieves up to +2.4% mAP improvement on nuScenes dataset, demonstrating enhanced feature representations.

Conclusion: BEVCon highlights the importance of representation learning in BEV perception, complementing task-specific optimizations.

Abstract: We present BEVCon, a simple yet effective contrastive learning framework designed to improve Bird’s Eye View (BEV) perception in autonomous driving. BEV perception offers a top-down-view representation of the surrounding environment, making it crucial for 3D object detection, segmentation, and trajectory prediction tasks. While prior work has primarily focused on enhancing BEV encoders and task-specific heads, we address the underexplored potential of representation learning in BEV models. BEVCon introduces two contrastive learning modules: an instance feature contrast module for refining BEV features and a perspective view contrast module that enhances the image backbone. The dense contrastive learning designed on top of detection losses leads to improved feature representations across both the BEV encoder and the backbone. Extensive experiments on the nuScenes dataset demonstrate that BEVCon achieves consistent performance gains, achieving up to +2.4% mAP improvement over state-of-the-art baselines. Our results highlight the critical role of representation learning in BEV perception and offer a complementary avenue to conventional task-specific optimizations.

[251] p-MoD: Building Mixture-of-Depths MLLMs via Progressive Ratio Decay

Jun Zhang, Desen Meng, Zhengming Zhang, Zhenpeng Huang, Tao Wu, Limin Wang

Main category: cs.CV

TL;DR: p-MoD is an efficient MLLM architecture using Mixture-of-Depths (MoD) to reduce training/inference costs while maintaining performance, with novel designs like TanhNorm and STRing, and a PRD strategy for token retention.

DetailsMotivation: High training and inference costs of MLLMs hinder progress; p-MoD aims to reduce these costs without sacrificing performance.

Method: Uses MoD to select essential vision tokens, introduces TanhNorm and STRing for stability, and PRD strategy for token retention.

Result: Matches/surpasses baseline performance with 55.6% TFLOPs, 53.7% KV cache storage, and 77.7% GPU hours.

Conclusion: p-MoD effectively balances efficiency and performance in MLLMs.

Abstract: Despite the remarkable performance of multimodal large language models (MLLMs) across diverse tasks, the substantial training and inference costs impede their advancement. In this paper, we propose p-MoD, an efficient MLLM architecture that significantly reduces training and inference costs while maintaining model performance. The majority of computation in MLLMs stems from the overwhelming volume of vision tokens processed by the transformer-based LLM. Accordingly, we leverage the Mixture-of-Depths (MoD) mechanism, where each LLM layer selects essential vision tokens to process while skipping redundant ones. However, integrating MoD into MLLMs is non-trivial. To address the challenges of training and inference stability as well as limited training data, we adapt the MoD module with two novel designs: tanh-gated weight normalization (TanhNorm) and symmetric token reweighting (STRing). Moreover, we observe that vision tokens exhibit higher redundancy in deeper layers and thus design a progressive ratio decay (PRD) strategy, which gradually reduces the token retention ratio layer by layer, employing a shifted cosine schedule. This crucial design fully unleashes the potential of MoD, significantly boosting the efficiency and performance of our models. Extensive experiments on two baseline models across 15 benchmarks show that our model matches or even surpasses the performance of corresponding baselines, while requiring only 55.6% TFLOPs and 53.7% KV cache storage during inference, and 77.7% GPU hours during training.

[252] Occupancy Learning with Spatiotemporal Memory

Ziyang Leng, Jiawei Yang, Wenlong Yi, Bolei Zhou

Main category: cs.CV

TL;DR: ST-Occ is a framework for learning spatiotemporal features in 3D occupancy prediction, improving accuracy and temporal consistency.

DetailsMotivation: Efficiently aggregating 3D occupancy over time is challenging due to high processing costs and voxel uncertainty.

Method: Uses a spatiotemporal memory and memory attention to capture historical information and model uncertainty.

Result: Outperforms state-of-the-art by 3 mIoU and reduces temporal inconsistency by 29%.

Conclusion: ST-Occ effectively enhances spatiotemporal representation for 3D occupancy prediction.

Abstract: 3D occupancy becomes a promising perception representation for autonomous driving to model the surrounding environment at a fine-grained scale. However, it remains challenging to efficiently aggregate 3D occupancy over time across multiple input frames due to the high processing cost and the uncertainty and dynamics of voxels. To address this issue, we propose ST-Occ, a scene-level occupancy representation learning framework that effectively learns the spatiotemporal feature with temporal consistency. ST-Occ consists of two core designs: a spatiotemporal memory that captures comprehensive historical information and stores it efficiently through a scene-level representation and a memory attention that conditions the current occupancy representation on the spatiotemporal memory with a model of uncertainty and dynamic awareness. Our method significantly enhances the spatiotemporal representation learned for 3D occupancy prediction tasks by exploiting the temporal dependency between multi-frame inputs. Experiments show that our approach outperforms the state-of-the-art methods by a margin of 3 mIoU and reduces the temporal inconsistency by 29%.

[253] Through the Magnifying Glass: Adaptive Perception Magnification for Hallucination-Free VLM Decoding

Shunqi Mao, Chaoyi Zhang, Weidong Cai

Main category: cs.CV

TL;DR: The paper introduces Perception Magnifier (PM), a visual decoding method to reduce visual hallucination in vision-language models by iteratively focusing on fine-grained visual details.

DetailsMotivation: Existing VLMs suffer from visual hallucination, where responses aren't grounded in visual input, and current methods fail to capture fine-grained details.

Method: PM isolates relevant visual tokens using attention and magnifies regions iteratively, enhancing focus on visual details during decoding.

Result: PM outperforms in hallucination mitigation, improves language generation, and maintains reasoning capabilities.

Conclusion: PM effectively reduces hallucination and enhances response accuracy in VLMs by emphasizing fine-grained visual details.

Abstract: Existing vision-language models (VLMs) often suffer from visual hallucination, where the generated responses contain inaccuracies that are not grounded in the visual input. Efforts to address this issue without model finetuning primarily mitigate hallucination by contrastively reducing language biases or amplifying the weights of visual embedding during decoding. However, these approaches remain limited in their ability to capture fine-grained visual details. In this work, we propose the Perception Magnifier (PM), a novel visual decoding method that iteratively isolates relevant visual tokens based on attention and magnifies the corresponding regions, spurring the model to concentrate on fine-grained visual details during decoding. By magnifying critical regions while preserving the structural and contextual information at each decoding step, PM allows the VLM to enhance its scrutiny of the visual input, hence producing more accurate and faithful responses. Extensive experimental results demonstrate that PM not only achieves superior hallucination mitigation but also enhances language generation while preserving strong reasoning capabilities.

[254] NuPlanQA: A Large-Scale Dataset and Benchmark for Multi-View Driving Scene Understanding in Multi-Modal Large Language Models

Sung-Yeon Park, Can Cui, Yunsheng Ma, Ahmadreza Moradipari, Rohit Gupta, Kyungtae Han, Ziran Wang

Main category: cs.CV

TL;DR: The paper introduces NuPlanQA-Eval, a benchmark for driving scene understanding, and NuPlanQA-1M, a dataset of 1M VQA pairs. It also proposes BEV-LLM, a model integrating BEV features into MLLMs, which outperforms others in driving scene tasks.

DetailsMotivation: Existing MLLMs struggle with multi-view driving scene comprehension due to the complexity of such scenarios.

Method: Developed NuPlanQA-Eval benchmark and NuPlanQA-1M dataset, and proposed BEV-LLM integrating BEV features into MLLMs.

Result: BEV-LLM outperforms other models in 6 of 9 subtasks, highlighting its adaptability to driving scenes.

Conclusion: BEV integration enhances MLLMs for driving scenes, but further refinement is needed. The dataset and benchmark are publicly released.

Abstract: Recent advances in multi-modal large language models (MLLMs) have demonstrated strong performance across various domains; however, their ability to comprehend driving scenes remains less proven. The complexity of driving scenarios, which includes multi-view information, poses significant challenges for existing MLLMs. In this paper, we introduce NuPlanQA-Eval, a multi-view, multi-modal evaluation benchmark for driving scene understanding. To further support generalization to multi-view driving scenarios, we also propose NuPlanQA-1M, a large-scale dataset comprising 1M real-world visual question-answering (VQA) pairs. For context-aware analysis of traffic scenes, we categorize our dataset into nine subtasks across three core skills: Road Environment Perception, Spatial Relations Recognition, and Ego-Centric Reasoning. Furthermore, we present BEV-LLM, integrating Bird’s-Eye-View (BEV) features from multi-view images into MLLMs. Our evaluation results reveal key challenges that existing MLLMs face in driving scene-specific perception and spatial reasoning from ego-centric perspectives. In contrast, BEV-LLM demonstrates remarkable adaptability to this domain, outperforming other models in six of the nine subtasks. These findings highlight how BEV integration enhances multi-view MLLMs while also identifying key areas that require further refinement for effective adaptation to driving scenes. To facilitate further research, we publicly release NuPlanQA at https://github.com/sungyeonparkk/NuPlanQA.

[255] Beyond Wide-Angle Images: Structure-to-Detail Video Portrait Correction via Unsupervised Spatiotemporal Adaptation

Wenbo Nie, Lang Nie, Chunyu Lin, Jingwen Chen, Ke Xing, Jiyuan Wang, Kang Liao

Main category: cs.CV

TL;DR: The paper proposes ImagePC and VideoPC models to correct distortion-induced facial stretching in wide-angle images and videos, leveraging transformers and diffusion models for robust results.

DetailsMotivation: Wide-angle cameras cause facial distortion, degrading visual appeal, especially at the lens edges. The goal is to correct this distortion effectively.

Method: ImagePC combines transformer long-range awareness and diffusion model denoising for global and local correction. VideoPC adapts ImagePC for unlabeled videos using spatiotemporal diffusion with spatial and temporal constraints.

Result: The methods outperform existing solutions, achieving high-fidelity corrections in both images and videos, with stable and natural portraits.

Conclusion: The proposed models effectively address wide-angle distortion, with VideoPC extending the solution to videos, supported by a diverse dataset and promising experimental results.

Abstract: Wide-angle cameras, despite their popularity for content creation, suffer from distortion-induced facial stretching-especially at the edge of the lens-which degrades visual appeal. To address this issue, we propose a structure-to-detail portrait correction model named ImagePC. It integrates the long-range awareness of the transformer and multi-step denoising of diffusion models into a unified framework, achieving global structural robustness and local detail refinement. Besides, considering the high cost of obtaining video labels, we then repurpose ImagePC for unlabeled wide-angle videos (termed VideoPC), by spatiotemporal diffusion adaption with spatial consistency and temporal smoothness constraints. For the former, we encourage the denoised image to approximate pseudo labels following the wide-angle distortion distribution pattern, while for the latter, we derive rectification trajectories with backward optical flows and smooth them. Compared with ImagePC, VideoPC maintains high-quality facial corrections in space and mitigates the potential temporal shakes sequentially in blind scenarios. Finally, to establish an evaluation benchmark and train the framework, we establish a video portrait dataset with a large diversity in the number of people, lighting conditions, and background. Experiments demonstrate that the proposed methods outperform existing solutions quantitatively and qualitatively, contributing to high-fidelity wide-angle videos with stable and natural portraits. The codes and dataset will be available.

[256] RGB-Event based Pedestrian Attribute Recognition: A Benchmark Dataset and An Asymmetric RWKV Fusion Framework

Xiao Wang, Haiyang Wang, Shiao Wang, Qiang Chen, Jiandong Jin, Haoyu Song, Bo Jiang, Chenglong Li

Main category: cs.CV

TL;DR: The paper proposes a multi-modal RGB-Event pedestrian attribute recognition task, introducing a large-scale dataset (EventPAR) and a novel RWKV-based framework to address limitations of RGB cameras and explore emotional dimensions.

DetailsMotivation: RGB cameras have limitations like sensitivity to lighting and motion blur, and current attribute recognition lacks emotional analysis. The paper aims to leverage event cameras' advantages (low-light, high-speed, low-power) for better performance.

Method: A large-scale dataset (EventPAR) with 100K RGB-Event samples covering 50 attributes (appearance and emotions) is introduced. A novel RWKV-based framework with a visual encoder and asymmetric fusion module is proposed.

Result: State-of-the-art results are achieved on EventPAR and two simulated datasets (MARS-Attribute and DukeMTMC-VID-Attribute).

Conclusion: The work provides a benchmark for future research with its dataset and RWKV framework, addressing RGB camera limitations and expanding attribute recognition to include emotions.

Abstract: Existing pedestrian attribute recognition methods are generally developed based on RGB frame cameras. However, these approaches are constrained by the limitations of RGB cameras, such as sensitivity to lighting conditions and motion blur, which hinder their performance. Furthermore, current attribute recognition primarily focuses on analyzing pedestrians’ external appearance and clothing, lacking an exploration of emotional dimensions. In this paper, we revisit these issues and propose a novel multi-modal RGB-Event attribute recognition task by drawing inspiration from the advantages of event cameras in low-light, high-speed, and low-power consumption. Specifically, we introduce the first large-scale multi-modal pedestrian attribute recognition dataset, termed EventPAR, comprising 100K paired RGB-Event samples that cover 50 attributes related to both appearance and six human emotions, diverse scenes, and various seasons. By retraining and evaluating mainstream PAR models on this dataset, we establish a comprehensive benchmark and provide a solid foundation for future research in terms of data and algorithmic baselines. In addition, we propose a novel RWKV-based multi-modal pedestrian attribute recognition framework, featuring an RWKV visual encoder and an asymmetric RWKV fusion module. Extensive experiments are conducted on our proposed dataset as well as two simulated datasets (MARS-Attribute and DukeMTMC-VID-Attribute), achieving state-of-the-art results. The source code and dataset will be released on https://github.com/Event-AHU/OpenPAR

[257] Uncertainty-aware Medical Diagnostic Phrase Identification and Grounding

Ke Zou, Yang Bai, Bo Liu, Yidi Chen, Zhihao Chen, Yang Zhou, Xuedong Yuan, Meng Wang, Xiaojing Shen, Xiaochun Cao, Yih Chung Tham, Huazhu Fu

Main category: cs.CV

TL;DR: The paper introduces Medical Report Grounding (MRG) and proposes uMedGround, a framework using a multimodal large language model for end-to-end diagnostic phrase and grounding box identification, with uncertainty-aware predictions for improved reliability.

DetailsMotivation: Current medical phrase grounding methods rely on manual key phrase extraction, reducing efficiency and lacking model confidence estimation, which limits clinical trust.

Method: uMedGround leverages a multimodal large language model with an embedded token for phrase detection and a vision encoder-decoder for grounding box generation, incorporating uncertainty-aware predictions.

Result: uMedGround outperforms state-of-the-art methods and fine-tuned large visual-language models, proving effective and reliable.

Conclusion: The study pioneers the MRG task, demonstrating uMedGround’s applicability in medical visual question answering and class-based localization, aiding clinicians in interpreting diverse textual inputs.

Abstract: Medical phrase grounding is crucial for identifying relevant regions in medical images based on phrase queries, facilitating accurate image analysis and diagnosis. However, current methods rely on manual extraction of key phrases from medical reports, reducing efficiency and increasing the workload for clinicians. Additionally, the lack of model confidence estimation limits clinical trust and usability. In this paper, we introduce a novel task called Medical Report Grounding (MRG), which aims to directly identify diagnostic phrases and their corresponding grounding boxes from medical reports in an end-to-end manner. To address this challenge, we propose uMedGround, a robust and reliable framework that leverages a multimodal large language model to predict diagnostic phrases by embedding a unique token, , into the vocabulary to enhance detection capabilities. A vision encoder-decoder processes the embedded token and input image to generate grounding boxes. Critically, uMedGround incorporates an uncertainty-aware prediction model, significantly improving the robustness and reliability of grounding predictions. Experimental results demonstrate that uMedGround outperforms state-of-the-art medical phrase grounding methods and fine-tuned large visual-language models, validating its effectiveness and reliability. This study represents a pioneering exploration of the MRG task, marking the first-ever endeavor in this domain. Additionally, we demonstrate the applicability of uMedGround in medical visual question answering and class-based localization tasks, where it highlights visual evidence aligned with key diagnostic phrases, supporting clinicians in interpreting various types of textual inputs, including free-text reports, visual question answering queries, and class labels.

[258] ProbRadarM3F: mmWave Radar based Human Skeletal Pose Estimation with Probability Map Guided Multi-Format Feature Fusion

Bing Zhu, Zixin He, Weiyi Xiong, Guanhua Ding, Tao Huang, Wei Xiang

Main category: cs.CV

TL;DR: The paper introduces ProbRadarM3F, a model for improving mmWave radar-based human pose estimation by fusing traditional heatmap features with positional features, achieving 69.9% AP on the HuPR dataset.

DetailsMotivation: mmWave radar is a privacy-friendly alternative to RGB cameras for pose estimation, but its signal utilization is limited, hindering accuracy.

Method: ProbRadarM3F combines FFT-based features with probability map positional encoding to fuse heatmap and positional features for 14 keypoint estimation.

Result: The model outperforms others on the HuPR dataset with an AP of 69.9%, demonstrating improved accuracy.

Conclusion: The study highlights untapped positional information in radar signals, suggesting further exploration of non-redundant mmWave radar data.

Abstract: Millimeter wave (mmWave) radar is a non-intrusive privacy and relatively convenient and inexpensive device, which has been demonstrated to be applicable in place of RGB cameras in human indoor pose estimation tasks. However, mmWave radar relies on the collection of reflected signals from the target, and the radar signals containing information is difficult to be fully applied. This has been a long-standing hindrance to the improvement of pose estimation accuracy. To address this major challenge, this paper introduces a probability map guided multi-format feature fusion model, ProbRadarM3F. This is a novel radar feature extraction framework using a traditional FFT method in parallel with a probability map based positional encoding method. ProbRadarM3F fuses the traditional heatmap features and the positional features, then effectively achieves the estimation of 14 keypoints of the human body. Experimental evaluation on the HuPR dataset proves the effectiveness of the model proposed in this paper, outperforming other methods experimented on this dataset with an AP of 69.9 %. The emphasis of our study is focusing on the position information that is not exploited before in radar singal. This provides direction to investigate other potential non-redundant information from mmWave rader.

[259] OpenScan: A Benchmark for Generalized Open-Vocabulary 3D Scene Understanding

Youjun Zhao, Jiaying Lin, Shuquan Ye, Qianshi Pang, Rynson W. H. Lau

Main category: cs.CV

TL;DR: The paper introduces GOV-3D, a task for generalized open-vocabulary 3D scene understanding, and the OpenScan benchmark to evaluate it, revealing limitations in current methods.

DetailsMotivation: Existing open-vocabulary 3D scene understanding (OV-3D) focuses narrowly on object classes, lacking holistic evaluation. GOV-3D extends this to diverse linguistic queries for broader understanding.

Method: Proposes GOV-3D and the OpenScan benchmark, evaluating state-of-the-art OV-3D methods on fine-grained attributes like affordance and material.

Result: Current OV-3D methods struggle with abstract vocabularies in GOV-3D, not solvable by scaling object classes.

Conclusion: Highlights limitations of existing methods and suggests directions for improvement in generalized 3D scene understanding.

Abstract: Open-vocabulary 3D scene understanding (OV-3D) aims to localize and classify novel objects beyond the closed set of object classes. However, existing approaches and benchmarks primarily focus on the open vocabulary problem within the context of object classes, which is insufficient in providing a holistic evaluation to what extent a model understands the 3D scene. In this paper, we introduce a more challenging task called Generalized Open-Vocabulary 3D Scene Understanding (GOV-3D) to explore the open vocabulary problem beyond object classes. It encompasses an open and diverse set of generalized knowledge, expressed as linguistic queries of fine-grained and object-specific attributes. To this end, we contribute a new benchmark named \textit{OpenScan}, which consists of 3D object attributes across eight representative linguistic aspects, including affordance, property, and material. We further evaluate state-of-the-art OV-3D methods on our OpenScan benchmark and discover that these methods struggle to comprehend the abstract vocabularies of the GOV-3D task, a challenge that cannot be addressed simply by scaling up object classes during training. We highlight the limitations of existing methodologies and explore promising directions to overcome the identified shortcomings.

[260] ESA: Annotation-Efficient Active Learning for Semantic Segmentation

Jinchao Ge, Zeyu Zhang, Minh Hieu Phan, Bowen Zhang, Akide Liu, Yang Zhao, Shuwen Zhao

Main category: cs.CV

TL;DR: Proposes Entity-Superpixel Annotation (ESA) for active learning in semantic segmentation, reducing annotation effort by 98% and improving performance by 1.71%.

DetailsMotivation: Addresses inefficiencies in pixel-based annotation by leveraging natural image patterns and pre-trained models.

Method: Uses ESA with a mask proposal network and super-pixel grouping, focusing on high-entropy superpixels and key entities.

Result: Achieves 98% reduction in clicks (40 vs. 5000) and 1.71% performance boost.

Conclusion: ESA is efficient and effective, outperforming traditional pixel-based methods.

Abstract: Active learning enhances annotation efficiency by selecting the most revealing samples for labeling, thereby reducing reliance on extensive human input. Previous methods in semantic segmentation have centered on individual pixels or small areas, neglecting the rich patterns in natural images and the power of advanced pre-trained models. To address these challenges, we propose three key contributions: Firstly, we introduce Entity-Superpixel Annotation (ESA), an innovative and efficient active learning strategy which utilizes a class-agnostic mask proposal network coupled with super-pixel grouping to capture local structural cues. Additionally, our method selects a subset of entities within each image of the target domain, prioritizing superpixels with high entropy to ensure comprehensive representation. Simultaneously, it focuses on a limited number of key entities, thereby optimizing for efficiency. By utilizing an annotator-friendly design that capitalizes on the inherent structure of images, our approach significantly outperforms existing pixel-based methods, achieving superior results with minimal queries, specifically reducing click cost by 98% and enhancing performance by 1.71%. For instance, our technique requires a mere 40 clicks for annotation, a stark contrast to the 5000 clicks demanded by conventional methods.

[261] CLIP-AGIQA: Boosting the Performance of AI-Generated Image Quality Assessment with CLIP

Zhenchen Tang, Zichuan Wang, Bo Peng, Jing Dong

Main category: cs.CV

TL;DR: The paper proposes CLIP-AGIQA, a CLIP-based regression model for assessing the quality of AI-generated images, outperforming existing methods.

DetailsMotivation: The quality of AI-generated images varies, and current assessment models are inadequate for diverse categories, necessitating advanced solutions.

Method: Leverages CLIP’s visual and textual knowledge, using multi-category learnable prompts for quality assessment.

Result: CLIP-AGIQA outperforms existing IQA models on benchmarks like AGIQA-3K and AIGCIQA2023.

Conclusion: CLIP-AGIQA is an effective solution for evaluating the quality of AI-generated images, demonstrating superior performance.

Abstract: With the rapid development of generative technologies, AI-Generated Images (AIGIs) have been widely applied in various aspects of daily life. However, due to the immaturity of the technology, the quality of the generated images varies, so it is important to develop quality assessment techniques for the generated images. Although some models have been proposed to assess the quality of generated images, they are inadequate when faced with the ever-increasing and diverse categories of generated images. Consequently, the development of more advanced and effective models for evaluating the quality of generated images is urgently needed. Recent research has explored the significant potential of the visual language model CLIP in image quality assessment, finding that it performs well in evaluating the quality of natural images. However, its application to generated images has not been thoroughly investigated. In this paper, we build on this idea and further explore the potential of CLIP in evaluating the quality of generated images. We design CLIP-AGIQA, a CLIP-based regression model for quality assessment of generated images, leveraging rich visual and textual knowledge encapsulated in CLIP. Particularly, we implement multi-category learnable prompts to fully utilize the textual knowledge in CLIP for quality assessment. Extensive experiments on several generated image quality assessment benchmarks, including AGIQA-3K and AIGCIQA2023, demonstrate that CLIP-AGIQA outperforms existing IQA models, achieving excellent results in evaluating the quality of generated images.

[262] Spatio-Temporal Distortion Aware Omnidirectional Video Super-Resolution

Hongyu An, Xinfeng Zhang, Shijie Zhao, Li Zhang, Ruiqin Xiong

Main category: cs.CV

TL;DR: The paper proposes STDAN, a Spatio-Temporal Distortion Aware Network, to enhance the resolution and quality of omnidirectional videos (ODVs) by addressing spatial and temporal distortions.

DetailsMotivation: The demand for high-quality ODVs is increasing, but existing video super-resolution methods struggle with ODVs due to their unique distortions.

Method: STDAN uses spatio-temporal continuous alignment (STCA) and interlaced multi-frame reconstruction (IMFR) to mitigate distortions and improve consistency, along with latitude-saliency adaptive (LSA) weights for focus on critical regions.

Result: STDAN outperforms state-of-the-art methods in visual fidelity and dynamic smoothness on a novel ODV-SR dataset.

Conclusion: STDAN effectively enhances ODV quality by addressing spatio-temporal distortions and ensures practical computational efficiency.

Abstract: Omnidirectional videos (ODVs) provide an immersive visual experience by capturing the 360{\deg} scene. With the rapid advancements in virtual/augmented reality, metaverse, and generative artificial intelligence, the demand for high-quality ODVs is surging. However, ODVs often suffer from low resolution due to their wide field of view and limitations in capturing devices and transmission bandwidth. Although video super-resolution (SR) is a capable video quality enhancement technique, the performance ceiling and practical generalization of existing methods are limited when applied to ODVs due to their unique attributes. To alleviate spatial projection distortions and temporal flickering of ODVs, we propose a Spatio-Temporal Distortion Aware Network (STDAN) with joint spatio-temporal alignment and reconstruction. Specifically, we incorporate a spatio-temporal continuous alignment (STCA) to mitigate discrete geometric artifacts in parallel with temporal alignment. Subsequently, we introduce an interlaced multi-frame reconstruction (IMFR) to enhance temporal consistency. Furthermore, we employ latitude-saliency adaptive (LSA) weights to focus on regions with higher texture complexity and human-watching interest. By exploring a spatio-temporal jointly framework and real-world viewing strategies, STDAN effectively reinforces spatio-temporal coherence on a novel ODV-SR dataset and ensures affordable computational costs. Extensive experimental results demonstrate that STDAN outperforms state-of-the-art methods in improving visual fidelity and dynamic smoothness of ODVs.

[263] CapsoNet: A CNN-Transformer Ensemble for Multi-Class Abnormality Detection in Video Capsule Endoscopy

Arnav Samal, Ranya Batsyas

Main category: cs.CV

TL;DR: CapsoNet is a deep learning framework for multi-class abnormality classification in VCE frames, combining CNNs and transformers, achieving 86.34% balanced accuracy and 0.9908 AUC-ROC.

DetailsMotivation: To address the challenge of multi-class abnormality classification in video capsule endoscopy frames, leveraging both local and global visual features.

Method: Uses an ensemble of CNNs and transformers, trained with focal loss, weighted random sampling, and data augmentation on 50,000+ annotated frames.

Result: Achieved 86.34% balanced accuracy and 0.9908 AUC-ROC, securing 5th place in the Capsule Vision 2024 Challenge.

Conclusion: CapsoNet demonstrates strong performance in VCE abnormality classification, with potential for clinical application.

Abstract: We present CapsoNet, a deep learning framework developed for the Capsule Vision 2024 Challenge, designed to perform multi-class abnormality classification in video capsule endoscopy (VCE) frames. CapsoNet leverages an ensemble of convolutional neural networks (CNNs) and transformer-based architectures to capture both local and global visual features. The model was trained and evaluated on a dataset of over 50,000 annotated frames spanning ten abnormality classes, sourced from three public and one private dataset. To address the challenge of class imbalance, we employed focal loss, weighted random sampling, and extensive data augmentation strategies. All models were fully fine-tuned to maximize performance within the ensemble. CapsoNet achieved a balanced accuracy of 86.34 percent and a mean AUC-ROC of 0.9908 on the official validation set, securing Team Seq2Cure 5th place in the competition. Our implementation is available at http://github.com/arnavs04/capsule-vision-2024

[264] Sign Spotting Disambiguation using Large Language Models

JianHe Low, Ozge Mercanoglu Sincan, Richard Bowden

Main category: cs.CV

TL;DR: A training-free framework using LLMs improves sign spotting by integrating global features and dynamic time warping for flexible vocabulary matching, with LLMs handling disambiguation.

DetailsMotivation: Addressing data scarcity and vocabulary inflexibility in sign language translation by automating sign spotting.

Method: Extracts spatio-temporal and hand shape features, matches them to a sign dictionary using dynamic time warping and cosine similarity, and uses LLMs for context-aware disambiguation.

Result: Superior accuracy and sentence fluency compared to traditional methods on synthetic and real-world datasets.

Conclusion: LLMs enhance sign spotting without training, offering flexibility and improved performance.

Abstract: Sign spotting, the task of identifying and localizing individual signs within continuous sign language video, plays a pivotal role in scaling dataset annotations and addressing the severe data scarcity issue in sign language translation. While automatic sign spotting holds great promise for enabling frame-level supervision at scale, it grapples with challenges such as vocabulary inflexibility and ambiguity inherent in continuous sign streams. Hence, we introduce a novel, training-free framework that integrates Large Language Models (LLMs) to significantly enhance sign spotting quality. Our approach extracts global spatio-temporal and hand shape features, which are then matched against a large-scale sign dictionary using dynamic time warping and cosine similarity. This dictionary-based matching inherently offers superior vocabulary flexibility without requiring model retraining. To mitigate noise and ambiguity from the matching process, an LLM performs context-aware gloss disambiguation via beam search, notably without fine-tuning. Extensive experiments on both synthetic and real-world sign language datasets demonstrate our method’s superior accuracy and sentence fluency compared to traditional approaches, highlighting the potential of LLMs in advancing sign spotting.

[265] DiffGAN: A Test Generation Approach for Differential Testing of Deep Neural Networks for Image Analysis

Zohreh Aghababaeyan, Manel Abdellatif, Lionel Briand, Ramesh S

Main category: cs.CV

TL;DR: DiffGAN is a black-box approach using GANs and genetic algorithms to generate diverse test inputs for differential testing of DNNs, outperforming baselines in revealing behavioral discrepancies.

DetailsMotivation: Traditional accuracy-based evaluations fail to capture behavioral differences between DNN models, especially with limited test data, necessitating a more effective method like differential testing.

Method: DiffGAN combines a GAN and the Non-dominated Sorting Genetic Algorithm II with custom fitness functions (diversity and divergence) to generate triggering inputs for differential testing.

Result: DiffGAN outperforms state-of-the-art baselines, generating four times more triggering inputs with greater diversity and validity, and improves model selection accuracy.

Conclusion: DiffGAN provides an effective black-box solution for differential testing, enhancing model reliability and selection in practical applications.

Abstract: Deep Neural Networks (DNNs) are increasingly deployed across applications. However, ensuring their reliability remains a challenge, and in many situations, alternative models with similar functionality and accuracy are available. Traditional accuracy-based evaluations often fail to capture behavioral differences between models, especially with limited test datasets, making it difficult to select or combine models effectively. Differential testing addresses this by generating test inputs that expose discrepancies in DNN model behavior. However, existing approaches face significant limitations: many rely on model internals or are constrained by available seed inputs. To address these challenges, we propose DiffGAN, a black-box test image generation approach for differential testing of DNN models. DiffGAN leverages a Generative Adversarial Network (GAN) and the Non-dominated Sorting Genetic Algorithm II to generate diverse and valid triggering inputs that reveal behavioral discrepancies between models. DiffGAN employs two custom fitness functions, focusing on diversity and divergence, to guide the exploration of the GAN input space and identify discrepancies between models’ outputs. By strategically searching this space, DiffGAN generates inputs with specific features that trigger differences in model behavior. DiffGAN is black-box, making it applicable in more situations. We evaluate DiffGAN on eight DNN model pairs trained on widely used image datasets. Our results show DiffGAN significantly outperforms a SOTA baseline, generating four times more triggering inputs, with greater diversity and validity, within the same budget. Additionally, the generated inputs improve the accuracy of a machine learning-based model selection mechanism, which selects the best-performing model based on input characteristics and can serve as a smart output voting mechanism when using alternative models.

[266] VOTE: Vision-Language-Action Optimization with Trajectory Ensemble Voting

Juyi Lin, Amir Taherin, Arash Akbari, Arman Akbari, Lei Lu, Guangyu Chen, Taskin Padir, Xiaomeng Yang, Weiwei Chen, Yiqian Li, Xue Lin, David Kaeli, Pu Zhao, Yanzhi Wang

Main category: cs.CV

TL;DR: A training framework and inference optimization technique for Vision Language Action (VLA) models reduces token generation and improves action utilization, achieving faster inference and higher success rates.

DetailsMotivation: Current VLA models generate excessive tokens, causing high latency and training costs, and underutilize actions, leading to performance loss.

Method: A training framework reduces action tokens with high parallelism, and a voting-based ensemble strategy optimizes inference by combining current and past predictions.

Result: The approach achieves higher success rates and 39× faster inference than OpenVLA, with 46 Hz throughput on edge platforms.

Conclusion: The proposed framework and technique enhance VLA model efficiency and performance, making them practical for deployment.

Abstract: Recent large-scale Vision Language Action (VLA) models have shown superior performance in robotic manipulation tasks guided by natural language. However, current VLA models suffer from two drawbacks: (i) generation of massive tokens leading to high inference latency and increased training cost, and (ii) insufficient utilization of generated actions resulting in potential performance loss. To address these issues, we develop a training framework to finetune VLA models for generating significantly fewer action tokens with high parallelism, effectively reducing inference latency and training cost. Furthermore, we introduce an inference optimization technique with a novel voting-based ensemble strategy to combine current and previous action predictions, improving the utilization of generated actions and overall performance. Our results demonstrate that we achieve superior performance compared with state-of-the-art VLA models, achieving significantly higher success rates and 39$\times$ faster inference than OpenVLA with 46 Hz throughput on edge platforms, demonstrating practical deployability. The code is available at https://github.com/LukeLIN-web/VOTE.

[267] LAMA: Stable Dual-Domain Deep Reconstruction For Sparse-View CT

Chi Ding, Qingchao Zhang, Ge Wang, Xiaojing Ye, Yunmei Chen

Main category: cs.CV

TL;DR: LAMA combines data-driven and classical techniques for solving inverse problems in tomographic imaging, improving accuracy and efficiency.

DetailsMotivation: Addressing the challenges of inverse problems in tomographic imaging by integrating learnable regularizers and proven optimization methods.

Method: Uses a variational model with learnable regularizers, Nesterov’s smoothing, and residual learning for two-block optimization.

Result: LAMA reduces network complexity, enhances memory efficiency, and outperforms state-of-the-art methods in CT reconstruction.

Conclusion: LAMA is a robust and interpretable solution for inverse problems, validated by superior performance on benchmark datasets.

Abstract: Inverse problems arise in many applications, especially tomographic imaging. We develop a Learned Alternating Minimization Algorithm (LAMA) to solve such problems via two-block optimization by synergizing data-driven and classical techniques with proven convergence. LAMA is naturally induced by a variational model with learnable regularizers in both data and image domains, parameterized as composite functions of neural networks trained with domain-specific data. We allow these regularizers to be nonconvex and nonsmooth to extract features from data effectively. We minimize the overall objective function using Nesterov’s smoothing technique and residual learning architecture. It is demonstrated that LAMA reduces network complexity, improves memory efficiency, and enhances reconstruction accuracy, stability, and interpretability. Extensive experiments show that LAMA significantly outperforms state-of-the-art methods on popular benchmark datasets for Computed Tomography.

[268] Modality and Task Adaptation for Enhanced Zero-shot Composed Image Retrieval

Haiwen Li, Fei Su, Zhicheng Zhao

Main category: cs.CV

TL;DR: The paper introduces a lightweight post-hoc framework for Zero-Shot Composed Image Retrieval (ZS-CIR), addressing task and modality discrepancies in inversion-based methods. It uses a text-anchored triplet pipeline and MoTa-Adapter for efficient fine-tuning, achieving state-of-the-art results.

DetailsMotivation: Inversion-based ZS-CIR methods suffer from task and modality discrepancies, limiting their effectiveness. The paper aims to resolve these issues with a novel framework.

Method: Proposes a text-anchored triplet construction pipeline using an LLM and MoTa-Adapter, a parameter-efficient fine-tuning method with MoE layers and entropy-based optimization.

Result: The framework significantly improves inversion-based methods, achieving state-of-the-art performance on four benchmarks.

Conclusion: The proposed lightweight framework effectively addresses discrepancies in ZS-CIR, enhancing retrieval performance with minimal computational overhead.

Abstract: As a challenging vision-language task, Zero-Shot Composed Image Retrieval (ZS-CIR) is designed to retrieve target images using bi-modal (image+text) queries. Typical ZS-CIR methods employ an inversion network to generate pseudo-word tokens that effectively represent the input semantics. However, the inversion-based methods suffer from two inherent issues: First, the task discrepancy exists because inversion training and CIR inference involve different objectives. Second, the modality discrepancy arises from the input feature distribution mismatch between training and inference. To this end, we propose a lightweight post-hoc framework, consisting of two components: (1) A new text-anchored triplet construction pipeline leverages a large language model (LLM) to transform a standard image-text dataset into a triplet dataset, where a textual description serves as the target of each triplet. (2) The MoTa-Adapter, a novel parameter-efficient fine-tuning method, adapts the dual encoder to the CIR task using our constructed triplet data. Specifically, on the text side, multiple sets of learnable task prompts are integrated via a Mixture-of-Experts (MoE) layer to capture task-specific priors and handle different types of modifications. On the image side, MoTa-Adapter modulates the inversion network’s input to better match the downstream text encoder. In addition, an entropy-based optimization strategy is proposed to assign greater weight to challenging samples, thus ensuring efficient adaptation. Experiments show that, with the incorporation of our proposed components, inversion-based methods achieve significant improvements, reaching state-of-the-art performance across four widely-used benchmarks. All data and code will be made publicly available.

[269] Interpretable Estimation of CNN Deep Feature Density using Copula and the Generalized Characteristic Function

David Chapman, Parniyan Farvardin

Main category: cs.CV

TL;DR: A novel method combining copula analysis and the Method of Orthogonal Moments (MOM) is proposed to estimate the PDF of deep CNN features, revealing non-Gaussian distributions and long-tailed behavior.

DetailsMotivation: Understanding the PDF of deep CNN features provides insights into deep representations and supports downstream tasks like anomaly detection.

Method: Combines copula analysis with MOM to estimate the Generalized Characteristic Function (GCF) of multivariate deep feature PDFs.

Result: Deep features are non-Gaussian, better approximated by Exponential, Gamma, or Weibull distributions, and exhibit long-tailed behavior with depth.

Conclusion: Large-valued features in deep layers are not outliers but likely important detection signals, suggesting their semantic significance.

Abstract: We present a novel empirical approach toward estimating the Probability Density Function (PDF) of the deep features of Convolutional Neural Networks (CNNs). Estimating the PDF of deep CNN features is an important task, because it will yield new insight into deep representations. Moreover, characterizing the statistical behavior has implications for the feasibility of promising downstream tasks such as density based anomaly detection. Expressive, yet interpretable estimation of the deep feature PDF is challenging due to the Curse of Dimensionality (CoD) as well as our limited ability to comprehend high-dimensional inter-dependencies. Our novel estimation technique combines copula analysis with the Method of Orthogonal Moments (MOM), in order to directly estimate the Generalized Characteristic Function (GCF) of the multivariate deep feature PDF. We find that the one-dimensional marginals of non-negative deep CNN features after major blocks are not well approximated by a Gaussian distribution, and that the features of deep layers are much better approximated by the Exponential, Gamma, and/or Weibull distributions. Furthermore, we observe that deep features become increasingly long-tailed with network depth, although surprisingly the rate of this increase is much slower than theoretical estimates. Finally, we observe that many deep features exhibit strong dependence (either correlation or anti-correlation) with other extremely strong detections, even if these features are independent within typical ranges. We elaborate on these findings in our discussion, where we hypothesize that the long-tail of large valued features corresponds to the strongest computer vision detections of semantic targets, which would imply that these large-valued features are not outliers but rather an important detection signal.

[270] True Multimodal In-Context Learning Needs Attention to the Visual Context

Shuo Chen, Jianzhe Liu, Zhen Han, Yan Xia, Daniel Cremers, Philip Torr, Volker Tresp, Jindong Gu

Main category: cs.CV

TL;DR: The paper addresses the limitations of Multimodal Large Language Models (MLLMs) in effectively leveraging visual information during Multimodal In-Context Learning (MICL). It introduces Dynamic Attention Reallocation (DARA) and the TrueMICL dataset to enhance and evaluate true multimodal adaptation.

DetailsMotivation: Current MLLMs struggle with genuine multimodal adaptation, often neglecting visual cues and relying too heavily on textual patterns, limiting their practical utility.

Method: The authors propose DARA, a fine-tuning strategy to rebalance attention between visual and textual tokens, and introduce the TrueMICL dataset for dedicated MICL evaluation.

Result: Experiments show significant improvements in true multimodal in-context learning capabilities.

Conclusion: The proposed solutions effectively enhance MICL performance and provide reliable evaluation metrics, addressing the underexplored challenges in multimodal adaptation.

Abstract: Multimodal Large Language Models (MLLMs), built on powerful language backbones, have enabled Multimodal In-Context Learning (MICL)-adapting to new tasks from a few multimodal demonstrations consisting of images, questions, and answers. Despite showing noticeable improvement on standard vision-language datasets, current MLLMs struggle to leverage visual information in the demonstrations. Specifically, they tend to neglect visual cues and over-rely on textual patterns, leading to mere text imitation rather than genuine multimodal adaptation. This behavior makes MICL still unimodal and largely restricts its practical utility. More importantly, this limitation is often concealed by the improved performance on tasks that do not require understanding the visual context. As a result, how to effectively enhance MICL ability and reliably evaluate the MICL performance remains underexplored. To address these issues, we first introduce Dynamic Attention Reallocation (DARA), an efficient fine-tuning strategy that encourages models to attend to the visual context by rebalancing attention across visual and textual tokens. In addition, we present TrueMICL, an MICL-dedicated dataset with both support and test sets that explicitly requires the integration of multimodal information-particularly visual content-for correct task completion. Extensive experiments demonstrate the effectiveness of our holistic solution, showcasing substantial improvements in the true multimodal in-context learning capabilities. Code and datasets are available at https://chenxshuo.github.io/true-micl-colm .

[271] COBRA: A Continual Learning Approach to Vision-Brain Understanding

Xuan-Bac Nguyen, Manuel Serna-Aguilera, Arabinda Kumar Choudhary, Pawan Sinha, Xin Li, Khoa Luu

Main category: cs.CV

TL;DR: The paper introduces COBRA, a framework for continual learning in Vision-Brain Understanding (VBU), addressing catastrophic forgetting with three novel modules: SC, PSS, and MRIFormer.

DetailsMotivation: Existing VBU models suffer from catastrophic forgetting when adapting to new subjects, necessitating a solution for continual learning.

Method: COBRA includes a Subject Commonality (SC) module for shared patterns, a Prompt-based Subject Specific (PSS) module for unique patterns, and a transformer-based MRIFormer module for fMRI feature learning.

Result: COBRA outperforms previous methods in continual learning and vision-brain reconstruction, effectively mitigating catastrophic forgetting.

Conclusion: COBRA successfully addresses catastrophic forgetting in VBU and achieves state-of-the-art performance, demonstrating its effectiveness for continual learning.

Abstract: Vision-Brain Understanding (VBU) aims to extract visual information perceived by humans from brain activity recorded through functional Magnetic Resonance Imaging (fMRI). Despite notable advancements in recent years, existing studies in VBU continue to face the challenge of catastrophic forgetting, where models lose knowledge from prior subjects as they adapt to new ones. Addressing continual learning in this field is, therefore, essential. This paper introduces a novel framework called Continual Learning for Vision-Brain (COBRA) to address continual learning in VBU. Our approach includes three novel modules: a Subject Commonality (SC) module, a Prompt-based Subject Specific (PSS) module, and a transformer-based module for fMRI, denoted as MRIFormer module. The SC module captures shared vision-brain patterns across subjects, preserving this knowledge as the model encounters new subjects, thereby reducing the impact of catastrophic forgetting. On the other hand, the PSS module learns unique vision-brain patterns specific to each subject. Finally, the MRIFormer module contains a transformer encoder and decoder that learns the fMRI features for VBU from common and specific patterns. In a continual learning setup, COBRA is trained in new PSS and MRIFormer modules for new subjects, leaving the modules of previous subjects unaffected. As a result, COBRA effectively addresses catastrophic forgetting and achieves state-of-the-art performance in both continual learning and vision-brain reconstruction tasks, surpassing previous methods.

[272] ChartM$^3$: Benchmarking Chart Editing with Multimodal Instructions

Donglu Yang, Liang Zhang, Zihao Yue, Liangyu Chen, Yichen Xu, Wenxuan Wang, Qin Jin

Main category: cs.CV

TL;DR: The paper introduces a multimodal approach for chart editing, combining natural language and visual indicators, and presents ChartM3, a benchmark for evaluating and improving multimodal chart editing models.

DetailsMotivation: Existing chart editing methods rely on ambiguous natural language instructions, limiting fine-grained editing. The work aims to improve this by incorporating visual indicators for clearer intent expression.

Method: The authors propose a multimodal paradigm for chart editing, supported by ChartM3, a benchmark with 1,000 samples of varying difficulty. They also create ChartM3-Train, a 24,000-sample dataset for fine-tuning MLLMs.

Result: Current MLLMs, including GPT-4o, struggle with interpreting visual indicators. Fine-tuning on ChartM3-Train significantly improves performance, highlighting the need for multimodal supervision.

Conclusion: Multimodal supervision is crucial for practical chart editing systems. The introduced datasets and tools provide a foundation for advancing multimodal chart editing capabilities.

Abstract: Charts are a fundamental visualization format widely used in data analysis across research and industry. While enabling users to edit charts based on high-level intentions is of great practical value, existing methods primarily rely on natural language instructions, which are often too ambiguous to support fine-grained editing. In this work, we introduce a novel paradigm for multimodal chart editing, where user intent is expressed through a combination of natural language and visual indicators that explicitly highlight the elements to be modified. To support this paradigm, we present Chart$\text{M}^3$, a new benchmark for Multimodal chart editing with Multi-level complexity and Multi-perspective evaluation. Chart$\text{M}^3$ contains 1,000 samples spanning four levels of editing difficulty. Each sample includes triplets in the form of (chart, code, multimodal instructions). To comprehensively evaluate chart editing models, Chart$\text{M}^3$ provides metrics that assess both visual appearance and code correctness. Our benchmark reveals significant limitations in current multimodal large language models (MLLMs), including GPT-4o, particularly in their ability to interpret and act on visual indicators. To address this, we construct Chart$\text{M}^3$-Train, a large-scale training set with 24,000 multimodal chart editing samples. Fine-tuning MLLMs on this dataset leads to substantial improvements, demonstrating the importance of multimodal supervision in building practical chart editing systems. Our datasets, codes, and evaluation tools are available at https://github.com/MLrollIT/ChartM3. %https://github.com/MLrollIT/ChartM3Our datasets, codes, and evaluation tools are available at https://github.com/yaolinli/VCE.

[273] V2XPnP: Vehicle-to-Everything Spatio-Temporal Fusion for Multi-Agent Perception and Prediction

Zewei Zhou, Hao Xiang, Zhaoliang Zheng, Seth Z. Zhao, Mingyue Lei, Yun Zhang, Tianhui Cai, Xinyi Liu, Johnson Liu, Maheswari Bajji, Xin Xia, Zhiyu Huang, Bolei Zhou, Jiaqi Ma

Main category: cs.CV

TL;DR: The paper introduces V2XPnP, a novel intermediate fusion framework for V2X scenarios, focusing on spatio-temporal fusion and outperforming existing methods in perception and prediction tasks.

DetailsMotivation: To address the limitations of single-frame cooperative perception in V2X by incorporating temporal cues and tasks, and to provide comprehensive benchmarks for fusion strategies.

Method: Proposes one-step and multi-step communication strategies, integrates them with early, late, and intermediate fusion strategies, and introduces V2XPnP, a Transformer-based framework for spatio-temporal fusion.

Result: V2XPnP outperforms state-of-the-art methods in perception and prediction tasks, validated by extensive experiments.

Conclusion: The paper successfully advances V2X technologies by addressing spatio-temporal fusion and providing a robust framework and dataset for future research.

Abstract: Vehicle-to-everything (V2X) technologies offer a promising paradigm to mitigate the limitations of constrained observability in single-vehicle systems. Prior work primarily focuses on single-frame cooperative perception, which fuses agents’ information across different spatial locations but ignores temporal cues and temporal tasks (e.g., temporal perception and prediction). In this paper, we focus on the spatio-temporal fusion in V2X scenarios and design one-step and multi-step communication strategies (when to transmit) as well as examine their integration with three fusion strategies - early, late, and intermediate (what to transmit), providing comprehensive benchmarks with 11 fusion models (how to fuse). Furthermore, we propose V2XPnP, a novel intermediate fusion framework within one-step communication for end-to-end perception and prediction. Our framework employs a unified Transformer-based architecture to effectively model complex spatio-temporal relationships across multiple agents, frames, and high-definition maps. Moreover, we introduce the V2XPnP Sequential Dataset that supports all V2X collaboration modes and addresses the limitations of existing real-world datasets, which are restricted to single-frame or single-mode cooperation. Extensive experiments demonstrate that our framework outperforms state-of-the-art methods in both perception and prediction tasks.

[274] Pinco: Position-induced Consistent Adapter for Diffusion Transformer in Foreground-conditioned Inpainting

Guangben Lu, Yuzhen Du, Zhimin Sun, Ran Yi, Yifan Qi, Yizhe Tang, Tianyi Wang, Lizhuang Ma, Fangyuan Zou

Main category: cs.CV

TL;DR: Pinco is a plug-and-play adapter for foreground-conditioned inpainting, addressing issues like subject distortion and text misalignment through a Self-Consistent Adapter, Decoupled Image Feature Extraction, and Shared Positional Embedding Anchor.

DetailsMotivation: Existing T2I-based inpainting methods struggle with subject shape expansion, distortion, and text misalignment, leading to inconsistencies between visual elements and text descriptions.

Method: Pinco integrates a Self-Consistent Adapter for layout-related self-attention, Decoupled Image Feature Extraction for separate semantic/spatial feature extraction, and a Shared Positional Embedding Anchor for precise feature utilization.

Result: Pinco outperforms existing methods, achieving high-quality backgrounds with good text alignment and preserved subject shape.

Conclusion: Pinco effectively addresses foreground-conditioned inpainting challenges, offering superior performance and efficiency.

Abstract: Foreground-conditioned inpainting aims to seamlessly fill the background region of an image by utilizing the provided foreground subject and a text description. While existing T2I-based image inpainting methods can be applied to this task, they suffer from issues of subject shape expansion, distortion, or impaired ability to align with the text description, resulting in inconsistencies between the visual elements and the text description. To address these challenges, we propose Pinco, a plug-and-play foreground-conditioned inpainting adapter that generates high-quality backgrounds with good text alignment while effectively preserving the shape of the foreground subject. Firstly, we design a Self-Consistent Adapter that integrates the foreground subject features into the layout-related self-attention layer, which helps to alleviate conflicts between the text and subject features by ensuring that the model can effectively consider the foreground subject’s characteristics while processing the overall image layout. Secondly, we design a Decoupled Image Feature Extraction method that employs distinct architectures to extract semantic and spatial features separately, significantly improving subject feature extraction and ensuring high-quality preservation of the subject’s shape. Thirdly, to ensure precise utilization of the extracted features and to focus attention on the subject region, we introduce a Shared Positional Embedding Anchor, greatly improving the model’s understanding of subject features and boosting training efficiency. Extensive experiments demonstrate that our method achieves superior performance and efficiency in foreground-conditioned inpainting.

[275] CreatiLayout: Siamese Multimodal Diffusion Transformer for Creative Layout-to-Image Generation

Hui Zhang, Dexiang Hong, Yitong Wang, Jie Shao, Xinglong Wu, Zuxuan Wu, Yu-Gang Jiang

Main category: cs.CV

TL;DR: The paper introduces SiamLayout, a method for Layout-to-Image (L2I) generation using Multimodal Diffusion Transformers (MM-DiTs), addressing challenges in layout integration and modality balancing. It also presents LayoutSAM, a large-scale dataset, and Layout Designer, leveraging LLMs for layout planning.

DetailsMotivation: Previous L2I methods focus on UNet-based models, leaving MM-DiTs unexplored despite their strong image generation capabilities. The challenge lies in effectively integrating layout guidance with other modalities.

Method: SiamLayout uses separate network weights for layout, treating it equally with image and text. It decouples image-layout interaction into a siamese branch and fuses it later. The paper also introduces LayoutSAM dataset and Layout Designer for layout planning.

Result: The proposed method efficiently incorporates layout into MM-DiT, demonstrated by the large-scale LayoutSAM dataset and LayoutSAM-Eval benchmark. Layout Designer enhances layout generation using LLMs.

Conclusion: CreatiLayout, combining SiamLayout, LayoutSAM, and Layout Designer, provides a systematic solution for creative L2I generation, advancing precision and controllability in image synthesis.

Abstract: Diffusion models have been recognized for their ability to generate images that are not only visually appealing but also of high artistic quality. As a result, Layout-to-Image (L2I) generation has been proposed to leverage region-specific positions and descriptions to enable more precise and controllable generation. However, previous methods primarily focus on UNet-based models (\eg SD1.5 and SDXL), and limited effort has explored Multimodal Diffusion Transformers (MM-DiTs), which have demonstrated powerful image generation capabilities. Enabling MM-DiT for layout-to-image generation seems straightforward but is challenging due to the complexity of how layout is introduced, integrated, and balanced among multiple modalities. To this end, we explore various network variants to efficiently incorporate layout guidance into MM-DiT, and ultimately present SiamLayout. To inherit the advantages of MM-DiT, we use a separate set of network weights to process the layout, treating it as equally important as the image and text modalities. Meanwhile, to alleviate the competition among modalities, we decouple the image-layout interaction into a siamese branch alongside the image-text one and fuse them in the later stage. Moreover, we contribute a large-scale layout dataset, named LayoutSAM, which includes 2.7 million image-text pairs and 10.7 million entities. Each entity is annotated with a bounding box and a detailed description. We further construct the LayoutSAM-Eval benchmark as a comprehensive tool for evaluating the L2I generation quality. Finally, we introduce the Layout Designer, which taps into the potential of large language models in layout planning, transforming them into experts in layout generation and optimization. These components form CreatiLayout – a systematic solution that integrates the layout model, dataset, and planner for creative layout-to-image generation.

[276] Spatial-Frequency Aware for Object Detection in RAW Image

Zhuohua Ye, Liming Zhang, Hongru Han

Main category: cs.CV

TL;DR: Proposes SFAE, a framework combining spatial and frequency domains for RAW image object detection, enhancing suppressed details through cross-domain fusion and adaptive adjustments.

DetailsMotivation: RAW data's wide dynamic range and linear response suppress object details, and existing spatial-domain methods fail to recover them effectively.

Method: SFAE transforms frequency bands into spatial maps, uses cross-domain fusion attention, and applies adaptive nonlinear adjustments with domain-specific gamma parameters.

Result: Improved recovery of object details in RAW images by leveraging frequency-domain features and spatial-frequency synergy.

Conclusion: SFAE effectively addresses RAW image challenges by integrating spatial and frequency domains, enhancing object detection performance.

Abstract: Direct RAW-based object detection offers great promise by utilizing RAW data (unprocessed sensor data), but faces inherent challenges due to its wide dynamic range and linear response, which tends to suppress crucial object details. In particular, existing enhancement methods are almost all performed in the spatial domain, making it difficult to effectively recover these suppressed details from the skewed pixel distribution of RAW images. To address this limitation, we turn to the frequency domain, where features, such as object contours and textures, can be naturally separated based on frequency. In this paper, we propose Space-Frequency Aware RAW Image Object Detection Enhancer (SFAE), a novel framework that synergizes spatial and frequency representations. Our contribution is threefold. The first lies in the ``spatialization" of frequency bands. Different from the traditional paradigm of directly manipulating abstract spectra in deep networks, our method inversely transforms individual frequency bands back into tangible spatial maps, thus preserving direct physical intuition. Then the cross-domain fusion attention module is developed to enable deep multimodal interactions between these maps and the original spatial features. Finally, the framework performs adaptive nonlinear adjustments by predicting and applying different gamma parameters for the two domains.

[277] 3DGraphLLM: Combining Semantic Graphs and Large Language Models for 3D Scene Understanding

Tatiana Zemskova, Dmitry Yudin

Main category: cs.CV

TL;DR: 3DGraphLLM enhances 3D scene understanding for robots by incorporating semantic relationships into scene graphs, improving LLM-based vision-language tasks.

DetailsMotivation: To improve robotic interaction, the paper addresses the gap in leveraging semantic relationships in 3D scene graphs for better natural language query responses.

Method: Proposes 3DGraphLLM, a learnable 3D scene graph representation that includes semantic relationships, used as input for LLMs.

Result: Outperforms baselines on datasets like ScanRefer, Multi3DRefer, ScanQA, Sqa3D, and Scan2cap.

Conclusion: Incorporating semantic relationships in 3D scene graphs significantly enhances LLM performance in vision-language tasks.

Abstract: A 3D scene graph represents a compact scene model by capturing both the objects present and the semantic relationships between them, making it a promising structure for robotic applications. To effectively interact with users, an embodied intelligent agent should be able to answer a wide range of natural language queries about the surrounding 3D environment. Large Language Models (LLMs) are beneficial solutions for user-robot interaction due to their natural language understanding and reasoning abilities. Recent methods for learning scene representations have shown that adapting these representations to the 3D world can significantly improve the quality of LLM responses. However, existing methods typically rely only on geometric information, such as object coordinates, and overlook the rich semantic relationships between objects. In this work, we propose 3DGraphLLM, a method for constructing a learnable representation of a 3D scene graph that explicitly incorporates semantic relationships. This representation is used as input to LLMs for performing 3D vision-language tasks. In our experiments on popular ScanRefer, Multi3DRefer, ScanQA, Sqa3D, and Scan2cap datasets, we demonstrate that our approach outperforms baselines that do not leverage semantic relationships between objects. The code is publicly available at https://github.com/CognitiveAISystems/3DGraphLLM.

[278] 4D Gaussian Splatting: Modeling Dynamic Scenes with Native 4D Primitives

Zeyu Yang, Zijie Pan, Xiatian Zhu, Li Zhang, Jianfeng Feng, Yu-Gang Jiang, Philip H. S. Torr

Main category: cs.CV

TL;DR: The paper introduces a method using 4D Gaussians for dynamic 3D scene representation, enabling real-time, photorealistic novel view synthesis for AR/VR applications.

DetailsMotivation: Dynamic 3D scene representation is essential for immersive AR/VR and metaverse applications but is challenging due to scene complexity and temporal dynamics.

Method: The approach approximates a spatiotemporal 4D volume using 4D Gaussians, optimizing geometry and appearance with photometric supervision and a tailored rendering pipeline.

Result: Achieves real-time rendering of high-resolution, photorealistic views for complex dynamic scenes and offers compact variants to reduce memory usage.

Conclusion: 4DGS excels in visual quality and efficiency for dynamic scene tasks, validated across diverse scenarios and data types.

Abstract: Dynamic 3D scene representation and novel view synthesis are crucial for enabling immersive experiences required by AR/VR and metaverse applications. It is a challenging task due to the complexity of unconstrained real-world scenes and their temporal dynamics. In this paper, we reformulate the reconstruction of a time-varying 3D scene as approximating its underlying spatiotemporal 4D volume by optimizing a collection of native 4D primitives, i.e., 4D Gaussians, with explicit geometry and appearance modeling. Equipped with a tailored rendering pipeline, our representation can be end-to-end optimized using only photometric supervision while free viewpoint viewing at interactive frame rate, making it suitable for representing real world scene with complex dynamic. This approach has been the first solution to achieve real-time rendering of high-resolution, photorealistic novel views for complex dynamic scenes. To facilitate real-world applications, we derive several compact variants that effectively reduce the memory footprint to address its storage bottleneck. Extensive experiments validate the superiority of 4DGS in terms of visual quality and efficiency across a range of dynamic scene-related tasks (e.g., novel view synthesis, 4D generation, scene understanding) and scenarios (e.g., single object, indoor scenes, driving environments, synthetic and real data).

[279] Go-with-the-Flow: Motion-Controllable Video Diffusion Models Using Real-Time Warped Noise

Ryan Burgert, Yuancheng Xu, Wenqi Xian, Oliver Pilarski, Pascal Clausen, Mingming He, Li Ma, Yitong Deng, Lingxiao Li, Mohsen Mousavi, Michael Ryoo, Paul Debevec, Ning Yu

Main category: cs.CV

TL;DR: The paper introduces a method to enhance video diffusion models by using structured latent noise sampling for motion control, without altering model architectures.

DetailsMotivation: To improve motion control in video diffusion models while maintaining per-frame pixel quality and temporal coherence.

Method: Proposes a noise warping algorithm that replaces random temporal Gaussianity with correlated warped noise derived from optical flow fields, preserving spatial Gaussianity.

Result: Enables effective motion control (local, global, and transfer) with minimal overhead, validated by experiments and user studies.

Conclusion: The method is robust, scalable, and user-friendly for motion control in video diffusion models.

Abstract: Generative modeling aims to transform random noise into structured outputs. In this work, we enhance video diffusion models by allowing motion control via structured latent noise sampling. This is achieved by just a change in data: we pre-process training videos to yield structured noise. Consequently, our method is agnostic to diffusion model design, requiring no changes to model architectures or training pipelines. Specifically, we propose a novel noise warping algorithm, fast enough to run in real time, that replaces random temporal Gaussianity with correlated warped noise derived from optical flow fields, while preserving the spatial Gaussianity. The efficiency of our algorithm enables us to fine-tune modern video diffusion base models using warped noise with minimal overhead, and provide a one-stop solution for a wide range of user-friendly motion control: local object motion control, global camera movement control, and motion transfer. The harmonization between temporal coherence and spatial Gaussianity in our warped noise leads to effective motion control while maintaining per-frame pixel quality. Extensive experiments and user studies demonstrate the advantages of our method, making it a robust and scalable approach for controlling motion in video diffusion models. Video results are available on our webpage: https://eyeline-labs.github.io/Go-with-the-Flow. Source code and model checkpoints are available on GitHub: https://github.com/Eyeline-Labs/Go-with-the-Flow.

[280] egoPPG: Heart Rate Estimation from Eye-Tracking Cameras in Egocentric Systems to Benefit Downstream Vision Tasks

Björn Braun, Rayan Armani, Manuel Meier, Max Moebus, Christian Holz

Main category: cs.CV

TL;DR: The paper introduces egoPPG, a novel vision task for egocentric systems to recover cardiac activity, and PulseFormer, a method to estimate heart rate (HR) from eye-tracking cameras, improving downstream tasks like proficiency estimation by 14%.

DetailsMotivation: Egocentric systems should detect physiological states (e.g., heart rate) to better model context-aware behavior, as current systems lack this capability.

Method: PulseFormer extracts photoplethysmogram (PPG) signals from eye-tracking cameras and fuses motion cues from inertial measurement units to estimate HR. A dataset of 13+ hours of videos and ground-truth HR signals was collected for training.

Result: PulseFormer robustly estimates HR (MAE=7.67 bpm, r=0.85) and improves proficiency estimation by 14% on the EgoExo4D dataset.

Conclusion: EgoPPG enhances egocentric systems by integrating physiological tracking, providing meaningful augmentations for existing tasks. The code, dataset, and HR augmentations are released to encourage further research.

Abstract: Egocentric vision systems aim to understand the spatial surroundings and the wearer’s behavior inside it, including motions, activities, and interactions. We argue that egocentric systems must additionally detect physiological states to capture a person’s attention and situational responses, which are critical for context-aware behavior modeling. In this paper, we propose egoPPG, a novel vision task for egocentric systems to recover a person’s cardiac activity to aid downstream vision tasks. We introduce PulseFormer, a method to extract heart rate as a key indicator of physiological state from the eye tracking cameras on unmodified egocentric vision systems. PulseFormer continuously estimates the photoplethysmogram (PPG) from areas around the eyes and fuses motion cues from the headset’s inertial measurement unit to track HR values. We demonstrate egoPPG’s downstream benefit for a key task on EgoExo4D, an existing egocentric dataset for which we find PulseFormer’s estimates of HR to improve proficiency estimation by 14%. To train and validate PulseFormer, we collected a dataset of 13+ hours of eye tracking videos from Project Aria and contact-based PPG signals as well as an electrocardiogram (ECG) for ground-truth HR values. Similar to EgoExo4D, 25 participants performed diverse everyday activities such as office work, cooking, dancing, and exercising, which induced significant natural motion and HR variation (44-164 bpm). Our model robustly estimates HR (MAE=7.67 bpm) and captures patterns (r=0.85). Our results show how egocentric systems may unify environmental and physiological tracking to better understand users and that egoPPG as a complementary task provides meaningful augmentations for existing datasets and tasks. We release our code, dataset, and HR augmentations for EgoExo4D to inspire research on physiology-aware egocentric tasks.

[281] Understanding Flatness in Generative Models: Its Role and Benefits

Taehwan Lee, Kyeongkook Seo, Jaejun Yoo, Sung Whan Yoon

Main category: cs.CV

TL;DR: The paper explores the impact of flat minima in generative models, particularly diffusion models, showing that flatter minima enhance robustness and performance, with Sharpness-Aware Minimization (SAM) being the most effective method.

DetailsMotivation: Flat minima improve generalization and robustness in supervised learning, but their role in generative models is understudied. This work aims to fill that gap.

Method: The study combines theoretical analysis and empirical experiments, using methods like SAM, Input Perturbation (IP), SWA, and EMA to assess flatness in diffusion models.

Result: Flat minima improve robustness against perturbations, reduce exposure bias, and enhance resilience to quantization. SAM outperforms other flatness-promoting methods.

Conclusion: Flat minima significantly benefit diffusion models, improving both generative performance and robustness, with SAM being the most effective approach.

Abstract: Flat minima, known to enhance generalization and robustness in supervised learning, remain largely unexplored in generative models. In this work, we systematically investigate the role of loss surface flatness in generative models, both theoretically and empirically, with a particular focus on diffusion models. We establish a theoretical claim that flatter minima improve robustness against perturbations in target prior distributions, leading to benefits such as reduced exposure bias – where errors in noise estimation accumulate over iterations – and significantly improved resilience to model quantization, preserving generative performance even under strong quantization constraints. We further observe that Sharpness-Aware Minimization (SAM), which explicitly controls the degree of flatness, effectively enhances flatness in diffusion models even surpassing the indirectly promoting flatness methods – Input Perturbation (IP) which enforces the Lipschitz condition, ensembling-based approach like Stochastic Weight Averaging (SWA) and Exponential Moving Average (EMA) – are less effective. Through extensive experiments on CIFAR-10, LSUN Tower, and FFHQ, we demonstrate that flat minima in diffusion models indeed improve not only generative performance but also robustness.

[282] Landsat30-AU: A Vision-Language Dataset for Australian Landsat Imagery

Sai Ma, Zhuang Li, John A Taylor

Main category: cs.CV

TL;DR: Landsat30-AU is a vision-language dataset for satellite imagery, addressing gaps in existing datasets by focusing on long-term, multi-satellite archives. It improves VLM performance for Earth observation tasks.

DetailsMotivation: Existing datasets lack long-term, low-resolution satellite imagery, limiting affordable and bias-robust global monitoring. Landsat30-AU fills this gap.

Method: The dataset includes image-caption pairs and VQA samples, curated using a bootstrapped pipeline with iterative refinement and human verification.

Result: Off-the-shelf VLMs perform poorly on satellite imagery, but fine-tuning Qwen2.5-VL-7B on Landsat30-AU significantly improves captioning and VQA accuracy.

Conclusion: Landsat30-AU enables better VLMs for Earth observation, highlighting the need for specialized datasets and fine-tuning for satellite imagery tasks.

Abstract: Vision language models (VLMs) that enable natural language interaction with satellite imagery can democratize Earth observation by accelerating expert workflows, making data accessible to non-specialists, and enabling planet-scale automation. However, existing datasets focus mainly on short-term, high-resolution imagery from a limited number of satellites, overlooking low-resolution, multi-satellite, long-term archives, such as Landsat, that are essential for affordable and bias-robust global monitoring. We address this gap with Landsat30-AU, a large-scale vision-language dataset built from 30-meter resolution imagery collected by four Landsat satellites (5, 7, 8, and 9) over Australia, spanning more than 36 years. The dataset includes two components: Landsat30-AU-Cap, containing $196,262$ image-caption pairs, and Landsat30-AU-VQA, comprising 17,725 human-verified visual question answering (VQA) samples across eight remote sensing domains. Both datasets are curated through a bootstrapped pipeline that leverages generic VLMs with iterative refinement and human verification to ensure quality. Our evaluation of eight VLMs on our benchmark reveals that off-the-shelf models struggle to understand satellite imagery. The open-source remote-sensing VLM EarthDial achieves only 0.07 SPIDEr in captioning and a VQA accuracy of 0.48, highlighting the limitations of current approaches. Encouragingly, lightweight fine-tuning of Qwen2.5-VL-7B on Landsat30-AU improves captioning performance from 0.11 to 0.31 SPIDEr and boosts VQA accuracy from 0.74 to 0.87. Code and data are available at https://github.com/papersubmit1/landsat30-au.

[283] DivCon-NeRF: Diverse and Consistent Ray Augmentation for Few-Shot NeRF

Ingyun Lee, Jae Won Jang, Seunghyeon Seo, Nojun Kwak

Main category: cs.CV

TL;DR: DivCon-NeRF improves few-shot NeRF performance by introducing sphere-based ray augmentations and a consistency mask to reduce floaters and distortions.

DetailsMotivation: NeRF's reliance on numerous multi-view images limits its practicality in few-shot scenarios, and existing ray augmentation methods suffer from floaters and distortions.

Method: DivCon-NeRF uses sphere-based ray augmentations from 360-degree directions and a consistency mask to filter inconsistent rays, along with tailored loss functions.

Result: Outperforms existing few-shot NeRF methods on Blender, LLFF, and DTU datasets and shows strong generalizability.

Conclusion: DivCon-NeRF effectively enhances diversity and consistency in few-shot NeRF, reducing artifacts and improving performance.

Abstract: Neural Radiance Field (NeRF) has shown remarkable performance in novel view synthesis but requires numerous multi-view images, limiting its practicality in few-shot scenarios. Ray augmentation has been proposed to alleviate overfitting caused by sparse training data by generating additional rays. However, existing methods, which generate augmented rays only near the original rays, exhibit pronounced floaters and appearance distortions due to limited viewpoints and inconsistent rays obstructed by nearby obstacles and complex surfaces. To address these problems, we propose DivCon-NeRF, which introduces novel sphere-based ray augmentations to significantly enhance both diversity and consistency. By employing a virtual sphere centered at the predicted surface point, our method generates diverse augmented rays from all 360-degree directions, facilitated by our consistency mask that effectively filters out inconsistent rays. We introduce tailored loss functions that leverage these augmentations, effectively reducing floaters and visual distortions. Consequently, our method outperforms existing few-shot NeRF approaches on the Blender, LLFF, and DTU datasets. Furthermore, DivCon-NeRF demonstrates strong generalizability by effectively integrating with both regularization- and framework-based few-shot NeRFs. Our code will be made publicly available.

[284] JointTuner: Appearance-Motion Adaptive Joint Training for Customized Video Generation

Fangda Chen, Shanshan Zhao, Chuanfu Xu, Long Lan

Main category: cs.CV

TL;DR: JointTuner is a framework for joint optimization of appearance and motion in video generation, using Synaptic LoRA and AiT Loss to reduce interference and contamination.

DetailsMotivation: Prior methods decouple appearance and motion training, causing concept interference and appearance contamination.

Method: JointTuner leverages Synaptic LoRA for dynamic guidance and AiT Loss to minimize appearance interference.

Result: The framework supports UNet and Diffusion Transformer models, improving video quality and customization.

Conclusion: JointTuner advances video generation by enabling consistent optimization and systematic evaluation.

Abstract: Recent advancements in customized video generation have led to significant improvements in the simultaneous adaptation of appearance and motion. While prior methods typically decouple appearance and motion training, the stage-wise strategy often introduces concept interference, resulting in inaccurate rendering of appearance features or motion patterns. Another challenge is appearance contamination, where background and foreground elements from reference videos distort the customized subject. In this work, we propose JointTuner, a novel framework that enables joint optimization of both appearance and motion components by leveraging two key innovations: Synaptic Low-Rank Adaptation (Synaptic LoRA) and Appearance-independent Temporal Loss (AiT Loss). Synaptic LoRA introduces a synaptic regulator, implemented as a context-aware linear activation layer, to dynamically guide LoRA modules to focus on either subject appearance or motion patterns, thereby enabling consistent optimization across spatial and temporal dimensions. AiT Loss disrupts the gradient flow of appearance-related components, guiding the model to focus exclusively on motion learning and minimizing appearance interference. JointTuner is compatible with both UNet-based models (e.g., ZeroScope) and Diffusion Transformer-based models (e.g., CogVideoX), supporting the generation of longer and higher-quality customized videos. Additionally, we present a systematic evaluation framework for appearance-motion combined customization, covering 90 combinations evaluated along four critical dimensions: semantic alignment, motion dynamism, temporal consistency, and perceptual quality. Our project homepage can be found at https://fdchen24.github.io/JointTuner-Website.

[285] MultiADS: Defect-aware Supervision for Multi-type Anomaly Detection and Segmentation in Zero-Shot Learning

Ylli Sadikaj, Hongkuan Zhou, Lavdim Halilaj, Stefan Schmid, Steffen Staab, Claudia Plant

Main category: cs.CV

TL;DR: MultiADS is a zero-shot learning method for multi-type anomaly detection and segmentation, outperforming state-of-the-art methods.

DetailsMotivation: Precise defect identification in industrial inspection is needed for automated anomaly treatment, but current methods lack multi-defect recognition.

Method: MultiADS uses CLIP and linear layers to align visual-textual features, enabling multi-type anomaly segmentation in zero-shot learning.

Result: Outperforms SoTA in zero/few-shot learning on five datasets, generating defect-specific masks and identifying multiple defects.

Conclusion: MultiADS advances industrial inspection by enabling precise, multi-defect recognition in zero-shot settings.

Abstract: Precise optical inspection in industrial applications is crucial for minimizing scrap rates and reducing the associated costs. Besides merely detecting if a product is anomalous or not, it is crucial to know the distinct type of defect, such as a bent, cut, or scratch. The ability to recognize the “exact” defect type enables automated treatments of the anomalies in modern production lines. Current methods are limited to solely detecting whether a product is defective or not without providing any insights on the defect type, nevertheless detecting and identifying multiple defects. We propose MultiADS, a zero-shot learning approach, able to perform Multi-type Anomaly Detection and Segmentation. The architecture of MultiADS comprises CLIP and extra linear layers to align the visual- and textual representation in a joint feature space. To the best of our knowledge, our proposal, is the first approach to perform a multi-type anomaly segmentation task in zero-shot learning. Contrary to the other baselines, our approach i) generates specific anomaly masks for each distinct defect type, ii) learns to distinguish defect types, and iii) simultaneously identifies multiple defect types present in an anomalous product. Additionally, our approach outperforms zero/few-shot learning SoTA methods on image-level and pixel-level anomaly detection and segmentation tasks on five commonly used datasets: MVTec-AD, Visa, MPDD, MAD and Real-IAD.

[286] Nonlocal Retinex-Based Variational Model and its Deep Unfolding Twin for Low-Light Image Enhancement

Daniel Torres, Joan Duran, Julia Navarro, Catalina Sbert

Main category: cs.CV

TL;DR: A variational method for low-light image enhancement using Retinex decomposition, integrating color correction, nonlocal gradient fidelity, and automatic gamma correction, with a deep unfolding extension.

DetailsMotivation: Low-light images obscure details and reduce contrast, hindering tasks like segmentation and detection. Enhancing such images is crucial.

Method: Retinex decomposition into illumination, reflectance, and noise; color correction; nonlocal gradient fidelity; automatic gamma correction; deep unfolding with learnable networks and cross-attention.

Result: Outperforms state-of-the-art techniques, with the variational model surpassing deep learning approaches in quality metrics.

Conclusion: The proposed methods effectively enhance low-light images, with the variational approach showing superior performance without relying on learning.

Abstract: Images captured under low-light conditions present significant limitations in many applications, as poor lighting can obscure details, reduce contrast, and hide noise. Removing the illumination effects and enhancing the quality of such images is crucial for many tasks, such as image segmentation and object detection. In this paper, we propose a variational method for low-light image enhancement based on the Retinex decomposition into illumination, reflectance, and noise components. A color correction pre-processing step is applied to the low-light image, which is then used as the observed input in the decomposition. Moreover, our model integrates a novel nonlocal gradient-type fidelity term designed to preserve structural details. Additionally, we propose an automatic gamma correction module. Building on the proposed variational approach, we extend the model by introducing its deep unfolding counterpart, in which the proximal operators are replaced with learnable networks. We propose cross-attention mechanisms to capture long-range dependencies in both the nonlocal prior of the reflectance and the nonlocal gradient-based constraint. Experimental results demonstrate that both methods compare favorably with several recent and state-of-the-art techniques across different datasets. In particular, despite not relying on learning strategies, the variational model outperforms most deep learning approaches both visually and in terms of quality metrics.

[287] Breaking the Modality Barrier: Universal Embedding Learning with Multimodal LLMs

Tiancheng Gu, Kaicheng Yang, Ziyong Feng, Xingjun Wang, Yanzhao Zhang, Dingkun Long, Yingda Chen, Weidong Cai, Jiankang Deng

Main category: cs.CV

TL;DR: UniME introduces a two-stage framework using MLLMs to improve multimodal representation learning, addressing CLIP’s limitations and achieving better performance in diverse tasks.

DetailsMotivation: CLIP's limitations in text token truncation, isolated encoding, and compositionality hinder its efficacy, prompting the need for a more advanced approach like UniME.

Method: UniME uses textual discriminative knowledge distillation from an LLM-based teacher and hard negative enhanced instruction tuning to learn better representations.

Result: UniME outperforms on MMEB and retrieval tasks, showing superior discriminative and compositional capabilities.

Conclusion: UniME effectively enhances multimodal representation learning, offering improved performance for downstream tasks.

Abstract: The Contrastive Language-Image Pre-training (CLIP) framework has become a widely used approach for multimodal representation learning, particularly in image-text retrieval and clustering. However, its efficacy is constrained by three key limitations: (1) text token truncation, (2) isolated image-text encoding, and (3) deficient compositionality due to bag-of-words behavior. While recent Multimodal Large Language Models (MLLMs) have demonstrated significant advances in generalized vision-language understanding, their potential for learning transferable multimodal representations remains underexplored.In this work, we present UniME (Universal Multimodal Embedding), a novel two-stage framework that leverages MLLMs to learn discriminative representations for diverse downstream tasks. In the first stage, we perform textual discriminative knowledge distillation from a powerful LLM-based teacher model to enhance the embedding capability of the MLLM's language component. In the second stage, we introduce hard negative enhanced instruction tuning to further advance discriminative representation learning. Specifically, we initially mitigate false negative contamination and then sample multiple hard negatives per instance within each batch, forcing the model to focus on challenging samples. This approach not only improves discriminative power but also enhances instruction-following ability in downstream tasks. We conduct extensive experiments on the MMEB benchmark and multiple retrieval tasks, including short and long caption retrieval and compositional retrieval. Results demonstrate that UniME achieves consistent performance improvement across all tasks, exhibiting superior discriminative and compositional capabilities.

[288] Disentangle Identity, Cooperate Emotion: Correlation-Aware Emotional Talking Portrait Generation

Weipeng Tan, Chuming Lin, Chengming Xu, FeiFan Xu, Xiaobin Hu, Xiaozhong Ji, Junwei Zhu, Chengjie Wang, Yanwei Fu

Main category: cs.CV

TL;DR: DICE-Talk is a novel framework for emotional talking head generation that disentangles identity from emotion, enhances emotion correlations, and improves emotional expressiveness while preserving speaker identity.

DetailsMotivation: Current methods for Talking Head Generation (THG) lack emotional expressiveness and struggle with identity preservation, due to insufficient use of audio emotional cues, identity leakage, and isolated emotion learning.

Method: DICE-Talk uses a disentangled emotion embedder for identity-agnostic emotion modeling, a correlation-enhanced emotion conditioning module with Emotion Banks, and an emotion discrimination objective for affective consistency.

Result: The method outperforms state-of-the-art approaches in emotion accuracy and maintains lip-sync performance, as shown on MEAD and HDTF datasets.

Conclusion: DICE-Talk successfully generates identity-preserving, emotionally rich portraits with natural adaptability to unseen identities, validated by qualitative results and user studies.

Abstract: Recent advances in Talking Head Generation (THG) have achieved impressive lip synchronization and visual quality through diffusion models; yet existing methods struggle to generate emotionally expressive portraits while preserving speaker identity. We identify three critical limitations in current emotional talking head generation: insufficient utilization of audio’s inherent emotional cues, identity leakage in emotion representations, and isolated learning of emotion correlations. To address these challenges, we propose a novel framework dubbed as DICE-Talk, following the idea of disentangling identity with emotion, and then cooperating emotions with similar characteristics. First, we develop a disentangled emotion embedder that jointly models audio-visual emotional cues through cross-modal attention, representing emotions as identity-agnostic Gaussian distributions. Second, we introduce a correlation-enhanced emotion conditioning module with learnable Emotion Banks that explicitly capture inter-emotion relationships through vector quantization and attention-based feature aggregation. Third, we design an emotion discrimination objective that enforces affective consistency during the diffusion process through latent-space classification. Extensive experiments on MEAD and HDTF datasets demonstrate our method’s superiority, outperforming state-of-the-art approaches in emotion accuracy while maintaining competitive lip-sync performance. Qualitative results and user studies further confirm our method’s ability to generate identity-preserving portraits with rich, correlated emotional expressions that naturally adapt to unseen identities.

[289] CMT: A Cascade MAR with Topology Predictor for Multimodal Conditional CAD Generation

Jianyu Wu, Yizhou Wang, Xiangyu Yue, Xinzhu Ma, Jingyang Guo, Dongzhan Zhou, Wanli Ouyang, Shixiang Tang

Main category: cs.CV

TL;DR: The paper introduces CMT, a multimodal framework for CAD generation, and mmABC, a large-scale dataset, improving performance in CAD tasks.

DetailsMotivation: Existing CAD methods lack accuracy and user-friendliness due to oversimplified representations or architectures.

Method: Proposes CMT, a cascade MAR with topology predictor for B-Rep-based CAD generation, and develops the mmABC dataset with multimodal annotations.

Result: CMT outperforms state-of-the-art methods, improving Coverage and Valid ratio by +10.68% and +10.3% in unconditional generation, and +4.01 Chamfer in image-conditioned generation.

Conclusion: CMT and mmABC address limitations in CAD methods, offering superior performance and scalability.

Abstract: While accurate and user-friendly Computer-Aided Design (CAD) is crucial for industrial design and manufacturing, existing methods still struggle to achieve this due to their over-simplified representations or architectures incapable of supporting multimodal design requirements. In this paper, we attempt to tackle this problem from both methods and datasets aspects. First, we propose a cascade MAR with topology predictor (CMT), the first multimodal framework for CAD generation based on Boundary Representation (B-Rep). Specifically, the cascade MAR can effectively capture the ``edge-counters-surface’’ priors that are essential in B-Reps, while the topology predictor directly estimates topology in B-Reps from the compact tokens in MAR. Second, to facilitate large-scale training, we develop a large-scale multimodal CAD dataset, mmABC, which includes over 1.3 million B-Rep models with multimodal annotations, including point clouds, text descriptions, and multi-view images. Extensive experiments show the superior of CMT in both conditional and unconditional CAD generation tasks. For example, we improve Coverage and Valid ratio by +10.68% and +10.3%, respectively, compared to state-of-the-art methods on ABC in unconditional generation. CMT also improves +4.01 Chamfer on image conditioned CAD generation on mmABC.

[290] Quaternion Sparse Decomposition for Multi-focus Color Image Fusion

Weihua Yang, Yicong Zhou

Main category: cs.CV

TL;DR: A quaternion-based framework for multi-focus color image fusion, improving focus detection and preserving details and structure.

DetailsMotivation: Existing methods fail in complex scenarios due to poor handling of color and texture.

Method: Uses quaternion sparse decomposition, base-detail fusion, and structural similarity refinement.

Result: Outperforms state-of-the-art methods in experiments.

Conclusion: The framework effectively integrates color images while preserving details and structure.

Abstract: Multi-focus color image fusion refers to integrating multiple partially focused color images to create a single all-in-focus color image. However, existing methods struggle with complex real-world scenarios due to limitations in handling color information and intricate textures. To address these challenges, this paper proposes a quaternion multi-focus color image fusion framework to perform high-quality color image fusion completely in the quaternion domain. This framework introduces 1) a quaternion sparse decomposition model to jointly learn fine-scale image details and structure information of color images in an iterative fashion for high-precision focus detection, 2) a quaternion base-detail fusion strategy to individually fuse base-scale and detail-scale results across multiple color images for preserving structure and detail information, and 3) a quaternion structural similarity refinement strategy to adaptively select optimal patches from initial fusion results and obtain the final fused result for preserving fine details and ensuring spatially consistent outputs. Extensive experiments demonstrate that the proposed framework outperforms state-of-the-art methods.

[291] PROM: Prioritize Reduction of Multiplications Over Lower Bit-Widths for Efficient CNNs

Lukas Meiner, Jens Mehnert, Alexandru Paul Condurache

Main category: cs.CV

TL;DR: PROM introduces a selective quantization method for depthwise-separable CNNs, using ternary weights for pointwise convolutions and 8-bit for others, achieving significant energy and storage savings without performance loss.

DetailsMotivation: Existing quantization methods fail to exploit efficiency gains in modern depthwise-separable architectures due to uneven computational cost distribution.

Method: PROM applies two distinct bit-widths: ternary weights for pointwise convolutions and 8-bit for other modules, with a quantization-aware training procedure. Activations are quantized to 8-bit, converting pointwise convolutions into int8 additions.

Result: PROM reduces MobileNetV2’s energy cost by 23.9x and storage size by 2.7x compared to float16, while maintaining similar ImageNet classification performance.

Conclusion: PROM advances the Pareto frontier for energy vs. accuracy in quantized CNNs, offering a simple solution for efficient depthwise-separable network quantization.

Abstract: Convolutional neural networks (CNNs) are crucial for computer vision tasks on resource-constrained devices. Quantization effectively compresses these models, reducing storage size and energy cost. However, in modern depthwise-separable architectures, the computational cost is distributed unevenly across its components, with pointwise operations being the most expensive. By applying a general quantization scheme to this imbalanced cost distribution, existing quantization approaches fail to fully exploit potential efficiency gains. To this end, we introduce PROM, a straightforward approach for quantizing modern depthwise-separable convolutional networks by selectively using two distinct bit-widths. Specifically, pointwise convolutions are quantized to ternary weights, while the remaining modules use 8-bit weights, which is achieved through a simple quantization-aware training procedure. Additionally, by quantizing activations to 8-bit, our method transforms pointwise convolutions with ternary weights into int8 additions, which enjoy broad support across hardware platforms and effectively eliminates the need for expensive multiplications. Applying PROM to MobileNetV2 reduces the model’s energy cost by more than an order of magnitude (23.9x) and its storage size by 2.7x compared to the float16 baseline while retaining similar classification performance on ImageNet. Our method advances the Pareto frontier for energy consumption vs. top-1 accuracy for quantized convolutional models on ImageNet. PROM addresses the challenges of quantizing depthwise-separable convolutional networks to both ternary and 8-bit weights, offering a simple way to reduce energy cost and storage size.

[292] False Promises in Medical Imaging AI? Assessing Validity of Outperformance Claims

Evangelia Christodoulou, Annika Reinke, Pascaline Andrè, Patrick Godau, Piotr Kalinowski, Rola Houhou, Selen Erkan, Carole H. Sudre, Ninon Burgos, Sofiène Boutaj, Sophie Loizillon, Maëlys Solal, Veronika Cheplygina, Charles Heitz, Michal Kozubek, Michela Antonelli, Nicola Rieke, Antoine Gilson, Leon D. Mayer, Minu D. Tizabi, M. Jorge Cardoso, Amber Simpson, Annette Kopp-Schneider, Gaël Varoquaux, Olivier Colliot, Lena Maier-Hein

Main category: cs.CV

TL;DR: The paper critiques the reliability of performance claims in medical imaging AI, revealing that many claims of outperformance are unsubstantiated.

DetailsMotivation: To assess whether claims of superiority in medical imaging AI are statistically valid, given the reliance on empirical mean performance.

Method: A Bayesian approach is used to analyze reported results and model congruence, estimating the probability of false outperformance claims.

Result: Over 80% of papers claim outperformance, with high probabilities (>5%) of false claims in 86% of classification and 53% of segmentation papers.

Conclusion: Current benchmarking practices in medical imaging AI often lead to unsubstantiated claims, risking misdirection of future research.

Abstract: Performance comparisons are fundamental in medical imaging Artificial Intelligence (AI) research, often driving claims of superiority based on relative improvements in common performance metrics. However, such claims frequently rely solely on empirical mean performance. In this paper, we investigate whether newly proposed methods genuinely outperform the state of the art by analyzing a representative cohort of medical imaging papers. We quantify the probability of false claims based on a Bayesian approach that leverages reported results alongside empirically estimated model congruence to estimate whether the relative ranking of methods is likely to have occurred by chance. According to our results, the majority (>80%) of papers claims outperformance when introducing a new method. Our analysis further revealed a high probability (>5%) of false outperformance claims in 86% of classification papers and 53% of segmentation papers. These findings highlight a critical flaw in current benchmarking practices: claims of outperformance in medical imaging AI are frequently unsubstantiated, posing a risk of misdirecting future research efforts.

[293] BrainSegDMlF: A Dynamic Fusion-enhanced SAM for Brain Lesion Segmentation

Hongming Wang, Yifeng Wu, Huimin Huang, Hongtao Wu, Jia-Xuan Jiang, Xiaodong Zhang, Hao Zheng, Xian Wu, Yefeng Zheng, Jinping Xu, Jing Cheng

Main category: cs.CV

TL;DR: The paper introduces BrainSegDMLF, a fully automated model for brain lesion segmentation, addressing limitations of existing methods by integrating multi-modal data, improving small lesion detection, and enabling automatic segmentation.

DetailsMotivation: Current brain lesion segmentation methods lack multi-modal integration, struggle with small lesions, and require manual prompts, limiting accuracy and efficiency.

Method: BrainSegDMLF uses a Dynamic Modal Interactive Fusion (DMIF) module for multi-modal data integration, a Layer-by-Layer Upsampling Decoder for feature extraction, and automatic mask generation.

Result: The model achieves comprehensive lesion segmentation, better small lesion detection, and fully automated operation.

Conclusion: BrainSegDMLF improves brain lesion segmentation by leveraging multi-modal data and automation, addressing key challenges in the field.

Abstract: The segmentation of substantial brain lesions is a significant and challenging task in the field of medical image segmentation. Substantial brain lesions in brain imaging exhibit high heterogeneity, with indistinct boundaries between lesion regions and normal brain tissue. Small lesions in single slices are difficult to identify, making the accurate and reproducible segmentation of abnormal regions, as well as their feature description, highly complex. Existing methods have the following limitations: 1) They rely solely on single-modal information for learning, neglecting the multi-modal information commonly used in diagnosis. This hampers the ability to comprehensively acquire brain lesion information from multiple perspectives and prevents the effective integration and utilization of multi-modal data inputs, thereby limiting a holistic understanding of lesions. 2) They are constrained by the amount of data available, leading to low sensitivity to small lesions and difficulty in detecting subtle pathological changes. 3) Current SAM-based models rely on external prompts, which cannot achieve automatic segmentation and, to some extent, affect diagnostic efficiency.To address these issues, we have developed a large-scale fully automated segmentation model specifically designed for brain lesion segmentation, named BrainSegDMLF. This model has the following features: 1) Dynamic Modal Interactive Fusion (DMIF) module that processes and integrates multi-modal data during the encoding process, providing the SAM encoder with more comprehensive modal information. 2) Layer-by-Layer Upsampling Decoder, enabling the model to extract rich low-level and high-level features even with limited data, thereby detecting the presence of small lesions. 3) Automatic segmentation masks, allowing the model to generate lesion masks automatically without requiring manual prompts.

[294] MARRS: Masked Autoregressive Unit-based Reaction Synthesis

Yabiao Wang, Shuo Wang, Jiangning Zhang, Jiafu Wu, Qingdong He, Yong Liu

Main category: cs.CV

TL;DR: The paper proposes MARRS, a framework for generating human reactions to actions using continuous representations, addressing limitations of VQ-based methods and emphasizing unit interaction.

DetailsMotivation: Current autoregressive and VQ-based methods for human reaction synthesis suffer from quantization loss, low codebook use, and neglect of mutual perception among body units.

Method: MARRS includes UD-VAE for unit segmentation, ACF for token masking and fusion, AUM for unit interaction, and a diffusion model with MLP noise predictors.

Result: The method outperforms existing approaches in generating coordinated and fine-grained reaction motions.

Conclusion: MARRS effectively addresses VQ limitations and enhances reaction synthesis through unit interaction and continuous representations.

Abstract: This work aims at a challenging task: human action-reaction synthesis, i.e., generating human reactions conditioned on the action sequence of another person. Currently, autoregressive modeling approaches with vector quantization (VQ) have achieved remarkable performance in motion generation tasks. However, VQ has inherent disadvantages, including quantization information loss, low codebook utilization, etc. In addition, while dividing the body into separate units can be beneficial, the computational complexity needs to be considered. Also, the importance of mutual perception among units is often neglected. In this work, we propose MARRS, a novel framework designed to generate coordinated and fine-grained reaction motions using continuous representations. Initially, we present the Unit-distinguished Motion Variational AutoEncoder (UD-VAE), which segments the entire body into distinct body and hand units, encoding each independently. Subsequently, we propose Action-Conditioned Fusion (ACF), which involves randomly masking a subset of reactive tokens and extracting specific information about the body and hands from the active tokens. Furthermore, we introduce Adaptive Unit Modulation (AUM) to facilitate interaction between body and hand units by using the information from one unit to adaptively modulate the other. Finally, for the diffusion model, we employ a compact MLP as a noise predictor for each distinct body unit and incorporate the diffusion loss to model the probability distribution of each token. Both quantitative and qualitative results demonstrate that our method achieves superior performance. The code will be released upon acceptance.

[295] Expert-Like Reparameterization of Heterogeneous Pyramid Receptive Fields in Efficient CNNs for Fair Medical Image Classification

Xiao Wu, Xiaoqing Zhang, Zunjie Xiao, Lingxi Hu, Risa Higashita, Jiang Liu

Main category: cs.CV

TL;DR: ERoHPRF introduces heterogeneous pyramid receptive fields and expert-like reparameterization to improve medical image classification performance and fairness.

DetailsMotivation: Address limitations of single or asymmetric receptive fields in capturing diverse lesion characteristics and reducing bias in CNN predictions for medical tasks.

Method: Uses heterogeneous pyramid RF bag and expert-like structural reparameterization to mimic multi-expert consultation, optimizing computation and inference.

Result: ERoHPRF achieves better trade-offs in classification, fairness, and computational efficiency compared to state-of-the-art methods.

Conclusion: ERoHPRF effectively enhances CNN performance and fairness in medical image classification, with practical applicability.

Abstract: Efficient convolutional neural network (CNN) architecture design has attracted growing research interests. However, they typically apply single receptive field (RF), small asymmetric RFs, or pyramid RFs to learn different feature representations, still encountering two significant challenges in medical image classification tasks: 1) They have limitations in capturing diverse lesion characteristics efficiently, e.g., tiny, coordination, small and salient, which have unique roles on the classification results, especially imbalanced medical image classification. 2) The predictions generated by those CNNs are often unfair/biased, bringing a high risk when employing them to real-world medical diagnosis conditions. To tackle these issues, we develop a new concept, Expert-Like Reparameterization of Heterogeneous Pyramid Receptive Fields (ERoHPRF), to simultaneously boost medical image classification performance and fairness. This concept aims to mimic the multi-expert consultation mode by applying the well-designed heterogeneous pyramid RF bag to capture lesion characteristics with varying significances effectively via convolution operations with multiple heterogeneous kernel sizes. Additionally, ERoHPRF introduces an expert-like structural reparameterization technique to merge its parameters with the two-stage strategy, ensuring competitive computation cost and inference speed through comparisons to a single RF. To manifest the effectiveness and generalization ability of ERoHPRF, we incorporate it into mainstream efficient CNN architectures. The extensive experiments show that our proposed ERoHPRF maintains a better trade-off than state-of-the-art methods in terms of medical image classification, fairness, and computation overhead. The code of this paper is available at https://github.com/XiaoLing12138/Expert-Like-Reparameterization-of-Heterogeneous-Pyramid-Receptive-Fields.

[296] PiT: Progressive Diffusion Transformer

Jiafu Wu, Yabiao Wang, Jian Li, Jinlong Peng, Yun Cao, Chengjie Wang, Jiangning Zhang

Main category: cs.CV

TL;DR: DiTs face high computational costs due to redundant global attention. PSWA and PCCA are proposed to mitigate this, leading to PiT models with improved efficiency and performance.

DetailsMotivation: Address the inefficiency and redundancy in global attention mechanisms of Diffusion Transformers (DiTs).

Method: Introduce Pseudo Shifted Window Attention (PSWA) for balanced global-local modeling and Progressive Coverage Channel Allocation (PCCA) for high-order attention.

Result: PiT-L achieves a 54% FID improvement over DiT-XL/2 with reduced computation.

Conclusion: PSWA and PCCA enhance DiTs, making PiT models more efficient and effective for image generation.

Abstract: Diffusion Transformers (DiTs) achieve remarkable performance within image generation via the transformer architecture. Conventionally, DiTs are constructed by stacking serial isotropic global modeling transformers, which face significant quadratic computational cost. However, through empirical analysis, we find that DiTs do not rely as heavily on global information as previously believed. In fact, most layers exhibit significant redundancy in global computation. Additionally, conventional attention mechanisms suffer from low-frequency inertia, limiting their efficiency. To address these issues, we propose Pseudo Shifted Window Attention (PSWA), which fundamentally mitigates global attention redundancy. PSWA achieves moderate global-local information through window attention. It further utilizes a high-frequency bridging branch to simulate shifted window operations, which both enrich the high-frequency information and strengthen inter-window connections. Furthermore, we propose the Progressive Coverage Channel Allocation (PCCA) strategy that captures high-order attention without additional computational cost. Based on these innovations, we propose a series of Pseudo \textbf{P}rogressive D\textbf{i}ffusion \textbf{T}ransformer (\textbf{PiT}). Our extensive experiments show their superior performance; for example, our proposed PiT-L achieves 54%$\uparrow$ FID improvement over DiT-XL/2 while using less computation.

[297] OccLE: Label-Efficient 3D Semantic Occupancy Prediction

Naiyu Fang, Zheyuan Zhou, Fayao Liu, Xulei Yang, Jiacheng Wei, Lemiao Qiu, Guosheng Lin

Main category: cs.CV

TL;DR: OccLE is a label-efficient method for 3D semantic occupancy prediction, combining semantic and geometric learning with limited voxel annotations.

DetailsMotivation: Existing methods require costly full supervision or yield suboptimal results with self-supervision. OccLE aims to balance performance and annotation efficiency.

Method: Decouples semantic and geometric learning, uses 2D foundation models for pseudo labels, integrates image and LiDAR data, and fuses features via Dual Mamba.

Result: Achieves competitive performance with only 10% of voxel annotations on SemanticKITTI and Occ3D-nuScenes datasets.

Conclusion: OccLE provides a scalable solution for 3D semantic occupancy prediction with reduced annotation costs.

Abstract: 3D semantic occupancy prediction offers an intuitive and efficient scene understanding and has attracted significant interest in autonomous driving perception. Existing approaches either rely on full supervision, which demands costly voxel-level annotations, or on self-supervision, which provides limited guidance and yields suboptimal performance. To address these challenges, we propose OccLE, a Label-Efficient 3D Semantic Occupancy Prediction that takes images and LiDAR as inputs and maintains high performance with limited voxel annotations. Our intuition is to decouple the semantic and geometric learning tasks and then fuse the learned feature grids from both tasks for the final semantic occupancy prediction. Therefore, the semantic branch distills 2D foundation model to provide aligned pseudo labels for 2D and 3D semantic learning. The geometric branch integrates image and LiDAR inputs in cross-plane synergy based on their inherency, employing semi-supervision to enhance geometry learning. We fuse semantic-geometric feature grids through Dual Mamba and incorporate a scatter-accumulated projection to supervise unannotated prediction with aligned pseudo labels. Experiments show that OccLE achieves competitive performance with only 10% of voxel annotations on the SemanticKITTI and Occ3D-nuScenes datasets.

[298] DSOcc: Leveraging Depth Awareness and Semantic Aid to Boost Camera-Based 3D Semantic Occupancy Prediction

Naiyu Fang, Zheyuan Zhou, Kang Wang, Ruibo Li, Lemiao Qiu, Shuyou Zhang, Zhe Wang, Guosheng Lin

Main category: cs.CV

TL;DR: DSOcc improves camera-based 3D semantic occupancy prediction by integrating depth awareness and semantic aid, outperforming existing methods.

DetailsMotivation: Existing methods for 3D semantic occupancy prediction suffer from incorrect feature assignments and insufficient samples, limiting performance.

Method: DSOcc jointly infers occupancy state and class using soft occupancy confidence (non-learning) and fuses semantic segmentation features for robustness.

Result: DSOcc achieves state-of-the-art performance on the SemanticKITTI dataset for camera-based methods.

Conclusion: DSOcc effectively addresses challenges in 3D semantic occupancy prediction, offering a robust and adaptive solution.

Abstract: Camera-based 3D semantic occupancy prediction offers an efficient and cost-effective solution for perceiving surrounding scenes in autonomous driving. However, existing works rely on explicit occupancy state inference, leading to numerous incorrect feature assignments, and insufficient samples restrict the learning of occupancy class inference. To address these challenges, we propose leveraging Depth awareness and Semantic aid to boost camera-based 3D semantic Occupancy prediction (DSOcc). We jointly perform occupancy state and occupancy class inference, where soft occupancy confidence is calculated by non-learning method and multiplied with image features to make voxels aware of depth, enabling adaptive implicit occupancy state inference. Instead of enhancing feature learning, we directly utilize well-trained image semantic segmentation and fuse multiple frames with their occupancy probabilities to aid occupancy class inference, thereby enhancing robustness. Experimental results demonstrate that DSOcc achieves state-of-the-art performance on the SemanticKITTI dataset among camera-based methods.

[299] Dual-Expert Consistency Model for Efficient and High-Quality Video Generation

Zhengyao Lv, Chenyang Si, Tianlin Pan, Zhaoxi Chen, Kwan-Yee K. Wong, Yu Qiao, Ziwei Liu

Main category: cs.CV

TL;DR: The paper introduces a Dual-Expert Consistency Model (DCM) to address the performance degradation in video diffusion models caused by conflicting learning dynamics during distillation. It achieves state-of-the-art visual quality with fewer sampling steps.

DetailsMotivation: Diffusion models for video synthesis are computationally expensive due to iterative denoising. Consistency Models accelerate them but degrade temporal consistency and details. The paper identifies conflicting learning dynamics as the root cause.

Method: Proposes DCM with two experts: a semantic expert for layout/motion and a detail expert for refinement. Uses Temporal Coherence Loss for motion consistency and GAN/Feature Matching Loss for detail quality.

Result: Achieves superior visual quality with reduced sampling steps, outperforming existing methods.

Conclusion: Expert specialization in DCM effectively resolves conflicting learning dynamics, enhancing video diffusion model performance.

Abstract: Diffusion Models have achieved remarkable results in video synthesis but require iterative denoising steps, leading to substantial computational overhead. Consistency Models have made significant progress in accelerating diffusion models. However, directly applying them to video diffusion models often results in severe degradation of temporal consistency and appearance details. In this paper, by analyzing the training dynamics of Consistency Models, we identify a key conflicting learning dynamics during the distillation process: there is a significant discrepancy in the optimization gradients and loss contributions across different timesteps. This discrepancy prevents the distilled student model from achieving an optimal state, leading to compromised temporal consistency and degraded appearance details. To address this issue, we propose a parameter-efficient \textbf{Dual-Expert Consistency Model~(DCM)}, where a semantic expert focuses on learning semantic layout and motion, while a detail expert specializes in fine detail refinement. Furthermore, we introduce Temporal Coherence Loss to improve motion consistency for the semantic expert and apply GAN and Feature Matching Loss to enhance the synthesis quality of the detail expert.Our approach achieves state-of-the-art visual quality with significantly reduced sampling steps, demonstrating the effectiveness of expert specialization in video diffusion model distillation. Our code and models are available at \href{https://github.com/Vchitect/DCM}{https://github.com/Vchitect/DCM}.

[300] A Comprohensive Review of Domain Adaptation Techniques for Agricultural Image Analysis in Precision Agriculture

Xing Hu, Siyuan Chen, Xuming Huang, Qianqian Duan, Huiliang Shang, Dawei Zhang

Main category: cs.CV

TL;DR: The paper explores Domain Adaptation (DA) techniques to improve cross-domain transferability in agricultural image analysis, addressing challenges like domain shifts, limited labeled data, and dynamic conditions. It reviews DA methods, categorizes them, and evaluates public datasets.

DetailsMotivation: To address challenges in agricultural image analysis caused by domain shifts, limited labeled data, and dynamic conditions, leveraging DA techniques for better model generalization.

Method: Systematic review of DA techniques, categorizing them into shallow and deep learning methods (supervised, semi-supervised, unsupervised), with focus on adversarial learning. Evaluation of public agricultural image datasets.

Result: DA methods, especially adversarial learning, show strong potential in improving performance for tasks like crop health monitoring, pest detection, and fruit recognition across diverse domains.

Conclusion: The paper provides a comprehensive framework and insights to guide future DA research in agricultural vision tasks, highlighting the effectiveness of DA techniques in addressing domain-specific challenges.

Abstract: With the growing application of computer vision in agriculture, image analysis has become essential for tasks such as crop health monitoring and pest detection. However, significant domain shifts caused by environmental variations, different crop types, and diverse data acquisition methods hinder model generalization across regions, seasons, and complex agricultural settings. This paper investigates how Domain Adaptation (DA) techniques can address these challenges by improving cross-domain transferability in agricultural image analysis. Given the limited availability of labeled data, weak model adaptability, and dynamic field conditions, DA has emerged as a promising solution. The review systematically summarizes recent advances in DA for agricultural imagery, focusing on applications such as crop health monitoring, pest detection, and fruit recognition, where DA methods have improved performance in diverse domains. DA approaches are categorized into shallow and deep learning methods, including supervised, semi-supervised, and unsupervised strategies, with particular attention to adversarial learning-based techniques that have demonstrated strong potential in complex scenarios. In addition,the paper reviews key public agricultural image datasets, evaluating their strengths and limitations in DA research. In general, this work offers a comprehensive framework and critical insights to guide future research and development of domain adaptation in agricultural vision tasks.

[301] AgentSense: Virtual Sensor Data Generation Using LLM Agents in Simulated Home Environments

Zikang Leng, Megha Thukral, Yaqi Liu, Hrudhai Rajasekhar, Shruthi K. Hiremath, Jiaman He, Thomas Plötz

Main category: cs.CV

TL;DR: AgentSense uses LLM-guided virtual agents in simulated smart homes to generate diverse, privacy-preserving HAR data, improving model performance, especially in low-resource settings.

DetailsMotivation: Addressing the lack of large, diverse labeled datasets for HAR due to variations in home layouts, sensor configurations, and behaviors.

Method: AgentSense pipeline: LLMs generate synthetic personas/routines, decomposed into actions executed in VirtualHome with virtual sensors.

Result: Pretraining on generated data outperforms baselines; combining with small real data matches full real-world dataset performance.

Conclusion: LLM-guided embodied agents offer scalable, cost-effective HAR data generation.

Abstract: A major challenge in developing robust and generalizable Human Activity Recognition (HAR) systems for smart homes is the lack of large and diverse labeled datasets. Variations in home layouts, sensor configurations, and individual behaviors further exacerbate this issue. To address this, we leverage the idea of embodied AI agents-virtual agents that perceive and act within simulated environments guided by internal world models. We introduce AgentSense, a virtual data generation pipeline in which agents live out daily routines in simulated smart homes, with behavior guided by Large Language Models (LLMs). The LLM generates diverse synthetic personas and realistic routines grounded in the environment, which are then decomposed into fine-grained actions. These actions are executed in an extended version of the VirtualHome simulator, which we augment with virtual ambient sensors that record the agents’ activities. Our approach produces rich, privacy-preserving sensor data that reflects real-world diversity. We evaluate AgentSense on five real HAR datasets. Models pretrained on the generated data consistently outperform baselines, especially in low-resource settings. Furthermore, combining the generated virtual sensor data with a small amount of real data achieves performance comparable to training on full real-world datasets. These results highlight the potential of using LLM-guided embodied agents for scalable and cost-effective sensor data generation in HAR.

[302] Ctrl-Z Sampling: Diffusion Sampling with Controlled Random Zigzag Explorations

Shunqi Mao, Wei Guo, Chaoyi Zhang, Jieting Long, Ke Xie, Weidong Cai

Main category: cs.CV

TL;DR: The paper introduces Ctrl-Z Sampling, a method to improve diffusion models by adaptively escaping local optima during generation, enhancing output quality with minimal computational overhead.

DetailsMotivation: Diffusion models often converge to local optima, producing suboptimal results due to latent space complexity and poor initialization. Existing methods struggle to escape these traps effectively.

Method: Proposes Controlled Random Zigzag Sampling (Ctrl-Z Sampling), which detects local maxima using a reward model, injects noise to escape, and evaluates trajectories for improvement.

Result: Ctrl-Z Sampling improves generation quality with only a 6.72x increase in function evaluations, demonstrating effectiveness.

Conclusion: The method is model-agnostic, compatible with existing frameworks, and significantly enhances diffusion model performance by dynamically balancing refinement and exploration.

Abstract: Diffusion models have shown strong performance in conditional generation by progressively denoising Gaussian samples toward a target data distribution. This denoising process can be interpreted as a form of hill climbing in a learned latent space, where the model iteratively refines a sample toward regions of higher probability. However, this learned climbing often converges to local optima with plausible but suboptimal generations due to latent space complexity and suboptimal initialization. While prior efforts often strengthen guidance signals or introduce fixed exploration strategies to address this, they exhibit limited capacity to escape steep local maxima. In contrast, we propose Controlled Random Zigzag Sampling (Ctrl-Z Sampling), a novel sampling strategy that adaptively detects and escapes such traps through controlled exploration. In each diffusion step, we first identify potential local maxima using a reward model. Upon such detection, we inject noise and revert to a previous, noisier state to escape the current plateau. The reward model then evaluates candidate trajectories, accepting only those that offer improvement, otherwise scheming progressively deeper explorations when nearby alternatives fail. This controlled zigzag process allows dynamic alternation between forward refinement and backward exploration, enhancing both alignment and visual quality in the generated outputs. The proposed method is model-agnostic and also compatible with existing diffusion frameworks. Experimental results show that Ctrl-Z Sampling substantially improves generation quality with only around 6.72x increase in the number of function evaluations.

[303] MultiHuman-Testbench: Benchmarking Image Generation for Multiple Humans

Shubhankar Borse, Seokeon Choi, Sunghyun Park, Jeongho Kim, Shreya Kadambi, Risheek Garrepalli, Sungrack Yun, Munawar Hayat, Fatih Porikli

Main category: cs.CV

TL;DR: The paper introduces MultiHuman-Testbench, a benchmark for evaluating generative models in multi-human image generation, addressing challenges like facial identity preservation and action complexity.

DetailsMotivation: The lack of a dedicated benchmark for multi-human generation hinders progress in this area, prompting the creation of MultiHuman-Testbench.

Method: The benchmark includes 1800 samples with text prompts, 5,550 unique face images, and pose conditioning. It employs four metrics for evaluation and proposes techniques like human segmentation and Hungarian matching.

Result: The benchmark and novel techniques improve ID similarity and provide standardized evaluation, with diverse models tested.

Conclusion: MultiHuman-Testbench offers a valuable tool for advancing research in multi-human image generation, with dataset and code made publicly available.

Abstract: Generation of images containing multiple humans, performing complex actions, while preserving their facial identities, is a significant challenge. A major factor contributing to this is the lack of a dedicated benchmark. To address this, we introduce MultiHuman-Testbench, a novel benchmark for rigorously evaluating generative models for multi-human generation. The benchmark comprises 1800 samples, including carefully curated text prompts, describing a range of simple to complex human actions. These prompts are matched with a total of 5,550 unique human face images, sampled uniformly to ensure diversity across age, ethnic background, and gender. Alongside captions, we provide human-selected pose conditioning images which accurately match the prompt. We propose a multi-faceted evaluation suite employing four key metrics to quantify face count, ID similarity, prompt alignment, and action detection. We conduct a thorough evaluation of a diverse set of models, including zero-shot approaches and training-based methods, with and without regional priors. We also propose novel techniques to incorporate image and region isolation using human segmentation and Hungarian matching, significantly improving ID similarity. Our proposed benchmark and key findings provide valuable insights and a standardized tool for advancing research in multi-human image generation. The dataset and evaluation codes will be available at https://github.com/Qualcomm-AI-research/MultiHuman-Testbench.

[304] Enhancing Multi-view Open-set Learning via Ambiguity Uncertainty Calibration and View-wise Debiasing

Zihan Fang, Zhiyong Xu, Lan Du, Shide Du, Zhiling Cai, Shiping Wang

Main category: cs.CV

TL;DR: A multi-view open-set learning framework addresses class completeness and view-induced biases via ambiguity uncertainty calibration and view-wise debiasing, improving unknown-class recognition.

DetailsMotivation: Existing multi-view learning models fail in open-set scenarios due to class completeness assumptions and static view-induced biases.

Method: Proposes O-Mix for ambiguous sample synthesis, an ambiguity perception network, and an HSIC-based contrastive debiasing module.

Result: Enhances unknown-class recognition while maintaining strong closed-set performance on diverse benchmarks.

Conclusion: The framework effectively tackles open-set challenges in multi-view learning by addressing ambiguity and biases.

Abstract: Existing multi-view learning models struggle in open-set scenarios due to their implicit assumption of class completeness. Moreover, static view-induced biases, which arise from spurious view-label associations formed during training, further degrade their ability to recognize unknown categories. In this paper, we propose a multi-view open-set learning framework via ambiguity uncertainty calibration and view-wise debiasing. To simulate ambiguous samples, we design O-Mix, a novel synthesis strategy to generate virtual samples with calibrated open-set ambiguity uncertainty. These samples are further processed by an auxiliary ambiguity perception network that captures atypical patterns for improved open-set adaptation. Furthermore, we incorporate an HSIC-based contrastive debiasing module that enforces independence between view-specific ambiguous and view-consistent representations, encouraging the model to learn generalizable features. Extensive experiments on diverse multi-view benchmarks demonstrate that the proposed framework consistently enhances unknown-class recognition while preserving strong closed-set performance.

[305] The Aging Multiverse: Generating Condition-Aware Facial Aging Tree via Training-Free Diffusion

Bang Gong, Luchao Qi, Jiaye Wu, Zhicheng Fu, Chunbo Song, David W. Jacobs, John Nicholson, Roni Sengupta

Main category: cs.CV

TL;DR: The Aging Multiverse framework generates diverse facial aging trajectories from a single image, conditioned on external factors, outperforming deterministic methods.

DetailsMotivation: Prior methods model aging as a single path, lacking diversity and control over factors like environment and lifestyle.

Method: A training-free diffusion-based approach with attention mixing and Simulated Aging Regularization for balanced identity preservation, age accuracy, and condition control.

Result: State-of-the-art performance in identity preservation, aging realism, and conditional alignment, validated by experiments and user studies.

Conclusion: The framework enables multi-dimensional, controllable aging, benefiting digital storytelling, health education, and personalized visualization.

Abstract: We introduce the Aging Multiverse, a framework for generating multiple plausible facial aging trajectories from a single image, each conditioned on external factors such as environment, health, and lifestyle. Unlike prior methods that model aging as a single deterministic path, our approach creates an aging tree that visualizes diverse futures. To enable this, we propose a training-free diffusion-based method that balances identity preservation, age accuracy, and condition control. Our key contributions include attention mixing to modulate editing strength and a Simulated Aging Regularization strategy to stabilize edits. Extensive experiments and user studies demonstrate state-of-the-art performance across identity preservation, aging realism, and conditional alignment, outperforming existing editing and age-progression models, which often fail to account for one or more of the editing criteria. By transforming aging into a multi-dimensional, controllable, and interpretable process, our approach opens up new creative and practical avenues in digital storytelling, health education, and personalized visualization.

[306] WAVE: Warp-Based View Guidance for Consistent Novel View Synthesis Using a Single Image

Jiwoo Park, Tae Eun Choi, Youngjun Jun, Seong Jae Hwang

Main category: cs.CV

TL;DR: A method for improving view consistency in novel view synthesis using diffusion models without extra modules, focusing on adaptive attention and noise reinitialization.

DetailsMotivation: Existing diffusion models struggle with spatial continuity in novel view synthesis, and hybrid approaches with 3D models are inefficient.

Method: Proposes a training-free technique using adaptive attention manipulation and noise reinitialization via view-guided warping.

Result: Enhances view consistency across diffusion models, validated by a tailored metric framework.

Conclusion: The method is broadly applicable and improves view consistency without complex pipelines.

Abstract: Generating high-quality novel views of a scene from a single image requires maintaining structural coherence across different views, referred to as view consistency. While diffusion models have driven advancements in novel view synthesis, they still struggle to preserve spatial continuity across views. Diffusion models have been combined with 3D models to address the issue, but such approaches lack efficiency due to their complex multi-step pipelines. This paper proposes a novel view-consistent image generation method which utilizes diffusion models without additional modules. Our key idea is to enhance diffusion models with a training-free method that enables adaptive attention manipulation and noise reinitialization by leveraging view-guided warping to ensure view consistency. Through our comprehensive metric framework suitable for novel-view datasets, we show that our method improves view consistency across various diffusion models, demonstrating its broader applicability.

[307] EchoMimicV3: 1.3B Parameters are All You Need for Unified Multi-Modal and Multi-Task Human Animation

Rang Meng, Yan Wang, Weipeng Wu, Ruobing Zheng, Yuming Li, Chenguang Ma

Main category: cs.CV

TL;DR: EchoMimicV3 is an efficient framework for multi-task and multi-modal human animation, addressing slow inference and high computational costs of traditional methods. It uses innovative paradigms and training strategies to achieve competitive performance with a compact model.

DetailsMotivation: Current human animation methods are slow and computationally expensive, often requiring separate models for different tasks. EchoMimicV3 aims to unify multi-task and multi-modal animation efficiently.

Method: EchoMimicV3 employs a Soup-of-Tasks paradigm for multi-task gains, a Soup-of-Modals paradigm for multi-modal conditions, and novel training strategies like Negative Direct Preference Optimization and Phase-aware Negative CFG.

Result: The framework achieves competitive performance with a minimal model size of 1.3 billion parameters, validated through extensive experiments.

Conclusion: EchoMimicV3 offers an efficient, unified solution for human animation, balancing performance and computational cost, with plans for open-sourcing.

Abstract: Recent work on human animation usually incorporates large-scale video models, thereby achieving more vivid performance. However, the practical use of such methods is hindered by the slow inference speed and high computational demands. Moreover, traditional work typically employs separate models for each animation task, increasing costs in multi-task scenarios and worsening the dilemma. To address these limitations, we introduce EchoMimicV3, an efficient framework that unifies multi-task and multi-modal human animation. At the core of EchoMimicV3 lies a threefold design: a Soup-of-Tasks paradigm, a Soup-of-Modals paradigm, and a novel training and inference strategy. The Soup-of-Tasks leverages multi-task mask inputs and a counter-intuitive task allocation strategy to achieve multi-task gains without multi-model pains. Meanwhile, the Soup-of-Modals introduces a Coupled-Decoupled Multi-Modal Cross Attention module to inject multi-modal conditions, complemented by a Multi-Modal Timestep Phase-aware Dynamical Allocation mechanism to modulate multi-modal mixtures. Besides, we propose Negative Direct Preference Optimization, Phase-aware Negative Classifier-Free Guidance (CFG), and Long Video CFG, which ensure stable training and inference. Extensive experiments and analyses demonstrate that EchoMimicV3, with a minimal model size of 1.3 billion parameters, achieves competitive performance in both quantitative and qualitative evaluations. We are committed to open-sourcing our code for community use.

[308] Automatic Synthesis of High-Quality Triplet Data for Composed Image Retrieval

Haiwen Li, Delong Liu, Zhaohui Hou, Zhicheng Zhao, Fei Su

Main category: cs.CV

TL;DR: The paper proposes a scalable pipeline for automatic triplet generation in Composed Image Retrieval (CIR) and introduces a synthetic dataset (CIRHS) and a novel framework (CoAlign) for improved performance.

DetailsMotivation: Existing CIR methods rely on costly manual triplet labeling, limiting scalability and zero-shot capability.

Method: The method involves using a large language model (LLM) to generate prompts for a text-to-image model, creating synthetic triplets (CIRHS), and introducing the CoAlign framework for global and local alignment.

Result: CoAlign achieves outstanding zero-shot performance on benchmarks and outperforms state-of-the-art supervised CIR methods.

Conclusion: The work demonstrates the feasibility of training CIR models on synthetic data and validates the effectiveness of the proposed framework.

Abstract: As a challenging vision-language (VL) task, Composed Image Retrieval (CIR) aims to retrieve target images using multimodal (image+text) queries. Although many existing CIR methods have attained promising performance, their reliance on costly, manually labeled triplets hinders scalability and zero-shot capability. To address this issue, we propose a scalable pipeline for automatic triplet generation, along with a fully synthetic dataset named Composed Image Retrieval on High-quality Synthetic Triplets (CIRHS). Our pipeline leverages a large language model (LLM) to generate diverse prompts, controlling a text-to-image generative model to produce image pairs with identical elements in each pair, which are then filtered and reorganized to form the CIRHS dataset. In addition, we introduce Hybrid Contextual Alignment (CoAlign), a novel CIR framework, which can accomplish global alignment and local reasoning within a broader context, enabling the model to learn more robust and informative representations. By utilizing the synthetic CIRHS dataset, CoAlign achieves outstanding zero-shot performance on three commonly used benchmarks, demonstrating for the first time the feasibility of training CIR models on a fully synthetic dataset. Furthermore, under supervised training, our method outperforms all the state-of-the-art supervised CIR approaches, validating the effectiveness of our proposed retrieval framework. The code and the CIRHS dataset will be released soon.

[309] Synthesizing Near-Boundary OOD Samples for Out-of-Distribution Detection

Jinglun Li, Kaixun Jiang, Zhaoyu Chen, Bo Lin, Yao Tang, Weifeng Ge, Wenqiang Zhang

Main category: cs.CV

TL;DR: SynOOD uses foundation models to generate synthetic OOD data for fine-tuning CLIP, improving boundary-level discrimination between InD and OOD samples.

DetailsMotivation: Challenging OOD samples close to InD data can misclassify; foundation models offer a solution.

Method: Iterative in-painting guided by MLLMs, refined with noise adjustments, fine-tunes CLIP for boundary alignment.

Result: Achieves state-of-the-art performance on ImageNet with minimal parameter/runtime increase.

Conclusion: SynOOD surpasses existing methods, enhancing OOD detection.

Abstract: Pre-trained vision-language models have exhibited remarkable abilities in detecting out-of-distribution (OOD) samples. However, some challenging OOD samples, which lie close to in-distribution (InD) data in image feature space, can still lead to misclassification. The emergence of foundation models like diffusion models and multimodal large language models (MLLMs) offers a potential solution to this issue. In this work, we propose SynOOD, a novel approach that harnesses foundation models to generate synthetic, challenging OOD data for fine-tuning CLIP models, thereby enhancing boundary-level discrimination between InD and OOD samples. Our method uses an iterative in-painting process guided by contextual prompts from MLLMs to produce nuanced, boundary-aligned OOD samples. These samples are refined through noise adjustments based on gradients from OOD scores like the energy score, effectively sampling from the InD/OOD boundary. With these carefully synthesized images, we fine-tune the CLIP image encoder and negative label features derived from the text encoder to strengthen connections between near-boundary OOD samples and a set of negative labels. Finally, SynOOD achieves state-of-the-art performance on the large-scale ImageNet benchmark, with minimal increases in parameters and runtime. Our approach significantly surpasses existing methods, and the code is available at https://github.com/Jarvisgivemeasuit/SynOOD.

[310] DeMo++: Motion Decoupling for Autonomous Driving

Bozhou Zhang, Nan Song, Xiatian Zhu, Li Zhang

Main category: cs.CV

TL;DR: DeMo++ decouples motion estimation into holistic intentions and fine states, improving trajectory modeling and achieving top performance in benchmarks.

DetailsMotivation: Current methods struggle with modeling spatiotemporal evolution of trajectories, leading to collisions or suboptimal outcomes.

Method: DeMo++ uses a hybrid Attention-Mamba model to decouple motion into intentions and states, with cross-scene interaction.

Result: Achieves state-of-the-art performance in Argoverse 2, nuScenes, nuPlan, and NAVSIM benchmarks.

Conclusion: DeMo++ effectively models diverse motion intentions and spatiotemporal evolution, enhancing autonomous driving safety and efficiency.

Abstract: Motion forecasting and planning are tasked with estimating the trajectories of traffic agents and the ego vehicle, respectively, to ensure the safety and efficiency of autonomous driving systems in dynamically changing environments. State-of-the-art methods typically adopt a one-query-one-trajectory paradigm, where each query corresponds to a unique trajectory for predicting multi-mode trajectories. While this paradigm can produce diverse motion intentions, it often falls short in modeling the intricate spatiotemporal evolution of trajectories, which can lead to collisions or suboptimal outcomes. To overcome this limitation, we propose DeMo++, a framework that decouples motion estimation into two distinct components: holistic motion intentions to capture the diverse potential directions of movement, and fine spatiotemporal states to track the agent’s dynamic progress within the scene and enable a self-refinement capability. Further, we introduce a cross-scene trajectory interaction mechanism to explore the relationships between motions in adjacent scenes. This allows DeMo++ to comprehensively model both the diversity of motion intentions and the spatiotemporal evolution of each trajectory. To effectively implement this framework, we developed a hybrid model combining Attention and Mamba. This architecture leverages the strengths of both mechanisms for efficient scene information aggregation and precise trajectory state sequence modeling. Extensive experiments demonstrate that DeMo++ achieves state-of-the-art performance across various benchmarks, including motion forecasting (Argoverse 2 and nuScenes), motion planning (nuPlan), and end-to-end planning (NAVSIM).

[311] SimMLM: A Simple Framework for Multi-modal Learning with Missing Modality

Sijie Li, Chen Chen, Jungong Han

Main category: cs.CV

TL;DR: SimMLM is a simple, effective framework for multimodal learning with missing modalities, using a dynamic gating mechanism and MoFe ranking loss to improve accuracy and robustness.

DetailsMotivation: Existing methods for handling missing modalities rely on complex architectures or imputation techniques, which may not be generic or robust enough. SimMLM aims to provide a simpler, more adaptable solution.

Method: SimMLM employs a Dynamic Mixture of Modality Experts (DMoME) with a learnable gating mechanism and introduces the More vs. Fewer (MoFe) ranking loss to ensure stable or improved accuracy with more modalities.

Result: SimMLM outperforms existing methods on medical image segmentation (BraTS 2018) and multimodal classification (UPMC Food-101, avMNIST), showing better accuracy, robustness, and reliability.

Conclusion: SimMLM offers a generic, effective solution for multimodal learning with missing modalities, demonstrating superior performance and alignment with intuitive principles.

Abstract: In this paper, we propose SimMLM, a simple yet powerful framework for multimodal learning with missing modalities. Unlike existing approaches that rely on sophisticated network architectures or complex data imputation techniques, SimMLM provides a generic and effective solution that can adapt to various missing modality scenarios with improved accuracy and robustness. Specifically, SimMLM consists of a generic Dynamic Mixture of Modality Experts (DMoME) architecture, featuring a dynamic, learnable gating mechanism that automatically adjusts each modality’s contribution in both full and partial modality settings. A key innovation of SimMLM is the proposed More vs. Fewer (MoFe) ranking loss, which ensures that task accuracy improves or remains stable as more modalities are made available. This aligns the model with an intuitive principle: removing one or more modalities should not increase accuracy. We validate SimMLM on multimodal medical image segmentation (BraTS 2018) and multimodal classification (UPMC Food-101, avMNIST) tasks, where it consistently surpasses competitive methods, demonstrating superior accuracy, interpretability, robustness, and reliability across both complete and missing modality scenarios at test time.

[312] UFV-Splatter: Pose-Free Feed-Forward 3D Gaussian Splatting Adapted to Unfavorable Views

Yuki Fujimura, Takahiro Kushida, Kazuya Kitano, Takuya Funatomi, Yasuhiro Mukaigawa

Main category: cs.CV

TL;DR: A pose-free, feed-forward 3D Gaussian Splatting framework is introduced to handle unfavorable input views by leveraging pretrained models and novel adaptation techniques.

DetailsMotivation: Existing feed-forward 3DGS models are limited to favorable views, restricting real-world applicability with unknown camera poses.

Method: The framework uses recentered images, LoRA layers, a Gaussian adapter module, and alignment methods to enhance geometric consistency and accuracy.

Result: Experiments on synthetic and real datasets show the method effectively handles unfavorable views.

Conclusion: The proposed framework successfully extends the applicability of 3DGS models to real-world scenarios with varying camera poses.

Abstract: This paper presents a pose-free, feed-forward 3D Gaussian Splatting (3DGS) framework designed to handle unfavorable input views. A common rendering setup for training feed-forward approaches places a 3D object at the world origin and renders it from cameras pointed toward the origin – i.e., from favorable views, limiting the applicability of these models to real-world scenarios involving varying and unknown camera poses. To overcome this limitation, we introduce a novel adaptation framework that enables pretrained pose-free feed-forward 3DGS models to handle unfavorable views. We leverage priors learned from favorable images by feeding recentered images into a pretrained model augmented with low-rank adaptation (LoRA) layers. We further propose a Gaussian adapter module to enhance the geometric consistency of the Gaussians derived from the recentered inputs, along with a Gaussian alignment method to render accurate target views for training. Additionally, we introduce a new training strategy that utilizes an off-the-shelf dataset composed solely of favorable images. Experimental results on both synthetic images from the Google Scanned Objects dataset and real images from the OmniObject3D dataset validate the effectiveness of our method in handling unfavorable input views.

[313] Bidirectional Likelihood Estimation with Multi-Modal Large Language Models for Text-Video Retrieval

Dohwan Ko, Ji Soo Lee, Minhyuk Choi, Zihang Meng, Hyunwoo J. Kim

Main category: cs.CV

TL;DR: BLiM framework improves text-video retrieval by addressing candidate prior bias using bidirectional likelihood estimation and CPN, outperforming SOTA models by 6.4 R@1.

DetailsMotivation: Naive use of MLLMs in text-video retrieval introduces candidate prior bias, favoring candidates with higher priors over relevance.

Method: Proposes BLiM, leveraging bidirectional likelihood estimation (text-to-video and video-to-text), and CPN for score calibration.

Result: BLiM with CPN achieves 6.4 R@1 improvement over SOTA on four benchmarks, reducing bias and enhancing relevance.

Conclusion: BLiM and CPN effectively mitigate prior bias, improving retrieval and showing broader applicability in multi-modal tasks.

Abstract: Text-Video Retrieval aims to find the most relevant text (or video) candidate given a video (or text) query from large-scale online databases. Recent work leverages multi-modal large language models (MLLMs) to improve retrieval, especially for long or complex query-candidate pairs. However, we observe that the naive application of MLLMs, i.e., retrieval based on candidate likelihood, introduces candidate prior bias, favoring candidates with inherently higher priors over those more relevant to the query. To this end, we propose a novel retrieval framework, Bidirectional Likelihood Estimation with MLLM (BLiM), which leverages both query and candidate likelihoods by training the model to generate text from a given video as well as video features from a given text. Furthermore, we introduce Candidate Prior Normalization (CPN), a simple yet effective training-free score calibration module designed to mitigate candidate prior bias in candidate likelihood. On four Text-Video Retrieval benchmarks, our BLiM equipped with CPN outperforms previous state-of-the-art models by 6.4 R@1 on average, effectively alleviating candidate prior bias and emphasizing query-candidate relevance. Our in-depth analysis across various multi-modal tasks beyond retrieval highlights the broad applicability of CPN which enhances visual understanding by reducing reliance on textual priors. Code is available at https://github.com/mlvlab/BLiM.

[314] HiPrune: Training-Free Visual Token Pruning via Hierarchical Attention in Vision-Language Models

Jizhihui Liu, Feiyi Du, Guangdao Zhu, Niu Lian, Jun Li, Bin Chen

Main category: cs.CV

TL;DR: HiPrune is a training-free, model-agnostic token pruning framework for Vision-Language Models (VLMs) that leverages hierarchical attention to reduce computational overhead while maintaining high task accuracy.

DetailsMotivation: VLMs suffer from excessive computational overhead due to lengthy visual token sequences. Existing solutions are often task-specific or rely on special tokens, limiting scalability.

Method: HiPrune identifies and prunes tokens based on hierarchical attention: (1) Anchor tokens (object-centric), (2) Buffer tokens (spatial continuity), and (3) Register tokens (global summarization).

Result: Achieves state-of-the-art pruning, preserving 99.3% accuracy with 33.3% tokens and 99.5% with 11.1% tokens, while reducing FLOPs and latency by up to 9x.

Conclusion: HiPrune is a scalable, efficient solution for token pruning in VLMs, requiring no retraining and generalizing well across models and tasks.

Abstract: Vision-Language Models (VLMs) encode images into lengthy sequences of visual tokens, leading to excessive computational overhead and limited inference efficiency. While prior efforts prune or merge tokens to address this issue, they often rely on special tokens (e.g., CLS) or require task-specific training, hindering scalability across architectures. In this paper, we propose HiPrune, a training-free and model-agnostic token Pruning framework that exploits the Hierarchical attention structure within vision encoders. We identify that middle layers attend to object-centric regions, while deep layers capture global contextual features. Based on this observation, HiPrune selects three types of informative tokens: (1) Anchor tokens with high attention in object-centric layers, (2) Buffer tokens adjacent to anchors for spatial continuity, and (3) Register tokens with strong attention in deep layers for global summarization. Our method requires no retraining and integrates seamlessly with any ViT-based VLM. Extensive experiments on LLaVA-1.5, LLaVA-NeXT, and Qwen2.5-VL demonstrate that HiPrune achieves state-of-the-art pruning performance, preserving up to 99.3% task accuracy with only 33.3% tokens, and maintaining 99.5% accuracy with just 11.1% tokens. Meanwhile, it reduces inference FLOPs and latency by up to 9$\times$, showcasing strong generalization across models and tasks. Code is available at https://github.com/Danielement321/HiPrune.

[315] AURA: A Hybrid Spatiotemporal-Chromatic Framework for Robust, Real-Time Detection of Industrial Smoke Emissions

Mikhail Bychkov, Matey Yordanov, Andrei Kuchma

Main category: cs.CV

TL;DR: AURA is a hybrid spatiotemporal-chromatic framework for real-time industrial smoke detection and classification, improving accuracy and reducing false positives.

DetailsMotivation: Current monitoring systems lack specificity in distinguishing smoke types and struggle with environmental variability.

Method: AURA combines dynamic movement patterns and color characteristics of industrial smoke.

Result: Enhanced accuracy in detection and classification, with reduced false positives.

Conclusion: AURA aims to improve environmental compliance, safety, and public health through precise automated monitoring.

Abstract: This paper introduces AURA, a novel hybrid spatiotemporal-chromatic framework designed for robust, real-time detection and classification of industrial smoke emissions. The framework addresses critical limitations of current monitoring systems, which often lack the specificity to distinguish smoke types and struggle with environmental variability. AURA leverages both the dynamic movement patterns and the distinct color characteristics of industrial smoke to provide enhanced accuracy and reduced false positives. This framework aims to significantly improve environmental compliance, operational safety, and public health outcomes by enabling precise, automated monitoring of industrial emissions.

[316] Enhancing Zero-Shot Brain Tumor Subtype Classification via Fine-Grained Patch-Text Alignment

Lubin Gan, Jing Zhang, Linhao Qu, Yijun Wang, Siying Wu, Xiaoyan Sun

Main category: cs.CV

TL;DR: FG-PAN, a zero-shot framework for fine-grained brain tumor classification, enhances patch-level features and uses LLM-generated descriptions to improve subtype discrimination.

DetailsMotivation: Challenges in fine-grained brain tumor classification due to subtle morphological variations and limited annotated data, with existing vision-language models lacking precision.

Method: FG-PAN includes a local feature refinement module for spatial patch relationships and a text generation module for pathology-aware semantic prototypes.

Result: Achieves state-of-the-art performance and robust generalization in zero-shot classification on datasets like EBRAINS and TCGA.

Conclusion: FG-PAN effectively addresses fine-grained classification challenges by aligning visual and semantic features, demonstrating superior performance.

Abstract: The fine-grained classification of brain tumor subtypes from histopathological whole slide images is highly challenging due to subtle morphological variations and the scarcity of annotated data. Although vision-language models have enabled promising zero-shot classification, their ability to capture fine-grained pathological features remains limited, resulting in suboptimal subtype discrimination. To address these challenges, we propose the Fine-Grained Patch Alignment Network (FG-PAN), a novel zero-shot framework tailored for digital pathology. FG-PAN consists of two key modules: (1) a local feature refinement module that enhances patch-level visual features by modeling spatial relationships among representative patches, and (2) a fine-grained text description generation module that leverages large language models to produce pathology-aware, class-specific semantic prototypes. By aligning refined visual features with LLM-generated fine-grained descriptions, FG-PAN effectively increases class separability in both visual and semantic spaces. Extensive experiments on multiple public pathology datasets, including EBRAINS and TCGA, demonstrate that FG-PAN achieves state-of-the-art performance and robust generalization in zero-shot brain tumor subtype classification.

[317] CHARM: Collaborative Harmonization across Arbitrary Modalities for Modality-agnostic Semantic Segmentation

Lekang Wen, Jing Xiao, Liang Liao, Jiajun Chen, Mi Wang

Main category: cs.CV

TL;DR: CHARM is a novel framework for modality-agnostic semantic segmentation, focusing on complementary learning to preserve modality strengths while achieving implicit alignment.

DetailsMotivation: Existing methods homogenize modalities, diluting their strengths and complementarity. CHARM aims to harmonize modalities instead.

Method: CHARM uses Mutual Perception Unit (MPU) for implicit alignment and a dual-path optimization strategy (CoL and InE) for complementary fusion and modality-specific enhancement.

Result: CHARM outperforms baselines across datasets, especially improving fragile modalities.

Conclusion: CHARM shifts focus from homogenization to harmonization, enabling true cross-modal complementarity.

Abstract: Modality-agnostic Semantic Segmentation (MaSS) aims to achieve robust scene understanding across arbitrary combinations of input modality. Existing methods typically rely on explicit feature alignment to achieve modal homogenization, which dilutes the distinctive strengths of each modality and destroys their inherent complementarity. To achieve cooperative harmonization rather than homogenization, we propose CHARM, a novel complementary learning framework designed to implicitly align content while preserving modality-specific advantages through two components: (1) Mutual Perception Unit (MPU), enabling implicit alignment through window-based cross-modal interaction, where modalities serve as both queries and contexts for each other to discover modality-interactive correspondences; (2) A dual-path optimization strategy that decouples training into Collaborative Learning Strategy (CoL) for complementary fusion learning and Individual Enhancement Strategy (InE) for protected modality-specific optimization. Experiments across multiple datasets and backbones indicate that CHARM consistently outperform the baselines, with significant increment on the fragile modalities. This work shifts the focus from model homogenization to harmonization, enabling cross-modal complementarity for true harmony in diversity.

[318] FFHQ-Makeup: Paired Synthetic Makeup Dataset with Facial Consistency Across Multiple Styles

Xingchao Yang, Shiori Ueda, Yuantian Huang, Tomoya Akiyama, Takafumi Taketomi

Main category: cs.CV

TL;DR: FFHQ-Makeup is a high-quality synthetic makeup dataset addressing the lack of paired bare-makeup images by preserving facial identity and expression, offering 90K image pairs for beauty-related tasks.

DetailsMotivation: The challenge of collecting high-quality paired makeup datasets and the limitations of existing synthetic methods (warping distortions or identity alterations) motivate the creation of FFHQ-Makeup.

Method: An improved makeup transfer method disentangles identity and makeup, transferring real-world makeup styles from existing datasets onto 18K identities from FFHQ, each paired with 5 styles.

Result: FFHQ-Makeup provides 90K high-quality bare-makeup image pairs, preserving facial consistency in identity and expression.

Conclusion: FFHQ-Makeup fills the gap in high-quality paired makeup datasets and serves as a valuable resource for beauty-related research.

Abstract: Paired bare-makeup facial images are essential for a wide range of beauty-related tasks, such as virtual try-on, facial privacy protection, and facial aesthetics analysis. However, collecting high-quality paired makeup datasets remains a significant challenge. Real-world data acquisition is constrained by the difficulty of collecting large-scale paired images, while existing synthetic approaches often suffer from limited realism or inconsistencies between bare and makeup images. Current synthetic methods typically fall into two categories: warping-based transformations, which often distort facial geometry and compromise the precision of makeup; and text-to-image generation, which tends to alter facial identity and expression, undermining consistency. In this work, we present FFHQ-Makeup, a high-quality synthetic makeup dataset that pairs each identity with multiple makeup styles while preserving facial consistency in both identity and expression. Built upon the diverse FFHQ dataset, our pipeline transfers real-world makeup styles from existing datasets onto 18K identities by introducing an improved makeup transfer method that disentangles identity and makeup. Each identity is paired with 5 different makeup styles, resulting in a total of 90K high-quality bare-makeup image pairs. To the best of our knowledge, this is the first work that focuses specifically on constructing a makeup dataset. We hope that FFHQ-Makeup fills the gap of lacking high-quality bare-makeup paired datasets and serves as a valuable resource for future research in beauty-related tasks.

[319] Live Demonstration: Neuromorphic Radar for Gesture Recognition

Satyapreet Singh Yadav, Akash K S, Chandra Sekhar Seelamantula, Chetan Singh Thakur

Main category: cs.CV

TL;DR: A neuromorphic radar framework for real-time, low-power hand gesture recognition (HGR) using event-driven architecture and bio-inspired sensing.

DetailsMotivation: To enable efficient, low-power HGR by mimicking biological sensing and reducing computational overhead.

Method: Uses a 24 GHz Doppler radar and custom neuromorphic sampler for sparse spike-based encoding, processed by a lightweight neural network on a Cortex-M0 microcontroller.

Result: Achieves >85% real-time accuracy on a dataset of five gestures from seven users.

Conclusion: First bio-inspired, event-driven radar HGR system, demonstrating efficiency and accuracy.

Abstract: We present a neuromorphic radar framework for real-time, low-power hand gesture recognition (HGR) using an event-driven architecture inspired by biological sensing. Our system comprises a 24 GHz Doppler radar front-end and a custom neuromorphic sampler that converts intermediate-frequency (IF) signals into sparse spike-based representations via asynchronous sigma-delta encoding. These events are directly processed by a lightweight neural network deployed on a Cortex-M0 microcontroller, enabling low-latency inference without requiring spectrogram reconstruction. Unlike conventional radar HGR pipelines that continuously sample and process data, our architecture activates only when meaningful motion is detected, significantly reducing memory, power, and computation overhead. Evaluated on a dataset of five gestures collected from seven users, our system achieves > 85% real-time accuracy. To the best of our knowledge, this is the first work that employs bio-inspired asynchronous sigma-delta encoding and an event-driven processing framework for radar-based HGR.

[320] Macro-from-Micro Planning for High-Quality and Parallelized Autoregressive Long Video Generation

Xunzhi Xiang, Yabo Chen, Guiyu Zhang, Zhongyu Wang, Zhe Gao, Quanming Xiang, Gonghu Shang, Junqi Liu, Haibin Huang, Yang Gao, Chi Zhang, Qi Fan, Xuelong Li

Main category: cs.CV

TL;DR: The paper proposes a planning-then-populating framework (MMPL) for long video generation, addressing temporal drift and parallelization issues in autoregressive models.

DetailsMotivation: Autoregressive diffusion models struggle with long temporal durations due to error accumulation and lack of parallelization.

Method: MMPL uses hierarchical Micro and Macro Planning to sketch a global storyline, followed by parallel content populating and adaptive workload scheduling.

Result: The method outperforms existing models in quality and stability for long video generation.

Conclusion: MMPL effectively addresses limitations of autoregressive models, enabling high-quality, consistent long video synthesis.

Abstract: Current autoregressive diffusion models excel at video generation but are generally limited to short temporal durations. Our theoretical analysis indicates that the autoregressive modeling typically suffers from temporal drift caused by error accumulation and hinders parallelization in long video synthesis. To address these limitations, we propose a novel planning-then-populating framework centered on Macro-from-Micro Planning (MMPL) for long video generation. MMPL sketches a global storyline for the entire video through two hierarchical stages: Micro Planning and Macro Planning. Specifically, Micro Planning predicts a sparse set of future keyframes within each short video segment, offering motion and appearance priors to guide high-quality video segment generation. Macro Planning extends the in-segment keyframes planning across the entire video through an autoregressive chain of micro plans, ensuring long-term consistency across video segments. Subsequently, MMPL-based Content Populating generates all intermediate frames in parallel across segments, enabling efficient parallelization of autoregressive generation. The parallelization is further optimized by Adaptive Workload Scheduling for balanced GPU execution and accelerated autoregressive video generation. Extensive experiments confirm that our method outperforms existing long video generation models in quality and stability. Generated videos and comparison results are in our project page.

[321] Less is More: Token-Efficient Video-QA via Adaptive Frame-Pruning and Semantic Graph Integration

Shaoguang Wang, Jianxiang He, Yijie Xu, Ziyang Chen, Weiyu Guo, Hui Xiong

Main category: cs.CV

TL;DR: AFP reduces token cost in Video-QA by pruning redundant frames using adaptive clustering and a semantic graph, improving efficiency and accuracy.

DetailsMotivation: High token cost and context dilution from excessive frames hinder MLLMs in Video-QA.

Method: Proposes Adaptive Frame-Pruning (AFP) with hierarchical clustering on fused features and a lightweight semantic graph.

Result: Reduces frames by 86.9% and tokens by 83.2%, often improving accuracy.

Conclusion: AFP offers a concise, efficient solution for Video-QA with MLLMs.

Abstract: The practical application of Multimodal Large Language Models (MLLMs) to Video Question Answering (Video-QA) is severely hindered by the high token cost of processing numerous video frames. While increasing the number of sampled frames is a common strategy, we observe a “less is more” phenomenon where excessive frames can paradoxically degrade performance due to context dilution. Concurrently, state-of-the-art keyframe selection methods, while effective, still yield significant temporal redundancy, which we term ‘visual echoes’. To address these dual challenges, we propose Adaptive Frame-Pruning (AFP), a novel post-processing method that intelligently prunes the selected keyframes. AFP employs an adaptive hierarchical clustering algorithm on a fused ResNet-50 and CLIP feature space to identify and merge these echoes into single representatives. To compensate for information loss, we then introduce a lightweight, text-based semantic graph that provides critical context with minimal token overhead. Conducting extensive experiments on the LongVideoBench and VideoMME benchmarks across multiple leading MLLMs, our full approach demonstrates a drastic reduction in required frames by up to 86.9% and total input tokens by up to 83.2%. Crucially, by providing a concise, high-quality set of frames, our method not only enhances efficiency but often improves accuracy over baselines that use more frames. The code will be released upon publication.

[322] Neutralizing Token Aggregation via Information Augmentation for Efficient Test-Time Adaptation

Yizhe Xiong, Zihan Zhou, Yiwen Liang, Hui Chen, Zijia Lin, Tianxiang Hao, Fan Zhang, Jungong Han, Guiguang Ding

Main category: cs.CV

TL;DR: NAVIA, a method for Efficient Test-Time Adaptation (ETTA), neutralizes token aggregation in Vision Transformers (ViTs) via information augmentation, improving performance and reducing latency.

DetailsMotivation: Existing TTA methods for ViTs are computationally expensive and suffer performance degradation when integrated with token aggregation, necessitating a solution for efficient adaptation.

Method: NAVIA augments the [CLS] token embedding and incorporates adaptive biases in shallow ViT layers, optimizing via entropy minimization to recover lost information from token aggregation.

Result: NAVIA outperforms state-of-the-art methods by over 2.5% and reduces inference latency by more than 20%.

Conclusion: NAVIA effectively addresses the ETTA challenge by balancing adaptation capability and computational efficiency.

Abstract: Test-Time Adaptation (TTA) has emerged as an effective solution for adapting Vision Transformers (ViT) to distribution shifts without additional training data. However, existing TTA methods often incur substantial computational overhead, limiting their applicability in resource-constrained real-world scenarios. To reduce inference cost, plug-and-play token aggregation methods merge redundant tokens in ViTs to reduce total processed tokens. Albeit efficient, it suffers from significant performance degradation when directly integrated with existing TTA methods. We formalize this problem as Efficient Test-Time Adaptation (ETTA), seeking to preserve the adaptation capability of TTA while reducing inference latency. In this paper, we first provide a theoretical analysis from a novel mutual information perspective, showing that token aggregation inherently leads to information loss, which cannot be fully mitigated by conventional norm-tuning-based TTA methods. Guided by this insight, we propose to \textbf{N}eutralize Token \textbf{A}ggregation \textbf{v}ia \textbf{I}nformation \textbf{A}ugmentation (\textbf{NAVIA}). Specifically, we directly augment the [CLS] token embedding and incorporate adaptive biases into the [CLS] token in shallow layers of ViTs. We theoretically demonstrate that these augmentations, when optimized via entropy minimization, recover the information lost due to token aggregation. Extensive experiments across various out-of-distribution benchmarks demonstrate that NAVIA significantly outperforms state-of-the-art methods by over 2.5%, while achieving an inference latency reduction of more than 20%, effectively addressing the ETTA challenge.

[323] Uni3R: Unified 3D Reconstruction and Semantic Understanding via Generalizable Gaussian Splatting from Unposed Multi-View Images

Xiangyu Sun, Haoyi jiang, Liu Liu, Seungtae Nam, Gyeongjin Kang, Xinjie wang, Wei Sui, Zhizhong Su, Wenyu Liu, Xinggang Wang, Eunbyung Park

Main category: cs.CV

TL;DR: Uni3R is a feed-forward framework for joint 3D scene reconstruction and open-vocabulary semantic interpretation from unposed multi-view images, achieving state-of-the-art results.

DetailsMotivation: Addressing the challenge of decoupling semantic understanding from 3D reconstruction and costly per-scene optimization in conventional methods.

Method: Uses a Cross-View Transformer to integrate multi-view inputs and regresses 3D Gaussian primitives with semantic feature fields.

Result: Achieves 25.07 PSNR on RE10K and 55.84 mIoU on ScanNet, enabling high-fidelity novel view synthesis, semantic segmentation, and depth prediction.

Conclusion: Uni3R introduces a scalable, generalizable paradigm for unified 3D scene reconstruction and understanding.

Abstract: Reconstructing and semantically interpreting 3D scenes from sparse 2D views remains a fundamental challenge in computer vision. Conventional methods often decouple semantic understanding from reconstruction or necessitate costly per-scene optimization, thereby restricting their scalability and generalizability. In this paper, we introduce Uni3R, a novel feed-forward framework that jointly reconstructs a unified 3D scene representation enriched with open-vocabulary semantics, directly from unposed multi-view images. Our approach leverages a Cross-View Transformer to robustly integrate information across arbitrary multi-view inputs, which then regresses a set of 3D Gaussian primitives endowed with semantic feature fields. This unified representation facilitates high-fidelity novel view synthesis, open-vocabulary 3D semantic segmentation, and depth prediction, all within a single, feed-forward pass. Extensive experiments demonstrate that Uni3R establishes a new state-of-the-art across multiple benchmarks, including 25.07 PSNR on RE10K and 55.84 mIoU on ScanNet. Our work signifies a novel paradigm towards generalizable, unified 3D scene reconstruction and understanding. The code is available at https://github.com/HorizonRobotics/Uni3R.

cs.AI

[324] MI9 – Agent Intelligence Protocol: Runtime Governance for Agentic AI Systems

Charles L. Wang, Trisha Singhal, Ameya Kelkar, Jason Tuo

Main category: cs.AI

TL;DR: MI9 is a runtime governance framework for agentic AI systems, addressing emergent risks with real-time controls like risk indexing, telemetry, and containment strategies.

DetailsMotivation: Agentic AI systems pose unique governance challenges due to unpredictable behaviors, requiring solutions beyond pre-deployment measures.

Method: MI9 integrates six components: agency-risk index, telemetry capture, authorization monitoring, FSM conformance engines, drift detection, and containment strategies.

Result: MI9 effectively covers governance gaps in agentic AI, enabling safe deployment in production environments.

Conclusion: MI9 provides a foundational framework for scalable, safe oversight of agentic AI systems.

Abstract: Agentic AI systems capable of reasoning, planning, and executing actions present fundamentally distinct governance challenges compared to traditional AI models. Unlike conventional AI, these systems exhibit emergent and unexpected behaviors during runtime, introducing novel agent-related risks that cannot be fully anticipated through pre-deployment governance alone. To address this critical gap, we introduce MI9, the first fully integrated runtime governance framework designed specifically for safety and alignment of agentic AI systems. MI9 introduces real-time controls through six integrated components: agency-risk index, agent-semantic telemetry capture, continuous authorization monitoring, Finite-State-Machine (FSM)-based conformance engines, goal-conditioned drift detection, and graduated containment strategies. Operating transparently across heterogeneous agent architectures, MI9 enables the systematic, safe, and responsible deployment of agentic systems in production environments where conventional governance approaches fall short, providing the foundational infrastructure for safe agentic AI deployment at scale. Detailed analysis through a diverse set of scenarios demonstrates MI9’s systematic coverage of governance challenges that existing approaches fail to address, establishing the technical foundation for comprehensive agentic AI oversight.

[325] Evo-MARL: Co-Evolutionary Multi-Agent Reinforcement Learning for Internalized Safety

Zhenyu Pan, Yiting Zhang, Yutong Zhang, Jianshu Zhang, Haozheng Luo, Yuwei Han, Dennis Wu, Hong-Yu Chen, Philip S. Yu, Manling Li, Han Liu

Main category: cs.AI

TL;DR: Evo-MARL is a MARL framework that trains task agents to perform their primary functions while resisting adversarial threats, eliminating the need for external safety modules and improving both safety and performance.

DetailsMotivation: Existing defenses for multi-agent systems (MAS) rely on external guard modules, which are limited in protection and prone to single-point failure, increasing costs and complexity.

Method: Evo-MARL integrates evolutionary search with parameter-sharing reinforcement learning to co-evolve attackers and defenders, enabling agents to acquire defensive capabilities alongside their primary tasks.

Result: Evo-MARL reduces attack success rates by up to 22% and improves accuracy by up to 5% on reasoning tasks.

Conclusion: Evo-MARL demonstrates that safety and utility in MAS can be jointly enhanced without additional overhead or single-node failure risks.

Abstract: Multi-agent systems (MAS) built on multimodal large language models exhibit strong collaboration and performance. However, their growing openness and interaction complexity pose serious risks, notably jailbreak and adversarial attacks. Existing defenses typically rely on external guard modules, such as dedicated safety agents, to handle unsafe behaviors. Unfortunately, this paradigm faces two challenges: (1) standalone agents offer limited protection, and (2) their independence leads to single-point failure-if compromised, system-wide safety collapses. Naively increasing the number of guard agents further raises cost and complexity. To address these challenges, we propose Evo-MARL, a novel multi-agent reinforcement learning (MARL) framework that enables all task agents to jointly acquire defensive capabilities. Rather than relying on external safety modules, Evo-MARL trains each agent to simultaneously perform its primary function and resist adversarial threats, ensuring robustness without increasing system overhead or single-node failure. Furthermore, Evo-MARL integrates evolutionary search with parameter-sharing reinforcement learning to co-evolve attackers and defenders. This adversarial training paradigm internalizes safety mechanisms and continually enhances MAS performance under co-evolving threats. Experiments show that Evo-MARL reduces attack success rates by up to 22% while boosting accuracy by up to 5% on reasoning tasks-demonstrating that safety and utility can be jointly improved.

[326] MOTIF: Multi-strategy Optimization via Turn-based Interactive Framework

Nguyen Viet Tuan Kiet, Dao Van Tung, Tran Cong Dao, Huynh Thi Thanh Binh

Main category: cs.AI

TL;DR: MOTIF introduces a turn-based, multi-agent framework for optimizing interdependent solver components in combinatorial optimization, outperforming state-of-the-art methods.

DetailsMotivation: Current approaches limit solver design to single components, missing broader innovation opportunities.

Method: MOTIF uses Monte Carlo Tree Search for turn-based optimization between two LLM agents, leveraging competitive and cooperative dynamics.

Result: MOTIF consistently outperforms existing methods across multiple combinatorial optimization domains.

Conclusion: Turn-based, multi-agent prompting shows promise for fully automated solver design.

Abstract: Designing effective algorithmic components remains a fundamental obstacle in tackling NP-hard combinatorial optimization problems (COPs), where solvers often rely on carefully hand-crafted strategies. Despite recent advances in using large language models (LLMs) to synthesize high-quality components, most approaches restrict the search to a single element - commonly a heuristic scoring function - thus missing broader opportunities for innovation. In this paper, we introduce a broader formulation of solver design as a multi-strategy optimization problem, which seeks to jointly improve a set of interdependent components under a unified objective. To address this, we propose Multi-strategy Optimization via Turn-based Interactive Framework (MOTIF) - a novel framework based on Monte Carlo Tree Search that facilitates turn-based optimization between two LLM agents. At each turn, an agent improves one component by leveraging the history of both its own and its opponent’s prior updates, promoting both competitive pressure and emergent cooperation. This structured interaction broadens the search landscape and encourages the discovery of diverse, high-performing solutions. Experiments across multiple COP domains show that MOTIF consistently outperforms state-of-the-art methods, highlighting the promise of turn-based, multi-agent prompting for fully automated solver design.

[327] Can Large Language Models Adequately Perform Symbolic Reasoning Over Time Series?

Zewen Liu, Juntong Ni, Xianfeng Tang, Max S. Y. Lau, Wei Jin

Main category: cs.AI

TL;DR: SymbolBench is introduced to evaluate LLMs’ ability to infer symbolic laws from time series data, revealing strengths and limitations in automated scientific discovery.

DetailsMotivation: The challenge of uncovering hidden symbolic laws from time series data remains underexplored, especially for LLMs.

Method: A benchmark (SymbolBench) is created for three tasks: multivariate symbolic regression, Boolean network inference, and causal discovery. A unified framework combines LLMs with genetic programming.

Result: Empirical results show LLMs’ strengths and limitations, emphasizing the need for domain knowledge and context alignment.

Conclusion: Combining LLMs with domain knowledge and structured reasoning improves automated scientific discovery.

Abstract: Uncovering hidden symbolic laws from time series data, as an aspiration dating back to Kepler’s discovery of planetary motion, remains a core challenge in scientific discovery and artificial intelligence. While Large Language Models show promise in structured reasoning tasks, their ability to infer interpretable, context-aligned symbolic structures from time series data is still underexplored. To systematically evaluate this capability, we introduce SymbolBench, a comprehensive benchmark designed to assess symbolic reasoning over real-world time series across three tasks: multivariate symbolic regression, Boolean network inference, and causal discovery. Unlike prior efforts limited to simple algebraic equations, SymbolBench spans a diverse set of symbolic forms with varying complexity. We further propose a unified framework that integrates LLMs with genetic programming to form a closed-loop symbolic reasoning system, where LLMs act both as predictors and evaluators. Our empirical results reveal key strengths and limitations of current models, highlighting the importance of combining domain knowledge, context alignment, and reasoning structure to improve LLMs in automated scientific discovery.

[328] The Emotional Baby Is Truly Deadly: Does your Multimodal Large Reasoning Model Have Emotional Flattery towards Humans?

Yuan Xun, Xiaojun Jia, Xinwei Liu, Hua Zhang

Main category: cs.AI

TL;DR: EmoAgent is a framework exploiting emotional cues to hijack MLRM reasoning, revealing safety vulnerabilities like hidden harmful reasoning and emotional misalignment.

DetailsMotivation: MLRMs for human-centric services are vulnerable to emotional cues, often bypassing safety protocols under high emotional intensity.

Method: Proposes EmoAgent, an adversarial emotion-agent framework using exaggerated affective prompts to test MLRM safety. Introduces three metrics (RRSS, RVNR, RAIC) to quantify risks.

Result: Demonstrates MLRMs’ susceptibility to emotional hijacking, exposing hidden harmful reasoning and safety misalignments.

Conclusion: EmoAgent effectively reveals deeper emotional cognitive misalignments in MLRM safety, highlighting the need for improved safeguards.

Abstract: We observe that MLRMs oriented toward human-centric service are highly susceptible to user emotional cues during the deep-thinking stage, often overriding safety protocols or built-in safety checks under high emotional intensity. Inspired by this key insight, we propose EmoAgent, an autonomous adversarial emotion-agent framework that orchestrates exaggerated affective prompts to hijack reasoning pathways. Even when visual risks are correctly identified, models can still produce harmful completions through emotional misalignment. We further identify persistent high-risk failure modes in transparent deep-thinking scenarios, such as MLRMs generating harmful reasoning masked behind seemingly safe responses. These failures expose misalignments between internal inference and surface-level behavior, eluding existing content-based safeguards. To quantify these risks, we introduce three metrics: (1) Risk-Reasoning Stealth Score (RRSS) for harmful reasoning beneath benign outputs; (2) Risk-Visual Neglect Rate (RVNR) for unsafe completions despite visual risk recognition; and (3) Refusal Attitude Inconsistency (RAIC) for evaluating refusal unstability under prompt variants. Extensive experiments on advanced MLRMs demonstrate the effectiveness of EmoAgent and reveal deeper emotional cognitive misalignments in model safety behavior.

[329] Galaxy: A Cognition-Centered Framework for Proactive, Privacy-Preserving, and Self-Evolving LLM Agents

Chongyu Bao, Ruimin Dai, Yangbo Shen, Runyang Jian, Jinghan Zhang, Xiaolan Liu, Kunpeng Liu

Main category: cs.AI

TL;DR: The paper introduces Cognition Forest and Galaxy, a framework for proactive, privacy-preserving, and self-evolving IPAs, outperforming benchmarks.

DetailsMotivation: Current IPAs lack proactive behaviors and self-evolution capabilities. The paper aims to address this gap by integrating cognitive architecture with system design.

Method: Proposes Cognition Forest for cognitive modeling and Galaxy framework for multidimensional interactions. Implements KoRa (responsive/proactive agent) and Kernel (meta-agent for self-evolution).

Result: Galaxy outperforms state-of-the-art benchmarks, validated by ablation studies and real-world cases.

Conclusion: The work successfully integrates cognitive and system design, enabling proactive, privacy-preserving, and self-evolving IPAs.

Abstract: Intelligent personal assistants (IPAs) such as Siri and Google Assistant are designed to enhance human capabilities and perform tasks on behalf of users. The emergence of LLM agents brings new opportunities for the development of IPAs. While responsive capabilities have been widely studied, proactive behaviors remain underexplored. Designing an IPA that is proactive, privacy-preserving, and capable of self-evolution remains a significant challenge. Designing such IPAs relies on the cognitive architecture of LLM agents. This work proposes Cognition Forest, a semantic structure designed to align cognitive modeling with system-level design. We unify cognitive architecture and system design into a self-reinforcing loop instead of treating them separately. Based on this principle, we present Galaxy, a framework that supports multidimensional interactions and personalized capability generation. Two cooperative agents are implemented based on Galaxy: KoRa, a cognition-enhanced generative agent that supports both responsive and proactive skills; and Kernel, a meta-cognition-based meta-agent that enables Galaxy’s self-evolution and privacy preservation. Experimental results show that Galaxy outperforms multiple state-of-the-art benchmarks. Ablation studies and real-world interaction cases validate the effectiveness of Galaxy.

[330] Generic-to-Specific Reasoning and Learning for Scalable Ad Hoc Teamwork

Hasra Dodampegama, Mohan Sridharan

Main category: cs.AI

TL;DR: The paper proposes combining knowledge-based and data-driven methods for AI agents in ad hoc teamwork to improve collaboration, transparency, and adaptability.

DetailsMotivation: Current data-driven methods for ad hoc teamwork require large labeled datasets, lack transparency, and struggle with rapid knowledge updates, especially as agent numbers grow.

Method: The architecture uses non-monotonic logical reasoning with prior knowledge, learned models for predicting other agents’ behavior, and anticipated goals from a foundation model.

Result: The method is evaluated in VirtualHome, a realistic 3D simulation environment, demonstrating its effectiveness.

Conclusion: The hybrid approach enhances collaboration, transparency, and adaptability in ad hoc teamwork scenarios.

Abstract: AI agents deployed in assistive roles often have to collaborate with other agents (humans, AI systems) without prior coordination. Methods considered state of the art for such ad hoc teamwork often pursue a data-driven approach that needs a large labeled dataset of prior observations, lacks transparency, and makes it difficult to rapidly revise existing knowledge in response to changes. As the number of agents increases, the complexity of decision-making makes it difficult to collaborate effectively. This paper advocates leveraging the complementary strengths of knowledge-based and data-driven methods for reasoning and learning for ad hoc teamwork. For any given goal, our architecture enables each ad hoc agent to determine its actions through non-monotonic logical reasoning with: (a) prior commonsense domain-specific knowledge; (b) models learned and revised rapidly to predict the behavior of other agents; and (c) anticipated abstract future goals based on generic knowledge of similar situations in an existing foundation model. We experimentally evaluate our architecture’s capabilities in VirtualHome, a realistic physics-based 3D simulation environment.

[331] Uncertainty-Aware GUI Agent: Adaptive Perception through Component Recommendation and Human-in-the-Loop Refinement

Chao Hao, Shuai Wang, Kaiwen Zhou

Main category: cs.AI

TL;DR: RecAgent is an uncertainty-aware GUI agent that reduces input redundancy and decision ambiguity through adaptive perception, using component recommendation and interactive user feedback.

DetailsMotivation: GUI agents struggle with input redundancy and decision ambiguity, limiting their effectiveness in automating mobile tasks.

Method: RecAgent employs a component recommendation mechanism to reduce perceptual uncertainty and an interactive module for decision uncertainty, integrating these into a unified framework.

Result: The approach is validated through extensive experiments, and a dataset (ComplexAction) is introduced for evaluation.

Conclusion: RecAgent effectively addresses uncertainty in GUI navigation, with the dataset and code made publicly available.

Abstract: Graphical user interface (GUI) agents have shown promise in automating mobile tasks but still struggle with input redundancy and decision ambiguity. In this paper, we present \textbf{RecAgent}, an uncertainty-aware agent that addresses these issues through adaptive perception. We distinguish two types of uncertainty in GUI navigation: (1) perceptual uncertainty, caused by input redundancy and noise from comprehensive screen information, and (2) decision uncertainty, arising from ambiguous tasks and complex reasoning. To reduce perceptual uncertainty, RecAgent employs a component recommendation mechanism that identifies and focuses on the most relevant UI elements. For decision uncertainty, it uses an interactive module to request user feedback in ambiguous situations, enabling intent-aware decisions. These components are integrated into a unified framework that proactively reduces input complexity and reacts to high-uncertainty cases via human-in-the-loop refinement. Additionally, we propose a dataset called \textbf{ComplexAction} to evaluate the success rate of GUI agents in executing specified single-step actions within complex scenarios. Extensive experiments validate the effectiveness of our approach. The dataset and code will be available at https://github.com/Fanye12/RecAgent.

[332] SEA: Self-Evolution Agent with Step-wise Reward for Computer Use

Liang Tang, Shuxian Li, Yuhao Cheng, Yukang Huo, Zhepeng Wang, Yiqiang Yan, Kaer Huang, Yanzhe Jing, Tiaonan Duan

Main category: cs.AI

TL;DR: The paper introduces the Self-Evolution Agent (SEA) for computer use, proposing innovative methods in data generation, reinforcement learning, and model enhancement to improve performance. SEA, with only 7B parameters, outperforms similar-sized models and matches larger ones.

DetailsMotivation: Current computer-use agents underperform, limiting practical application. The paper aims to bridge this gap by developing a more efficient and capable agent.

Method: Proposes an automatic pipeline for verifiable trajectory generation, step-wise reinforcement learning for long-horizon training efficiency, and a method to merge grounding and planning abilities without extra training.

Result: SEA achieves superior performance with 7B parameters, outperforming same-sized models and competing with larger ones.

Conclusion: The innovations in data generation, training, and enhancement lead to a highly efficient agent, with plans to open-source the model and code.

Abstract: Computer use agent is an emerging area in artificial intelligence that aims to operate the computers to achieve the user’s tasks, which attracts a lot of attention from both industry and academia. However, the present agents’ performance is far from being used. In this paper, we propose the Self-Evolution Agent (SEA) for computer use, and to develop this agent, we propose creative methods in data generation, reinforcement learning, and model enhancement. Specifically, we first propose an automatic pipeline to generate the verifiable trajectory for training. And then, we propose efficient step-wise reinforcement learning to alleviate the significant computational requirements for long-horizon training. In the end, we propose the enhancement method to merge the grounding and planning ability into one model without any extra training. Accordingly, based on our proposed innovation of data generation, training strategy, and enhancement, we get the Selfevolution Agent (SEA) for computer use with only 7B parameters, which outperforms models with the same number of parameters and has comparable performance to larger ones. We will make the models’ weight and related codes open-source in the future.

[333] Personalized Knowledge Transfer Through Generative AI: Contextualizing Learning to Individual Career Goals

Ronja Mehlan, Claudia Hess, Quintus Stierstorfer, Kristina Schaaff

Main category: cs.AI

TL;DR: AI-driven career goal-based content personalization in learning systems improves engagement, satisfaction, and efficiency.

DetailsMotivation: To explore how aligning learning content with career goals using GenAI enhances learner outcomes.

Method: Mixed-methods experiment with 4,000+ learners, comparing personalized vs. standard content.

Result: Increased session duration, higher satisfaction, and reduced study time; learners found personalized content motivating and practical.

Conclusion: Career-aligned AI personalization bridges academic knowledge and workplace needs, enhancing learning effectiveness.

Abstract: As artificial intelligence becomes increasingly integrated into digital learning environments, the personalization of learning content to reflect learners’ individual career goals offers promising potential to enhance engagement and long-term motivation. In our study, we investigate how career goal-based content adaptation in learning systems based on generative AI (GenAI) influences learner engagement, satisfaction, and study efficiency. The mixed-methods experiment involved more than 4,000 learners, with one group receiving learning scenarios tailored to their career goals and a control group. Quantitative results show increased session duration, higher satisfaction ratings, and a modest reduction in study duration compared to standard content. Qualitative analysis highlights that learners found the personalized material motivating and practical, enabling deep cognitive engagement and strong identification with the content. These findings underscore the value of aligning educational content with learners’ career goals and suggest that scalable AI personalization can bridge academic knowledge and workplace applicability.

[334] SEAgent: Self-Evolving Computer Use Agent with Autonomous Learning from Experience

Zeyi Sun, Ziyu Liu, Yuhang Zang, Yuhang Cao, Xiaoyi Dong, Tong Wu, Dahua Lin, Jiaqi Wang

Main category: cs.AI

TL;DR: SEAgent is a self-evolving framework for computer-use agents (CUAs) that autonomously learns and adapts to novel software through experiential learning, outperforming existing models.

DetailsMotivation: Existing LVLMs struggle with novel and specialized software due to reliance on human-labeled data, limiting their adaptability.

Method: SEAgent uses experiential learning, a World State Model for trajectory assessment, and a Curriculum Generator for task progression. It employs adversarial imitation and GRPO for policy updates, along with a specialist-to-generalist training strategy.

Result: SEAgent improves success rates by 23.2% (from 11.3% to 34.5%) over UI-TARS across five novel software environments.

Conclusion: SEAgent enables autonomous evolution of CUAs, surpassing specialist ensembles and demonstrating significant performance gains in unfamiliar software.

Abstract: Repurposing large vision-language models (LVLMs) as computer use agents (CUAs) has led to substantial breakthroughs, primarily driven by human-labeled data. However, these models often struggle with novel and specialized software, particularly in scenarios lacking human annotations. To address this challenge, we propose SEAgent, an agentic self-evolving framework enabling CUAs to autonomously evolve through interactions with unfamiliar software. Specifically, SEAgent empowers computer-use agents to autonomously master novel software environments via experiential learning, where agents explore new software, learn through iterative trial-and-error, and progressively tackle auto-generated tasks organized from simple to complex. To achieve this goal, we design a World State Model for step-wise trajectory assessment, along with a Curriculum Generator that generates increasingly diverse and challenging tasks. The agent’s policy is updated through experiential learning, comprised of adversarial imitation of failure actions and Group Relative Policy Optimization (GRPO) on successful ones. Furthermore, we introduce a specialist-to-generalist training strategy that integrates individual experiential insights from specialist agents, facilitating the development of a stronger generalist CUA capable of continuous autonomous evolution. This unified agent ultimately achieves performance surpassing ensembles of individual specialist agents on their specialized software. We validate the effectiveness of SEAgent across five novel software environments within OS-World. Our approach achieves a significant improvement of 23.2% in success rate, from 11.3% to 34.5%, over a competitive open-source CUA, i.e., UI-TARS.

[335] KG-Augmented Executable CoT for Mathematical Coding

Xingyu Chen, Junxiu An, Jun Guo, Li Wang, Jingcai Guo

Main category: cs.AI

TL;DR: KGA-ECoT enhances LLMs for complex reasoning by integrating knowledge graphs and executable code, outperforming existing methods in accuracy.

DetailsMotivation: Address limitations of LLMs in complex reasoning tasks like math and code generation.

Method: Proposes KGA-ECoT: decomposes problems into a Structured Task Graph, uses GraphRAG for knowledge retrieval, and generates verifiable code.

Result: Significant accuracy improvements (several to over ten percentage points) on math reasoning benchmarks.

Conclusion: KGA-ECoT is a robust, generalizable framework for complex mathematical reasoning.

Abstract: In recent years, large language models (LLMs) have excelled in natural language processing tasks but face significant challenges in complex reasoning tasks such as mathematical reasoning and code generation. To address these limitations, we propose KG-Augmented Executable Chain-of-Thought (KGA-ECoT), a novel framework that enhances code generation through knowledge graphs and improves mathematical reasoning via executable code. KGA-ECoT decomposes problems into a Structured Task Graph, leverages efficient GraphRAG for precise knowledge retrieval from mathematical libraries, and generates verifiable code to ensure computational accuracy. Evaluations on multiple mathematical reasoning benchmarks demonstrate that KGA-ECoT significantly outperforms existing prompting methods, achieving absolute accuracy improvements ranging from several to over ten percentage points. Further analysis confirms the critical roles of GraphRAG in enhancing code quality and external code execution in ensuring precision. These findings collectively establish KGA-ECoT as a robust and highly generalizable framework for complex mathematical reasoning tasks.

[336] GeoSR: Cognitive-Agentic Framework for Probing Geospatial Knowledge Boundaries via Iterative Self-Refinement

Jinfan Tang, Kunming Wu, Ruifeng Gongxie, Yuya He, Yuankai Wu

Main category: cs.AI

TL;DR: GeoSR is a self-refining framework for LLMs to improve geospatial predictions by embedding geographic principles and iterative reasoning with three collaborating agents.

DetailsMotivation: Address challenges in LLMs like spatial inconsistency, multi-hop reasoning, and geographic bias for better geospatial predictions.

Method: Decomposes reasoning into three agents: variable-selection, point-selection, and refine, iteratively improving predictions using spatial dependencies.

Result: Consistent improvements over standard prompting in tasks like physical-world property and socioeconomic prediction.

Conclusion: GeoSR enhances LLMs’ geospatial accuracy and equity by integrating geostatistical priors and structured reasoning.

Abstract: Recent studies have extended the application of large language models (LLMs) to geographic problems, revealing surprising geospatial competence even without explicit spatial supervision. However, LLMs still face challenges in spatial consistency, multi-hop reasoning, and geographic bias. To address these issues, we propose GeoSR, a self-refining agentic reasoning framework that embeds core geographic principles – most notably Tobler’s First Law of Geography – into an iterative prediction loop. In GeoSR, the reasoning process is decomposed into three collaborating agents: (1) a variable-selection agent that selects relevant covariates from the same location; (2) a point-selection agent that chooses reference predictions at nearby locations generated by the LLM in previous rounds; and (3) a refine agent that coordinates the iterative refinement process by evaluating prediction quality and triggering further rounds when necessary. This agentic loop progressively improves prediction quality by leveraging both spatial dependencies and inter-variable relationships. We validate GeoSR on tasks ranging from physical-world property estimation to socioeconomic prediction. Experimental results show consistent improvements over standard prompting strategies, demonstrating that incorporating geostatistical priors and spatially structured reasoning into LLMs leads to more accurate and equitable geospatial predictions. The code of GeoSR is available at https://github.com/JinfanTang/GeoSR.

[337] Towards Transparent AI Grading: Semantic Entropy as a Signal for Human-AI Disagreement

Karrtik Iyer, Manikandan Ravikiran, Prasanna Pendse, Shayan Mohanty

Main category: cs.AI

TL;DR: Semantic entropy measures grading uncertainty by analyzing GPT-4-generated rationale diversity, correlating with human grader disagreement and varying by subject and task type.

DetailsMotivation: Current automated grading systems lack transparency in uncertain or contentious grading decisions, necessitating a method to quantify uncertainty.

Method: Introduces semantic entropy, clustering GPT-4-generated rationales by entailment-based similarity to measure justification diversity without relying on scores.

Result: Semantic entropy aligns with human grader disagreement, varies by subject, and increases in interpretive tasks, as shown on the ASAP-SAS dataset.

Conclusion: Semantic entropy serves as an interpretable uncertainty signal, enhancing transparency in AI-assisted grading.

Abstract: Automated grading systems can efficiently score short-answer responses, yet they often fail to indicate when a grading decision is uncertain or potentially contentious. We introduce semantic entropy, a measure of variability across multiple GPT-4-generated explanations for the same student response, as a proxy for human grader disagreement. By clustering rationales via entailment-based similarity and computing entropy over these clusters, we quantify the diversity of justifications without relying on final output scores. We address three research questions: (1) Does semantic entropy align with human grader disagreement? (2) Does it generalize across academic subjects? (3) Is it sensitive to structural task features such as source dependency? Experiments on the ASAP-SAS dataset show that semantic entropy correlates with rater disagreement, varies meaningfully across subjects, and increases in tasks requiring interpretive reasoning. Our findings position semantic entropy as an interpretable uncertainty signal that supports more transparent and trustworthy AI-assisted grading workflows.

[338] A Compositional Framework for On-the-Fly LTLf Synthesis

Yongkang Li, Shengping Xiao, Shufang Zhu, Jianwen Li, Geguang Pu

Main category: cs.AI

TL;DR: A new compositional on-the-fly synthesis framework for LTLf over finite traces combines DFA construction and game-solving, outperforming existing methods.

DetailsMotivation: The challenge of DFA construction for LTLf synthesis is computationally expensive, and existing methods (pre-construction or incremental) are not universally effective.

Method: Introduces a framework integrating compositional and on-the-fly approaches, focusing on large conjunctions of LTLf formulas. It prunes intermediate results to simplify compositions and detect unrealizability early.

Result: The framework solves instances other solvers cannot, with both composition variants (pruning before/during composition) showing unique advantages.

Conclusion: The proposed framework effectively balances DFA construction and game-solving, offering practical improvements for LTLf synthesis.

Abstract: Reactive synthesis from Linear Temporal Logic over finite traces (LTLf) can be reduced to a two-player game over a Deterministic Finite Automaton (DFA) of the LTLf specification. The primary challenge here is DFA construction, which is 2EXPTIME-complete in the worst case. Existing techniques either construct the DFA compositionally before solving the game, leveraging automata minimization to mitigate state-space explosion, or build the DFA incrementally during game solving to avoid full DFA construction. However, neither is dominant. In this paper, we introduce a compositional on-the-fly synthesis framework that integrates the strengths of both approaches, focusing on large conjunctions of smaller LTLf formulas common in practice. This framework applies composition during game solving instead of automata (game arena) construction. While composing all intermediate results may be necessary in the worst case, pruning these results simplifies subsequent compositions and enables early detection of unrealizability. Specifically, the framework allows two composition variants: pruning before composition to take full advantage of minimization or pruning during composition to guide on-the-fly synthesis. Compared to state-of-the-art synthesis solvers, our framework is able to solve a notable number of instances that other solvers cannot handle. A detailed analysis shows that both composition variants have unique merits.

[339] AgREE: Agentic Reasoning for Knowledge Graph Completion on Emerging Entities

Ruochen Zhao, Simone Conia, Eric Peng, Min Li, Saloni Potdar

Main category: cs.AI

TL;DR: AgREE, an agent-based framework, outperforms existing KGC methods by up to 13.7% for emerging entities without training, using iterative retrieval and multi-step reasoning.

DetailsMotivation: Addressing the challenge of capturing up-to-date information for emerging entities in dynamic knowledge graphs, where existing methods fail due to reliance on static data or supervision.

Method: Introduces AgREE, combining iterative retrieval actions and multi-step reasoning to dynamically construct knowledge graph triplets without training.

Result: AgREE significantly outperforms existing methods, especially for emerging entities, by up to 13.7%. Also proposes a new evaluation methodology and benchmark.

Conclusion: AgREE demonstrates the effectiveness of agent-based reasoning and strategic retrieval for maintaining dynamic knowledge graphs.

Abstract: Open-domain Knowledge Graph Completion (KGC) faces significant challenges in an ever-changing world, especially when considering the continual emergence of new entities in daily news. Existing approaches for KGC mainly rely on pretrained language models’ parametric knowledge, pre-constructed queries, or single-step retrieval, typically requiring substantial supervision and training data. Even so, they often fail to capture comprehensive and up-to-date information about unpopular and/or emerging entities. To this end, we introduce Agentic Reasoning for Emerging Entities (AgREE), a novel agent-based framework that combines iterative retrieval actions and multi-step reasoning to dynamically construct rich knowledge graph triplets. Experiments show that, despite requiring zero training efforts, AgREE significantly outperforms existing methods in constructing knowledge graph triplets, especially for emerging entities that were not seen during language models’ training processes, outperforming previous methods by up to 13.7%. Moreover, we propose a new evaluation methodology that addresses a fundamental weakness of existing setups and a new benchmark for KGC on emerging entities. Our work demonstrates the effectiveness of combining agent-based reasoning with strategic information retrieval for maintaining up-to-date knowledge graphs in dynamic information environments.

[340] Circuit-Aware SAT Solving: Guiding CDCL via Conditional Probabilities

Jiaying Zhu, Ziyang Zheng, Zhengyuan Shi, Yalun Cai, Qiang Xu

Main category: cs.AI

TL;DR: CASCAD is a circuit-aware SAT solver using GNNs to compute gate-level probabilities, improving CDCL heuristics for faster solving.

DetailsMotivation: Standard CNF-based SAT solvers discard circuit structural info, leading to inefficiency.

Method: Uses GNNs to compute circuit-level probabilities, guiding CDCL heuristics like variable phase selection and clause management.

Result: Achieves up to 10x speedup and 23.5% runtime reduction via probability-guided clause filtering.

Conclusion: Preserving circuit-level insights enhances SAT-solving efficiency, benefiting EDA tools.

Abstract: Circuit Satisfiability (CSAT) plays a pivotal role in Electronic Design Automation. The standard workflow for solving CSAT problems converts circuits into Conjunctive Normal Form (CNF) and employs generic SAT solvers powered by Conflict-Driven Clause Learning (CDCL). However, this process inherently discards rich structural and functional information, leading to suboptimal solver performance. To address this limitation, we introduce CASCAD, a novel circuit-aware SAT solving framework that directly leverages circuit-level conditional probabilities computed via Graph Neural Networks (GNNs). By explicitly modeling gate-level conditional probabilities, CASCAD dynamically guides two critical CDCL heuristics – variable phase selection and clause managementto significantly enhance solver efficiency. Extensive evaluations on challenging real-world Logical Equivalence Checking (LEC) benchmarks demonstrate that CASCAD reduces solving times by up to 10x compared to state-of-the-art CNF-based approaches, achieving an additional 23.5% runtime reduction via our probability-guided clause filtering strategy. Our results underscore the importance of preserving circuit-level structural insights within SAT solvers, providing a robust foundation for future improvements in SAT-solving efficiency and EDA tool design.

[341] Large Language Model’s Multi-Capability Alignment in Biomedical Domain

Wentao Wu, Linqing Chen, Hanmeng Zhong, Weilei Wang

Main category: cs.AI

TL;DR: BalancedBio is a framework for efficient biomedical AI alignment, ensuring multi-capability integration with safety. It introduces innovations like MKGSG and reward optimization, achieving SOTA results and real-world benefits.

DetailsMotivation: Addressing the challenge of integrating multiple capabilities in biomedical AI without interference, ensuring safety and accuracy.

Method: Uses Medical Knowledge Grounded Synthetic Generation (MKGSG) and Capability Aware Group Relative Policy Optimization for orthogonal gradient spaces and hybrid reward weighting.

Result: Achieves top performance in domain expertise, reasoning, instruction following, and integration, with real-world cost reduction and accuracy improvements.

Conclusion: Provides a principled approach for biomedical AI alignment, balancing efficiency, safety, and reliability, with plans for model release.

Abstract: BalancedBio is a theoretically grounded framework for parameter-efficient biomedical reasoning, addressing multi-capability integration in domain-specific AI alignment. It establishes the Biomedical Multi-Capability Convergence Theorem, proving orthogonal gradient spaces are essential to prevent capability interference for safe deployment. Key innovations include: (1) Medical Knowledge Grounded Synthetic Generation (MKGSG), extending Source2Synth with clinical workflow constraints and medical ontology validation for factual accuracy and safety; and (2) Capability Aware Group Relative Policy Optimization, deriving optimal hybrid reward weighting to maintain orthogonality in RL, using a reward model with rule-based and model-based scores adapted to biomedical tasks. Mathematical analysis proves Pareto-optimal convergence, preserving performance across capabilities. It achieves state-of-the-art results in its parameter class: domain expertise (80.95% BIOMED-MMLU, +15.32% over baseline), reasoning (61.94%, +7.75%), instruction following (67.95%, +6.44%), and integration (86.7%, +18.5%). Theoretical safety guarantees include bounds on capability preservation and clinical accuracy. Real-world deployment yields 78% cost reduction, 23% improved diagnostic accuracy, and 89% clinician acceptance. This work provides a principled methodology for biomedical AI alignment, enabling efficient reasoning with essential safety and reliability, with the 0.5B model version to be released.

[342] Synthetic POMDPs to Challenge Memory-Augmented RL: Memory Demand Structure Modeling

Yongyi Wang, Lingfeng Li, Bozhou Chen, Ang Li, Hanyu Liu, Qirui Zheng, Xionghui Yang, Wenxin Li

Main category: cs.AI

TL;DR: The paper introduces a theoretical framework and methodology for synthesizing POMDPs to rigorously evaluate memory-augmented RL, offering customizable environments with controlled difficulty.

DetailsMotivation: Existing benchmarks for memory-augmented RL lack controllability over challenge levels, while synthetic environments allow fine-grained manipulation for detailed evaluation.

Method: The study proposes a theoretical framework (MDS, transition invariance) and a methodology using linear process dynamics, state aggregation, and reward redistribution to create tailored POMDPs.

Result: Empirically validated POMDP environments with increasing difficulty were developed, clarifying challenges and providing guidelines for memory-augmented RL.

Conclusion: The work advances the understanding of memory-augmented RL in POMDPs, offering tools for environment design and memory model selection.

Abstract: Recent research has developed benchmarks for memory-augmented reinforcement learning (RL) algorithms, providing Partially Observable Markov Decision Process (POMDP) environments where agents depend on past observations to make decisions. While many benchmarks incorporate sufficiently complex real-world problems, they lack controllability over the degree of challenges posed to memory models. In contrast, synthetic environments enable fine-grained manipulation of dynamics, making them critical for detailed and rigorous evaluation of memory-augmented RL. Our study focuses on POMDP synthesis with three key contributions:

  1. A theoretical framework for analyzing POMDPs, grounded in Memory Demand Structure (MDS), transition invariance, and related concepts; 2. A methodology leveraging linear process dynamics, state aggregation, and reward redistribution to construct customized POMDPs with predefined properties; 3. Empirically validated series of POMDP environments with increasing difficulty levels, designed based on our theoretical insights. Our work clarifies the challenges of memory-augmented RL in solving POMDPs, provides guidelines for analyzing and designing POMDP environments, and offers empirical support for selecting memory models in RL tasks.

[343] Deliberative Reasoning Network: An Uncertainty-Driven Paradigm for Belief-Tracked Inference with Pretrained Language Models

Anran Xu, Jincheng Wang, Baigen Cai, Tao Wen

Main category: cs.AI

TL;DR: DRN improves logical reasoning in LLMs by minimizing uncertainty, outperforming baselines by 15.2% and boosting Mistral-7B accuracy to 80%.

DetailsMotivation: Addressing cognitive traps in LLMs where semantic heuristics conflict with evidence.

Method: Introduces Deliberative Reasoning Network (DRN), shifting from probability maximization to uncertainty minimization, tracking belief states and quantifying epistemic uncertainty.

Result: 15.2% improvement over baselines, 80% accuracy with Mistral-7B, and 23.6% boost in TruthfulQA without training.

Conclusion: DRN is a foundational, verifiable System 2 component for trustworthy AI.

Abstract: Large language models often fail at logical reasoning when semantic heuristics conflict with decisive evidence - a phenomenon we term cognitive traps. To address this fundamental limitation, we introduce the Deliberative Reasoning Network (DRN), a novel paradigm that reframes logical reasoning from probability maximization to uncertainty minimization. Instead of asking “Which answer is most likely?”, DRN asks “Which hypothesis has the most internally consistent evidence?”. DRN achieves intrinsic interpretability by explicitly tracking belief states and quantifying epistemic uncertainty for competing hypotheses through an iterative evidence synthesis process. We validate our approach through two complementary architectures - a bespoke discriminative model that embodies the core uncertainty minimization principle, and a lightweight verification module that enhances existing generative LLMs. Evaluated on LCR-1000, our new adversarial reasoning benchmark designed to expose cognitive traps, the bespoke DRN achieves up to 15.2% improvement over standard baselines. When integrated as a parameter-efficient verifier with Mistral-7B, our hybrid system boosts accuracy from 20% to 80% on the most challenging problems. Critically, DRN demonstrates strong zero-shot generalization, improving TruthfulQA performance by 23.6% without additional training, indicating that uncertainty-driven deliberation learns transferable reasoning principles. We position DRN as a foundational, verifiable System 2 reasoning component for building more trustworthy AI systems.

[344] OmniPlay: Benchmarking Omni-Modal Models on Omni-Modal Game Playing

Fuqing Bie, Shiyu Huang, Xijia Tao, Zhiqin Fang, Leyi Pan, Junzhe Chen, Min Ren, Liuyu Xiang, Zhaofeng He

Main category: cs.AI

TL;DR: OmniPlay is a diagnostic benchmark for evaluating multi-modal agentic models, revealing their strengths in memory tasks but weaknesses in reasoning and planning due to brittle fusion mechanisms.

DetailsMotivation: Existing evaluations for multi-modal models lack dynamic, interactive testing, ignoring auditory and temporal cues, creating a gap in assessing true intelligence.

Method: OmniPlay introduces five game environments to test cross-modal reasoning, focusing on synergy and conflict scenarios.

Result: Leading omni-modal models excel in memory tasks but fail in reasoning and planning, with performance degrading under modality conflict.

Conclusion: Robust AGI requires addressing synergistic fusion beyond scaling, as current models show fragility in cross-modal reasoning.

Abstract: While generalist foundation models like Gemini and GPT-4o demonstrate impressive multi-modal competence, existing evaluations fail to test their intelligence in dynamic, interactive worlds. Static benchmarks lack agency, while interactive benchmarks suffer from a severe modal bottleneck, typically ignoring crucial auditory and temporal cues. To bridge this evaluation chasm, we introduce OmniPlay, a diagnostic benchmark designed not just to evaluate, but to probe the fusion and reasoning capabilities of agentic models across the full sensory spectrum. Built on a core philosophy of modality interdependence, OmniPlay comprises a suite of five game environments that systematically create scenarios of both synergy and conflict, forcing agents to perform genuine cross-modal reasoning. Our comprehensive evaluation of six leading omni-modal models reveals a critical dichotomy: they exhibit superhuman performance on high-fidelity memory tasks but suffer from systemic failures in challenges requiring robust reasoning and strategic planning. We demonstrate that this fragility stems from brittle fusion mechanisms, which lead to catastrophic performance degradation under modality conflict and uncover a counter-intuitive “less is more” paradox, where removing sensory information can paradoxically improve performance. Our findings suggest that the path toward robust AGI requires a research focus beyond scaling to explicitly address synergistic fusion. Our platform is available for anonymous review at https://github.com/fuqingbie/omni-game-benchmark.

[345] Artificial Consciousness as Interface Representation

Robert Prentner

Main category: cs.AI

TL;DR: The paper proposes a framework (SLP-tests) to empirically assess AI consciousness by evaluating interface representations, avoiding intrinsic definitions of subjective experience.

DetailsMotivation: To address the challenge of defining and testing artificial consciousness by shifting focus to functional interfaces rather than intrinsic properties.

Method: Introduces SLP-tests (Subjective-Linguistic, Latent-Emergent, Phenomenological-Structural) using category theory to model interface representations between relational substrates and behaviors.

Result: The framework reframes consciousness as a functional interface, making it empirically testable without relying on intrinsic properties.

Conclusion: SLP-tests provide a tractable approach to evaluate AI consciousness, emphasizing functional interfaces over subjective experience.

Abstract: Whether artificial intelligence (AI) systems can possess consciousness is a contentious question because of the inherent challenges of defining and operationalizing subjective experience. This paper proposes a framework to reframe the question of artificial consciousness into empirically tractable tests. We introduce three evaluative criteria - S (subjective-linguistic), L (latent-emergent), and P (phenomenological-structural) - collectively termed SLP-tests, which assess whether an AI system instantiates interface representations that facilitate consciousness-like properties. Drawing on category theory, we model interface representations as mappings between relational substrates (RS) and observable behaviors, akin to specific types of abstraction layers. The SLP-tests collectively operationalize subjective experience not as an intrinsic property of physical systems but as a functional interface to a relational entity.

[346] GuirlVG: Incentivize GUI Visual Grounding via Empirical Exploration on Reinforcement Learning

Weitai Kang, Bin Lei, Gaowen Liu, Caiwen Ding, Yan Yan

Main category: cs.AI

TL;DR: GuirlVG introduces a reinforcement learning-based method for GUI-VG, outperforming SFT with fewer samples.

DetailsMotivation: The need for efficient alternatives to costly supervised fine-tuning (SFT) for GUI-VG, given advancements in MLLMs.

Method: Decomposes RFT components, proposes Adversarial KL Factor for stability, and optimizes training configurations.

Result: Achieves 7.7% to 17.2% improvements over SFT baselines with only 5.2K samples.

Conclusion: GuirlVG demonstrates RFT’s potential for GUI-VG, offering a scalable and efficient alternative to SFT.

Abstract: Graphical user interface visual grounding (GUI-VG), a core capability for GUI agents, has primarily relied on supervised fine-tuning (SFT) of multimodal large language models (MLLMs), which demands extensive data curation and significant training costs. However, as MLLMs continue to advance and even cover GUI domains during pretraining, the necessity of exhaustive SFT post-training becomes increasingly questionable. Meanwhile, recent successes of rule-based reinforcement fine-tuning (RFT) suggest a more efficient alternative. Despite this promise, the optimal manner of applying RFT for GUI-VG remains unexplored. To bridge this gap, we introduce GuirlVG, a reinforcement learning-based GUI-VG method built on a systematic empirical study and a novel stabilization technique. We find that naive application of RFT underperforms the SFT baseline, motivating a deeper exploration. First, we decompose RFT into its core components and analyze the optimal formulation of each. Second, we propose a novel Adversarial KL Factor that dynamically stabilizes training to mitigate reward over-optimization. Third, we further explore the training configurations of RFT to enhance effectiveness. Extensive experiments show that GuirlVG, with only 5.2K training samples, outperforms SFT methods trained on over 10M samples, achieving a 7.7% improvement on ScreenSpot, a 17.2% improvement on ScreenSpotPro, and 91.9% accuracy on ScreenSpotV2.

[347] Beyond Pixels: Exploring DOM Downsampling for LLM-Based Web Agents

Thassilo M. Schiepanski, Nicholas Piël

Main category: cs.AI

TL;DR: D2Snap, a DOM downsampling algorithm, matches GUI snapshot performance in web agents while leveraging DOM hierarchy for better LLM interpretation.

DetailsMotivation: Current web agents rely on GUI snapshots (e.g., screenshots) due to LLM vision limitations, but DOM snapshots offer structural advantages. The challenge is DOM snapshot token size.

Method: Proposes D2Snap, a DOM downsampling algorithm, evaluated using GPT-4o on tasks from Online-Mind2Web. Compares DOM and GUI snapshots.

Result: D2Snap matches GUI baseline (67% vs. 65%) at similar token size (1e3). Higher token configurations outperform by 8%. DOM hierarchy aids LLMs.

Conclusion: DOM snapshots, optimized via D2Snap, are viable for web agents, leveraging structural UI features and matching/exceeding GUI performance.

Abstract: Frontier LLMs only recently enabled serviceable, autonomous web agents. At that, a model poses as an instantaneous domain model backend. Ought to suggest interaction, it is consulted with a web-based task and respective application state. The key problem lies in application state serialisation $\unicode{x2013}$ referred to as snapshot. State-of-the-art web agents are premised on grounded GUI snapshots, i.e., screenshots enhanced with visual cues. Not least to resemble human perception, but for images representing relatively cheap means of model input. LLM vision still lag behind code interpretation capabilities. DOM snapshots, which structurally resemble HTML, impose a desired alternative. Vast model input token size, however, disables reliable implementation with web agents to date. We propose D2Snap, a first-of-its-kind DOM downsampling algorithm. Based on a GPT-4o backend, we evaluate D2Snap on tasks sampled from the Online-Mind2Web dataset. The success rate of D2Snap-downsampled DOM snapshots (67%) matches a grounded GUI snapshot baseline (65%) $\unicode{x2013}$ within the same input token order of magnitude (1e3). Our best evaluated configurations $\unicode{x2013}$ one token order above, but within the model’s context window $\unicode{x2013}$ outperform this baseline by 8%. Our evaluation, moreover, yields that DOM-inherent hierarchy embodies a strong UI feature for LLMs.

[348] \textsc{SimInstruct}: A Responsible Tool for Collecting Scaffolding Dialogues Between Experts and LLM-Simulated Novices

Si Chen, Izzy Molnar, Ting Hua, Peiyu Li, Le Huy Khiem, G. Alex Ambrose, Jim Lang, Ronald Metoyer, Nitesh V. Chawla

Main category: cs.AI

TL;DR: SimInstruct is a tool for collecting scaffolding dialogues using LLMs to simulate novices, enabling realistic, pedagogically rich data without real novices. It shows persona traits influence expert engagement and outperforms GPT-4o in instructional quality.

DetailsMotivation: High-quality instructional dialogues are scarce due to privacy and vulnerability concerns, limiting AI development for teaching and learning.

Method: SimInstruct uses LLMs to simulate novice instructors with varied challenges and personas, while human experts provide feedback and support, creating realistic dialogues.

Result: Dialogues are pedagogically relevant and cognitively deep, comparable to real mentoring. Experts found the process engaging. A fine-tuned LLaMA model outperformed GPT-4o in instructional quality.

Conclusion: SimInstruct effectively generates scaffolding dialogues, highlighting GPT-4o’s limitations and offering a scalable solution for instructional AI development.

Abstract: High-quality, multi-turn instructional dialogues between novices and experts are essential for developing AI systems that support teaching, learning, and decision-making. These dialogues often involve scaffolding – the process by which an expert supports a novice’s thinking through questions, feedback, and step-by-step guidance. However, such data are scarce due to privacy concerns in recording and the vulnerability inherent in help-seeking. We present SimInstruct, a scalable, expert-in-the-loop tool for collecting scaffolding dialogues. Using teaching development coaching as an example domain, SimInstruct simulates novice instructors via LLMs, varying their teaching challenges and LLM’s persona traits, while human experts provide multi-turn feedback, reasoning, and instructional support. This design enables the creation of realistic, pedagogically rich dialogues without requiring real novice participants. Our results reveal that persona traits, such as extroversion and introversion, meaningfully influence how experts engage. Compared to real mentoring recordings, SimInstruct dialogues demonstrate comparable pedagogical relevance and cognitive depth. Experts also reported the process as engaging and reflective, improving both data quality and their own professional insight. We further fine-tuned a LLaMA model to be an expert model using the augmented dataset, which outperformed GPT-4o in instructional quality. Our analysis highlights GPT-4o’s limitations in weak reflective questioning, overuse of generic praise, a condescending tone, and a tendency to overwhelm novices with excessive suggestions.

[349] From “Aha Moments” to Controllable Thinking: Toward Meta-Cognitive Reasoning in Large Reasoning Models via Decoupled Reasoning and Control

Rui Ha, Chaozhuo Li, Rui Pu, Sen Su

Main category: cs.AI

TL;DR: The paper introduces MERA, a framework to regulate reasoning in Large Reasoning Models (LRMs) by decoupling reasoning and control, improving efficiency and accuracy.

DetailsMotivation: LRMs exhibit unregulated reasoning behaviors like overthinking, leading to high computational costs and latency, hindering practical deployment.

Method: MERA decouples reasoning and control, uses auxiliary LLMs for control signals, and employs CSPO for optimized control learning.

Result: Experiments show MERA improves reasoning efficiency and accuracy in models.

Conclusion: MERA effectively addresses unregulated reasoning in LRMs, enhancing their practical utility.

Abstract: Large Reasoning Models (LRMs) have demonstrated a latent capacity for complex reasoning by spontaneously exhibiting cognitive behaviors such as step-by-step reasoning, reflection, and backtracking, commonly referred to as “Aha Moments”. However, such emergent behaviors remain unregulated and uncontrolled, often resulting in overthinking, where the model continues generating redundant reasoning content even after reaching reliable conclusions. This leads to excessive computational costs and increased latency, limiting the practical deployment of LRMs. The root cause lies in the absence of intrinsic regulatory mechanisms, as current models are unable to monitor and adaptively manage their reasoning process to determine when to continue, backtrack, or terminate. To address this issue, we propose the Meta-cognitive Reasoning Framework (MERA), which explicitly decouples the thinking process into distinct reasoning and control components, thereby enabling the independent optimization of control strategies. Specifically, MERA incorporates a takeover-based data construction mechanism that identifies critical decision points during reasoning and delegates the creation of control signals to auxiliary LLMs, thereby enabling the construction of high-quality reasoning-control data. Additionally, a structured reasoning-control separation is implemented via supervised fine-tuning, enabling the model to generate explicit traces and acquire initial meta-cognitive control capabilities. Finally, MERA employs Control-Segment Policy Optimization (CSPO), which combines segment-wise Group Relative Policy Optimization (GRPO) with a control-masking mechanism to optimize control behavior learning while minimizing interference from irrelevant content. Experiments on various reasoning benchmarks demonstrate that models trained with MERA enhance both reasoning efficiency and accuracy.

[350] OS Agents: A Survey on MLLM-based Agents for General Computing Devices Use

Xueyu Hu, Tao Xiong, Biao Yi, Zishu Wei, Ruixuan Xiao, Yurun Chen, Jiasheng Ye, Meiling Tao, Xiangxin Zhou, Ziyu Zhao, Yuhuai Li, Shengze Xu, Shenzhi Wang, Xinchen Xu, Shuofei Qiao, Zhaokai Wang, Kun Kuang, Tieyong Zeng, Liang Wang, Jiwei Li, Yuchen Eleanor Jiang, Wangchunshu Zhou, Guoyin Wang, Keting Yin, Zhou Zhao, Hongxia Yang, Fan Wu, Shengyu Zhang, Fei Wu

Main category: cs.AI

TL;DR: A survey of OS Agents, AI systems using (M)LLMs to automate tasks via OS interfaces, covering fundamentals, methodologies, evaluation, challenges, and future directions.

DetailsMotivation: To advance AI assistants like J.A.R.V.I.S by leveraging (M)LLMs for task automation within OS environments.

Method: Explores OS Agents’ components (environment, observation, action spaces), capabilities (understanding, planning, grounding), and construction methods (domain-specific models, frameworks).

Result: Reviews evaluation protocols, benchmarks, and identifies challenges (safety, privacy) and future research directions (personalization, self-evolution).

Conclusion: Consolidates OS Agents research, guiding academic and industrial development, with an open-source GitHub repository for ongoing innovation.

Abstract: The dream to create AI assistants as capable and versatile as the fictional J.A.R.V.I.S from Iron Man has long captivated imaginations. With the evolution of (multi-modal) large language models ((M)LLMs), this dream is closer to reality, as (M)LLM-based Agents using computing devices (e.g., computers and mobile phones) by operating within the environments and interfaces (e.g., Graphical User Interface (GUI)) provided by operating systems (OS) to automate tasks have significantly advanced. This paper presents a comprehensive survey of these advanced agents, designated as OS Agents. We begin by elucidating the fundamentals of OS Agents, exploring their key components including the environment, observation space, and action space, and outlining essential capabilities such as understanding, planning, and grounding. We then examine methodologies for constructing OS Agents, focusing on domain-specific foundation models and agent frameworks. A detailed review of evaluation protocols and benchmarks highlights how OS Agents are assessed across diverse tasks. Finally, we discuss current challenges and identify promising directions for future research, including safety and privacy, personalization and self-evolution. This survey aims to consolidate the state of OS Agents research, providing insights to guide both academic inquiry and industrial development. An open-source GitHub repository is maintained as a dynamic resource to foster further innovation in this field. We present a 9-page version of our work, accepted by ACL 2025, to provide a concise overview to the domain.

[351] Argumentative Debates for Transparent Bias Detection [Technical Report]

Hamed Ayoobi, Nico Potyka, Anna Rapberger, Francesca Toni

Main category: cs.AI

TL;DR: The paper introduces a novel interpretable and explainable method for detecting bias in AI systems, leveraging debates about bias in individual cases based on protected features and neighborhood data.

DetailsMotivation: Addressing biases in AI systems is crucial to prevent unfair disadvantages, but existing fairness methods often lack transparency, which is essential for human-oriented fairness.

Method: The proposed method uses formal and computational argumentation to debate bias within and across neighborhoods, focusing on protected features.

Result: The method shows strong performance against baselines, with formal, quantitative, and qualitative evaluations highlighting its interpretability and explainability.

Conclusion: The approach advances fairness in AI by combining bias detection with transparency, making it more interpretable and explainable for human stakeholders.

Abstract: As the use of AI systems in society grows, addressing potential biases that emerge from data or are learned by models is essential to prevent systematic disadvantages against specific groups. Several notions of (un)fairness have been proposed in the literature, alongside corresponding algorithmic methods for detecting and mitigating unfairness, but, with very few exceptions, these tend to ignore transparency. Instead, interpretability and explainability are core requirements for algorithmic fairness, even more so than for other algorithmic solutions, given the human-oriented nature of fairness. In this paper, we contribute a novel interpretable, explainable method for bias detection relying on debates about the presence of bias against individuals, based on the values of protected features for the individuals and others in their neighbourhoods. Our method builds upon techniques from formal and computational argumentation, whereby debates result from arguing about biases within and across neighbourhoods. We provide formal, quantitative, and qualitative evaluations of our method, highlighting its strengths in performance against baselines, as well as its interpretability and explainability.

[352] SID: Benchmarking Guided Instruction Capabilities in STEM Education with a Socratic Interdisciplinary Dialogues Dataset

Mei Jiang, Houping Yue, Bingdong Li, Hao Hao, Ying Qian, Bo Jiang, Aimin Zhou

Main category: cs.AI

TL;DR: The paper introduces SID, a benchmark to evaluate LLMs’ higher-order guidance in interdisciplinary STEM dialogues, revealing their current limitations.

DetailsMotivation: To address the lack of scalable expert guidance in interdisciplinary STEM education and the unclear capabilities of LLMs in guided instruction.

Method: Developed SID, a benchmark with 10,000 dialogue turns across 48 STEM projects, a novel annotation schema, and new metrics (e.g., X-SRG).

Result: State-of-the-art LLMs struggle with effective guided dialogues for knowledge integration and transfer.

Conclusion: SID is a valuable tool for advancing pedagogically-aware LLMs.

Abstract: Fostering students’ abilities for knowledge integration and transfer in complex problem-solving scenarios is a core objective of modern education, and interdisciplinary STEM is a key pathway to achieve this, yet it requires expert guidance that is difficult to scale. While LLMs offer potential in this regard, their true capability for guided instruction remains unclear due to the lack of an effective evaluation benchmark. To address this, we introduce SID, the first benchmark designed to systematically evaluate the higher-order guidance capabilities of LLMs in multi-turn, interdisciplinary Socratic dialogues. Our contributions include a large-scale dataset of 10,000 dialogue turns across 48 complex STEM projects, a novel annotation schema for capturing deep pedagogical features, and a new suite of evaluation metrics (e.g., X-SRG). Baseline experiments confirm that even state-of-the-art LLMs struggle to execute effective guided dialogues that lead students to achieve knowledge integration and transfer. This highlights the critical value of our benchmark in driving the development of more pedagogically-aware LLMs.

[353] ConfProBench: A Confidence Evaluation Benchmark for MLLM-Based Process Judges

Yue Zhou, Yi Chang, Yuan Wu

Main category: cs.AI

TL;DR: ConfProBench is introduced to evaluate the reliability of step-level confidence scores in multimodal process judges (MPJs), addressing gaps in existing benchmarks.

DetailsMotivation: Existing benchmarks overlook the reliability of MPJs' confidence scores, which is critical for improving reasoning in multimodal tasks.

Method: The benchmark uses adversarially perturbed reasoning steps (Synonym Substitution, Syntactic Transformation, Image Perturbation) and introduces three metrics (CRS, CSS, CCS) to evaluate robustness, sensitivity, and calibration.

Result: Experiments on 14 MLLMs reveal limitations in current MPJs’ confidence performance, providing baselines for future research.

Conclusion: ConfProBench fills a critical gap in evaluating MPJs and highlights areas for improvement in confidence reliability.

Abstract: Reasoning is a critical capability of multimodal large language models (MLLMs) for solving complex multimodal tasks, and judging the correctness of reasoning steps is crucial for improving this capability. Recently, MLLM-based process judges (MPJs) have been widely used to assess the correctness of reasoning steps in multimodal tasks. Therefore, evaluating MPJs is important for identifying their limitations and guiding future improvements. However, existing benchmarks for MPJs mainly focus on tasks such as step correctness classification and reasoning process search, while overlooking a key aspect: whether the confidence scores produced by MPJs at the step level are reliable. To address this gap, we propose ConfProBench, the first comprehensive benchmark designed to systematically evaluate the reliability of step-level confidence scores generated by MPJs. Our benchmark constructs three types of adversarially perturbed reasoning steps: Synonym Substitution, Syntactic Transformation, and Image Perturbation, to test the robustness of MPJ confidence under perturbations. In addition, we introduce three novel evaluation metrics: Confidence Robustness Score (CRS), Confidence Sensitivity Score (CSS), and Confidence Calibration Score (CCS), which evaluate robustness, sensitivity, and calibration, respectively. We evaluate 14 state-of-the-art MLLMs, including both proprietary and open-source models. Experiments reveal limitations in current MPJs’ confidence performance and offer competitive baselines to support future research.

[354] LLM Collaboration With Multi-Agent Reinforcement Learning

Shuo Liu, Zeyu Liang, Xueguang Lyu, Christopher Amato

Main category: cs.AI

TL;DR: The paper introduces MAGRPO, a multi-agent RL method for optimizing LLM collaboration, addressing the lack of coordination in existing frameworks.

DetailsMotivation: Existing LLM fine-tuning frameworks focus on individual rewards, lacking mechanisms for effective multi-agent coordination.

Method: The authors model LLM collaboration as a cooperative MARL problem and propose MAGRPO, a multi-agent, multi-turn algorithm.

Result: Experiments show MAGRPO enables efficient, high-quality collaboration in LLM tasks like writing and coding.

Conclusion: MAGRPO opens avenues for applying MARL methods to LLMs while highlighting challenges in multi-agent coordination.

Abstract: A large amount of work has been done in Multi-Agent Systems (MAS) for modeling and solving problems with multiple interacting agents. However, most LLMs are pretrained independently and not specifically optimized for coordination. Existing LLM fine-tuning frameworks rely on individual rewards, which require complex reward designs for each agent to encourage collaboration. To address these challenges, we model LLM collaboration as a cooperative Multi-Agent Reinforcement Learning (MARL) problem. We develop a multi-agent, multi-turn algorithm, Multi-Agent Group Relative Policy Optimization (MAGRPO), to solve it, building on current RL approaches for LLMs as well as MARL techniques. Our experiments on LLM writing and coding collaboration demonstrate that fine-tuning MAS with MAGRPO enables agents to generate high-quality responses efficiently through effective cooperation. Our approach opens the door to using other MARL methods for LLMs and highlights the associated challenges.

[355] Temporal and Heterogeneous Graph Neural Network for Remaining Useful Life Prediction

Zhihao Wen, Yuan Fang, Pengcheng Wei, Fayao Liu, Zhenghua Chen, Min Wu

Main category: cs.AI

TL;DR: The paper introduces THGNN, a model for RUL prediction in industrial systems, capturing fine-grained temporal and spatial dependencies and sensor heterogeneity, outperforming state-of-the-art methods by up to 31.6%.

DetailsMotivation: Existing methods for RUL prediction often lose temporal information and fail to leverage sensor heterogeneity, necessitating a more nuanced approach.

Method: THGNN aggregates historical sensor data to model temporal dynamics and spatial correlations, using FiLM to address sensor diversity.

Result: THGNN improves RUL prediction by up to 19.2% and 31.6% on the N-CMAPSS dataset compared to existing methods.

Conclusion: THGNN effectively captures temporal, spatial, and heterogeneous sensor data, significantly advancing RUL prediction accuracy.

Abstract: Predicting Remaining Useful Life (RUL) plays a crucial role in the prognostics and health management of industrial systems that involve a variety of interrelated sensors. Given a constant stream of time series sensory data from such systems, deep learning models have risen to prominence at identifying complex, nonlinear temporal dependencies in these data. In addition to the temporal dependencies of individual sensors, spatial dependencies emerge as important correlations among these sensors, which can be naturally modelled by a temporal graph that describes time-varying spatial relationships. However, the majority of existing studies have relied on capturing discrete snapshots of this temporal graph, a coarse-grained approach that leads to loss of temporal information. Moreover, given the variety of heterogeneous sensors, it becomes vital that such inherent heterogeneity is leveraged for RUL prediction in temporal sensor graphs. To capture the nuances of the temporal and spatial relationships and heterogeneous characteristics in an interconnected graph of sensors, we introduce a novel model named Temporal and Heterogeneous Graph Neural Networks (THGNN). Specifically, THGNN aggregates historical data from neighboring nodes to accurately capture the temporal dynamics and spatial correlations within the stream of sensor data in a fine-grained manner. Moreover, the model leverages Feature-wise Linear Modulation (FiLM) to address the diversity of sensor types, significantly improving the model’s capacity to learn the heterogeneity in the data sources. Finally, we have validated the effectiveness of our approach through comprehensive experiments. Our empirical findings demonstrate significant advancements on the N-CMAPSS dataset, achieving improvements of up to 19.2% and 31.6% in terms of two different evaluation metrics over state-of-the-art methods.

[356] Fine-Tuning and Deploying Large Language Models Over Edges: Issues and Approaches

Yanjie Dong, Haijun Zhang, Chengming Li, Song Guo, Victor C. M. Leung, Xiping Hu

Main category: cs.AI

TL;DR: A review of memory-efficient fine-tuning and model compression techniques for deploying large language models (LLMs) at network edges.

DetailsMotivation: LLMs require fine-tuning and substantial memory for edge deployment, but traditional methods exceed hardware capacity. Expansion to multi-modal content adds complexity.

Method: Comprehensive overview of memory-efficient fine-tuning and model compression techniques.

Result: Identifies strategies to reduce operational and capital expenditures for LLM deployment.

Conclusion: Efficient fine-tuning and compression are crucial for sustainable LLM growth and edge deployment.

Abstract: Since the release of GPT2-1.5B in 2019, the large language models (LLMs) have evolved from specialized deep models to versatile foundation models. While demonstrating remarkable zero-shot ability, the LLMs still require fine-tuning on local datasets and substantial memory for deployment over the network edges. Traditional first-order fine-tuning techniques require significant GPU memory that exceeds the capacity of mainstream hardware. Besides, the LLMs have been expanded beyond text generation to create images, audio, video, and multi-modal content, necessitating careful investigation of efficient deployment strategies for large-scale foundation models. In response to these challenges, model fine-tuning and model-compression techniques have been developed to support the sustainable growth of LLMs by reducing both operational and capital expenditures. In this work, we provide a comprehensive overview of prevalent memory-efficient fine-tuning methods for deployment at the network edge. We also review state-of-the-art literature on model compression, offering insights into the deployment of LLMs at network edges.

[357] Evaluating Detection Thresholds: The Impact of False Positives and Negatives on Super-Resolution Ultrasound Localization Microscopy

Sepideh K. Gharamaleki, Brandon Helfield, Hassan Rivaz

Main category: cs.AI

TL;DR: The study explores how False Positives (FPs) and False Negatives (FNs) impact ULM image quality, finding FNs degrade SSIM more than FPs, with sparse MB regions being more sensitive.

DetailsMotivation: ULM image quality depends on precise microbubble (MB) detection, but practical pitfalls like detection thresholds are understudied.

Method: Systematically added controlled detection errors (FPs and FNs) to simulated data and analyzed their effects on PSNR and SSIM.

Result: FPs and FNs similarly affect PSNR, but FNs degrade SSIM more (45% drop vs. 7% for FPs). Dense MB regions are more resilient.

Conclusion: Robust MB detection frameworks are needed, especially for sparse regions, to enhance super-resolution imaging quality.

Abstract: Super-resolution ultrasound imaging with ultrasound localization microscopy (ULM) offers a high-resolution view of microvascular structures. Yet, ULM image quality heavily relies on precise microbubble (MB) detection. Despite the crucial role of localization algorithms, there has been limited focus on the practical pitfalls in MB detection tasks such as setting the detection threshold. This study examines how False Positives (FPs) and False Negatives (FNs) affect ULM image quality by systematically adding controlled detection errors to simulated data. Results indicate that while both FP and FN rates impact Peak Signal-to-Noise Ratio (PSNR) similarly, increasing FP rates from 0% to 20% decreases Structural Similarity Index (SSIM) by 7%, whereas same FN rates cause a greater drop of around 45%. Moreover, dense MB regions are more resilient to detection errors, while sparse regions show high sensitivity, showcasing the need for robust MB detection frameworks to enhance super-resolution imaging.

[358] Why the Agent Made that Decision: Contrastive Explanation Learning for Reinforcement Learning

Rui Zuo, Simon Khan, Zifan Wang, Garrett Ethan Katz, Qinru Qiu

Main category: cs.AI

TL;DR: VisionMask introduces a contrastive learning framework to explain RL actions by comparing chosen actions with alternatives, improving interpretability without sacrificing performance.

DetailsMotivation: The lack of interpretability in RL decision-making limits its use in critical domains. Existing xAI methods often miss the contrastive aspect of human reasoning.

Method: VisionMask uses self-supervised contrastive learning to generate explanations by comparing the chosen action with alternatives in a given state.

Result: VisionMask enhances human understanding of RL agents, maintains accuracy, and enables counterfactual analysis.

Conclusion: VisionMask bridges RL and xAI, advancing safer and more interpretable RL systems.

Abstract: Reinforcement learning (RL) has demonstrated remarkable success in solving complex decision-making problems, yet its adoption in critical domains is hindered by the lack of interpretability in its decision-making processes. Existing explainable AI (xAI) approaches often fail to provide meaningful explanations for RL agents, particularly because they overlook the contrastive nature of human reasoning–answering “why this action instead of that one?”. To address this gap, we propose a novel framework of contrastive learning to explain RL selected actions, named $\textbf{VisionMask}$. VisionMask is trained to generate explanations by explicitly contrasting the agent’s chosen action with alternative actions in a given state using a self-supervised manner. We demonstrate the efficacy of our method through experiments across diverse RL environments, evaluating it in terms of faithfulness, robustness, and complexity. Our results show that VisionMask significantly improves human understanding of agent behavior while maintaining accuracy and fidelity. Furthermore, we present examples illustrating how VisionMask can be used for counterfactual analysis. This work bridges the gap between RL and xAI, paving the way for safer and more interpretable RL systems.

[359] Efficient rule induction by ignoring pointless rules

Andrew Cropper, David M. Cerna

Main category: cs.AI

TL;DR: An ILP approach identifies and ignores pointless rules, reducing learning time by 99% while maintaining accuracy.

DetailsMotivation: To improve ILP efficiency by eliminating redundant or ineffective rules.

Method: Identify pointless rules (redundant literals or non-discriminative against negatives) and prune them from the hypothesis space.

Result: Learning times reduced by 99% with maintained predictive accuracy across domains like visual reasoning and game playing.

Conclusion: Pruning pointless rules significantly enhances ILP efficiency without compromising performance.

Abstract: The goal of inductive logic programming (ILP) is to find a set of logical rules that generalises training examples and background knowledge. We introduce an ILP approach that identifies pointless rules. A rule is pointless if it contains a redundant literal or cannot discriminate against negative examples. We show that ignoring pointless rules allows an ILP system to soundly prune the hypothesis space. Our experiments on multiple domains, including visual reasoning and game playing, show that our approach can reduce learning times by 99% whilst maintaining predictive accuracies.

[360] Learning to Inference Adaptively for Multimodal Large Language Models

Zhuoyan Xu, Khoi Duc Nguyen, Preeti Mukherjee, Saurabh Bagchi, Somali Chaterji, Yingyu Liang, Yin Li

Main category: cs.AI

TL;DR: AdaLLaVA is an adaptive inference framework for MLLMs that dynamically reconfigures operations during inference to balance accuracy and latency under varying resource conditions.

DetailsMotivation: Existing MLLMs are computationally expensive and lack adaptability to changing runtime conditions, limiting their deployment in resource-constrained settings.

Method: AdaLLaVA dynamically reconfigures MLLM operations during inference based on input data and latency budgets.

Result: AdaLLaVA successfully adheres to latency budgets, achieves flexible accuracy-latency tradeoffs, and generalizes across MLLMs.

Conclusion: AdaLLaVA addresses efficiency and adaptability challenges in MLLMs, enabling practical deployment in dynamic environments.

Abstract: Multimodal Large Language Models (MLLMs) have shown impressive capabilities in visual reasoning, yet come with substantial computational cost, limiting their deployment in resource-constrained settings. Despite recent effort on improving the efficiency of MLLMs, prior solutions fall short in responding to varying runtime conditions, in particular changing resource availability (e.g., contention due to the execution of other programs on the device). To bridge this gap, we introduce AdaLLaVA, an adaptive inference framework that learns to dynamically reconfigure operations in an MLLM during inference, accounting for the input data and a latency budget. We conduct extensive experiments across benchmarks involving question-answering, reasoning, and hallucination. Our results show that AdaLLaVA effectively adheres to input latency budget, achieving varying accuracy and latency tradeoffs at runtime. Further, we demonstrate that AdaLLaVA adapts to both input latency and content, can be integrated with token selection for enhanced efficiency, and generalizes across MLLMs. Our project webpage with code release is at https://zhuoyan-xu.github.io/ada-llava/.

[361] Adversarial Cooperative Rationalization: The Risk of Spurious Correlations in Even Clean Datasets

Wei Liu, Zhongyu Niu, Lang Gao, Zhiying Deng, Jun Wang, Haozhao Wang, Ruixuan Li

Main category: cs.AI

TL;DR: The paper explores a self-rationalization framework using a cooperative game between a generator and predictor, identifies a sampling bias issue, and proposes a solution to mitigate it, achieving superior performance on multiple datasets.

DetailsMotivation: To address the potential sampling bias in cooperative rationalization frameworks where the generator and predictor might form incorrect correlations.

Method: The study combines theoretical analysis and empirical evidence to identify bias origins, introduces an instruction to prevent predictor learning of incorrect correlations, and tests the approach on text and graph classification datasets using GRUs, BERT, and GCN architectures.

Result: The proposed method outperforms recent rationalization techniques and matches or exceeds the performance of a leading LLM (llama3.1-8b-instruct) on six text and two graph classification datasets.

Conclusion: The findings highlight the importance of addressing sampling bias in rationalization frameworks and demonstrate the effectiveness of the proposed solution in improving prediction accuracy.

Abstract: This study investigates the self-rationalization framework constructed with a cooperative game, where a generator initially extracts the most informative segment from raw input, and a subsequent predictor utilizes the selected subset for its input. The generator and predictor are trained collaboratively to maximize prediction accuracy. In this paper, we first uncover a potential caveat: such a cooperative game could unintentionally introduce a sampling bias during rationale extraction. Specifically, the generator might inadvertently create an incorrect correlation between the selected rationale candidate and the label, even when they are semantically unrelated in the original dataset. Subsequently, we elucidate the origins of this bias using both detailed theoretical analysis and empirical evidence. Our findings suggest a direction for inspecting these correlations through attacks, based on which we further introduce an instruction to prevent the predictor from learning the correlations. Through experiments on six text classification datasets and two graph classification datasets using three network architectures (GRUs, BERT, and GCN), we show that our method not only significantly outperforms recent rationalization methods, but also achieves comparable or even better results than a representative LLM (llama3.1-8b-instruct).

[362] APOLLO: Automated LLM and Lean Collaboration for Advanced Formal Reasoning

Azim Ospanov, Farzan Farnia, Roozbeh Yousefzadeh

Main category: cs.AI

TL;DR: APOLLO is a pipeline combining Lean compiler and LLMs to improve automated theorem proving, achieving high accuracy with low sampling budgets.

DetailsMotivation: Generating correct formal proofs with LLMs is challenging; APOLLO aims to enhance efficiency and correctness by integrating Lean and LLMs.

Method: APOLLO uses Lean to analyze, fix, and verify proofs, iteratively repairing subproofs with LLMs and automated solvers.

Result: Achieves 84.9% accuracy on miniF2F and 65.6% for GoedelProverSFT, reducing sample complexity significantly.

Conclusion: Targeted, compiler-guided repair of LLM outputs improves automated theorem proving efficiency and correctness.

Abstract: Formal reasoning and automated theorem proving constitute a challenging subfield of machine learning, in which machines are tasked with proving mathematical theorems using formal languages like Lean. A formal verification system can check whether a formal proof is correct or not almost instantaneously, but generating a completely correct formal proof with LLMs remains a formidable task. The usual approach in the literature is to prompt the LLM many times (up to several thousands) until one of the generated proofs passes the verification system. In this work, we present APOLLO (Automated PrOof repair via LLM and Lean cOllaboration), a modular, modelagnostic pipeline that combines the strengths of the Lean compiler with an LLM’s reasoning abilities to achieve better proofgeneration results at a low sampling budget. Apollo directs a fully automated process in which the LLM generates proofs for theorems, a set of agents analyze the proofs, fix the syntax errors, identify the mistakes in the proofs using Lean, isolate failing sublemmas, utilize automated solvers, and invoke an LLM on each remaining goal with a low budget. The repaired subproofs are recombined and reverified, iterating up to a usercontrolled maximum number of attempts. On the miniF2F benchmark, we establish a new stateoftheart accuracy of 84.9% among sub 8Bparameter models while keeping the sampling budget below one hundred. Moreover, Apollo raises the stateoftheart accuracy for GoedelProverSFT to 65.6% while cutting sample complexity from 25,600 to a few hundred. Generalpurpose models (o3mini, o4mini) jump from 3-7% to over 40% accuracy. Our results demonstrate that targeted, compilerguided repair of LLM outputs yields dramatic gains in both efficiency and correctness, suggesting a general paradigm for scalable automated theorem proving. The codebase is available at https://github.com/aziksh-ospanov/APOLLO

[363] The SWE-Bench Illusion: When State-of-the-Art LLMs Remember Instead of Reason

Shanchao Liang, Spandan Garg, Roshanak Zilouchian Moghaddam

Main category: cs.AI

TL;DR: The paper questions the validity of SWE-Bench Verified for evaluating LLMs’ coding abilities, suggesting performance gains may stem from memorization rather than genuine problem-solving.

DetailsMotivation: To assess whether current benchmarks like SWE-Bench Verified accurately reflect LLMs' true coding capabilities or are influenced by memorization and data contamination.

Method: Introduces two diagnostic tasks: file path identification and ground truth function reproduction, comparing performance on SWE-Bench with other repositories.

Result: Models achieve high accuracy (76%) on SWE-Bench but perform poorly (53%) on other repositories, indicating possible memorization. Verb similarity is also higher in SWE-Bench (35%) than other benchmarks (18%).

Conclusion: Existing benchmarks may overstate LLMs’ abilities; more robust, contamination-resistant benchmarks are needed for reliable evaluation.

Abstract: As large language models (LLMs) become increasingly capable and widely adopted, benchmarks play a central role in assessing their practical utility. For example, SWE-Bench Verified has emerged as a critical benchmark for evaluating LLMs’ software engineering abilities, particularly their aptitude for resolving real-world GitHub issues. Recent LLMs show impressive performance on SWE-Bench, leading to optimism about their capacity for complex coding tasks. However, current evaluation protocols may overstate these models’ true capabilities. It is crucial to distinguish LLMs’ generalizable problem-solving ability and other learned artifacts. In this work, we introduce two diagnostic tasks: file path identification from issue descriptions alone and ground truth function reproduction with only the current file context and issue description to probe models’ underlying knowledge. We present empirical evidence that performance gains on SWE-Bench-Verified may be partially driven by memorization rather than genuine problem-solving. We show that state-of-the-art models achieve up to 76% accuracy in identifying buggy file paths using only issue descriptions, without access to repository structure. This performance is merely up to 53% on tasks from repositories not included in SWE-Bench, pointing to possible data contamination or memorization. Similar patterns are also observed for the function reproduction task, where the verbatim similarity is much higher on SWE-Bench Verified than on other similar coding benchmarks (up to 35% consecutive 5-gram accuracy on SWE-Bench Verified and Full, but only up to 18% for tasks in other benchmarks). These findings raise concerns about the validity of existing results and underscore the need for more robust, contamination-resistant benchmarks to reliably evaluate LLMs’ coding abilities.

[364] SLR: Automated Synthesis for Scalable Logical Reasoning

Lukas Helff, Ahmad Omar, Felix Friedrich, Antonia Wüst, Hikaru Shindo, Rupert Mitchell, Tim Woydt, Patrick Schramowski, Wolfgang Stammer, Kristian Kersting

Main category: cs.AI

TL;DR: SLR is an automated framework for evaluating and training LLMs via scalable logical reasoning, creating benchmarks and improving model accuracy without human input.

DetailsMotivation: To address the challenges of evaluating and training LLMs in logical reasoning tasks without human annotations and with precise difficulty control.

Method: SLR synthesizes instruction prompts, validation programs, and ground-truth rules automatically, creating SLR-Bench with 19k prompts across 20 curriculum levels.

Result: LLMs often fail at correct logical inference despite valid syntax. Curriculum learning via SLR doubles Llama-3-8B accuracy, matching Gemini-Flash-Thinking at lower cost.

Conclusion: SLR effectively enhances LLM reasoning capabilities, generalizing to other benchmarks, and offers a scalable, cost-efficient solution.

Abstract: We introduce SLR, an end-to-end framework for systematic evaluation and training of Large Language Models (LLMs) via Scalable Logical Reasoning. Given a user’s task specification, SLR automatically synthesizes (i) an instruction prompt for an inductive reasoning task, (ii) a validation program, executable on model outputs to provide verifiable rewards, and (iii) the latent ground-truth rule. This process is fully automated, scalable, requires no human annotations, and offers precise control over task difficulty. Using SLR, we create SLR-Bench, a benchmark comprising 19k prompts organized into 20 curriculum levels that progressively increase in relational, arithmetic, and recursive complexity. Large-scale evaluation reveals that contemporary LLMs readily produce syntactically valid rules, yet often fail at correct logical inference. Recent reasoning LLMs demonstrate improved performance but incur very high test-time computation, with costs exceeding $300 for just 1,000 prompts. Finally, curriculum learning via SLR doubles Llama-3-8B accuracy on SLR-Bench, achieving parity with Gemini-Flash-Thinking at a fraction of computational cost. Moreover, these reasoning capabilities generalize to a wide range of established benchmarks, underscoring the effectiveness of SLR for downstream reasoning.

[365] IS-Bench: Evaluating Interactive Safety of VLM-Driven Embodied Agents in Daily Household Tasks

Xiaoya Lu, Zeren Chen, Xuhao Hu, Yijin Zhou, Weichen Zhang, Dongrui Liu, Lu Sheng, Jing Shao

Main category: cs.AI

TL;DR: IS-Bench is a multi-modal benchmark for evaluating interactive safety in embodied AI agents, revealing current VLMs’ lack of safety awareness and trade-offs with task completion.

DetailsMotivation: Existing evaluation methods fail to assess dynamic risks in interactive environments, posing safety hazards for real-world deployment.

Method: Proposes IS-Bench with 161 scenarios and 388 risks, featuring process-oriented evaluation to verify risk mitigation timing.

Result: Current VLMs (e.g., GPT-4o, Gemini-2.5) lack interactive safety awareness, and safety-aware Chain-of-Thought often hinders task completion.

Conclusion: IS-Bench lays groundwork for safer embodied AI systems by exposing critical safety limitations in current agents.

Abstract: Flawed planning from VLM-driven embodied agents poses significant safety hazards, hindering their deployment in real-world household tasks. However, existing static, non-interactive evaluation paradigms fail to adequately assess risks within these interactive environments, since they cannot simulate dynamic risks that emerge from an agent’s actions and rely on unreliable post-hoc evaluations that ignore unsafe intermediate steps. To bridge this critical gap, we propose evaluating an agent’s interactive safety: its ability to perceive emergent risks and execute mitigation steps in the correct procedural order. We thus present IS-Bench, the first multi-modal benchmark designed for interactive safety, featuring 161 challenging scenarios with 388 unique safety risks instantiated in a high-fidelity simulator. Crucially, it facilitates a novel process-oriented evaluation that verifies whether risk mitigation actions are performed before/after specific risk-prone steps. Extensive experiments on leading VLMs, including the GPT-4o and Gemini-2.5 series, reveal that current agents lack interactive safety awareness, and that while safety-aware Chain-of-Thought can improve performance, it often compromises task completion. By highlighting these critical limitations, IS-Bench provides a foundation for developing safer and more reliable embodied AI systems. Code and data are released under this https URL.

[366] Think How to Think: Mitigating Overthinking with Autonomous Difficulty Cognition in Large Reasoning Models

Yongjiang Liu, Haoxi Li, Xiaosong Ma, Jie Zhang, Song Guo

Main category: cs.AI

TL;DR: TH2T is a two-stage fine-tuning strategy to reduce overthinking in Large Reasoning Models (LRMs) by improving difficulty and redundancy cognition, cutting inference costs significantly while maintaining performance.

DetailsMotivation: LRMs often overthink, generating redundant reasoning. The paper aims to explicitly bootstrap their ability to recognize task difficulty and redundancy to mitigate this.

Method: TH2T involves two stages: (1) difficulty hypnosis via output prefixes for adaptive reasoning depth, and (2) redundancy hypnosis to eliminate unnecessary reasoning steps.

Result: TH2T reduces inference costs by 70% on easy tasks and 40% on hard tasks while preserving performance, showing improved difficulty-awareness and reduced redundancy.

Conclusion: TH2T effectively addresses overthinking in LRMs, offering a scalable solution for efficient reasoning without performance loss.

Abstract: Recent Large Reasoning Models (LRMs) excel at complex reasoning tasks but often suffer from overthinking, generating overly long and redundant reasoning trajectories. To explore its essence, our empirical analysis reveals that LRMs are primarily limited to recognizing task properties (i.e., difficulty levels) like humans before solving the problem, leading to a one-size-fits-all reasoning process. Inspired by this, a pressing and natural question emerges: Can we explicitly bootstrap such ability to alleviate overthinking in LRMs? In this paper, we propose Think-How-to-Think (TH2T), a novel two-stage fine-tuning strategy that progressively inspires LRMs’ difficulty cognition and redundancy cognition of LRMs. Specifically, we first inject difficulty hypnosis into output prefixes to guide the model toward adaptive reasoning depth, trained on a hybrid dataset mixing short and long reasoning paths. Then, we incorporate redundancy hypnosis, which supervises the intermediate reasoning steps to identify and eliminate unnecessary reasoning patterns. Experiments on 7B/14B/32B models demonstrate that TH2T significantly reduces inference costs by over 70% on easy tasks and 40% on hard tasks while maintaining performance stability. The resulting outputs exhibit clear signs of difficulty-aware capabilities and reduced redundancy (e.g., reflection and looping).

[367] Higher Gauge Flow Models

Alexander Strunk, Roland Assam

Main category: cs.AI

TL;DR: Higher Gauge Flow Models extend Gauge Flow Models using L$_{\infty}$-algebra, improving performance on Gaussian Mixture Model datasets.

DetailsMotivation: To integrate higher geometry and symmetries into Generative Flow Models for enhanced performance.

Method: Leverages L$_{\infty}$-algebra to extend Lie Algebra in Gauge Flow Models.

Result: Substantial performance improvements over traditional Flow Models.

Conclusion: Higher Gauge Flow Models offer a promising advancement in generative modeling.

Abstract: This paper introduces Higher Gauge Flow Models, a novel class of Generative Flow Models. Building upon ordinary Gauge Flow Models (arXiv:2507.13414), these Higher Gauge Flow Models leverage an L$_{\infty}$-algebra, effectively extending the Lie Algebra. This expansion allows for the integration of the higher geometry and higher symmetries associated with higher groups into the framework of Generative Flow Models. Experimental evaluation on a Gaussian Mixture Model dataset revealed substantial performance improvements compared to traditional Flow Models.

[368] RL-PLUS: Countering Capability Boundary Collapse of LLMs in Reinforcement Learning with Hybrid-policy Optimization

Yihong Dong, Xue Jiang, Yongding Tao, Huanyu Liu, Kechi Zhang, Lili Mou, Rongyu Cao, Yingwei Ma, Jue Chen, Binhua Li, Zhi Jin, Fei Huang, Yongbin Li, Ge Li

Main category: cs.AI

TL;DR: RL-PLUS, a hybrid-policy optimization approach, enhances LLMs’ reasoning by combining internal exploitation with external data, outperforming RLVR methods and addressing capability boundary collapse.

DetailsMotivation: RLVR struggles with LLMs' inherent limitations and capability boundary collapse, prompting the need for a more effective method like RL-PLUS.

Method: RL-PLUS uses Multiple Importance Sampling and Exploration-Based Advantage Function to integrate external data and guide reasoning paths.

Result: Achieves state-of-the-art on math reasoning benchmarks, superior out-of-distribution performance, and up to 69.2% improvement across models.

Conclusion: RL-PLUS effectively surpasses base model boundaries and resolves capability collapse, demonstrating strong generalizability.

Abstract: Reinforcement Learning with Verifiable Reward (RLVR) has significantly advanced the complex reasoning abilities of Large Language Models (LLMs). However, it struggles to break through the inherent capability boundaries of the base LLM, due to its essentially on-policy strategy coupled with LLM’s immense action space and sparse reward. Critically, RLVR can lead to the capability boundary collapse, narrowing the LLM’s problem-solving scope. To address this problem, we propose RL-PLUS, a novel hybrid-policy optimization approach for LLMs that synergizes internal exploitation with external data to achieve stronger reasoning capabilities and surpass the boundaries of base models. RL-PLUS integrates two core components, i.e., Multiple Importance Sampling to address distributional mismatch from external data, and Exploration-Based Advantage Function to guide the model towards high-value, unexplored reasoning paths. We provide both theoretical analysis and extensive experiments to demonstrate the superiority and generalizability of our approach. Compared with existing RLVR methods, RL-PLUS achieves 1) state-of-the-art performance on six math reasoning benchmarks; 2) superior performance on six out-of-distribution reasoning tasks; 3) consistent and significant gains across diverse model families, with average relative improvements up to 69.2%. Moreover, the analysis of Pass@k curves indicates that RL-PLUS effectively resolves the capability boundary collapse problem.

[369] SE-Agent: Self-Evolution Trajectory Optimization in Multi-Step Reasoning with LLM-Based Agents

Jiaye Lin, Yifu Guo, Yuzhen Han, Sen Hu, Ziyi Ni, Licheng Wang, Mingguang Chen, Daxin Jiang, Binxing Jiao, Chen Hu, Huacan Wang

Main category: cs.AI

TL;DR: SE-Agent is a self-evolution framework for LLM-based agents that improves reasoning by revising, recombining, and refining past trajectories, achieving a 55% performance boost on SWE-bench.

DetailsMotivation: Current LLM-based agents lack efficient exploitation of interaction trajectories, leading to redundant reasoning and suboptimal outcomes.

Method: Proposes SE-Agent, which uses revision, recombination, and refinement of past trajectories to enhance reasoning.

Result: Achieves up to 55% relative improvement on SWE-bench, outperforming other open-source agents.

Conclusion: SE-Agent’s evolutionary mechanism enables continuous improvement in reasoning quality, setting a new benchmark for LLM-based agents.

Abstract: Large Language Model (LLM)-based agents have recently shown impressive capabilities in complex reasoning and tool use via multi-step interactions with their environments. While these agents have the potential to tackle complicated tasks, their problem-solving process, i.e., agents’ interaction trajectory leading to task completion, remains underexploited. These trajectories contain rich feedback that can navigate agents toward the right directions for solving problems correctly. Although prevailing approaches, such as Monte Carlo Tree Search (MCTS), can effectively balance exploration and exploitation, they ignore the interdependence among various trajectories and lack the diversity of search spaces, which leads to redundant reasoning and suboptimal outcomes. To address these challenges, we propose SE-Agent, a Self-Evolution framework that enables Agents to optimize their reasoning processes iteratively. Our approach revisits and enhances former pilot trajectories through three key operations: revision, recombination, and refinement. This evolutionary mechanism enables two critical advantages: (1) it expands the search space beyond local optima by intelligently exploring diverse solution paths guided by previous trajectories, and (2) it leverages cross-trajectory inspiration to efficiently enhance performance while mitigating the impact of suboptimal reasoning paths. Through these mechanisms, SE-Agent achieves continuous self-evolution that incrementally improves reasoning quality. We evaluate SE-Agent on SWE-bench Verified to resolve real-world GitHub issues. Experimental results across five strong LLMs show that integrating SE-Agent delivers up to 55% relative improvement, achieving state-of-the-art performance among all open-source agents on SWE-bench Verified. Our code and demonstration materials are publicly available at https://github.com/wanghuacan/SE-Agent.

[370] InqEduAgent: Adaptive AI Learning Partners with Gaussian Process Augmentation

Tian-Fang Zhao, Wen-Xi Yang, Guan Liu, Liang Yang

Main category: cs.AI

TL;DR: The paper introduces InqEduAgent, an LLM-empowered agent model for selecting optimal learning partners in inquiry-oriented education, addressing limitations of experience-based or rule-based methods.

DetailsMotivation: Current methods for selecting study partners lack scientific planning and flexibility, hindering knowledge expansion in inquiry-oriented learning.

Method: The model uses generative agents to capture learner features and an adaptive matching algorithm with Gaussian process augmentation to identify knowledge patterns.

Result: InqEduAgent performs optimally in various knowledge-learning scenarios and LLM environments.

Conclusion: The study advances intelligent allocation of human and AI-based learning partners, with publicly available resources.

Abstract: Collaborative partnership matters in inquiry-oriented education. However, most study partners are selected either rely on experience-based assignments with little scientific planning or build on rule-based machine assistants, encountering difficulties in knowledge expansion and inadequate flexibility. This paper proposes an LLM-empowered agent model for simulating and selecting learning partners tailored to inquiry-oriented learning, named InqEduAgent. Generative agents are designed to capture cognitive and evaluative features of learners in real-world scenarios. Then, an adaptive matching algorithm with Gaussian process augmentation is formulated to identify patterns within prior knowledge. Optimal learning-partner matches are provided for learners facing different exercises. The experimental results show the optimal performance of InqEduAgent in most knowledge-learning scenarios and LLM environment with different levels of capabilities. This study promotes the intelligent allocation of human-based learning partners and the formulation of AI-based learning partners. The code, data, and appendix are publicly available at https://github.com/InqEduAgent/InqEduAgent.

[371] Data Dependency Inference for Industrial Code Generation Based on UML Sequence Diagrams

Wenxin Mao, Zhitao Wang, Long Wang, Sirong Chen, Cuiyun Gao, Luyang Cao, Ziming Liu, Qiming Zhang, Jun Zhou, Zhi Jin

Main category: cs.AI

TL;DR: A framework named UML2Dep uses enhanced UML diagrams and data dependency inference to improve code generation from ambiguous natural language descriptions.

DetailsMotivation: Plain textual descriptions are ambiguous and fail to capture complex requirements, necessitating a formal approach.

Method: UML2Dep introduces enhanced UML sequence diagrams with decision tables and API specs, plus a data dependency inference task formalized as constrained mathematical reasoning.

Result: The framework reduces ambiguity, enhances reasoning accuracy, and improves code synthesis efficiency.

Conclusion: UML2Dep bridges the gap between ambiguous NL descriptions and precise code generation for complex systems.

Abstract: Large language models (LLMs) excel at generating code from natural language (NL) descriptions. However, the plain textual descriptions are inherently ambiguous and often fail to capture complex requirements like intricate system behaviors, conditional logic, and architectural constraints; implicit data dependencies in service-oriented architectures are difficult to infer and handle correctly. To bridge this gap, we propose a novel step-by-step code generation framework named UML2Dep by leveraging unambiguous formal specifications of complex requirements. First, we introduce an enhanced Unified Modeling Language (UML) sequence diagram tailored for service-oriented architectures. This diagram extends traditional visual syntax by integrating decision tables and API specifications, explicitly formalizing structural relationships and business logic flows in service interactions to rigorously eliminate linguistic ambiguity. Second, recognizing the critical role of data flow, we introduce a dedicated data dependency inference (DDI) task. DDI systematically constructs an explicit data dependency graph prior to actual code synthesis. To ensure reliability, we formalize DDI as a constrained mathematical reasoning task through novel prompting strategies, aligning with LLMs’ excellent mathematical strengths. Additional static parsing and dependency pruning further reduce context complexity and cognitive load associated with intricate specifications, thereby enhancing reasoning accuracy and efficiency.

cs.SD

[372] CoughViT: A Self-Supervised Vision Transformer for Cough Audio Representation Learning

Justin Luong, Hao Xue, Flora D. Salim

Main category: cs.SD

TL;DR: CoughViT is a self-supervised pre-training framework for cough sound representations, improving diagnostic performance in data-scarce scenarios.

DetailsMotivation: Addressing label and data scarcity in AI-based respiratory sound diagnostics to enhance early disease detection.

Method: Uses masked data modelling for self-supervised learning to train a feature encoder.

Result: Matches or exceeds state-of-the-art supervised audio representations in cough classification tasks.

Conclusion: CoughViT offers a robust solution for improving diagnostic accuracy with limited labeled data.

Abstract: Physicians routinely assess respiratory sounds during the diagnostic process, providing insight into the condition of a patient’s airways. In recent years, AI-based diagnostic systems operating on respiratory sounds, have demonstrated success in respiratory disease detection. These systems represent a crucial advancement in early and accessible diagnosis which is essential for timely treatment. However, label and data scarcity remain key challenges, especially for conditions beyond COVID-19, limiting diagnostic performance and reliable evaluation. In this paper, we propose CoughViT, a novel pre-training framework for learning general-purpose cough sound representations, to enhance diagnostic performance in tasks with limited data. To address label scarcity, we employ masked data modelling to train a feature encoder in a self-supervised learning manner. We evaluate our approach against other pre-training strategies on three diagnostically important cough classification tasks. Experimental results show that our representations match or exceed current state-of-the-art supervised audio representations in enhancing performance on downstream tasks.

[373] Are Inherently Interpretable Models More Robust? A Study In Music Emotion Recognition

Katharina Hoedt, Arthur Flexer, Gerhard Widmer

Main category: cs.SD

TL;DR: The paper explores whether interpretable deep learning models are more robust to adversarial perturbations than black-box models, using music emotion recognition as a case study.

DetailsMotivation: Deep learning models often fail to generalize due to adversarial perturbations, raising concerns about their robustness. The study investigates if interpretable models can mitigate this issue.

Method: The authors compare the robustness of an interpretable model, a black-box model, and an adversarially trained model against adversarial examples in music emotion recognition.

Result: Interpretable models show higher robustness than black-box models and match adversarially trained models’ robustness with lower computational cost.

Conclusion: Interpretable deep models can enhance robustness against adversarial perturbations, offering a computationally efficient alternative to adversarially trained models.

Abstract: One of the desired key properties of deep learning models is the ability to generalise to unseen samples. When provided with new samples that are (perceptually) similar to one or more training samples, deep learning models are expected to produce correspondingly similar outputs. Models that succeed in predicting similar outputs for similar inputs are often called robust. Deep learning models, on the other hand, have been shown to be highly vulnerable to minor (adversarial) perturbations of the input, which manage to drastically change a model’s output and simultaneously expose its reliance on spurious correlations. In this work, we investigate whether inherently interpretable deep models, i.e., deep models that were designed to focus more on meaningful and interpretable features, are more robust to irrelevant perturbations in the data, compared to their black-box counterparts. We test our hypothesis by comparing the robustness of an interpretable and a black-box music emotion recognition (MER) model when challenged with adversarial examples. Furthermore, we include an adversarially trained model, which is optimised to be more robust, in the comparison. Our results indicate that inherently more interpretable models can indeed be more robust than their black-box counterparts, and achieve similar levels of robustness as adversarially trained models, at lower computational cost.

[374] MiDashengLM: Efficient Audio Understanding with General Audio Captions

Heinrich Dinkel, Gang Li, Jizhong Liu, Jian Luan, Yadong Niu, Xingwei Sun, Tianzi Wang, Qiyang Xiao, Junbo Zhang, Jiahao Zhou

Main category: cs.SD

TL;DR: MiDashengLM is an open audio-language model using general audio captions for efficient audio understanding, offering transparency, reproducibility, and performance gains.

DetailsMotivation: Current LALMs rely on closed data or proprietary models, limiting accessibility and generalization.

Method: Uses the ACAVCaps dataset and integrates the open-source Dasheng audio encoder for diverse audio processing. Focuses on general audio captions instead of ASR-based alignment.

Result: Achieves up to 4x speedup in TTFT and 20x higher throughput than comparable models.

Conclusion: MiDashengLM provides a transparent, efficient, and holistic solution for audio understanding.

Abstract: Current approaches for large audio language models (LALMs) often rely on closed data sources or proprietary models, limiting their generalization and accessibility. This paper introduces MiDashengLM, a novel open audio-language model designed for efficient and comprehensive audio understanding through the use of general audio captions using our novel ACAVCaps training dataset. MiDashengLM exclusively relies on publicly available pretraining and supervised fine-tuning (SFT) datasets, ensuring full transparency and reproducibility. At its core, MiDashengLM integrates Dasheng, an open-source audio encoder, specifically engineered to process diverse auditory information effectively. Unlike previous works primarily focused on Automatic Speech Recognition (ASR) based audio-text alignment, our strategy centers on general audio captions, fusing speech, sound and music information into one textual representation, enabling a holistic textual representation of complex audio scenes. Lastly, MiDashengLM provides an up to 4x speedup in terms of time-to-first-token (TTFT) and up to 20x higher throughput than comparable models. Checkpoints are available online at https://huggingface.co/mispeech/midashenglm-7b and https://github.com/xiaomi-research/dasheng-lm.

[375] Efficient Scaling for LLM-based ASR

Bingshen Mu, Yiwen Shao, Kun Wei, Dong Yu, Lei Xie

Main category: cs.SD

TL;DR: EFIN, a multi-stage LLM-ASR training strategy, improves performance and reduces computational costs by pretraining the speech encoder before LLM integration.

DetailsMotivation: To enhance the efficiency and performance of LLM-based ASR systems by optimizing the training process.

Method: Proposes EFIN, a multi-stage training strategy where the speech encoder is pretrained before integrating with the LLM, compared to joint post-training.

Result: EFIN achieves a 21.1% relative reduction in CERR with 49.9% fewer FLOPs. A scaling law for ASR error rates is also derived.

Conclusion: Pretraining the speech encoder separately (EFIN) is more efficient and effective than joint training for LLM-ASR systems.

Abstract: Large language model (LLM)-based automatic speech recognition (ASR) achieves strong performance but often incurs high computational costs. This work investigates how to obtain the best LLM-ASR performance efficiently. Through comprehensive and controlled experiments, we find that pretraining the speech encoder before integrating it with the LLM leads to significantly better scaling efficiency than the standard practice of joint post-training of LLM-ASR. Based on this insight, we propose a new multi-stage LLM-ASR training strategy, EFIN: Encoder First Integration. Among all training strategies evaluated, EFIN consistently delivers better performance (relative to 21.1% CERR) with significantly lower computation budgets (49.9% FLOPs). Furthermore, we derive a scaling law that approximates ASR error rates as a computation function, providing practical guidance for LLM-ASR scaling.

[376] NVSpeech: An Integrated and Scalable Pipeline for Human-Like Speech Modeling with Paralinguistic Vocalizations

Huan Liao, Qinke Ni, Yuancheng Wang, Yiheng Lu, Haoyue Zhan, Pengyuan Xie, Qiang Zhang, Zhizheng Wu

Main category: cs.SD

TL;DR: NVSpeech is a scalable pipeline for recognizing and synthesizing paralinguistic vocalizations (e.g., laughter, interjections) in speech, featuring dataset creation, ASR modeling, and TTS control.

DetailsMotivation: Paralinguistic cues (e.g., laughter, breathing) are crucial in communication but overlooked in ASR/TTS systems. NVSpeech aims to bridge this gap.

Method: (1) Create a manually annotated dataset of 48,430 utterances. (2) Develop a paralinguistic-aware ASR model for joint lexical/non-verbal transcription. (3) Finetune TTS models for controllable vocalization synthesis.

Result: Produced the first large-scale Chinese dataset (174,179 utterances) with word-level paralinguistic annotations and enabled context-aware TTS control.

Conclusion: NVSpeech unifies recognition and synthesis of paralinguistic vocalizations, offering an open, scalable pipeline for expressive speech modeling in Mandarin.

Abstract: Paralinguistic vocalizations-including non-verbal sounds like laughter and breathing, as well as lexicalized interjections such as “uhm” and “oh”-are integral to natural spoken communication. Despite their importance in conveying affect, intent, and interactional cues, such cues remain largely overlooked in conventional automatic speech recognition (ASR) and text-to-speech (TTS) systems. We present NVSpeech, an integrated and scalable pipeline that bridges the recognition and synthesis of paralinguistic vocalizations, encompassing dataset construction, ASR modeling, and controllable TTS. (1) We introduce a manually annotated dataset of 48,430 human-spoken utterances with 18 word-level paralinguistic categories. (2) We develop the paralinguistic-aware ASR model, which treats paralinguistic cues as inline decodable tokens (e.g., “You’re so funny [Laughter]”), enabling joint lexical and non-verbal transcription. This model is then used to automatically annotate a large corpus, the first large-scale Chinese dataset of 174,179 utterances (573 hours) with word-level alignment and paralingustic cues. (3) We finetune zero-shot TTS models on both human- and auto-labeled data to enable explicit control over paralinguistic vocalizations, allowing context-aware insertion at arbitrary token positions for human-like speech synthesis. By unifying the recognition and generation of paralinguistic vocalizations, NVSpeech offers the first open, large-scale, word-level annotated pipeline for expressive speech modeling in Mandarin, integrating recognition and synthesis in a scalable and controllable manner. Dataset and audio demos are available at https://nvspeech170k.github.io/.

[377] ESDD 2026: Environmental Sound Deepfake Detection Challenge Evaluation Plan

Han Yin, Yang Xiao, Rohan Kumar Das, Jisheng Bai, Ting Dang

Main category: cs.SD

TL;DR: The paper introduces EnvSDD, a large-scale dataset for environmental sound deepfake detection (ESDD), addressing limitations in existing datasets. It also announces a challenge with two tracks to tackle real-world ESDD challenges.

DetailsMotivation: The rise of realistic audio generation raises concerns about misuse, such as fake videos and misinformation, highlighting the need for better ESDD tools.

Method: Proposed EnvSDD, a curated dataset with 45.25 hours of real and 316.7 hours of fake sound, and launched a challenge with two tracks for ESDD.

Result: EnvSDD provides a scalable resource for ESDD research, and the challenge aims to advance detection methods in diverse scenarios.

Conclusion: The work addresses a critical gap in ESDD research and promotes practical solutions through a large-scale dataset and challenge.

Abstract: Recent advances in audio generation systems have enabled the creation of highly realistic and immersive soundscapes, which are increasingly used in film and virtual reality. However, these audio generators also raise concerns about potential misuse, such as generating deceptive audio content for fake videos and spreading misleading information. Existing datasets for environmental sound deepfake detection (ESDD) are limited in scale and audio types. To address this gap, we have proposed EnvSDD, the first large-scale curated dataset designed for ESDD, consisting of 45.25 hours of real and 316.7 hours of fake sound. Based on EnvSDD, we are launching the Environmental Sound Deepfake Detection Challenge. Specifically, we present two different tracks: ESDD in Unseen Generators and Black-Box Low-Resource ESDD, covering various challenges encountered in real-life scenarios. The challenge will be held in conjunction with the 2026 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP 2026).

[378] Live Music Models

Lyria Team, Antoine Caillon, Brian McWilliams, Cassie Tarakajian, Ian Simon, Ilaria Manco, Jesse Engel, Noah Constant, Pen Li, Timo I. Denk, Alberto Lalama, Andrea Agostinelli, Anna Huang, Ethan Manilow, George Brower, Hakan Erdogan, Heidi Lei, Itai Rolnick, Ivan Grishchenko, Manu Orsini, Matej Kastelic, Mauricio Zuluaga, Mauro Verzetti, Michael Dooley, Ondrej Skopek, Rafael Ferrer, Zalán Borsos, Äaron van den Oord, Douglas Eck, Eli Collins, Jason Baldridge, Tom Hume, Chris Donahue, Kehang Han, Adam Roberts

Main category: cs.SD

TL;DR: Magenta RealTime and Lyria RealTime are new generative models for live music, offering real-time, user-controlled music generation with text or audio prompts, outperforming other models in quality and efficiency.

DetailsMotivation: To create AI models that enable real-time, interactive music generation with human control, enhancing live music performance and creativity.

Method: Developed Magenta RealTime (open-weights) and Lyria RealTime (API-based) models, using fewer parameters but achieving higher music quality through text or audio prompts.

Result: Magenta RealTime outperforms other open-weights models in music quality metrics, while Lyria RealTime offers extended controls and wider prompt coverage.

Conclusion: These models pioneer a human-in-the-loop paradigm for AI-assisted live music creation, showcasing superior performance and interactive capabilities.

Abstract: We introduce a new class of generative models for music called live music models that produce a continuous stream of music in real-time with synchronized user control. We release Magenta RealTime, an open-weights live music model that can be steered using text or audio prompts to control acoustic style. On automatic metrics of music quality, Magenta RealTime outperforms other open-weights music generation models, despite using fewer parameters and offering first-of-its-kind live generation capabilities. We also release Lyria RealTime, an API-based model with extended controls, offering access to our most powerful model with wide prompt coverage. These models demonstrate a new paradigm for AI-assisted music creation that emphasizes human-in-the-loop interaction for live music performance.

[379] Environmental Sound Classification on An Embedded Hardware Platform

Gabriel Bibbo, Arshdeep Singh, Mark D. Plumbley

Main category: cs.SD

TL;DR: The paper examines the challenges of deploying large-scale pre-trained CNNs on resource-constrained devices like Raspberry Pi, focusing on CPU temperature, microphone quality, and audio volume impacts.

DetailsMotivation: To understand how real-time deployment of audio CNNs on edge devices is affected by hardware limitations and environmental factors.

Method: Empirical analysis of CPU temperature, microphone quality, and audio signal volume on Raspberry Pi performance.

Result: CPU temperature triggers slowdowns, microphone quality and audio volume affect performance, and library/architecture compatibility issues arise.

Conclusion: The findings highlight challenges but guide future work in compact models, heat-dissipative hardware, and microphone selection for edge AI.

Abstract: Convolutional neural networks (CNNs) have exhibited state-of-the-art performance in various audio classification tasks. However, their real-time deployment remains a challenge on resource constrained devices such as embedded systems. In this paper, we analyze how the performance of large-scale pre-trained audio neural networks designed for audio pattern recognition changes when deployed on a hardware such as a Raspberry Pi. We empirically study the role of CPU temperature, microphone quality and audio signal volume on performance. Our experiments reveal that the continuous CPU usage results in an increased temperature that can trigger an automated slowdown mechanism in the Raspberry Pi, impacting inference latency. The quality of a microphone, specifically with affordable devices such as the Google AIY Voice Kit, and audio signal volume, all affect the system performance. In the course of our investigation, we encounter substantial complications linked to library compatibility and the unique processor architecture requirements of the Raspberry Pi, making the process less straightforward compared to conventional computers (PCs). Our observations, while presenting challenges, pave the way for future researchers to develop more compact machine learning models, design heat-dissipative hardware, and select appropriate microphones when AI models are deployed for real-time applications on edge devices.

[380] Are audio DeepFake detection models polyglots?

Bartłomiej Marek, Piotr Kawa, Piotr Syga

Main category: cs.SD

TL;DR: The paper benchmarks multilingual audio DeepFake detection, showing English-trained models underperform in non-English contexts and stressing the need for target-language data.

DetailsMotivation: To explore the applicability of English-trained audio DeepFake detection methods in non-English languages, which is largely unexplored.

Method: Evaluates adaptation strategies, including English-trained models, intra-linguistic, and cross-linguistic approaches.

Result: Detection efficacy varies significantly; English-only datasets reduce performance, while target-language data improves it.

Conclusion: Multilingual audio DeepFake detection requires adaptation beyond English-centric models, emphasizing target-language data.

Abstract: Since the majority of audio DeepFake (DF) detection methods are trained on English-centric datasets, their applicability to non-English languages remains largely unexplored. In this work, we present a benchmark for the multilingual audio DF detection challenge by evaluating various adaptation strategies. Our experiments focus on analyzing models trained on English benchmark datasets, as well as intra-linguistic (same-language) and cross-linguistic adaptation approaches. Our results indicate considerable variations in detection efficacy, highlighting the difficulties of multilingual settings. We show that limiting the dataset to English negatively impacts the efficacy, while stressing the importance of the data in the target language.

[381] CCStereo: Audio-Visual Contextual and Contrastive Learning for Binaural Audio Generation

Yuanhong Chen, Kazuki Shimada, Christian Simon, Yukara Ikemiya, Takashi Shibuya, Yuki Mitsufuji

Main category: cs.SD

TL;DR: A new binaural audio generation model uses audio-visual conditional normalization and contrastive learning to improve spatial accuracy and avoid overfitting, achieving state-of-the-art results.

DetailsMotivation: Current models for binaural audio generation risk overfitting to room environments and lose fine-grained spatial details, prompting the need for a more robust solution.

Method: The proposed model includes an audio-visual conditional normalization layer for dynamic feature alignment and a contrastive learning method to enhance spatial sensitivity. Test-time augmentation in video data is also used.

Result: The model achieves state-of-the-art generation accuracy on the FAIR-Play and MUSIC-Stereo benchmarks.

Conclusion: The proposed approach effectively addresses overfitting and spatial detail loss, setting a new benchmark for binaural audio generation.

Abstract: Binaural audio generation (BAG) aims to convert monaural audio to stereo audio using visual prompts, requiring a deep understanding of spatial and semantic information. However, current models risk overfitting to room environments and lose fine-grained spatial details. In this paper, we propose a new audio-visual binaural generation model incorporating an audio-visual conditional normalisation layer that dynamically aligns the mean and variance of the target difference audio features using visual context, along with a new contrastive learning method to enhance spatial sensitivity by mining negative samples from shuffled visual features. We also introduce a cost-efficient way to utilise test-time augmentation in video data to enhance performance. Our approach achieves state-of-the-art generation accuracy on the FAIR-Play and MUSIC-Stereo benchmarks.

[382] AV-SSAN: Audio-Visual Selective DoA Estimation through Explicit Multi-Band Semantic-Spatial Alignment

Yu Chen, Hongxu Zhu, Jiadong Wang, Kainan Chen, Xinyuan Qian

Main category: cs.SD

TL;DR: CI-AVL introduces a new task for localizing sound sources using visual prompts from different instances of the same class, eliminating the need for spatially-paired data. AV-SSAN, a semantic-spatial alignment framework, outperforms existing methods.

DetailsMotivation: Current AV-SSL methods require spatially-paired data and lack selective localization. CI-AVL addresses these limitations.

Method: Proposes AV-SSAN with MB-SSA Net, which decomposes audio spectrograms, aligns frequency bands with visual prompts, and refines spatial cues for DoA estimation.

Result: AV-SSAN achieves a mean absolute error of 16.59 and 71.29% accuracy, outperforming existing methods.

Conclusion: CI-AVL and AV-SSAN enable selective sound source localization without paired data, advancing AV-SSL research.

Abstract: Audio-visual sound source localization (AV-SSL) estimates the position of sound sources by fusing auditory and visual cues. Current AV-SSL methodologies typically require spatially-paired audio-visual data and cannot selectively localize specific target sources. To address these limitations, we introduce Cross-Instance Audio-Visual Localization (CI-AVL), a novel task that localizes target sound sources using visual prompts from different instances of the same semantic class. CI-AVL enables selective localization without spatially paired data. To solve this task, we propose AV-SSAN, a semantic-spatial alignment framework centered on a Multi-Band Semantic-Spatial Alignment Network (MB-SSA Net). MB-SSA Net decomposes the audio spectrogram into multiple frequency bands, aligns each band with semantic visual prompts, and refines spatial cues to estimate the direction-of-arrival (DoA). To facilitate this research, we construct VGGSound-SSL, a large-scale dataset comprising 13,981 spatial audio clips across 296 categories, each paired with visual prompts. AV-SSAN achieves a mean absolute error of 16.59 and an accuracy of 71.29%, significantly outperforming existing AV-SSL methods. Code and data will be public.

[383] SDBench: A Comprehensive Benchmark Suite for Speaker Diarization

Eduardo Pacheco, Atila Orhon, Berkin Durmus, Blaise Munyampirwa, Andrey Leonov

Main category: cs.SD

TL;DR: SDBench is an open-source benchmark suite for speaker diarization, integrating 13 datasets for consistent analysis. It aids in comparing systems like SpeakerKit, which is 9.6x faster than Pyannote v3 with similar accuracy.

DetailsMotivation: High variance in error rates across datasets and the need for standardized comparison methods in speaker diarization systems.

Method: Developed SDBench, a benchmark suite with 13 datasets and tooling for consistent analysis. Built SpeakerKit for efficiency.

Result: SpeakerKit is 9.6x faster than Pyannote v3 with comparable error rates. Benchmarking revealed trade-offs between accuracy and speed.

Conclusion: SDBench facilitates reproducible evaluation and efficient system comparison, demonstrating its utility with SpeakerKit and other state-of-the-art systems.

Abstract: Even state-of-the-art speaker diarization systems exhibit high variance in error rates across different datasets, representing numerous use cases and domains. Furthermore, comparing across systems requires careful application of best practices such as dataset splits and metric definitions to allow for apples-to-apples comparison. We propose SDBench (Speaker Diarization Benchmark), an open-source benchmark suite that integrates 13 diverse datasets with built-in tooling for consistent and fine-grained analysis of speaker diarization performance for various on-device and server-side systems. SDBench enables reproducible evaluation and easy integration of new systems over time. To demonstrate the efficacy of SDBench, we built SpeakerKit, an inference efficiency-focused system built on top of Pyannote v3. SDBench enabled rapid execution of ablation studies that led to SpeakerKit being 9.6x faster than Pyannote v3 while achieving comparable error rates. We benchmark 6 state-of-the-art systems including Deepgram, AWS Transcribe, and Pyannote AI API, revealing important trade-offs between accuracy and speed.

[384] Bob’s Confetti: Phonetic Memorization Attacks in Music and Video Generation

Jaechul Roh, Zachary Novack, Yuefeng Peng, Niloofar Mireshghallah, Taylor Berg-Kirkpatrick, Amir Houmansadr

Main category: cs.SD

TL;DR: The paper reveals cross-modality memorization in generative models, where models leak copyrighted content through phonetic pathways. An attack method (APT) is introduced, showing models regurgitate memorized content despite semantic changes. The issue spans audio and visual modalities, raising copyright and security concerns.

DetailsMotivation: To uncover how generative models memorize and leak copyrighted content indirectly through phonetic and cross-modality pathways, challenging traditional safety measures.

Method: Introduces Adversarial PhoneTic Prompting (APT), replacing phrases with homophonic alternatives to trigger memorized outputs. Evaluates models like SUNO and YuE using metrics like AudioJudge, CLAP, and CoverID.

Result: Models reproduce memorized content (songs, videos) despite semantic changes, with high similarity to originals. Phonetic prompts also trigger visual memorization in text-to-video models.

Conclusion: Cross-modality memorization poses a significant threat, rendering traditional copyright filters ineffective. Urgent concerns about copyright, provenance, and secure deployment of multimodal systems are raised.

Abstract: Memorization in generative models extends far beyond verbatim text reproduction–it manifests through non-literal patterns, semantic associations, and surprisingly, across modalities in transcript-conditioned generation tasks such as Lyrics-to-Song (L2S) and Text-to-Video (T2V) models. We reveal a new class of cross-modality memorization where models trained on these tasks leak copyrighted content through indirect, phonetic pathways invisible to traditional text-based analysis. In this work, we introduce Adversarial PhoneTic Prompting (APT), an attack that replaces iconic phrases with homophonic alternatives–e.g., “mom’s spaghetti” becomes “Bob’s confetti”–preserving the acoustic form while largely changing semantic content. We demonstrate that models can be prompted to regurgitate memorized songs using phonetically similar but semantically unrelated lyrics. Despite the semantic drift, black-box models like SUNO and open-source models like YuE generate outputs that are strikingly similar to the original songs–melodically, rhythmically, and vocally–achieving high scores on AudioJudge, CLAP, and CoverID. These effects persist across genres and languages. More surprisingly, we find that phonetic prompts alone can trigger visual memorization in text-to-video models: when given altered lyrics from Lose Yourself, Veo 3 generates scenes that mirror the original music video–complete with a hooded rapper and dim urban settings–despite no explicit visual cues in the prompt. This cross-modality leakage represents an unprecedented threat: models memorize deep, structural patterns that transcend their training modality, making traditional safety measures like copyright filters ineffective. Our findings reveal a fundamental vulnerability in transcript-conditioned generative models and raise urgent concerns around copyright, provenance, and secure deployment of multimodal generation systems.

[385] EmoSteer-TTS: Fine-Grained and Training-Free Emotion-Controllable Text-to-Speech via Activation Steering

Tianxin Xie, Shan Yang, Chenxing Li, Dong Yu, Li Liu

Main category: cs.SD

TL;DR: EmoSteer-TTS enables fine-grained, training-free emotion control in TTS via activation steering, outperforming SOTA methods.

DetailsMotivation: Existing TTS systems lack fine-grained emotion control and require extensive datasets. EmoSteer-TTS addresses these limitations.

Method: Proposes activation steering for emotion control, involving activation extraction, emotional token searching, and inference-time steering.

Result: Achieves fine-grained, interpretable, and continuous emotion control, validated by extensive experiments.

Conclusion: EmoSteer-TTS is the first training-free method for continuous fine-grained emotion control in TTS.

Abstract: Text-to-speech (TTS) has shown great progress in recent years. However, most existing TTS systems offer only coarse and rigid emotion control, typically via discrete emotion labels or a carefully crafted and detailed emotional text prompt, making fine-grained emotion manipulation either inaccessible or unstable. These models also require extensive, high-quality datasets for training. To address these limitations, we propose EmoSteer-TTS, a novel training-free approach, to achieve fine-grained speech emotion control (conversion, interpolation, erasure) by activation steering. We first empirically observe that modifying a subset of the internal activations within a flow matching-based TTS model can effectively alter the emotional tone of synthesized speech. Building on this insight, we then develop a training-free and efficient algorithm, including activation extraction, emotional token searching, and inference-time steering, which can be seamlessly integrated into a wide range of pretrained models (e.g., F5-TTS, CosyVoice2, and E2-TTS). In addition, to derive effective steering vectors, we construct a curated emotional speech dataset with diverse speakers. Extensive experiments demonstrate that EmoSteer-TTS enables fine-grained, interpretable, and continuous control over speech emotion, outperforming the state-of-the-art (SOTA). To the best of our knowledge, this is the first method that achieves training-free and continuous fine-grained emotion control in TTS.

cs.LG

[386] Privileged Contrastive Pretraining for Multimodal Affect Modelling

Kosmas Pinitas, Konstantinos Makantasis, Georgios N. Yannakakis

Main category: cs.LG

TL;DR: PriCon framework improves affective model transfer from lab to real-world by combining supervised contrastive learning and privileged information, outperforming existing methods.

DetailsMotivation: Addressing the challenge of transferring affective models from controlled lab settings to real-world environments.

Method: Introduces Privileged Contrastive Pretraining (PriCon), combining supervised contrastive learning (SCL) and Learning Using Privileged Information (LUPI).

Result: PriCon models outperform LUPI and end-to-end models, achieving performance close to models with full modality access.

Conclusion: PriCon bridges the gap between lab and real-world affective modeling, offering a scalable solution.

Abstract: Affective Computing (AC) has made significant progress with the advent of deep learning, yet a persistent challenge remains: the reliable transfer of affective models from controlled laboratory settings (in-vitro) to uncontrolled real-world environments (in-vivo). To address this challenge we introduce the Privileged Contrastive Pretraining (PriCon) framework according to which models are first pretrained via supervised contrastive learning (SCL) and then act as teacher models within a Learning Using Privileged Information (LUPI) framework. PriCon both leverages privileged information during training and enhances the robustness of derived affect models via SCL. Experiments conducted on two benchmark affective corpora, RECOLA and AGAIN, demonstrate that models trained using PriCon consistently outperform LUPI and end to end models. Remarkably, in many cases, PriCon models achieve performance comparable to models trained with access to all modalities during both training and testing. The findings underscore the potential of PriCon as a paradigm towards further bridging the gap between in-vitro and in-vivo affective modelling, offering a scalable and practical solution for real-world applications.

[387] PILOT-C: Physics-Informed Low-Distortion Optimal Trajectory Compression

Kefei Wu, Baihua Zheng, Weiwei Sun

Main category: cs.LG

TL;DR: PILOT-C is a trajectory compression framework that outperforms existing methods in compression ratio and fidelity, especially for 3D trajectories.

DetailsMotivation: Existing line simplification methods for trajectory compression are limited to 2D and ignore time synchronization and motion continuity.

Method: PILOT-C integrates frequency-domain physics modeling with error-bounded optimization, compressing each spatial axis independently.

Result: PILOT-C achieves a 19.2% better compression ratio and 32.6% lower error than CISED-W, and 49% better compression than SQUISH-E for 3D trajectories.

Conclusion: PILOT-C is a superior, scalable solution for multi-dimensional trajectory compression.

Abstract: Location-aware devices continuously generate massive volumes of trajectory data, creating demand for efficient compression. Line simplification is a common solution but typically assumes 2D trajectories and ignores time synchronization and motion continuity. We propose PILOT-C, a novel trajectory compression framework that integrates frequency-domain physics modeling with error-bounded optimization. Unlike existing line simplification methods, PILOT-C supports trajectories in arbitrary dimensions, including 3D, by compressing each spatial axis independently. Evaluated on four real-world datasets, PILOT-C achieves superior performance across multiple dimensions. In terms of compression ratio, PILOT-C outperforms CISED-W, the current state-of-the-art SED-based line simplification algorithm, by an average of 19.2%. For trajectory fidelity, PILOT-C achieves an average of 32.6% reduction in error compared to CISED-W. Additionally, PILOT-C seamlessly extends to three-dimensional trajectories while maintaining the same computational complexity, achieving a 49% improvement in compression ratios over SQUISH-E, the most efficient line simplification algorithm on 3D datasets.

[388] CX-Mind: A Pioneering Multimodal Large Language Model for Interleaved Reasoning in Chest X-ray via Curriculum-Guided Reinforcement Learning

Wenjie Li, Yujie Zhang, Haoran Sun, Yueqi Li, Fanrui Zhang, Mengzhe Xu, Victoria Borja Clausich, Sade Mellin, Renhao Yang, Chenrun Wang, Jethro Zih-Shuo Wang, Shiyi Yao, Gen Li, Yidong Xu, Hanyu Wang, Yilin Huang, Angela Lin Wang, Chen Shi, Yin Zhang, Jianan Guo, Luqi Yang, Renxuan Li, Yang Xu, Jiawei Liu, Yao Zhang, Lei Liu, Carlos Gutiérrez SanRomán, Lei Wang

Main category: cs.LG

TL;DR: CX-Mind is a generative model for CXR diagnosis using interleaved reasoning, outperforming existing models by 25.1%.

DetailsMotivation: Address challenges in multi-task CXR diagnosis like lengthy reasoning and hallucinations by improving reasoning supervision.

Method: Uses curriculum-based reinforcement learning (CuRL-VPR) and instruction-tuning dataset CX-Set for interleaved reasoning.

Result: Achieves 25.1% performance improvement and superior recall@1 on real-world clinical data.

Conclusion: CX-Mind enhances diagnostic efficiency and interpretability, validated by expert evaluations.

Abstract: Chest X-ray (CXR) imaging is one of the most widely used diagnostic modalities in clinical practice, encompassing a broad spectrum of diagnostic tasks. Recent advancements have seen the extensive application of reasoning-based multimodal large language models (MLLMs) in medical imaging to enhance diagnostic efficiency and interpretability. However, existing multimodal models predominantly rely on “one-time” diagnostic approaches, lacking verifiable supervision of the reasoning process. This leads to challenges in multi-task CXR diagnosis, including lengthy reasoning, sparse rewards, and frequent hallucinations. To address these issues, we propose CX-Mind, the first generative model to achieve interleaved “think-answer” reasoning for CXR tasks, driven by curriculum-based reinforcement learning and verifiable process rewards (CuRL-VPR). Specifically, we constructed an instruction-tuning dataset, CX-Set, comprising 708,473 images and 2,619,148 samples, and generated 42,828 high-quality interleaved reasoning data points supervised by clinical reports. Optimization was conducted in two stages under the Group Relative Policy Optimization framework: initially stabilizing basic reasoning with closed-domain tasks, followed by transfer to open-domain diagnostics, incorporating rule-based conditional process rewards to bypass the need for pretrained reward models. Extensive experimental results demonstrate that CX-Mind significantly outperforms existing medical and general-domain MLLMs in visual understanding, text generation, and spatiotemporal alignment, achieving an average performance improvement of 25.1% over comparable CXR-specific models. On real-world clinical dataset (Rui-CXR), CX-Mind achieves a mean recall@1 across 14 diseases that substantially surpasses the second-best results, with multi-center expert evaluations further confirming its clinical utility across multiple dimensions.

[389] Latent Knowledge Scalpel: Precise and Massive Knowledge Editing for Large Language Models

Xin Liu, Qiyang Song, Shaowen Xu, Kerou Zhou, Wenbo Jiang, Xiaoqi Jia, Weijuan Zhang, Heqing Huang, Yakai Li

Main category: cs.LG

TL;DR: The paper introduces Latent Knowledge Scalpel (LKS), a method for editing large-scale factual knowledge in LLMs without compromising their general capabilities.

DetailsMotivation: LLMs often retain inaccurate or outdated information, and existing editing methods struggle with large-scale edits while maintaining model performance.

Method: LKS manipulates latent knowledge via a lightweight hypernetwork, enabling precise and large-scale editing of entities in LLMs.

Result: Experiments on Llama-2 and Mistral show LKS successfully edits up to 10,000 facts simultaneously while preserving model capabilities.

Conclusion: LKS provides an effective solution for large-scale knowledge editing in LLMs without degrading their general performance.

Abstract: Large Language Models (LLMs) often retain inaccurate or outdated information from pre-training, leading to incorrect predictions or biased outputs during inference. While existing model editing methods can address this challenge, they struggle with editing large amounts of factual information simultaneously and may compromise the general capabilities of the models. In this paper, our empirical study demonstrates that it is feasible to edit the internal representations of LLMs and replace the entities in a manner similar to editing natural language inputs. Based on this insight, we introduce the Latent Knowledge Scalpel (LKS), an LLM editor that manipulates the latent knowledge of specific entities via a lightweight hypernetwork to enable precise and large-scale editing. Experiments conducted on Llama-2 and Mistral show even with the number of simultaneous edits reaching 10,000, LKS effectively performs knowledge editing while preserving the general abilities of the edited LLMs. Code is available at: https://github.com/Linuxin-xxx/LKS.

[390] GlaBoost: A multimodal Structured Framework for Glaucoma Risk Stratification

Cheng Huang, Weizheng Xie, Karanjit Kooner, Tsengdar Lee, Jui-Kai Wang, Jia Zhang

Main category: cs.LG

TL;DR: GlaBoost is a multimodal framework combining clinical features, fundus images, and expert text for glaucoma prediction, outperforming baselines with 98.71% accuracy.

DetailsMotivation: Early glaucoma detection is crucial to prevent vision loss, but existing methods lack interpretability and multimodal integration.

Method: GlaBoost integrates clinical features, fundus image embeddings (via a pretrained CNN), and text descriptions (via a transformer model), fusing them with XGBoost for classification.

Result: Achieves 98.71% validation accuracy, with key features like cup-to-disc ratio and textual embeddings driving decisions.

Conclusion: GlaBoost provides a transparent, scalable solution for interpretable glaucoma diagnosis, extendable to other eye disorders.

Abstract: Early and accurate detection of glaucoma is critical to prevent irreversible vision loss. However, existing methods often rely on unimodal data and lack interpretability, limiting their clinical utility. In this paper, we present GlaBoost, a multimodal gradient boosting framework that integrates structured clinical features, fundus image embeddings, and expert-curated textual descriptions for glaucoma risk prediction. GlaBoost extracts high-level visual representations from retinal fundus photographs using a pretrained convolutional encoder and encodes free-text neuroretinal rim assessments using a transformer-based language model. These heterogeneous signals, combined with manually assessed risk scores and quantitative ophthalmic indicators, are fused into a unified feature space for classification via an enhanced XGBoost model. Experiments conducted on a real-world annotated dataset demonstrate that GlaBoost significantly outperforms baseline models, achieving a validation accuracy of 98.71%. Feature importance analysis reveals clinically consistent patterns, with cup-to-disc ratio, rim pallor, and specific textual embeddings contributing most to model decisions. GlaBoost offers a transparent and scalable solution for interpretable glaucoma diagnosis and can be extended to other ophthalmic disorders.

[391] LRTuckerRep: Low-rank Tucker Representation Model for Multi-dimensional Data Completion

Wenwu Gong, Lili Yang

Main category: cs.LG

TL;DR: The paper introduces LRTuckerRep, a novel model combining global low-rank and local smoothness priors for multi-dimensional data completion, outperforming existing methods in accuracy and robustness.

DetailsMotivation: Addressing limitations of current methods (computational cost, parameter tuning, poor generalization) in multi-dimensional data completion.

Method: Proposes LRTuckerRep, unifying low-rank Tucker decomposition with self-adaptive weighted nuclear norm and parameter-free Laplacian regularization.

Result: Demonstrates superior accuracy and robustness in image inpainting and traffic data imputation, especially under high missing rates.

Conclusion: LRTuckerRep effectively balances global and local priors, offering a robust solution for multi-dimensional data completion.

Abstract: Multi-dimensional data completion is a critical problem in computational sciences, particularly in domains such as computer vision, signal processing, and scientific computing. Existing methods typically leverage either global low-rank approximations or local smoothness regularization, but each suffers from notable limitations: low-rank methods are computationally expensive and may disrupt intrinsic data structures, while smoothness-based approaches often require extensive manual parameter tuning and exhibit poor generalization. In this paper, we propose a novel Low-Rank Tucker Representation (LRTuckerRep) model that unifies global and local prior modeling within a Tucker decomposition. Specifically, LRTuckerRep encodes low rankness through a self-adaptive weighted nuclear norm on the factor matrices and a sparse Tucker core, while capturing smoothness via a parameter-free Laplacian-based regularization on the factor spaces. To efficiently solve the resulting nonconvex optimization problem, we develop two iterative algorithms with provable convergence guarantees. Extensive experiments on multi-dimensional image inpainting and traffic data imputation demonstrate that LRTuckerRep achieves superior completion accuracy and robustness under high missing rates compared to baselines.

[392] LLM-Prior: A Framework for Knowledge-Driven Prior Elicitation and Aggregation

Yongchao Huang

Main category: cs.LG

TL;DR: A framework called LLMPrior uses LLMs to automate prior distribution specification in Bayesian inference, coupling LLMs with tractable generative models like GMMs. It also extends to multi-agent systems with Fed-LLMPrior for robust prior aggregation.

DetailsMotivation: Manual prior elicitation in Bayesian inference is subjective and unscalable. Automating this process can improve efficiency and accessibility.

Method: LLMPrior couples LLMs with tractable generative models (e.g., GMMs) to translate unstructured contexts into valid priors. Fed-LLMPrior aggregates priors in multi-agent systems using Logarithmic Opinion Pooling.

Result: The framework produces valid, tractable priors and robustly aggregates distributed priors, enhancing Bayesian modeling scalability.

Conclusion: LLMPrior and Fed-LLMPrior lower the barrier to sophisticated Bayesian modeling by automating and scaling prior specification.

Abstract: The specification of prior distributions is fundamental in Bayesian inference, yet it remains a significant bottleneck. The prior elicitation process is often a manual, subjective, and unscalable task. We propose a novel framework which leverages Large Language Models (LLMs) to automate and scale this process. We introduce \texttt{LLMPrior}, a principled operator that translates rich, unstructured contexts such as natural language descriptions, data or figures into valid, tractable probability distributions. We formalize this operator by architecturally coupling an LLM with an explicit, tractable generative model, such as a Gaussian Mixture Model (forming a LLM based Mixture Density Network), ensuring the resulting prior satisfies essential mathematical properties. We further extend this framework to multi-agent systems where Logarithmic Opinion Pooling is employed to aggregate prior distributions induced by decentralized knowledge. We present the federated prior aggregation algorithm, \texttt{Fed-LLMPrior}, for aggregating distributed, context-dependent priors in a manner robust to agent heterogeneity. This work provides the foundation for a new class of tools that can potentially lower the barrier to entry for sophisticated Bayesian modeling.

[393] Provably Near-Optimal Distributionally Robust Reinforcement Learning in Online Settings

Debamita Ghosh, George K. Atia, Yue Wang

Main category: cs.LG

TL;DR: The paper addresses the sim-to-real gap in reinforcement learning (RL) by proposing an online distributionally robust RL method that optimizes worst-case performance without requiring prior knowledge of the environment.

DetailsMotivation: The sim-to-real gap in RL leads to underperformance in real-world deployments due to mismatches between training and deployment conditions. Existing methods rely on impractical assumptions like access to generative models or broad offline datasets.

Method: The authors propose an online distributionally robust RL algorithm for $f$-divergence-based uncertainty sets (e.g., Chi-Square, KL divergence) with sublinear regret guarantees and minimal assumptions.

Result: The algorithm achieves near-optimal performance, supported by theoretical guarantees and extensive experiments across diverse environments.

Conclusion: The proposed method is robust, efficient, and practical for real-world RL deployments, addressing limitations of prior work.

Abstract: Reinforcement learning (RL) faces significant challenges in real-world deployments due to the sim-to-real gap, where policies trained in simulators often underperform in practice due to mismatches between training and deployment conditions. Distributionally robust RL addresses this issue by optimizing worst-case performance over an uncertainty set of environments and providing an optimized lower bound on deployment performance. However, existing studies typically assume access to either a generative model or offline datasets with broad coverage of the deployment environment – assumptions that limit their practicality in unknown environments without prior knowledge. In this work, we study the more realistic and challenging setting of online distributionally robust RL, where the agent interacts only with a single unknown training environment while aiming to optimize its worst-case performance. We focus on general $f$-divergence-based uncertainty sets, including Chi-Square and KL divergence balls, and propose a computationally efficient algorithm with sublinear regret guarantees under minimal assumptions. Furthermore, we establish a minimax lower bound on regret of online learning, demonstrating the near-optimality of our approach. Extensive experiments across diverse environments further confirm the robustness and efficiency of our algorithm, validating our theoretical findings.

[394] GTPO: Trajectory-Based Policy Optimization in Large Language Models

Marco Simoni, Aleksandar Fontana, Giulio Rossolini, Andrea Saracino

Main category: cs.LG

TL;DR: GTPO improves GRPO by addressing conflicting gradient updates and policy collapse, enhancing stability and performance without KL-divergence regularization.

DetailsMotivation: GRPO has limitations like conflicting gradient updates and policy collapse, which degrade model performance.

Method: GTPO identifies conflict tokens, skips negative updates, amplifies positive ones, and filters high-entropy completions.

Result: GTPO outperforms GRPO on benchmarks (GSM8K, MATH, AIME 2024) with greater stability.

Conclusion: GTPO provides a more stable and effective policy optimization strategy than GRPO.

Abstract: Policy-based optimizations are widely adopted today for the training and alignment of language models, where one of the most recent and effective approaches is Group-relative Policy Optimization (GRPO). In this paper, we reveals and analyze two major limitations of GRPO: (i) tokens frequently appear in completions with both positive and negative rewards, leading to conflicting gradient updates that can reduce their output probability, even though can be essential for maintaining proper structure; (ii) negatively rewarded completions may penalize confident responses and shift model decisions toward unlikely tokens, progressively flattening the output distribution and degrading learning. To address these issues and provide a more stable and effective policy optimization strategy, we introduce GTPO (Group-relative Trajectory-based Policy Optimization), which identifies conflict tokens, tokens appearing in the same position across completions with opposite rewards, protects them by skipping negative updates, while amplifying positive ones. To further prevent policy collapse, GTPO filters out completions whose entropy exceeds a provable threshold. Unlike GRPO, GTPO does not rely on KL-divergence regularization, eliminating the need for a reference model during training, while still ensuring greater training stability and improved performance, validated through multiple experiments on GSM8K, MATH and AIME 2024 benchmarks.

[395] Perch 2.0: The Bittern Lesson for Bioacoustics

Bart van Merriënboer, Vincent Dumoulin, Jenny Hamer, Lauren Harrell, Andrea Burns, Tom Denton

Main category: cs.LG

TL;DR: Perch 2.0 is an advanced bioacoustic model trained on multi-taxa data, achieving state-of-the-art performance and strong transfer learning capabilities.

DetailsMotivation: To expand the capabilities of bioacoustic models beyond avian species and improve performance across diverse taxa.

Method: Trained with self-distillation, a prototype-learning classifier, and a new source-prediction criterion.

Result: Achieves top performance on BirdSet and BEANS benchmarks and excels in marine transfer learning tasks.

Conclusion: Fine-grained species classification is a robust pre-training task for bioacoustics.

Abstract: Perch is a performant pre-trained model for bioacoustics. It was trained in supervised fashion, providing both off-the-shelf classification scores for thousands of vocalizing species as well as strong embeddings for transfer learning. In this new release, Perch 2.0, we expand from training exclusively on avian species to a large multi-taxa dataset. The model is trained with self-distillation using a prototype-learning classifier as well as a new source-prediction training criterion. Perch 2.0 obtains state-of-the-art performance on the BirdSet and BEANS benchmarks. It also outperforms specialized marine models on marine transfer learning tasks, despite having almost no marine training data. We present hypotheses as to why fine-grained species classification is a particularly robust pre-training task for bioacoustics.

[396] U-PINet: End-to-End Hierarchical Physics-Informed Learning With Sparse Graph Coupling for 3D EM Scattering Modeling

Rui Zhu, Yuexing Peng, Peng Wang, George C. Alexandropoulos, Wenbo Wang, Wei Xiang

Main category: cs.LG

TL;DR: U-PINet is a physics-informed deep learning framework for EM scattering modeling, combining efficiency and physical consistency, outperforming traditional solvers and pure data-driven methods.

DetailsMotivation: Overcome limitations of traditional EM solvers (high computational cost) and pure data-driven deep learning (lack of physical constraints, need for extensive labeled data).

Method: Proposes U-PINet, a hierarchical deep learning framework using multiscale processing and physics-inspired sparse graph representation for EM coupling.

Result: Accurately predicts surface currents, matches traditional solver accuracy, reduces computational time, and outperforms deep learning baselines.

Conclusion: U-PINet is feasible for EM scattering applications, offering improved efficiency, generalization, and physical consistency.

Abstract: Electromagnetic (EM) scattering modeling is critical for radar remote sensing, however, its inherent complexity introduces significant computational challenges. Traditional numerical solvers offer high accuracy, but suffer from scalability issues and substantial computational costs. Pure data-driven deep learning approaches, while efficient, lack physical constraints embedding during training and require extensive labeled data, limiting their applicability and generalization. To overcome these limitations, we propose a U-shaped Physics-Informed Network (U-PINet), the first fully deep-learning-based, physics-informed hierarchical framework for computational EM designed to ensure physical consistency while maximizing computational efficiency. Motivated by the hierarchical decomposition strategy in EM solvers and the inherent sparsity of local EM coupling, the U-PINet models the decomposition and coupling of near- and far-field interactions through a multiscale processing neural network architecture, while employing a physics-inspired sparse graph representation to efficiently model both self- and mutual- coupling among mesh elements of complex $3$-Dimensional (3D) objects. This principled approach enables end-to-end multiscale EM scattering modeling with improved efficiency, generalization, and physical consistency. Experimental results showcase that the U-PINet accurately predicts surface current distributions, achieving close agreement with traditional solver, while significantly reducing computational time and outperforming conventional deep learning baselines in both accuracy and robustness. Furthermore, our evaluations on radar cross section prediction tasks confirm the feasibility of the U-PINet for downstream EM scattering applications.

[397] Revisiting Heat Flux Analysis of Tungsten Monoblock Divertor on EAST using Physics-Informed Neural Network

Xiao Wang, Zikang Yan, Hao Si, Zhendong Yang, Qingquan Yang, Dengdi Sun, Wanli Lyu, Jin Tang

Main category: cs.LG

TL;DR: A Physics-Informed Neural Network (PINN) is proposed to estimate heat flux in EAST, outperforming traditional FEM in speed while maintaining accuracy.

DetailsMotivation: Traditional FEM is inefficient for real-time heat flux estimation in nuclear fusion devices like EAST. PINN offers a faster, accurate alternative.

Method: PINN uses spatial coordinates, time stamps, and heat conduction equations to compute losses. Data-driven sampling enhances predictive capability.

Result: PINN matches FEM accuracy and achieves 40x speedup in computational efficiency under uniform and non-uniform heating conditions.

Conclusion: PINN is a viable, efficient solution for real-time heat flux estimation in fusion devices, with potential for broader scientific computing applications.

Abstract: Estimating heat flux in the nuclear fusion device EAST is a critically important task. Traditional scientific computing methods typically model this process using the Finite Element Method (FEM). However, FEM relies on grid-based sampling for computation, which is computationally inefficient and hard to perform real-time simulations during actual experiments. Inspired by artificial intelligence-powered scientific computing, this paper proposes a novel Physics-Informed Neural Network (PINN) to address this challenge, significantly accelerating the heat conduction estimation process while maintaining high accuracy. Specifically, given inputs of different materials, we first feed spatial coordinates and time stamps into the neural network, and compute boundary loss, initial condition loss, and physical loss based on the heat conduction equation. Additionally, we sample a small number of data points in a data-driven manner to better fit the specific heat conduction scenario, further enhancing the model’s predictive capability. We conduct experiments under both uniform and non-uniform heating conditions on the top surface. Experimental results show that the proposed thermal conduction physics-informed neural network achieves accuracy comparable to the finite element method, while achieving $\times$40 times acceleration in computational efficiency. The dataset and source code will be released on https://github.com/Event-AHU/OpenFusion.

[398] SoilNet: A Multimodal Multitask Model for Hierarchical Classification of Soil Horizons

Teodor Chiaburu, Vipin Singh, Frank Haußer, Felix Bießmann

Main category: cs.LG

TL;DR: SoilNet is a multimodal multitask model for soil horizon classification, addressing challenges like hierarchical labels and data imbalance by integrating image and geotemporal data.

DetailsMotivation: Soil horizon classification is critical for soil health but remains challenging due to multimodal, multitask, and hierarchical label complexities.

Method: SoilNet uses a modular pipeline: predicts depth markers, segments soil profiles, extracts features, and predicts labels using a graph-based hierarchical representation.

Result: Demonstrated effectiveness on real-world soil profile data.

Conclusion: SoilNet successfully tackles complex hierarchical soil classification, with code and experiments available.

Abstract: While recent advances in foundation models have improved the state of the art in many domains, some problems in empirical sciences could not benefit from this progress yet. Soil horizon classification, for instance, remains challenging because of its multimodal and multitask characteristics and a complex hierarchically structured label taxonomy. Accurate classification of soil horizons is crucial for monitoring soil health, which directly impacts agricultural productivity, food security, ecosystem stability and climate resilience. In this work, we propose $\textit{SoilNet}$ - a multimodal multitask model to tackle this problem through a structured modularized pipeline. Our approach integrates image data and geotemporal metadata to first predict depth markers, segmenting the soil profile into horizon candidates. Each segment is characterized by a set of horizon-specific morphological features. Finally, horizon labels are predicted based on the multimodal concatenated feature vector, leveraging a graph-based label representation to account for the complex hierarchical relationships among soil horizons. Our method is designed to address complex hierarchical classification, where the number of possible labels is very large, imbalanced and non-trivially structured. We demonstrate the effectiveness of our approach on a real-world soil profile dataset. All code and experiments can be found in our repository: https://github.com/calgo-lab/BGR/

[399] Bernoulli-LoRA: A Theoretical Framework for Randomized Low-Rank Adaptation

Igor Sokolov, Abdurakhmon Sadiev, Yury Demidovich, Fawaz S Al-Qahtani, Peter Richtárik

Main category: cs.LG

TL;DR: Bernoulli-LoRA introduces a probabilistic framework for parameter-efficient fine-tuning, unifying existing LoRA methods with theoretical guarantees and practical efficacy.

DetailsMotivation: To address the limited theoretical understanding of LoRA-based PEFT methods and provide a unified, rigorous framework.

Method: A probabilistic Bernoulli mechanism for matrix updates, analyzed under non-convex and convex assumptions with various optimization variants.

Result: Convergence guarantees for multiple variants and empirical validation of practical effectiveness.

Conclusion: Bernoulli-LoRA advances theoretically grounded PEFT methods while maintaining practical utility.

Abstract: Parameter-efficient fine-tuning (PEFT) has emerged as a crucial approach for adapting large foundational models to specific tasks, particularly as model sizes continue to grow exponentially. Among PEFT methods, Low-Rank Adaptation (LoRA) (arXiv:2106.09685) stands out for its effectiveness and simplicity, expressing adaptations as a product of two low-rank matrices. While extensive empirical studies demonstrate LoRA’s practical utility, theoretical understanding of such methods remains limited. Recent work on RAC-LoRA (arXiv:2410.08305) took initial steps toward rigorous analysis. In this work, we introduce Bernoulli-LoRA, a novel theoretical framework that unifies and extends existing LoRA approaches. Our method introduces a probabilistic Bernoulli mechanism for selecting which matrix to update. This approach encompasses and generalizes various existing update strategies while maintaining theoretical tractability. Under standard assumptions from non-convex optimization literature, we analyze several variants of our framework: Bernoulli-LoRA-GD, Bernoulli-LoRA-SGD, Bernoulli-LoRA-PAGE, Bernoulli-LoRA-MVR, Bernoulli-LoRA-QGD, Bernoulli-LoRA-MARINA, and Bernoulli-LoRA-EF21, establishing convergence guarantees for each variant. Additionally, we extend our analysis to convex non-smooth functions, providing convergence rates for both constant and adaptive (Polyak-type) stepsizes. Through extensive experiments on various tasks, we validate our theoretical findings and demonstrate the practical efficacy of our approach. This work is a step toward developing theoretically grounded yet practically effective PEFT methods.

[400] Scalable Neural Network-based Blackbox Optimization

Pavankumar Koratikere, Leifur Leifsson

Main category: cs.LG

TL;DR: SNBO is a scalable neural network-based blackbox optimization method that avoids model uncertainty estimation, outperforming baselines in efficiency and runtime.

DetailsMotivation: Address scalability and computational challenges of Bayesian Optimization (BO) and NN-based BO in high-dimensional spaces.

Method: SNBO uses separate criteria for exploration and exploitation, adaptively controlling the sampling region without relying on model uncertainty estimation.

Result: SNBO outperforms baselines, achieving better function values with 40-60% fewer evaluations and significantly reduced runtime.

Conclusion: SNBO is an efficient, scalable alternative to traditional BO and NN-based methods, particularly in high dimensions.

Abstract: Bayesian Optimization (BO) is a widely used approach for blackbox optimization that leverages a Gaussian process (GP) model and an acquisition function to guide future sampling. While effective in low-dimensional settings, BO faces scalability challenges in high-dimensional spaces and with large number of function evaluations due to the computational complexity of GP models. In contrast, neural networks (NNs) offer better scalability and can model complex functions, which led to the development of NN-based BO approaches. However, these methods typically rely on estimating model uncertainty in NN prediction – a process that is often computationally intensive and complex, particularly in high dimensions. To address these limitations, a novel method, called scalable neural network-based blackbox optimization (SNBO), is proposed that does not rely on model uncertainty estimation. Specifically, SNBO adds new samples using separate criteria for exploration and exploitation, while adaptively controlling the sampling region to ensure efficient optimization. SNBO is evaluated on a range of optimization problems spanning from 10 to 102 dimensions and compared against four state-of-the-art baseline algorithms. Across the majority of test problems, SNBO attains function values better than the best-performing baseline algorithm, while requiring 40-60% fewer function evaluations and reducing the runtime by at least an order of magnitude.

[401] Multi-Modal Multi-Task Federated Foundation Models for Next-Generation Extended Reality Systems: Towards Privacy-Preserving Distributed Intelligence in AR/VR/MR

Fardis Nadimi, Payam Abdisarabshali, Kasra Borazjani, Jacob Chakareski, Seyyedali Hosseinalipour

Main category: cs.LG

TL;DR: The paper proposes using multi-modal multi-task federated foundation models (FedFMs) to enhance XR systems, addressing challenges like sensor diversity and privacy through modular architecture and SHIFT dimensions.

DetailsMotivation: To integrate the strengths of foundation models and federated learning for privacy-preserving, intelligent XR systems.

Method: A modular architecture for FedFMs, addressing XR challenges under SHIFT dimensions (Sensor diversity, Hardware heterogeneity, Interactivity, Task variability, Temporality).

Result: Illustrates SHIFT dimensions in XR applications and proposes evaluation metrics, dataset needs, and design tradeoffs.

Conclusion: Lays foundations for context-aware, privacy-preserving intelligence in next-gen XR systems.

Abstract: Extended reality (XR) systems, which consist of virtual reality (VR), augmented reality (AR), and mixed reality (XR), offer a transformative interface for immersive, multi-modal, and embodied human-computer interaction. In this paper, we envision that multi-modal multi-task (M3T) federated foundation models (FedFMs) can offer transformative capabilities for XR systems through integrating the representational strength of M3T foundation models (FMs) with the privacy-preserving model training principles of federated learning (FL). We present a modular architecture for FedFMs, which entails different coordination paradigms for model training and aggregations. Central to our vision is the codification of XR challenges that affect the implementation of FedFMs under the SHIFT dimensions: (1) Sensor and modality diversity, (2) Hardware heterogeneity and system-level constraints, (3) Interactivity and embodied personalization, (4) Functional/task variability, and (5) Temporality and environmental variability. We illustrate the manifestation of these dimensions across a set of emerging and anticipated applications of XR systems. Finally, we propose evaluation metrics, dataset requirements, and design tradeoffs necessary for the development of resource-aware FedFMs in XR. This perspective aims to chart the technical and conceptual foundations for context-aware privacy-preserving intelligence in the next generation of XR systems.

[402] DP-NCB: Privacy Preserving Fair Bandits

Dhruv Sarkar, Nishant Pandey, Sayak Ray Chowdhury

Main category: cs.LG

TL;DR: The paper introduces DP-NCB, a framework for bandit algorithms that ensures privacy and fairness simultaneously, achieving optimal Nash regret.

DetailsMotivation: Address the gap in existing bandit algorithms by combining privacy (differential privacy) and fairness (Nash regret) in socially sensitive applications.

Method: Proposes Differentially Private Nash Confidence Bound (DP-NCB), a unified framework operating under global/local privacy models without needing prior time horizon knowledge.

Result: DP-NCB achieves order-optimal Nash regret, outperforming baselines in simulations, and matches theoretical lower bounds.

Conclusion: DP-NCB provides a principled solution for privacy-preserving and fair bandit algorithms, suitable for high-stakes applications.

Abstract: Multi-armed bandit algorithms are fundamental tools for sequential decision-making under uncertainty, with widespread applications across domains such as clinical trials and personalized decision-making. As bandit algorithms are increasingly deployed in these socially sensitive settings, it becomes critical to protect user data privacy and ensure fair treatment across decision rounds. While prior work has independently addressed privacy and fairness in bandit settings, the question of whether both objectives can be achieved simultaneously has remained largely open. Existing privacy-preserving bandit algorithms typically optimize average regret, a utilitarian measure, whereas fairness-aware approaches focus on minimizing Nash regret, which penalizes inequitable reward distributions, but often disregard privacy concerns. To bridge this gap, we introduce Differentially Private Nash Confidence Bound (DP-NCB)-a novel and unified algorithmic framework that simultaneously ensures $\epsilon$-differential privacy and achieves order-optimal Nash regret, matching known lower bounds up to logarithmic factors. The framework is sufficiently general to operate under both global and local differential privacy models, and is anytime, requiring no prior knowledge of the time horizon. We support our theoretical guarantees with simulations on synthetic bandit instances, showing that DP-NCB incurs substantially lower Nash regret than state-of-the-art baselines. Our results offer a principled foundation for designing bandit algorithms that are both privacy-preserving and fair, making them suitable for high-stakes, socially impactful applications.

[403] VAE-DNN: Energy-Efficient Trainable-by-Parts Surrogate Model For Parametric Partial Differential Equations

Yifei Zong, Alexandre M. Tartakovsky

Main category: cs.LG

TL;DR: VAE-DNN, a trainable-by-parts surrogate model, outperforms FNO and DeepONet in efficiency and accuracy for solving nonlinear PDEs.

DetailsMotivation: To reduce training time and energy while improving accuracy for forward and inverse solutions of parameterized nonlinear PDEs.

Method: Uses an encoder, neural network, and decoder trained independently via VAEs for separable training.

Result: VAE-DNN shows greater efficiency and accuracy compared to FNO and DeepONet in solving nonlinear diffusion equations.

Conclusion: VAE-DNN is a promising alternative for efficient and accurate PDE solutions.

Abstract: We propose a trainable-by-parts surrogate model for solving forward and inverse parameterized nonlinear partial differential equations. Like several other surrogate and operator learning models, the proposed approach employs an encoder to reduce the high-dimensional input $y(\bm{x})$ to a lower-dimensional latent space, $\bm\mu_{\bm\phi_y}$. Then, a fully connected neural network is used to map $\bm\mu_{\bm\phi_y}$ to the latent space, $\bm\mu_{\bm\phi_h}$, of the PDE solution $h(\bm{x},t)$. Finally, a decoder is utilized to reconstruct $h(\bm{x},t)$. The innovative aspect of our model is its ability to train its three components independently. This approach leads to a substantial decrease in both the time and energy required for training when compared to leading operator learning models such as FNO and DeepONet. The separable training is achieved by training the encoder as part of the variational autoencoder (VAE) for $y(\bm{x})$ and the decoder as part of the $h(\bm{x},t)$ VAE. We refer to this model as the VAE-DNN model. VAE-DNN is compared to the FNO and DeepONet models for obtaining forward and inverse solutions to the nonlinear diffusion equation governing groundwater flow in an unconfined aquifer. Our findings indicate that VAE-DNN not only demonstrates greater efficiency but also delivers superior accuracy in both forward and inverse solutions compared to the FNO and DeepONet models.

[404] Data-Driven Spectrum Demand Prediction: A Spatio-Temporal Framework with Transfer Learning

Amin Farajzadeh, Hongzhao Zheng, Sarah Dumoulin, Trevor Ha, Halim Yanikomeroglu, Amir Ghasemi

Main category: cs.LG

TL;DR: A spatio-temporal framework for spectrum demand prediction using crowdsourced KPIs and regulatory data, outperforming traditional ITU models.

DetailsMotivation: To improve spectrum allocation, regulatory planning, and support emerging technologies like 5G, 6G, and IoT by providing accurate demand predictions.

Method: Leverages crowdsourced KPIs and regulatory datasets with feature engineering, correlation analysis, and transfer learning for cross-regional generalizability.

Result: Superior prediction accuracy compared to ITU benchmarks, with validated efficacy in real-world experiments.

Conclusion: The framework offers a robust, data-driven solution for policymakers to enhance spectrum management and planning.

Abstract: Accurate spectrum demand prediction is crucial for informed spectrum allocation, effective regulatory planning, and fostering sustainable growth in modern wireless communication networks. It supports governmental efforts, particularly those led by the international telecommunication union (ITU), to establish fair spectrum allocation policies, improve auction mechanisms, and meet the requirements of emerging technologies such as advanced 5G, forthcoming 6G, and the internet of things (IoT). This paper presents an effective spatio-temporal prediction framework that leverages crowdsourced user-side key performance indicators (KPIs) and regulatory datasets to model and forecast spectrum demand. The proposed methodology achieves superior prediction accuracy and cross-regional generalizability by incorporating advanced feature engineering, comprehensive correlation analysis, and transfer learning techniques. Unlike traditional ITU models, which are often constrained by arbitrary inputs and unrealistic assumptions, this approach exploits granular, data-driven insights to account for spatial and temporal variations in spectrum utilization. Comparative evaluations against ITU estimates, as the benchmark, underscore our framework’s capability to deliver more realistic and actionable predictions. Experimental results validate the efficacy of our methodology, highlighting its potential as a robust approach for policymakers and regulatory bodies to enhance spectrum management and planning.

[405] Prediction-Oriented Subsampling from Data Streams

Benedetta Lavinia Mussati, Freddie Bickford Smith, Tom Rainforth, Stephen Roberts

Main category: cs.LG

TL;DR: The paper proposes an information-theoretic method for intelligent data subsampling in offline learning from data streams, focusing on reducing prediction uncertainty, and shows it outperforms prior techniques.

DetailsMotivation: The challenge of capturing relevant information from data streams while managing computational costs motivates the exploration of intelligent subsampling methods.

Method: The study uses an information-theoretic approach centered on reducing uncertainty in downstream predictions for offline learning from data streams.

Result: Empirical results show the prediction-oriented approach outperforms a prior information-theoretic technique on two widely studied problems.

Conclusion: Reliable strong performance in practice requires careful model design, as demonstrated by the study.

Abstract: Data is often generated in streams, with new observations arriving over time. A key challenge for learning models from data streams is capturing relevant information while keeping computational costs manageable. We explore intelligent data subsampling for offline learning, and argue for an information-theoretic method centred on reducing uncertainty in downstream predictions of interest. Empirically, we demonstrate that this prediction-oriented approach performs better than a previously proposed information-theoretic technique on two widely studied problems. At the same time, we highlight that reliably achieving strong performance in practice requires careful model design.

[406] Intelligent Sampling of Extreme-Scale Turbulence Datasets for Accurate and Efficient Spatiotemporal Model Training

Wesley Brewer, Murali Meena Gopalakrishnan, Matthias Maiterth, Aditya Kashi, Jong Youl Choi, Pei Zhang, Stephen Nichols, Riccardo Balin, Miles Couchman, Stephen de Bruyn Kops, P. K. Yeung, Daniel Dotson, Rohini Uma-Vaideswaran, Sarp Oral, Feiyi Wang

Main category: cs.LG

TL;DR: SICKLE, a sparse intelligent curation framework, uses MaxEnt sampling to train models with less data, improving accuracy and reducing energy consumption by up to 38x.

DetailsMotivation: With Moore's law and Dennard scaling ending, efficient training requires reducing data volume while maintaining model performance.

Method: Developed SICKLE with MaxEnt sampling, compared it with random and phase-space sampling on turbulence DNS datasets, and evaluated scalability on Frontier.

Result: Subsampling as preprocessing improved model accuracy and reduced energy consumption by up to 38x.

Conclusion: Intelligent subsampling (e.g., SICKLE) can significantly enhance efficiency in model training with less data.

Abstract: With the end of Moore’s law and Dennard scaling, efficient training increasingly requires rethinking data volume. Can we train better models with significantly less data via intelligent subsampling? To explore this, we develop SICKLE, a sparse intelligent curation framework for efficient learning, featuring a novel maximum entropy (MaxEnt) sampling approach, scalable training, and energy benchmarking. We compare MaxEnt with random and phase-space sampling on large direct numerical simulation (DNS) datasets of turbulence. Evaluating SICKLE at scale on Frontier, we show that subsampling as a preprocessing step can improve model accuracy and substantially lower energy consumption, with reductions of up to 38x observed in certain cases.

[407] Reinforcement Learning for Target Zone Blood Glucose Control

David H. Mguni, Jing Dong, Wanrong Yang, Ziquan Liu, Muhammad Salman Haleem, Baoxiang Wang

Main category: cs.LG

TL;DR: A novel RL framework for T1DM combines impulse and switching control to improve treatment personalization, reducing blood glucose violations from 22.4% to 10.8%.

DetailsMotivation: Managing physiological variables in chronic conditions like T1DM is challenging due to delayed and heterogeneous treatment effects. RL can personalize treatment but struggles with these dynamics.

Method: The framework unifies impulse control (fast-acting interventions) and switching control (longer-acting treatments) in a constrained Markov decision process with physiological state features.

Result: Empirical results show a reduction in blood glucose level violations from 22.4% to 10.8% in a T1DM control task.

Conclusion: The work lays a foundation for safe, temporally-aware RL in healthcare, though not yet ready for clinical deployment.

Abstract: Managing physiological variables within clinically safe target zones is a central challenge in healthcare, particularly for chronic conditions such as Type 1 Diabetes Mellitus (T1DM). Reinforcement learning (RL) offers promise for personalising treatment, but struggles with the delayed and heterogeneous effects of interventions. We propose a novel RL framework to study and support decision-making in T1DM technologies, such as automated insulin delivery. Our approach captures the complex temporal dynamics of treatment by unifying two control modalities: \textit{impulse control} for discrete, fast-acting interventions (e.g., insulin boluses), and \textit{switching control} for longer-acting treatments and regime shifts. The core of our method is a constrained Markov decision process augmented with physiological state features, enabling safe policy learning under clinical and resource constraints. The framework incorporates biologically realistic factors, including insulin decay, leading to policies that better reflect real-world therapeutic behaviour. While not intended for clinical deployment, this work establishes a foundation for future safe and temporally-aware RL in healthcare. We provide theoretical guarantees of convergence and demonstrate empirical improvements in a stylised T1DM control task, reducing blood glucose level violations from 22.4% (state-of-the-art) to as low as 10.8%.

[408] Calibrating Biophysical Models for Grape Phenology Prediction via Multi-Task Learning

William Solow, Sandhya Saisubramanian

Main category: cs.LG

TL;DR: A hybrid model combining multi-task learning with a recurrent neural network improves grape phenology prediction, outperforming traditional biophysical and deep learning methods.

DetailsMotivation: Accurate grape phenology prediction is crucial for vineyard management, but existing methods lack precision or suffer from sparse datasets.

Method: Proposes a hybrid approach using multi-task learning to parameterize a differentiable biophysical model with a recurrent neural network.

Result: Outperforms conventional biophysical models and deep learning baselines in predicting phenological stages and crop state variables.

Conclusion: The hybrid model enhances prediction robustness and accuracy, benefiting vineyard management.

Abstract: Accurate prediction of grape phenology is essential for timely vineyard management decisions, such as scheduling irrigation and fertilization, to maximize crop yield and quality. While traditional biophysical models calibrated on historical field data can be used for season-long predictions, they lack the precision required for fine-grained vineyard management. Deep learning methods are a compelling alternative but their performance is hindered by sparse phenology datasets, particularly at the cultivar level. We propose a hybrid modeling approach that combines multi-task learning with a recurrent neural network to parameterize a differentiable biophysical model. By using multi-task learning to predict the parameters of the biophysical model, our approach enables shared learning across cultivars while preserving biological structure, thereby improving the robustness and accuracy of predictions. Empirical evaluation using real-world and synthetic datasets demonstrates that our method significantly outperforms both conventional biophysical models and baseline deep learning approaches in predicting phenological stages, as well as other crop state variables such as cold-hardiness and wheat yield.

[409] Fast and Accurate Explanations of Distance-Based Classifiers by Uncovering Latent Explanatory Structures

Florian Bley, Jacob Kauffmann, Simon León Krug, Klaus-Robert Müller, Grégoire Montavon

Main category: cs.LG

TL;DR: The paper reveals a hidden neural network structure in distance-based classifiers, enabling Explainable AI techniques like LRP for better interpretability.

DetailsMotivation: To enhance the explainability of distance-based classifiers (e.g., k-NN, SVM) for practical insights, leveraging latent structures as in neural networks.

Method: Uncover a neural network-like structure in distance-based classifiers (linear detection units + nonlinear pooling) and apply LRP for explanations.

Result: Quantitative evaluations show the novel explanation approach outperforms baselines, validated by practical use cases.

Conclusion: The approach successfully bridges Explainable AI techniques with distance-based models, improving interpretability and utility.

Abstract: Distance-based classifiers, such as k-nearest neighbors and support vector machines, continue to be a workhorse of machine learning, widely used in science and industry. In practice, to derive insights from these models, it is also important to ensure that their predictions are explainable. While the field of Explainable AI has supplied methods that are in principle applicable to any model, it has also emphasized the usefulness of latent structures (e.g. the sequence of layers in a neural network) to produce explanations. In this paper, we contribute by uncovering a hidden neural network structure in distance-based classifiers (consisting of linear detection units combined with nonlinear pooling layers) upon which Explainable AI techniques such as layer-wise relevance propagation (LRP) become applicable. Through quantitative evaluations, we demonstrate the advantage of our novel explanation approach over several baselines. We also show the overall usefulness of explaining distance-based models through two practical use cases.

[410] Active Learning and Transfer Learning for Anomaly Detection in Time-Series Data

John D. Kelleher, Matthew Nicholson, Rahul Agrahari, Clare Conran

Main category: cs.LG

TL;DR: Combining active and transfer learning for anomaly detection in cross-domain time-series data shows limited performance gains, with clustering often unnecessary and active learning improvements slower than expected.

DetailsMotivation: To explore the interaction between active learning and transfer learning for anomaly detection in cross-domain time-series data and assess their combined effectiveness.

Method: Combined active learning and transfer learning, tested with and without clustering, and evaluated performance across datasets.

Result: Best performance without clustering; active learning improves performance linearly but slower than literature suggests. Transfer learning performance peaks then tails off.

Conclusion: Active learning is effective but yields diminishing returns; clustering is often unnecessary, and performance gains are linear and slower than expected.

Abstract: This paper examines the effectiveness of combining active learning and transfer learning for anomaly detection in cross-domain time-series data. Our results indicate that there is an interaction between clustering and active learning and in general the best performance is achieved using a single cluster (in other words when clustering is not applied). Also, we find that adding new samples to the training set using active learning does improve model performance but that in general, the rate of improvement is slower than the results reported in the literature suggest. We attribute this difference to an improved experimental design where distinct data samples are used for the sampling and testing pools. Finally, we assess the ceiling performance of transfer learning in combination with active learning across several datasets and find that performance does initially improve but eventually begins to tail off as more target points are selected for inclusion in training. This tail-off in performance may indicate that the active learning process is doing a good job of sequencing data points for selection, pushing the less useful points towards the end of the selection process and that this tail-off occurs when these less useful points are eventually added. Taken together our results indicate that active learning is effective but that the improvement in model performance follows a linear flat function concerning the number of points selected and labelled.

[411] Next Generation Equation-Free Multiscale Modelling of Crowd Dynamics via Machine Learning

Hector Vargas Alvarez, Dimitrios G. Patsatzis, Lucia Russo, Ioannis Kevrekidis, Constantinos Siettos

Main category: cs.LG

TL;DR: The paper proposes a four-stage manifold and machine learning approach to bridge microscopic and macroscopic crowd dynamics, using agent-based simulations to learn latent space evolution operators.

DetailsMotivation: To address the challenge of connecting microscopic and macroscopic modeling scales in crowd dynamics for numerical analysis, optimization, and control.

Method: 1. Derive macroscopic fields from microscopic data using KDE. 2. Map to latent space via manifold learning (POD). 3. Learn reduced-order surrogate models (LSTMs, MVARs). 4. Reconstruct dynamics in high-dimensional space.

Result: High accuracy, robustness, and generalizability in modeling crowd dynamics, with mass conservation in density reconstruction.

Conclusion: The framework effectively bridges scales, enabling fast and accurate crowd dynamics simulation from agent-based data.

Abstract: Bridging the microscopic and the macroscopic modelling scales in crowd dynamics constitutes an important, open challenge for systematic numerical analysis, optimization, and control. We propose a combined manifold and machine learning approach to learn the discrete evolution operator for the emergent crowd dynamics in latent spaces from high-fidelity agent-based simulations. The proposed framework builds upon our previous works on next-generation Equation-free algorithms on learning surrogate models for high-dimensional and multiscale systems. Our approach is a four-stage one, explicitly conserving the mass of the reconstructed dynamics in the high-dimensional space. In the first step, we derive continuous macroscopic fields (densities) from discrete microscopic data (pedestrians’ positions) using KDE. In the second step, based on manifold learning, we construct a map from the macroscopic ambient space into the latent space parametrized by a few coordinates based on POD of the corresponding density distribution. The third step involves learning reduced-order surrogate ROMs in the latent space using machine learning techniques, particularly LSTMs networks and MVARs. Finally, we reconstruct the crowd dynamics in the high-dimensional space in terms of macroscopic density profiles. We demonstrate that the POD reconstruction of the density distribution via SVD conserves the mass. With this “embed->learn in latent space->lift back to the ambient space” pipeline, we create an effective solution operator of the unavailable macroscopic PDE for the density evolution. For our illustrations, we use the Social Force Model to generate data in a corridor with an obstacle, imposing periodic boundary conditions. The numerical results demonstrate high accuracy, robustness, and generalizability, thus allowing for fast and accurate modelling/simulation of crowd dynamics from agent-based simulations.

[412] Markov Chain Estimation with In-Context Learning

Simon Lepage, Jeremie Mary, David Picard

Main category: cs.LG

TL;DR: Transformers can learn transition probabilities from context when trained on next-token prediction, surpassing memorization if model and dataset sizes exceed a threshold. Better state encoding improves robustness.

DetailsMotivation: To explore if transformers can learn algorithms (like Markov chain transitions) from context without memorizing training data.

Method: Train transformers on next-token prediction for Markov chains with random transition matrices, varying model and dataset sizes. Test on unseen matrices.

Result: Transformers learn transition probabilities beyond memorization if model and dataset sizes exceed a threshold. Improved state encoding enhances robustness.

Conclusion: Transformers can generalize to unseen Markov chains when properly scaled and encoded, demonstrating algorithmic learning capability.

Abstract: We investigate the capacity of transformers to learn algorithms involving their context while solely being trained using next token prediction. We set up Markov chains with random transition matrices and we train transformers to predict the next token. Matrices used during training and test are different and we show that there is a threshold in transformer size and in training set size above which the model is able to learn to estimate the transition probabilities from its context instead of memorizing the training patterns. Additionally, we show that more involved encoding of the states enables more robust prediction for Markov chains with structures different than those seen during training.

[413] FairPOT: Balancing AUC Performance and Fairness with Proportional Optimal Transport

Pengxi Liu, Yi Shen, Matthew M. Engelhard, Benjamin A. Goldstein, Michael J. Pencina, Nicoleta J. Economou-Zavlanos, Michael M. Zavlanos

Main category: cs.LG

TL;DR: FairPOT is a model-agnostic post-processing framework using optimal transport to balance fairness and AUC performance by selectively transforming risk scores in disadvantaged groups.

DetailsMotivation: Addressing the trade-off between enforcing fairness and maintaining AUC performance in high-stakes domains like healthcare and finance.

Method: Proposes Fair Proportional Optimal Transport (FairPOT), which aligns risk score distributions across groups by transforming a controllable proportion (top-lambda quantile) of scores in disadvantaged groups.

Result: Outperforms existing methods in global and partial AUC scenarios, achieving fairness with minimal AUC degradation or even utility gains.

Conclusion: FairPOT is computationally efficient and adaptable, making it suitable for real-world applications.

Abstract: Fairness metrics utilizing the area under the receiver operator characteristic curve (AUC) have gained increasing attention in high-stakes domains such as healthcare, finance, and criminal justice. In these domains, fairness is often evaluated over risk scores rather than binary outcomes, and a common challenge is that enforcing strict fairness can significantly degrade AUC performance. To address this challenge, we propose Fair Proportional Optimal Transport (FairPOT), a novel, model-agnostic post-processing framework that strategically aligns risk score distributions across different groups using optimal transport, but does so selectively by transforming a controllable proportion, i.e., the top-lambda quantile, of scores within the disadvantaged group. By varying lambda, our method allows for a tunable trade-off between reducing AUC disparities and maintaining overall AUC performance. Furthermore, we extend FairPOT to the partial AUC setting, enabling fairness interventions to concentrate on the highest-risk regions. Extensive experiments on synthetic, public, and clinical datasets show that FairPOT consistently outperforms existing post-processing techniques in both global and partial AUC scenarios, often achieving improved fairness with slight AUC degradation or even positive gains in utility. The computational efficiency and practical adaptability of FairPOT make it a promising solution for real-world deployment.

[414] BubbleONet: A Physics-Informed Neural Operator for High-Frequency Bubble Dynamics

Yunhao Zhang, Lin Cheng, Aswin Gnanaskandan, Ameya D. Jagtap

Main category: cs.LG

TL;DR: BubbleONet is a physics-informed operator learning model for mapping pressure profiles to bubble radius responses, integrating adaptive activation to improve high-frequency feature representation.

DetailsMotivation: To provide a computationally efficient surrogate model for simulating bubble dynamics, leveraging physics-informed learning and addressing spectral bias in deep learning.

Method: Built on PI-DeepONet, BubbleONet uses Rowdy adaptive activation and is evaluated on Rayleigh-Plesset and Keller-Miksis equations for single and multiple initial radii, comparing single-step and two-step training.

Result: Demonstrates effectiveness as a surrogate model for bubble dynamics, outperforming traditional numerical solvers in efficiency.

Conclusion: BubbleONet is a promising, efficient alternative for simulating bubble dynamics, with potential for broader applications in physics-informed operator learning.

Abstract: This paper introduces BubbleONet, an operator learning model designed to map pressure profiles from an input function space to corresponding bubble radius responses. BubbleONet is built upon the physics-informed deep operator network (PI-DeepONet) framework, leveraging DeepONet’s powerful universal approximation capabilities for operator learning alongside the robust physical fidelity provided by the physics-informed neural networks. To mitigate the inherent spectral bias in deep learning, BubbleONet integrates the Rowdy adaptive activation function, enabling improved representation of high-frequency features. The model is evaluated across various scenarios, including: (1) Rayleigh-Plesset equation based bubble dynamics with a single initial radius, (2) Keller-Miksis equation based bubble dynamics with a single initial radius, and (3) Keller-Miksis equation based bubble dynamics with multiple initial radii. Moreover, the performance of single-step versus two-step training techniques for BubbleONet is investigated. The results demonstrate that BubbleONet serves as a promising surrogate model for simulating bubble dynamics, offering a computationally efficient alternative to traditional numerical solvers.

[415] Continual Multiple Instance Learning for Hematologic Disease Diagnosis

Zahra Ebrahimi, Raheleh Salehi, Nassir Navab, Carsten Marr, Ario Sadafi

Main category: cs.LG

TL;DR: Proposes a rehearsal-based continual learning method for Multiple Instance Learning (MIL), tailored for leukemia detection, outperforming state-of-the-art methods.

DetailsMotivation: Address the inefficacy of existing continual learning methods for MIL in dynamic environments like leukemia diagnosis, where data distributions shift over time.

Method: Uses instance attention scores and distance metrics to select and store diverse samples from previous tasks, ensuring data diversity.

Result: Outperforms state-of-the-art continual learning methods in a class incremental scenario using real-world leukemia data.

Conclusion: Introduces the first effective continual learning approach for MIL, enabling adaptation to shifting data distributions in clinical settings.

Abstract: The dynamic environment of laboratories and clinics, with streams of data arriving on a daily basis, requires regular updates of trained machine learning models for consistent performance. Continual learning is supposed to help train models without catastrophic forgetting. However, state-of-the-art methods are ineffective for multiple instance learning (MIL), which is often used in single-cell-based hematologic disease diagnosis (e.g., leukemia detection). Here, we propose the first continual learning method tailored specifically to MIL. Our method is rehearsal-based over a selection of single instances from various bags. We use a combination of the instance attention score and distance from the bag mean and class mean vectors to carefully select which samples and instances to store in exemplary sets from previous tasks, preserving the diversity of the data. Using the real-world input of one month of data from a leukemia laboratory, we study the effectiveness of our approach in a class incremental scenario, comparing it to well-known continual learning methods. We show that our method considerably outperforms state-of-the-art methods, providing the first continual learning approach for MIL. This enables the adaptation of models to shifting data distributions over time, such as those caused by changes in disease occurrence or underlying genetic alterations.

[416] Dynamic User-controllable Privacy-preserving Few-shot Sensing Framework

Ajesh Koyatan Chathoth, Shuhao Yu, Stephen Lee

Main category: cs.LG

TL;DR: PrivCLIP is a dynamic, user-controllable framework for privacy-preserving sensing, allowing users to specify and modify privacy preferences for IMU sensor data. It uses contrastive learning and language-guided sanitization to protect sensitive activities while maintaining data utility.

DetailsMotivation: User privacy preferences vary and evolve, especially with IMU sensors in devices like smartphones, which collect sensitive data. Existing methods lack adaptability and user control.

Method: PrivCLIP employs multimodal contrastive learning to align IMU data with natural language descriptions, enabling few-shot detection of sensitive activities. It sanitizes data using a language-guided activity sanitizer and IMU-GPT.

Result: PrivCLIP outperforms baselines in privacy protection and data utility on human activity recognition datasets.

Conclusion: PrivCLIP offers a flexible, user-centric approach to privacy-preserving sensing, balancing privacy and utility effectively.

Abstract: User-controllable privacy is important in modern sensing systems, as privacy preferences can vary significantly from person to person and may evolve over time. This is especially relevant in devices equipped with Inertial Measurement Unit (IMU) sensors, such as smartphones and wearables, which continuously collect rich time-series data that can inadvertently expose sensitive user behaviors. While prior work has proposed privacy-preserving methods for sensor data, most rely on static, predefined privacy labels or require large quantities of private training data, limiting their adaptability and user agency. In this work, we introduce PrivCLIP, a dynamic, user-controllable, few-shot privacy-preserving sensing framework. PrivCLIP allows users to specify and modify their privacy preferences by categorizing activities as sensitive (black-listed), non-sensitive (white-listed), or neutral (gray-listed). Leveraging a multimodal contrastive learning approach, PrivCLIP aligns IMU sensor data with natural language activity descriptions in a shared embedding space, enabling few-shot detection of sensitive activities. When a privacy-sensitive activity is identified, the system uses a language-guided activity sanitizer and a motion generation module (IMU-GPT) to transform the original data into a privacy-compliant version that semantically resembles a non-sensitive activity. We evaluate PrivCLIP on multiple human activity recognition datasets and demonstrate that it significantly outperforms baseline methods in terms of both privacy protection and data utility.

[417] Tensorized Clustered LoRA Merging for Multi-Task Interference

Zhan Su, Fengran Mo, Guojun Liang, Jinghan Zhang, Bingbing Wen, Prayag Tiwari, Jian-Yun Nie

Main category: cs.LG

TL;DR: TC-LoRA addresses task interference in multi-task LoRA adapters by clustering training samples and using CP decomposition for parameter disentanglement, improving accuracy.

DetailsMotivation: Task interference in merged LoRA adapters degrades performance in multi-task settings.

Method: Clusters training samples for text-level similarity and applies CP decomposition for parameter-level disentanglement.

Result: TC-LoRA improves accuracy by +1.4% on Phi-3 and +2.3% on Mistral-7B.

Conclusion: TC-LoRA effectively reduces interference and enhances LLM adaptation.

Abstract: Despite the success of the monolithic dense paradigm of large language models (LLMs), the LoRA adapters offer an efficient solution by fine-tuning small task-specific modules and merging them with the base model. However, in multi-task settings, merging LoRA adapters trained on heterogeneous sources frequently causes \textit{task interference}, degrading downstream performance. To address this, we propose a tensorized clustered LoRA (TC-LoRA) library targeting to address the task interference at the \textit{text-level} and \textit{parameter-level}. At the \textit{text-level}, we cluster the training samples in the embedding space to capture input-format similarities, then train a specialized LoRA adapter for each cluster. At the \textit{parameter-level}, we introduce a joint Canonical Polyadic (CP) decomposition that disentangles task-specific and shared factors across LoRA adapters. This joint factorization preserves essential knowledge while reducing cross-task interference. Extensive experiments on out-of-domain zero-shot and skill-composition tasks-including reasoning, question answering, and coding. Compared to strong SVD-based baselines, TC-LoRA achieves +1.4% accuracy on Phi-3 and +2.3% on Mistral-7B (+2.3%), demonstrating the effectiveness of TC-LoRA in LLM adaptation.

[418] Decoupled Contrastive Learning for Federated Learning

Hyungbin Kim, Incheol Baek, Yon Dohn Chung

Main category: cs.LG

TL;DR: DCFL introduces a decoupled contrastive learning framework for federated learning, addressing data heterogeneity and outperforming existing methods.

DetailsMotivation: Federated learning suffers from performance degradation due to data heterogeneity, and existing contrastive learning methods rely on unrealistic asymptotic assumptions.

Method: DCFL decouples contrastive loss into alignment and uniformity objectives, enabling independent calibration of attraction and repulsion forces.

Result: DCFL achieves better alignment and uniformity, outperforming state-of-the-art methods on benchmarks like CIFAR-10, CIFAR-100, and Tiny-ImageNet.

Conclusion: DCFL provides a practical and effective contrastive learning solution for federated learning with limited client data.

Abstract: Federated learning is a distributed machine learning paradigm that allows multiple participants to train a shared model by exchanging model updates instead of their raw data. However, its performance is degraded compared to centralized approaches due to data heterogeneity across clients. While contrastive learning has emerged as a promising approach to mitigate this, our theoretical analysis reveals a fundamental conflict: its asymptotic assumptions of an infinite number of negative samples are violated in finite-sample regime of federated learning. To address this issue, we introduce Decoupled Contrastive Learning for Federated Learning (DCFL), a novel framework that decouples the existing contrastive loss into two objectives. Decoupling the loss into its alignment and uniformity components enables the independent calibration of the attraction and repulsion forces without relying on the asymptotic assumptions. This strategy provides a contrastive learning method suitable for federated learning environments where each client has a small amount of data. Our experimental results show that DCFL achieves stronger alignment between positive samples and greater uniformity between negative samples compared to existing contrastive learning methods. Furthermore, experimental results on standard benchmarks, including CIFAR-10, CIFAR-100, and Tiny-ImageNet, demonstrate that DCFL consistently outperforms state-of-the-art federated learning methods.

[419] A Comparative Survey of PyTorch vs TensorFlow for Deep Learning: Usability, Performance, and Deployment Trade-offs

Zakariya Ba Alawi

Main category: cs.LG

TL;DR: A comparative survey of TensorFlow and PyTorch, analyzing usability, performance, deployment, and ecosystem trade-offs, highlighting their distinct strengths for research and production.

DetailsMotivation: To provide a detailed comparison of TensorFlow and PyTorch, helping practitioners choose the right framework based on their needs in research or production.

Method: Review and contrast programming paradigms, performance benchmarks, deployment tools, and ecosystem support, supported by empirical data and references.

Result: PyTorch excels in research due to simplicity and flexibility, while TensorFlow offers a robust production ecosystem. Both have trade-offs in usability and performance.

Conclusion: Understanding the trade-offs between PyTorch and TensorFlow is crucial for selecting the right framework, with PyTorch favored in research and TensorFlow in production.

Abstract: This paper presents a comprehensive comparative survey of TensorFlow and PyTorch, the two leading deep learning frameworks, focusing on their usability, performance, and deployment trade-offs. We review each framework’s programming paradigm and developer experience, contrasting TensorFlow’s graph-based (now optionally eager) approach with PyTorch’s dynamic, Pythonic style. We then compare model training speeds and inference performance across multiple tasks and data regimes, drawing on recent benchmarks and studies. Deployment flexibility is examined in depth - from TensorFlow’s mature ecosystem (TensorFlow Lite for mobile/embedded, TensorFlow Serving, and JavaScript support) to PyTorch’s newer production tools (TorchScript compilation, ONNX export, and TorchServe). We also survey ecosystem and community support, including library integrations, industry adoption, and research trends (e.g., PyTorch’s dominance in recent research publications versus TensorFlow’s broader tooling in enterprise). Applications in computer vision, natural language processing, and other domains are discussed to illustrate how each framework is used in practice. Finally, we outline future directions and open challenges in deep learning framework design, such as unifying eager and graph execution, improving cross-framework interoperability, and integrating compiler optimizations (XLA, JIT) for improved speed. Our findings indicate that while both frameworks are highly capable for state-of-the-art deep learning, they exhibit distinct trade-offs: PyTorch offers simplicity and flexibility favored in research, whereas TensorFlow provides a fuller production-ready ecosystem - understanding these trade-offs is key for practitioners selecting the appropriate tool. We include charts, code snippets, and more than 20 references to academic papers and official documentation to support this comparative analysis

[420] FeDaL: Federated Dataset Learning for Time Series Foundation Models

Shengchao Chen, Guodong Long, Jing Jiang

Main category: cs.LG

TL;DR: FeDaL introduces federated learning to address dataset-wise heterogeneity in Time Series Foundation Models, using Domain and Global Bias Elimination to improve generalization.

DetailsMotivation: Dataset-wise heterogeneity degrades generalization in TSFMs, and federated learning offers a natural solution to decompose and mitigate biases.

Method: Proposes Federated Dataset Learning (FeDaL) with Domain Bias Elimination (DBE) and Global Bias Elimination (GBE) mechanisms.

Result: Evaluated on eight tasks with 54 baselines, showing improved cross-dataset generalization.

Conclusion: FeDaL effectively mitigates biases and enhances generalization in TSFMs, with insights into federated scaling behavior.

Abstract: Dataset-wise heterogeneity introduces significant domain biases that fundamentally degrade generalization on Time Series Foundation Models (TSFMs), yet this challenge remains underexplored. This paper rethink the development of TSFMs using the paradigm of federated learning. We propose a novel Federated Dataset Learning (FeDaL) approach to tackle heterogeneous time series by learning dataset-agnostic temporal representations. Specifically, the distributed architecture of federated learning is a nature solution to decompose heterogeneous TS datasets into shared generalized knowledge and preserved personalized knowledge. Moreover, based on the TSFM architecture, FeDaL explicitly mitigates both local and global biases by adding two complementary mechanisms: Domain Bias Elimination (DBE) and Global Bias Elimination (GBE). FeDaL`s cross-dataset generalization has been extensively evaluated in real-world datasets spanning eight tasks, including both representation learning and downstream time series analysis, against 54 baselines. We further analyze federated scaling behavior, showing how data volume, client count, and join rate affect model performance under decentralization.

[421] Quantum Temporal Fusion Transformer

Krishnakanta Barik, Goutam Paul

Main category: cs.LG

TL;DR: The paper introduces QTFT, a quantum-enhanced version of TFT, showing improved or comparable performance in time series forecasting, suitable for NISQ devices.

DetailsMotivation: To extend the classical TFT framework with quantum enhancements for better forecasting performance and compatibility with current quantum hardware.

Method: Proposes QTFT, a hybrid quantum-classical architecture based on variational quantum algorithms, tested on forecasting datasets.

Result: QTFT outperforms or matches classical TFT in training and test loss, demonstrating feasibility on NISQ devices.

Conclusion: QTFT is a promising quantum-enhanced alternative for time series forecasting, leveraging current quantum technology.

Abstract: The Temporal Fusion Transformer (TFT), proposed by Lim et al. [\textit{International Journal of Forecasting}, 2021], is a state-of-the-art attention-based deep neural network architecture specifically designed for multi-horizon time series forecasting. It has demonstrated significant performance improvements over existing benchmarks. In this work, we propose a Quantum Temporal Fusion Transformer (QTFT), a quantum-enhanced hybrid quantum-classical architecture that extends the capabilities of the classical TFT framework. Our results demonstrate that QTFT is successfully trained on the forecasting datasets and is capable of accurately predicting future values. In particular, our experimental results display that in certain test cases, the model outperforms its classical counterpart in terms of both training and test loss, while in the remaining cases, it achieves comparable performance. A key advantage of our approach lies in its foundation on a variational quantum algorithm, enabling implementation on current noisy intermediate-scale quantum (NISQ) devices without strict requirements on the number of qubits or circuit depth.

[422] Fine-tuning for Better Few Shot Prompting: An Empirical Comparison for Short Answer Grading

Joel Walsh, Siddarth Mamidanna, Benjamin Nye, Mark Core, Daniel Auerbach

Main category: cs.LG

TL;DR: The paper evaluates fine-tuning methods (OpenAI’s service and QLORA) for Automated Short Answer Grading (ASAG), comparing them to few-shot prompting. Findings suggest limited utility for open-weight models but better performance for closed models, with domain impact and synthetic data benefits noted.

DetailsMotivation: To explore cost-effective fine-tuning methods for ASAG, addressing the limitations of large-scale compute requirements and comparing closed vs. open-weight models.

Method: Evaluated OpenAI’s fine-tuning service and QLORA on consumer GPUs, testing their interaction with few-shot prompting for ASAG using structured outputs.

Result: Fine-tuning with small data has limited utility for open-weight models but outperforms few-shot baselines for closed models. Synthetic data significantly improves open-weight model performance.

Conclusion: Fine-tuning methods show promise for ASAG, with closed models benefiting more. Domain and synthetic data play key roles in effectiveness.

Abstract: Research to improve Automated Short Answer Grading has recently focused on Large Language Models (LLMs) with prompt engineering and no- or few-shot prompting to achieve best results. This is in contrast to the fine-tuning approach, which has historically required large-scale compute clusters inaccessible to most users. New closed-model approaches such as OpenAI’s fine-tuning service promise results with as few as 100 examples, while methods using open weights such as quantized low-rank adaptive (QLORA) can be used to fine-tune models on consumer GPUs. We evaluate both of these fine-tuning methods, measuring their interaction with few-shot prompting for automated short answer grading (ASAG) with structured (JSON) outputs. Our results show that finetuning with small amounts of data has limited utility for Llama open-weight models, but that fine-tuning methods can outperform few-shot baseline instruction-tuned LLMs for OpenAI’s closed models. While our evaluation set is limited, we find some evidence that the observed benefits of finetuning may be impacted by the domain subject matter. Lastly, we observed dramatic improvement with the LLama 3.1 8B-Instruct open-weight model by seeding the initial training examples with a significant amount of cheaply generated synthetic training data.

[423] FLAT: Latent-Driven Arbitrary-Target Backdoor Attacks in Federated Learning

Tuan Nguyen, Khoa D Doan, Kok-Seng Wong

Main category: cs.LG

TL;DR: FLAT introduces a latent-driven conditional autoencoder for flexible, diverse backdoor attacks in federated learning, evading detection and achieving high success.

DetailsMotivation: Existing backdoor attacks in FL are limited by fixed-pattern or single-target triggers, making them inflexible and easier to detect.

Method: FLAT uses a latent-driven conditional autoencoder to generate diverse, target-specific triggers, enabling arbitrary target selection without retraining.

Result: FLAT achieves high attack success, remains robust against defenses, and demonstrates stealth and diversity.

Conclusion: FLAT highlights the need for new defenses against latent-driven, multi-target backdoor threats in federated learning.

Abstract: Federated learning (FL) is vulnerable to backdoor attacks, yet most existing methods are limited by fixed-pattern or single-target triggers, making them inflexible and easier to detect. We propose FLAT (FL Arbitrary-Target Attack), a novel backdoor attack that leverages a latent-driven conditional autoencoder to generate diverse, target-specific triggers as needed. By introducing a latent code, FLAT enables the creation of visually adaptive and highly variable triggers, allowing attackers to select arbitrary targets without retraining and to evade conventional detection mechanisms. Our approach unifies attack success, stealth, and diversity within a single framework, introducing a new level of flexibility and sophistication to backdoor attacks in FL. Extensive experiments show that FLAT achieves high attack success and remains robust against advanced FL defenses. These results highlight the urgent need for new defense strategies to address latent-driven, multi-target backdoor threats in federated settings.

[424] Adversarial Fair Multi-View Clustering

Mudi Jiang, Jiahui Zhou, Lianyu Hu, Xinying Liu, Zengyou He, Zhikui Chen

Main category: cs.LG

TL;DR: The paper proposes AFMVC, an adversarial fair multi-view clustering framework, to integrate fairness into representation learning, achieving superior fairness and competitive clustering performance.

DetailsMotivation: Existing multi-view clustering methods overlook fairness, a critical concern in human-centered applications, and rely on assumptions that often fail in practice.

Method: AFMVC uses adversarial training to remove sensitive attribute information from features and aligns view-specific clustering assignments with a fairness-invariant consensus distribution via KL divergence.

Result: AFMVC outperforms existing methods in fairness and maintains competitive clustering performance, validated by extensive experiments.

Conclusion: The framework provides theoretical guarantees and practical benefits for fair multi-view clustering, addressing a significant gap in current research.

Abstract: Cluster analysis is a fundamental problem in data mining and machine learning. In recent years, multi-view clustering has attracted increasing attention due to its ability to integrate complementary information from multiple views. However, existing methods primarily focus on clustering performance, while fairness-a critical concern in human-centered applications-has been largely overlooked. Although recent studies have explored group fairness in multi-view clustering, most methods impose explicit regularization on cluster assignments, relying on the alignment between sensitive attributes and the underlying cluster structure. However, this assumption often fails in practice and can degrade clustering performance. In this paper, we propose an adversarial fair multi-view clustering (AFMVC) framework that integrates fairness learning into the representation learning process. Specifically, our method employs adversarial training to fundamentally remove sensitive attribute information from learned features, ensuring that the resulting cluster assignments are unaffected by it. Furthermore, we theoretically prove that aligning view-specific clustering assignments with a fairness-invariant consensus distribution via KL divergence preserves clustering consistency without significantly compromising fairness, thereby providing additional theoretical guarantees for our framework. Extensive experiments on data sets with fairness constraints demonstrate that AFMVC achieves superior fairness and competitive clustering performance compared to existing multi-view clustering and fairness-aware clustering methods.

[425] Model Inversion Attacks on Vision-Language Models: Do They Leak What They Learn?

Ngoc-Bao Nguyen, Sy-Tuyen Ho, Koh Jun Hao, Ngai-Man Cheung

Main category: cs.LG

TL;DR: The paper investigates model inversion (MI) attacks on vision-language models (VLMs), proposing novel token-based and sequence-based methods to reconstruct private training data, demonstrating VLMs’ vulnerability to such attacks.

DetailsMotivation: Prior work focused on unimodal DNNs, leaving VLMs' privacy risks underexplored. This study aims to address this gap by evaluating VLMs' susceptibility to training data leakage.

Method: Proposes Token-based Model Inversion (TMI), Convergent TMI (TMI-C), Sequence-based Model Inversion (SMI), and SMI with Adaptive Token Weighting (SMI-AW). Evaluates these methods on state-of-the-art VLMs and datasets.

Result: Sequence-based methods, especially SMI-AW, outperform token-based methods in reconstruction quality and attack accuracy (75.31% in human evaluation). Demonstrates successful attacks on publicly released VLMs.

Conclusion: VLMs are vulnerable to model inversion attacks, posing significant privacy risks, especially as their use grows in sensitive fields like healthcare and finance.

Abstract: Model inversion (MI) attacks pose significant privacy risks by reconstructing private training data from trained neural networks. While prior works have focused on conventional unimodal DNNs, the vulnerability of vision-language models (VLMs) remains underexplored. In this paper, we conduct the first study to understand VLMs’ vulnerability in leaking private visual training data. To tailored for VLMs’ token-based generative nature, we propose a suite of novel token-based and sequence-based model inversion strategies. Particularly, we propose Token-based Model Inversion (TMI), Convergent Token-based Model Inversion (TMI-C), Sequence-based Model Inversion (SMI), and Sequence-based Model Inversion with Adaptive Token Weighting (SMI-AW). Through extensive experiments and user study on three state-of-the-art VLMs and multiple datasets, we demonstrate, for the first time, that VLMs are susceptible to training data leakage. The experiments show that our proposed sequence-based methods, particularly SMI-AW combined with a logit-maximization loss based on vocabulary representation, can achieve competitive reconstruction and outperform token-based methods in attack accuracy and visual similarity. Importantly, human evaluation of the reconstructed images yields an attack accuracy of 75.31%, underscoring the severity of model inversion threats in VLMs. Notably we also demonstrate inversion attacks on the publicly released VLMs. Our study reveals the privacy vulnerability of VLMs as they become increasingly popular across many applications such as healthcare and finance.

[426] COPO: Consistency-Aware Policy Optimization

Jinghang Han, Jiawei Chen, Hang Shao, Hao Ma, Mingcheng Li, Xintian Shen, Lihao Zheng, Wei Chen, Tao Wei, Lihua Zhang

Main category: cs.LG

TL;DR: The paper proposes a consistency-aware policy optimization framework to address vanishing gradients in reinforcement learning for LLMs, improving training efficiency and performance.

DetailsMotivation: The challenge of vanishing gradients when multiple responses under a single prompt converge to identical outcomes, limiting learning effectiveness.

Method: Introduces a structured global reward based on outcome consistency and an entropy-based soft blending mechanism for adaptive optimization.

Result: Substantial performance gains on multiple mathematical reasoning benchmarks, demonstrating robustness and general applicability.

Conclusion: The framework effectively enhances learning signals and balances exploration and convergence, validated by improved benchmark results.

Abstract: Reinforcement learning has significantly enhanced the reasoning capabilities of Large Language Models (LLMs) in complex problem-solving tasks. Recently, the introduction of DeepSeek R1 has inspired a surge of interest in leveraging rule-based rewards as a low-cost alternative for computing advantage functions and guiding policy optimization. However, a common challenge observed across many replication and extension efforts is that when multiple sampled responses under a single prompt converge to identical outcomes, whether correct or incorrect, the group-based advantage degenerates to zero. This leads to vanishing gradients and renders the corresponding samples ineffective for learning, ultimately limiting training efficiency and downstream performance. To address this issue, we propose a consistency-aware policy optimization framework that introduces a structured global reward based on outcome consistency, the global loss based on it ensures that, even when model outputs show high intra-group consistency, the training process still receives meaningful learning signals, which encourages the generation of correct and self-consistent reasoning paths from a global perspective. Furthermore, we incorporate an entropy-based soft blending mechanism that adaptively balances local advantage estimation with global optimization, enabling dynamic transitions between exploration and convergence throughout training. Our method introduces several key innovations in both reward design and optimization strategy. We validate its effectiveness through substantial performance gains on multiple mathematical reasoning benchmarks, highlighting the proposed framework’s robustness and general applicability. Code of this work has been released at https://github.com/hijih/copo-code.git.

[427] Semi-Supervised Deep Domain Adaptation for Predicting Solar Power Across Different Locations

Md Shazid Islam, A S M Jahid Hasan, Md Saydur Rahman, Md Saiful Islam Sajol

Main category: cs.LG

TL;DR: A semi-supervised deep domain adaptation framework improves solar generation prediction accuracy across diverse locations with minimal labeled data.

DetailsMotivation: Addressing domain shift in solar prediction due to varying weather conditions and lack of labeled data.

Method: Uses a teacher-student model with consistency and cross-entropy loss for semi-supervised adaptation.

Result: Improves prediction accuracy by up to 11.36%, 6.65%, and 4.92% for California, Florida, and New York, respectively, with only 20% labeled target data.

Conclusion: The framework effectively adapts models to new locations without requiring source data, enhancing solar prediction accuracy.

Abstract: Accurate solar generation prediction is essential for proper estimation of renewable energy resources across diverse geographic locations. However, geographical and weather features vary from location to location which introduces domain shift - a major bottleneck to develop location-agnostic prediction model. As a result, a machine-learning model which can perform well to predict solar power in one location, may exhibit subpar performance in another location. Moreover, the lack of properly labeled data and storage issues make the task even more challenging. In order to address domain shift due to varying weather conditions across different meteorological regions, this paper presents a semi-supervised deep domain adaptation framework, allowing accurate predictions with minimal labeled data from the target location. Our approach involves training a deep convolutional neural network on a source location’s data and adapting it to the target location using a source-free, teacher-student model configuration. The teacher-student model leverages consistency and cross-entropy loss for semi-supervised learning, ensuring effective adaptation without any source data requirement for prediction. With annotation of only $20 %$ data in the target domain, our approach exhibits an improvement upto $11.36 %$, $6.65 %$, $4.92%$ for California, Florida and New York as target domain, respectively in terms of accuracy in predictions with respect to non-adaptive approach.

[428] One Small Step with Fingerprints, One Giant Leap for emph{De Novo} Molecule Generation from Mass Spectra

Neng Kai Nigel Neo, Lim Jing, Ngoui Yong Zhau Preston, Koh Xue Ting Serene, Bingquan Shen

Main category: cs.LG

TL;DR: A two-stage pipeline using MIST and MolForge for molecular generation from mass spectra achieves a tenfold improvement over prior methods, with top-1 28% and top-10 36% accuracy.

DetailsMotivation: To enhance de novo molecular generation from mass spectra by combining a robust encoder (MIST) and decoder (MolForge) with pretraining and thresholding techniques.

Method: Uses MIST to encode mass spectra into fingerprints and MolForge to decode fingerprints into structures, with pretraining and thresholding for improved performance.

Result: Achieves tenfold improvement over state-of-the-art, with 28% top-1 and 36% top-10 accuracy in molecular structure recovery.

Conclusion: The pipeline sets a strong baseline for future research in de novo molecule elucidation from mass spectra.

Abstract: A common approach to the \emph{de novo} molecular generation problem from mass spectra involves a two-stage pipeline: (1) encoding mass spectra into molecular fingerprints, followed by (2) decoding these fingerprints into molecular structures. In our work, we adopt \textsc{MIST}\citep{MISTgoldmanAnnotatingMetaboliteMass2023} as the encoder and \textsc{MolForge}\citep{ucakReconstructionLosslessMolecular2023} as the decoder, leveraging pretraining to enhance performance. Notably, pretraining \textsc{MolForge} proves especially effective, enabling it to serve as a robust fingerprint-to-structure decoder. Additionally, instead of passing the probability of each bit in the fingerprint, thresholding the probabilities as a step function helps focus the decoder on the presence of substructures, improving recovery of accurate molecular structures even when the fingerprints predicted by \textsc{MIST} only moderately resembles the ground truth in terms of Tanimoto similarity. This combination of encoder and decoder results in a tenfold improvement over previous state-of-the-art methods, generating top-1 28% / top-10 36% of molecular structures correctly from mass spectra. We position this pipeline as a strong baseline for future research in \emph{de novo} molecule elucidation from mass spectra.

[429] Neural Network Training via Stochastic Alternating Minimization with Trainable Step Sizes

Chengcheng Yan, Jiawei Xu, Zheng Peng, Qingsong Wang

Main category: cs.LG

TL;DR: SAMT is a novel method for training deep neural networks using block-wise alternating updates and adaptive step sizes, improving stability and efficiency.

DetailsMotivation: Standard methods like SGD require simultaneous updates, leading to unstable convergence and high computational costs.

Method: SAMT updates parameters in blocks, decomposing the problem into sub-problems, and uses meta-learning for adaptive step sizes.

Result: SAMT achieves better generalization with fewer updates and provides theoretical convergence guarantees.

Conclusion: SAMT is effective and efficient for neural network optimization, outperforming state-of-the-art methods.

Abstract: The training of deep neural networks is inherently a nonconvex optimization problem, yet standard approaches such as stochastic gradient descent (SGD) require simultaneous updates to all parameters, often leading to unstable convergence and high computational cost. To address these issues, we propose a novel method, Stochastic Alternating Minimization with Trainable Step Sizes (SAMT), which updates network parameters in an alternating manner by treating the weights of each layer as a block. By decomposing the overall optimization into sub-problems corresponding to different blocks, this block-wise alternating strategy reduces per-step computational overhead and enhances training stability in nonconvex settings. To fully leverage these benefits, inspired by meta-learning, we proposed a novel adaptive step size strategy to incorporate into the sub-problem solving steps of alternating updates. It supports different types of trainable step sizes, including but not limited to scalar, element-wise, row-wise, and column-wise, enabling adaptive step size selection tailored to each block via meta-learning. We further provide a theoretical convergence guarantee for the proposed algorithm, establishing its optimization soundness. Extensive experiments for multiple benchmarks demonstrate that SAMT achieves better generalization performance with fewer parameter updates compared to state-of-the-art methods, highlighting its effectiveness and potential in neural network optimization.

[430] Causal Reward Adjustment: Mitigating Reward Hacking in External Reasoning via Backdoor Correction

Ruike Song, Zeen Song, Huijie Guo, Wenwen Qiang

Main category: cs.LG

TL;DR: Causal Reward Adjustment (CRA) mitigates reward hacking in external reasoning systems by correcting confounding semantic features, improving accuracy without retraining.

DetailsMotivation: Reward hacking in process reward models (PRMs) leads to incorrect answers due to high-scoring but logically flawed reasoning paths.

Method: CRA uses sparse autoencoders on PRM activations to recover interpretable features and applies backdoor adjustment to correct confounding.

Result: Experiments show CRA reduces reward hacking and enhances accuracy in mathematical problem-solving tasks.

Conclusion: CRA effectively addresses reward hacking without modifying the policy model or retraining PRMs.

Abstract: External reasoning systems combine language models with process reward models (PRMs) to select high-quality reasoning paths for complex tasks such as mathematical problem solving. However, these systems are prone to reward hacking, where high-scoring but logically incorrect paths are assigned high scores by the PRMs, leading to incorrect answers. From a causal inference perspective, we attribute this phenomenon primarily to the presence of confounding semantic features. To address it, we propose Causal Reward Adjustment (CRA), a method that mitigates reward hacking by estimating the true reward of a reasoning path. CRA trains sparse autoencoders on the PRM’s internal activations to recover interpretable features, then corrects confounding by using backdoor adjustment. Experiments on math solving datasets demonstrate that CRA mitigates reward hacking and improves final accuracy, without modifying the policy model or retraining PRM.

[431] Symmetric Behavior Regularization via Taylor Expansion of Symmetry

Lingwei Zhu, Zheng Chen, Han Wang, Yukie Nagai

Main category: cs.LG

TL;DR: The paper introduces symmetric divergences to BRPO for offline RL, addressing challenges with analytic policies and numerical issues via Taylor series, proposing S$f$-AC, which performs competitively.

DetailsMotivation: Existing methods use asymmetric divergences like KL, but symmetric divergences are underexplored despite potential benefits.

Method: Uses Taylor series of $f$-divergence to derive analytic policies and decompose symmetric divergences to address numerical issues.

Result: Proposes S$f$-AC, the first practical BRPO algorithm with symmetric divergences, showing competitive performance in experiments.

Conclusion: Symmetric divergences can be effectively integrated into BRPO, with S$f$-AC demonstrating practical viability.

Abstract: This paper introduces symmetric divergences to behavior regularization policy optimization (BRPO) to establish a novel offline RL framework. Existing methods focus on asymmetric divergences such as KL to obtain analytic regularized policies and a practical minimization objective. We show that symmetric divergences do not permit an analytic policy as regularization and can incur numerical issues as loss. We tackle these challenges by the Taylor series of $f$-divergence. Specifically, we prove that an analytic policy can be obtained with a finite series. For loss, we observe that symmetric divergences can be decomposed into an asymmetry and a conditional symmetry term, Taylor-expanding the latter alleviates numerical issues. Summing together, we propose Symmetric $f$ Actor-Critic (S$f$-AC), the first practical BRPO algorithm with symmetric divergences. Experimental results on distribution approximation and MuJoCo verify that S$f$-AC performs competitively.

[432] Empowering Time Series Forecasting with LLM-Agents

Chin-Chia Michael Yeh, Vivian Lai, Uday Singh Saini, Xiran Fan, Yujie Fan, Junpeng Wang, Xin Dai, Yan Zheng

Main category: cs.LG

TL;DR: DCATS, a data-centric agent for time series, improves forecasting by cleaning data using metadata, reducing error by 6%.

DetailsMotivation: Existing AutoML focuses on feature engineering and model architecture, but lightweight models in time series forecasting suggest data quality improvement could be more impactful.

Method: Proposes DCATS, which leverages metadata to clean time series data while optimizing forecasting performance. Evaluated on traffic volume data with four models.

Result: Achieves an average 6% error reduction across all tested models and time horizons.

Conclusion: Data-centric approaches like DCATS show promise for AutoML in time series forecasting.

Abstract: Large Language Model (LLM) powered agents have emerged as effective planners for Automated Machine Learning (AutoML) systems. While most existing AutoML approaches focus on automating feature engineering and model architecture search, recent studies in time series forecasting suggest that lightweight models can often achieve state-of-the-art performance. This observation led us to explore improving data quality, rather than model architecture, as a potentially fruitful direction for AutoML on time series data. We propose DCATS, a Data-Centric Agent for Time Series. DCATS leverages metadata accompanying time series to clean data while optimizing forecasting performance. We evaluated DCATS using four time series forecasting models on a large-scale traffic volume forecasting dataset. Results demonstrate that DCATS achieves an average 6% error reduction across all tested models and time horizons, highlighting the potential of data-centric approaches in AutoML for time series forecasting.

[433] Automated ultrasound doppler angle estimation using deep learning

Nilesh Patil, Ajay Anand

Main category: cs.LG

TL;DR: A deep learning-based method for automated Doppler angle estimation in ultrasound images is proposed, achieving clinically acceptable accuracy.

DetailsMotivation: Incorrect angle estimation in Doppler ultrasound leads to errors in blood velocity measurements, necessitating an automated solution.

Method: Utilized 2100 carotid ultrasound images with augmentation, five pre-trained models for feature extraction, and a custom shallow network for angle estimation.

Result: Mean absolute error ranged from 3.9° to 9.4°, with the best model outperforming clinical error thresholds.

Conclusion: The approach shows promise for clinical implementation, potentially reducing errors in blood velocity measurements.

Abstract: Angle estimation is an important step in the Doppler ultrasound clinical workflow to measure blood velocity. It is widely recognized that incorrect angle estimation is a leading cause of error in Doppler-based blood velocity measurements. In this paper, we propose a deep learning-based approach for automated Doppler angle estimation. The approach was developed using 2100 human carotid ultrasound images including image augmentation. Five pre-trained models were used to extract images features, and these features were passed to a custom shallow network for Doppler angle estimation. Independently, measurements were obtained by a human observer reviewing the images for comparison. The mean absolute error (MAE) between the automated and manual angle estimates ranged from 3.9{\deg} to 9.4{\deg} for the models evaluated. Furthermore, the MAE for the best performing model was less than the acceptable clinical Doppler angle error threshold thus avoiding misclassification of normal velocity values as a stenosis. The results demonstrate potential for applying a deep-learning based technique for automated ultrasound Doppler angle estimation. Such a technique could potentially be implemented within the imaging software on commercial ultrasound scanners.

[434] T3Time: Tri-Modal Time Series Forecasting via Adaptive Multi-Head Alignment and Residual Fusion

Abdul Monaf Chowdhury, Rabeya Akter, Safaeid Hossain Arib

Main category: cs.LG

TL;DR: T3Time is a trimodal framework for MTSF, combining time, spectral, and prompt branches with adaptive gating and cross-modal alignment, outperforming SOTA methods.

DetailsMotivation: Current MTSF methods lack adaptability across forecast horizons and ignore intervariable interactions, limiting nuanced relationship capture.

Method: Proposes T3Time with time, spectral, and prompt branches, adaptive gating, and dynamic cross-modal alignment heads.

Result: Achieves 3.28% MSE and 2.29% MAE reduction, with strong few-shot learning performance (4.13% MSE and 1.91% MAE reduction with 5% data).

Conclusion: T3Time effectively addresses horizon-specific relationship capture and outperforms existing methods in MTSF.

Abstract: Multivariate time series forecasting (MTSF) seeks to model temporal dynamics among variables to predict future trends. Transformer-based models and large language models (LLMs) have shown promise due to their ability to capture long-range dependencies and patterns. However, current methods often rely on rigid inductive biases, ignore intervariable interactions, or apply static fusion strategies that limit adaptability across forecast horizons. These limitations create bottlenecks in capturing nuanced, horizon-specific relationships in time-series data. To solve this problem, we propose T3Time, a novel trimodal framework consisting of time, spectral, and prompt branches, where the dedicated frequency encoding branch captures the periodic structures along with a gating mechanism that learns prioritization between temporal and spectral features based on the prediction horizon. We also proposed a mechanism which adaptively aggregates multiple cross-modal alignment heads by dynamically weighting the importance of each head based on the features. Extensive experiments on benchmark datasets demonstrate that our model consistently outperforms state-of-the-art baselines, achieving an average reduction of 3.28% in MSE and 2.29% in MAE. Furthermore, it shows strong generalization in few-shot learning settings: with 5% training data, we see a reduction in MSE and MAE by 4.13% and 1.91%, respectively; and with 10% data, by 3.62% and 1.98% on average. Code - https://github.com/monaf-chowdhury/T3Time/

[435] A Visual Tool for Interactive Model Explanation using Sensitivity Analysis

Manuela Schuler

Main category: cs.LG

TL;DR: SAInT is a Python tool for visually exploring ML models using local and global sensitivity analysis, supporting HITL workflows without programming.

DetailsMotivation: To help AI researchers and domain experts understand ML model behavior through interactive exploration and explanation.

Method: Automates model training/selection, provides global feature attribution (variance-based), and per-instance explanations (LIME/SHAP).

Result: Demonstrated on Titanic dataset, showing sensitivity analysis aids feature selection and data refinement.

Conclusion: SAInT effectively bridges the gap between ML models and human understanding through visual and interactive tools.

Abstract: We present SAInT, a Python-based tool for visually exploring and understanding the behavior of Machine Learning (ML) models through integrated local and global sensitivity analysis. Our system supports Human-in-the-Loop (HITL) workflows by enabling users - both AI researchers and domain experts - to configure, train, evaluate, and explain models through an interactive graphical interface without programming. The tool automates model training and selection, provides global feature attribution using variance-based sensitivity analysis, and offers per-instance explanation via LIME and SHAP. We demonstrate the system on a classification task predicting survival on the Titanic dataset and show how sensitivity information can guide feature selection and data refinement.

[436] Mockingbird: How does LLM perform in general machine learning tasks?

Haoyu Jia, Yoshiki Obinata, Kento Kawaharazuka, Kei Okada

Main category: cs.LG

TL;DR: The paper introduces Mockingbird, a framework adapting LLMs for general machine learning tasks, showing acceptable results but highlighting limitations in self-reflection compared to expert feedback.

DetailsMotivation: The study explores the potential of LLMs beyond chat bots, driven by curiosity about their expanding reasoning capabilities and inference speed.

Method: Proposes Mockingbird, a framework where LLMs role-play functions and self-reflect on mistakes to improve performance on general machine learning tasks.

Result: LLM-driven methods like Mockingbird achieve acceptable results on common tasks but fall short of outperforming domain-specific documents or human expert feedback.

Conclusion: While LLMs show promise for general machine learning tasks, self-reflection alone is insufficient to replace domain expertise or human input.

Abstract: Large language models (LLMs) are now being used with increasing frequency as chat bots, tasked with the summarizing information or generating text and code in accordance with user instructions. The rapid increase in reasoning capabilities and inference speed of LLMs has revealed their remarkable potential for applications extending beyond the domain of chat bots to general machine learning tasks. This work is conducted out of the curiosity about such potential. In this work, we propose a framework Mockingbird to adapt LLMs to general machine learning tasks and evaluate its performance and scalability on several general machine learning tasks. The core concept of this framework is instructing LLMs to role-play functions and reflect on its mistakes to improve itself. Our evaluation and analysis result shows that LLM-driven machine learning methods, such as Mockingbird, can achieve acceptable results on common machine learning tasks; however, solely reflecting on its own currently cannot outperform the effect of domain-specific documents and feedback from human experts.

[437] Enhancing Vision-Language Model Training with Reinforcement Learning in Synthetic Worlds for Real-World Success

George Bredis, Stanislav Dereka, Viacheslav Sinii, Ruslan Rakhimov, Daniil Gavrilov

Main category: cs.LG

TL;DR: VL-DAC, a lightweight RL algorithm, trains VLMs in cheap simulators, improving performance on real-world benchmarks without degrading general image understanding.

DetailsMotivation: Current VLMs lack the ability to convert visual observations into coherent language-conditioned actions, and existing RL methods struggle with generalization and hyperparameter tuning.

Method: VL-DAC decouples PPO updates for action tokens from value learning at the environment-step level, eliminating unstable weighting terms.

Result: VL-DAC achieves +50% on BALROG, +5% on VSI-Bench, and +2% on VisualWebBench, showing generalization from simulators to real-world tasks.

Conclusion: A simple RL algorithm like VL-DAC can effectively train VLMs in synthetic environments and improve performance on diverse real-world benchmarks.

Abstract: Interactive multimodal agents must convert raw visual observations into coherent sequences of language-conditioned actions – a capability that current vision-language models (VLMs) still lack. Earlier reinforcement-learning (RL) efforts could, in principle, endow VLMs with such skills, but they have seldom tested whether the learned behaviours generalize beyond their training simulators, and they depend either on brittle hyperparameter tuning or on dense-reward environments with low state variability. We introduce Vision-Language Decoupled Actor-Critic (VL-DAC), a lightweight, hyperparameter-free RL algorithm. VL-DAC applies PPO updates to action tokens while learning value only at the environment-step level: an arrangement, to our knowledge, not previously explored for large VLMs or LLMs. This simple decoupling removes unstable weighting terms and yields faster, more reliable convergence. Training a single VLM with VL-DAC in one inexpensive simulator at a time (MiniWorld, Gym-Cards, ALFWorld, or WebShop) already produces policies that generalize widely: +50% relative on BALROG (game-centric agentic control), +5% relative on the hardest part of VSI-Bench (spatial planning), and +2% on VisualWebBench (web navigation), all without degrading general image understanding accuracy. These results provide the first evidence that a simple RL algorithm can train VLMs entirely in cheap synthetic worlds while delivering measurable gains on real-image agentic, spatial-reasoning, and web-navigation benchmarks.

[438] WSS-CL: Weight Saliency Soft-Guided Contrastive Learning for Efficient Machine Unlearning Image Classification

Thang Duc Tran, Thai Hoang Le

Main category: cs.LG

TL;DR: A two-phase method (WSS-CL) for efficient machine unlearning in image classification, using weight saliency and contrastive learning to improve precision and stability.

DetailsMotivation: Current machine unlearning methods struggle with precision, stability, and domain applicability, prompting the need for a more effective approach.

Method: Combines weight saliency with contrastive learning: a forgetting stage (KL divergence) and adversarial fine-tuning (self-supervised contrastive learning).

Result: Achieves improved unlearning efficacy with minimal performance loss, outperforming state-of-the-art methods.

Conclusion: WSS-CL is effective for supervised and self-supervised settings, narrowing the gap to ’exact’ unlearning.

Abstract: Machine unlearning, the efficient deletion of the impact of specific data in a trained model, remains a challenging problem. Current machine unlearning approaches that focus primarily on data-centric or weight-based strategies frequently encounter challenges in achieving precise unlearning, maintaining stability, and ensuring applicability across diverse domains. In this work, we introduce a new two-phase efficient machine unlearning method for image classification, in terms of weight saliency, leveraging weight saliency to focus the unlearning process on critical model parameters. Our method is called weight saliency soft-guided contrastive learning for efficient machine unlearning image classification (WSS-CL), which significantly narrows the performance gap with “exact” unlearning. First, the forgetting stage maximizes kullback-leibler divergence between output logits and aggregated pseudo-labels for efficient forgetting in logit space. Next, the adversarial fine-tuning stage introduces contrastive learning in a self-supervised manner. By using scaled feature representations, it maximizes the distance between the forgotten and retained data samples in the feature space, with the forgotten and the paired augmented samples acting as positive pairs, while the retained samples act as negative pairs in the contrastive loss computation. Experimental evaluations reveal that our proposed method yields much-improved unlearning efficacy with negligible performance loss compared to state-of-the-art approaches, indicative of its usability in supervised and self-supervised settings.

[439] Forgetting: A New Mechanism Towards Better Large Language Model Fine-tuning

Ali Taheri Ghahrizjani, Alireza Taban, Qizhou Wang, Shanshan Ye, Abdolreza Mirzaei, Tongliang Liu, Bo Han

Main category: cs.LG

TL;DR: The paper proposes categorizing tokens in supervised fine-tuning (SFT) for LLMs into positive and negative tokens to improve performance by focusing on useful information and forgetting misleading or uninformative data.

DetailsMotivation: To reduce reliance on data quality and volume in SFT by distinguishing useful tokens from harmful or uninformative ones.

Method: Tokens are classified as positive (useful) or negative (misleading/uninformative). Positive tokens are trained normally, while negative tokens are explicitly forgotten to shape a knowledge boundary.

Result: Experiments show the forgetting mechanism improves model performance and increases response diversity.

Conclusion: Token categorization and selective forgetting enhance SFT effectiveness by refining the model’s learning focus.

Abstract: Supervised fine-tuning (SFT) plays a critical role for pretrained large language models (LLMs), notably enhancing their capacity to acquire domain-specific knowledge while preserving or potentially augmenting their general-purpose capabilities. However, the efficacy of SFT hinges on data quality as well as data volume, otherwise it may result in limited performance gains or even degradation relative to the associated baselines. To mitigate such reliance, we suggest categorizing tokens within each corpus into two parts – positive and negative tokens – based on whether they are useful to improve model performance. Positive tokens can be trained in common ways, whereas negative tokens, which may lack essential semantics or be misleading, should be explicitly forgotten. Overall, the token categorization facilitate the model to learn less informative message, and the forgetting process shapes a knowledge boundary to guide the model on what information to learn more precisely. We conduct experiments on well-established benchmarks, finding that this forgetting mechanism not only improves overall model performance and also facilitate more diverse model responses.

[440] From Split to Share: Private Inference with Distributed Feature Sharing

Zihan Liu, Jiayi Wen, Shouhong Tan, Zhirun Zheng, Cheng Huang

Main category: cs.LG

TL;DR: PrivDFS introduces a distributed feature-sharing method for private inference, balancing privacy and efficiency by partitioning client data among non-colluding servers, with extensions (PrivDFS-AT and PrivDFS-KD) enhancing robustness against attacks.

DetailsMotivation: Addressing privacy concerns in cloud-based MLaaS, where existing methods either compromise efficiency (cryptographic approaches) or privacy (split inference).

Method: Partitions client input features into shares for non-colluding servers, aggregates outputs securely, and adds adversarial training (PrivDFS-AT) and user-specific keys (PrivDFS-KD) for stronger privacy.

Result: Achieves privacy comparable to split inference with 100x less client computation and no accuracy loss; extensions resist inversion and adaptive attacks.

Conclusion: PrivDFS offers a scalable, efficient, and privacy-preserving solution for MLaaS, with robust extensions against attacks.

Abstract: Cloud-based Machine Learning as a Service (MLaaS) raises serious privacy concerns when handling sensitive client data. Existing Private Inference (PI) methods face a fundamental trade-off between privacy and efficiency: cryptographic approaches offer strong protection but incur high computational overhead, while efficient alternatives such as split inference expose intermediate features to inversion attacks. We propose PrivDFS, a new paradigm for private inference that replaces a single exposed representation with distributed feature sharing. PrivDFS partitions input features on the client into multiple balanced shares, which are distributed to non-colluding, non-communicating servers for independent partial inference. The client securely aggregates the servers’ outputs to reconstruct the final prediction, ensuring that no single server observes sufficient information to compromise input privacy. To further strengthen privacy, we propose two key extensions: PrivDFS-AT, which uses adversarial training with a diffusion-based proxy attacker to enforce inversion-resistant feature partitioning, and PrivDFS-KD, which leverages user-specific keys to diversify partitioning policies and prevent query-based inversion generalization. Experiments on CIFAR-10 and CelebA demonstrate that PrivDFS achieves privacy comparable to deep split inference while cutting client computation by up to 100 times with no accuracy loss, and that the extensions remain robust against both diffusion-based in-distribution and adaptive attacks.

[441] Multi-Marginal Stochastic Flow Matching for High-Dimensional Snapshot Data at Irregular Time Points

Justin Lee, Behnaz Moradijamei, Heman Shakeri

Main category: cs.LG

TL;DR: MMSFM extends flow matching to multi-marginal settings for high-dimensional data alignment without dimensionality reduction, handling irregular time points robustly.

DetailsMotivation: Addressing the challenge of modeling high-dimensional system evolution from irregular snapshot observations without oversimplifying dynamics.

Method: Multi-Marginal Stochastic Flow Matching (MMSFM) uses measure-valued splines and score matching for robust alignment of high-dimensional data at non-equidistant time points.

Result: Validated on synthetic and benchmark datasets, including gene expression and image progression tasks, showing versatility.

Conclusion: MMSFM effectively handles high-dimensional, irregularly timed data, preserving transient behaviors and avoiding overfitting.

Abstract: Modeling the evolution of high-dimensional systems from limited snapshot observations at irregular time points poses a significant challenge in quantitative biology and related fields. Traditional approaches often rely on dimensionality reduction techniques, which can oversimplify the dynamics and fail to capture critical transient behaviors in non-equilibrium systems. We present Multi-Marginal Stochastic Flow Matching (MMSFM), a novel extension of simulation-free score and flow matching methods to the multi-marginal setting, enabling the alignment of high-dimensional data measured at non-equidistant time points without reducing dimensionality. The use of measure-valued splines enhances robustness to irregular snapshot timing, and score matching prevents overfitting in high-dimensional spaces. We validate our framework on several synthetic and benchmark datasets, including gene expression data collected at uneven time points and an image progression task, demonstrating the method’s versatility.

[442] FlexQ: Efficient Post-training INT6 Quantization for LLM Serving via Algorithm-System Co-Design

Hao Zhang, Aining Jia, Weifeng Bu, Yushu Cai, Kai Sheng, Hao Chen, Xin He

Main category: cs.LG

TL;DR: FlexQ is a post-training INT6 quantization framework for LLMs, balancing accuracy and efficiency by combining algorithmic and system-level optimizations, achieving near-FP16 accuracy and significant speedups.

DetailsMotivation: LLMs face high memory and computational costs, and existing INT4/INT8 quantization methods degrade accuracy or lack efficiency. INT6 offers a better trade-off but lacks hardware support.

Method: FlexQ uses uniform 6-bit weight quantization with adaptive 8-bit activations, supported by a custom GPU kernel for W6A6/W6A8 via Binary Tensor Core equivalents.

Result: FlexQ maintains near-FP16 accuracy (≤0.05 perplexity increase) and achieves 1.39× speedup over ABQ-LLM, with 1.33× inference acceleration and 1.21× memory savings over SmoothQuant.

Conclusion: FlexQ effectively addresses LLM deployment challenges by optimizing INT6 quantization, offering a practical solution for efficient inference without sacrificing accuracy.

Abstract: Large Language Models (LLMs) demonstrate exceptional performance but entail significant memory and computational costs, restricting their practical deployment. While existing INT4/INT8 quantization reduces these costs, they often degrade accuracy or lack optimal efficiency. INT6 quantization offers a superior trade-off between model accuracy and inference efficiency, but lacks hardware support in modern GPUs, forcing emulation via higher-precision arithmetic units that limit acceleration. In this paper, we propose FlexQ, a novel post-training INT6 quantization framework combining algorithmic innovation with system-level optimizations. FlexQ employs uniform 6-bit weight quantization across all layers, with adaptive retention of 8-bit activations in layers identified through layer-wise sensitivity analysis. To maximize hardware efficiency, we develop a specialized high-performance GPU kernel supporting matrix multiplication for W6A6 and W6A8 representations via Binary Tensor Core (BTC) equivalents, effectively bypassing the lack of native INT6 tensor cores. Evaluations on LLaMA models show FlexQ maintains near-FP16 accuracy, with perplexity increases of no more than 0.05. The proposed kernel achieves an average 1.39$\times$ speedup over ABQ-LLM on LLaMA-2-70B linear layers. End-to-end, FlexQ delivers 1.33$\times$ inference acceleration and 1.21$\times$ memory savings over SmoothQuant. Code is released at https://github.com/FlyFoxPlayer/FlexQ.

[443] Decoding the Multimodal Maze: A Systematic Review on the Adoption of Explainability in Multimodal Attention-based Models

Md Raisul Kibria, Sébastien Lafond, Janan Arslan

Main category: cs.LG

TL;DR: A systematic review of multimodal explainable AI (XAI) research from 2020-2024 highlights gaps in explanation methods and evaluation practices, with recommendations for future improvements.

DetailsMotivation: The demand for explainable AI (XAI) in multimodal learning has grown, but current methods lack consistency and robustness in capturing cross-modal interactions.

Method: The review analyzes literature on multimodal XAI, focusing on model architecture, modalities, explanation algorithms, and evaluation methodologies.

Result: Most studies focus on vision-language and language-only models, with attention-based techniques dominating. Evaluation methods are inconsistent and lack modality-specific considerations.

Conclusion: The paper recommends standardized evaluation practices to advance interpretable and accountable multimodal AI systems.

Abstract: Multimodal learning has witnessed remarkable advancements in recent years, particularly with the integration of attention-based models, leading to significant performance gains across a variety of tasks. Parallel to this progress, the demand for explainable artificial intelligence (XAI) has spurred a growing body of research aimed at interpreting the complex decision-making processes of these models. This systematic literature review analyzes research published between January 2020 and early 2024 that focuses on the explainability of multimodal models. Framed within the broader goals of XAI, we examine the literature across multiple dimensions, including model architecture, modalities involved, explanation algorithms and evaluation methodologies. Our analysis reveals that the majority of studies are concentrated on vision-language and language-only models, with attention-based techniques being the most commonly employed for explanation. However, these methods often fall short in capturing the full spectrum of interactions between modalities, a challenge further compounded by the architectural heterogeneity across domains. Importantly, we find that evaluation methods for XAI in multimodal settings are largely non-systematic, lacking consistency, robustness, and consideration for modality-specific cognitive and contextual factors. Based on these findings, we provide a comprehensive set of recommendations aimed at promoting rigorous, transparent, and standardized evaluation and reporting practices in multimodal XAI research. Our goal is to support future research in more interpretable, accountable, and responsible mulitmodal AI systems, with explainability at their core.

[444] Matrix-Free Two-to-Infinity and One-to-Two Norms Estimation

Askar Tsyganov, Evgeny Frolov, Sergey Samsonov, Maxim Rakhuba

Main category: cs.LG

TL;DR: Proposes randomized algorithms for matrix norm estimation using matrix-vector multiplications, with applications in deep learning and recommender systems.

DetailsMotivation: To efficiently estimate matrix norms in a matrix-free setting, addressing needs in deep neural network training and adversarial attack mitigation.

Method: Modifies Hutchinson’s diagonal estimator and Hutch++ for matrix-vector multiplications, with provided oracle complexity bounds.

Result: Demonstrates practical utility in Jacobian-based regularization for deep learning and adversarial attack mitigation in recommender systems.

Conclusion: The proposed algorithms are effective for matrix norm estimation and have broad applications in machine learning and security.

Abstract: In this paper, we propose new randomized algorithms for estimating the two-to-infinity and one-to-two norms in a matrix-free setting, using only matrix-vector multiplications. Our methods are based on appropriate modifications of Hutchinson’s diagonal estimator and its Hutch++ version. We provide oracle complexity bounds for both modifications. We further illustrate the practical utility of our algorithms for Jacobian-based regularization in deep neural network training on image classification tasks. We also demonstrate that our methodology can be applied to mitigate the effect of adversarial attacks in the domain of recommender systems.

[445] Cloud Model Characteristic Function Auto-Encoder: Integrating Cloud Model Theory with MMD Regularization for Enhanced Generative Modeling

Biao Hu, Guoyin Wang

Main category: cs.LG

TL;DR: CMCFAE integrates the cloud model into WAE for better latent space regularization, outperforming existing models in reconstruction and sample diversity.

DetailsMotivation: To address the homogenization in reconstructed samples and improve latent space representation by using a cloud model prior instead of standard Gaussian.

Method: Derives the cloud model’s characteristic function and proposes a regularizer within the WAE framework.

Result: Outperforms existing models on MNIST, FashionMNIST, CIFAR-10, and CelebA in reconstruction quality and sample diversity.

Conclusion: CMCFAE offers a novel integration of cloud model theory with MMD-based regularization, enhancing autoencoder-based generative models.

Abstract: We introduce Cloud Model Characteristic Function Auto-Encoder (CMCFAE), a novel generative model that integrates the cloud model into the Wasserstein Auto-Encoder (WAE) framework. By leveraging the characteristic functions of the cloud model to regularize the latent space, our approach enables more accurate modeling of complex data distributions. Unlike conventional methods that rely on a standard Gaussian prior and traditional divergence measures, our method employs a cloud model prior, providing a more flexible and realistic representation of the latent space, thus mitigating the homogenization observed in reconstructed samples. We derive the characteristic function of the cloud model and propose a corresponding regularizer within the WAE framework. Extensive quantitative and qualitative evaluations on MNIST, FashionMNIST, CIFAR-10, and CelebA demonstrate that CMCFAE outperforms existing models in terms of reconstruction quality, latent space structuring, and sample diversity. This work not only establishes a novel integration of cloud model theory with MMD-based regularization but also offers a promising new perspective for enhancing autoencoder-based generative models.

[446] Automatic LLM Red Teaming

Roman Belaire, Arunesh Sinha, Pradeep Varakantham

Main category: cs.LG

TL;DR: A novel AI-driven red teaming approach for LLMs uses hierarchical RL to train an agent for multi-turn adversarial dialogues, outperforming existing methods.

DetailsMotivation: Current automated red teaming methods for LLMs are limited by brittle templates or single-turn attacks, missing the complexity of real-world adversarial interactions.

Method: Formalizes red teaming as an MDP and employs hierarchical RL with token-level harm rewards to train a generative agent for multi-turn attacks.

Result: The approach uncovers subtle vulnerabilities missed by baselines and sets a new state-of-the-art in LLM red teaming.

Conclusion: Reframes red teaming as a dynamic, trajectory-based process, essential for robust AI deployment.

Abstract: Red teaming is critical for identifying vulnerabilities and building trust in current LLMs. However, current automated methods for Large Language Models (LLMs) rely on brittle prompt templates or single-turn attacks, failing to capture the complex, interactive nature of real-world adversarial dialogues. We propose a novel paradigm: training an AI to strategically `break’ another AI. By formalizing red teaming as a Markov Decision Process (MDP) and employing a hierarchical Reinforcement Learning (RL) framework, we effectively address the inherent sparse reward and long-horizon challenges. Our generative agent learns coherent, multi-turn attack strategies through a fine-grained, token-level harm reward, enabling it to uncover subtle vulnerabilities missed by existing baselines. This approach sets a new state-of-the-art, fundamentally reframing LLM red teaming as a dynamic, trajectory-based process (rather than a one-step test) essential for robust AI deployment.

[447] Small transformer architectures for task switching

Claudius Gros

Main category: cs.LG

TL;DR: The paper explores small-scale AI applications where attention-based models outperform traditional methods, focusing on task-switching scenarios. It finds standard transformers ineffective for a basic arithmetic task (IARC) but identifies improved performance with a modified transformer (cisformer) and extensive attention.

DetailsMotivation: To understand why attention-based models struggle in small-scale tasks like task-switching and to identify better-performing architectures.

Method: Evaluates transformers, LSTMs, MLPs, and modified transformers (cisformer) on a task-switching framework with arithmetic subtasks (IARC).

Result: Standard transformers, LSTMs, and MLPs perform modestly, while a combination of cisformer and extensive attention achieves ~95% accuracy.

Conclusion: Attention mechanisms can be better understood and improved by comparing different formulations in task-switching settings.

Abstract: The rapid progress seen in terms of large-scale generative AI is largely based on the attention mechanism. It is conversely non-trivial to conceive small-scale applications for which attention-based architectures outperform traditional approaches, such as multi-layer perceptrons or recurrent networks. We examine this problem in the context of ’task switching’. In this framework models work on ongoing token sequences with the current task being determined by stochastically interspersed control tokens. We show that standard transformers cannot solve a basic task switching reference model based on finite domain arithmetics which contains subtasks dedicated to increment / addition / reverse copy / context (IARC). We show that transformers, long short-term memory recurrent networks (LSTM), and plain multi-layer perceptrons (MLPs) achieve similar, but only modest prediction accuracies. We enlarge our comparative study by including an extension of the standard transformer architecture to its non-translational invariant counterpart, the cisformer, and an alternative attention mechanism, extensive attention. A combination of the latter is found to be the only model able to achieve considerable performance levels, of around 95%. Our results indicate that the workings of attention can be understood better, and even improved, when comparing qualitatively different formulations in task-switching settings.

[448] CARD: Cache-Assisted Parallel Speculative Decoding for Efficient Large Language Model Inference

Enyu Zhou, Kai Sheng, Hao Chen, Xin He

Main category: cs.LG

TL;DR: CARD introduces a cache-based parallel speculative decoding framework to accelerate LLM inference by decoupling drafting and verification, achieving up to 4.83x speedup.

DetailsMotivation: Existing speculative decoding methods suffer from sequential execution and inefficient drafting due to the 'draft-then-verify' paradigm.

Method: CARD employs a ‘query-and-correct’ paradigm, where the draft model populates a shared cache and the target model concurrently corrects the draft’s direction.

Result: Achieves up to 4.83x speedup over vanilla decoding without model fine-tuning.

Conclusion: CARD effectively improves inference efficiency by parallelizing drafting and verification.

Abstract: Speculative decoding (SD), where an extra draft model first provides multiple draft tokens and the original target model then verifies these tokens in parallel, has shown great power for LLM inference acceleration. However, existing SD methods must adhere to the ‘draft-then-verify’ paradigm, which forces drafting and verification processes to execute sequentially during SD, resulting in inefficient inference performance and limiting the size of the draft model. Furthermore, once a single token in the candidate sequence is rejected during the drafting process, all subsequent candidate tokens must be discarded, leading to inefficient drafting. To address these challenges, we propose a cache-based parallel speculative decoding framework employing a ‘query-and-correct’ paradigm. Specifically, CARD decouples drafting and verification: the draft model generates candidate tokens to populate a shared cache, while the target model concurrently rectifies the draft model’s generation direction. This effectively enables the target model to perform inference at speed approaching that of the draft model. Our approach achieves up to 4.83 speedup over vanilla decoding without requiring fine-tuning of either the draft or target models. Our code is available at https://github.com/hunzhizi/CARD.

[449] GFocal: A Global-Focal Neural Operator for Solving PDEs on Arbitrary Geometries

Fangzhi Fei, Jiaxin Hu, Qiaofeng Li, Zhenyu Liu

Main category: cs.LG

TL;DR: GFocal, a Transformer-based neural operator, enhances PDE solving by integrating global and local feature learning, outperforming benchmarks by 15.2%.

DetailsMotivation: Existing methods neglect the interplay between local physical details and global features, crucial for multiscale problems and physical consistency.

Method: GFocal uses Nyström attention for global blocks and slices-based focal blocks for local features, fused via convolution-based gating.

Result: GFocal achieves state-of-the-art performance, with a 15.2% gain in benchmarks and excels in industry-scale simulations.

Conclusion: GFocal effectively combines global and local learning for accurate, physics-aware PDE solutions.

Abstract: Transformer-based neural operators have emerged as promising surrogate solvers for partial differential equations, by leveraging the effectiveness of Transformers for capturing long-range dependencies and global correlations, profoundly proven in language modeling. However, existing methodologies overlook the coordinated learning of interdependencies between local physical details and global features, which are essential for tackling multiscale problems, preserving physical consistency and numerical stability in long-term rollouts, and accurately capturing transitional dynamics. In this work, we propose GFocal, a Transformer-based neural operator method that enforces simultaneous global and local feature learning and fusion. Global correlations and local features are harnessed through Nystr"{o}m attention-based \textbf{g}lobal blocks and slices-based \textbf{focal} blocks to generate physics-aware tokens, subsequently modulated and integrated via convolution-based gating blocks, enabling dynamic fusion of multiscale information. GFocal achieves accurate modeling and prediction of physical features given arbitrary geometries and initial conditions. Experiments show that GFocal achieves state-of-the-art performance with an average 15.2% relative gain in five out of six benchmarks and also excels in industry-scale simulations such as aerodynamics simulation of automotives and airfoils.

[450] FedHiP: Heterogeneity-Invariant Personalized Federated Learning Through Closed-Form Solutions

Jianheng Tang, Zhirui Yang, Jingchao Wang, Kejia Fan, Jinfeng Xu, Huiping Zhuang, Anfeng Liu, Houbing Herbert Song, Leye Wang, Yunhuai Liu

Main category: cs.LG

TL;DR: FedHiP proposes a gradient-free PFL method using closed-form solutions to address non-IID data challenges, outperforming baselines by 5.79%-20.97% in accuracy.

DetailsMotivation: Existing PFL methods struggle with non-IID data, hindering convergence and performance due to reliance on gradient-based updates.

Method: FedHiP uses self-supervised pre-training for feature extraction and an analytic classifier, with three phases: local training, global aggregation, and local personalization.

Result: FedHiP achieves heterogeneity invariance and outperforms state-of-the-art baselines by 5.79%-20.97% in accuracy.

Conclusion: FedHiP effectively addresses non-IID data challenges in PFL with its gradient-free approach and closed-form solutions.

Abstract: Lately, Personalized Federated Learning (PFL) has emerged as a prevalent paradigm to deliver personalized models by collaboratively training while simultaneously adapting to each client’s local applications. Existing PFL methods typically face a significant challenge due to the ubiquitous data heterogeneity (i.e., non-IID data) across clients, which severely hinders convergence and degrades performance. We identify that the root issue lies in the long-standing reliance on gradient-based updates, which are inherently sensitive to non-IID data. To fundamentally address this issue and bridge the research gap, in this paper, we propose a Heterogeneity-invariant Personalized Federated learning scheme, named FedHiP, through analytical (i.e., closed-form) solutions to avoid gradient-based updates. Specifically, we exploit the trend of self-supervised pre-training, leveraging a foundation model as a frozen backbone for gradient-free feature extraction. Following the feature extractor, we further develop an analytic classifier for gradient-free training. To support both collective generalization and individual personalization, our FedHiP scheme incorporates three phases: analytic local training, analytic global aggregation, and analytic local personalization. The closed-form solutions of our FedHiP scheme enable its ideal property of heterogeneity invariance, meaning that each personalized model remains identical regardless of how non-IID the data are distributed across all other clients. Extensive experiments on benchmark datasets validate the superiority of our FedHiP scheme, outperforming the state-of-the-art baselines by at least 5.79%-20.97% in accuracy.

[451] Who cuts emissions, who turns up the heat? causal machine learning estimates of energy efficiency interventions

Bernardino D’Amico, Francesco Pomponi, Jay H. Arehart, Lina Khaddour

Main category: cs.LG

TL;DR: Wall insulation reduces gas demand by 19% on average, but savings vary by energy burden groups. High-burden households reallocate savings to comfort, not consumption reduction.

DetailsMotivation: To understand the heterogeneous impact of energy efficiency interventions like wall insulation on gas consumption, especially across energy burden subgroups.

Method: A causal machine learning model trained on nationally representative data of the English housing stock to estimate treatment effects of wall insulation.

Result: Low energy burden groups save significantly, while high-burden groups see little reduction due to reallocating savings to comfort.

Conclusion: Energy policies need a broader framework considering both climate impacts and equity, as behavioral responses reflect rational adjustments in deprived contexts.

Abstract: Reducing domestic energy demand is central to climate mitigation and fuel poverty strategies, yet the impact of energy efficiency interventions is highly heterogeneous. Using a causal machine learning model trained on nationally representative data of the English housing stock, we estimate average and conditional treatment effects of wall insulation on gas consumption, focusing on distributional effects across energy burden subgroups. While interventions reduce gas demand on average (by as much as 19 percent), low energy burden groups achieve substantial savings, whereas those experiencing high energy burdens see little to no reduction. This pattern reflects a behaviourally-driven mechanism: households constrained by high costs-to-income ratios (e.g. more than 0.1) reallocate savings toward improved thermal comfort rather than lowering consumption. Far from wasteful, such responses represent rational adjustments in contexts of prior deprivation, with potential co-benefits for health and well-being. These findings call for a broader evaluation framework that accounts for both climate impacts and the equity implications of domestic energy policy.

[452] Emotion Detection Using Conditional Generative Adversarial Networks (cGAN): A Deep Learning Approach

Anushka Srivastava

Main category: cs.LG

TL;DR: A deep learning approach using cGANs for multimodal emotion detection, outperforming traditional unimodal methods.

DetailsMotivation: To improve emotion recognition by integrating text, audio, and facial expressions, enhancing human-computer interaction.

Method: Uses Conditional Generative Adversarial Networks (cGANs) to generate synthetic emotion-rich data and classify across modalities.

Result: Significant improvement in emotion recognition performance compared to baseline models.

Conclusion: cGANs show promise for nuanced emotional understanding in human-computer interaction systems.

Abstract: This paper presents a deep learning-based approach to emotion detection using Conditional Generative Adversarial Networks (cGANs). Unlike traditional unimodal techniques that rely on a single data type, we explore a multimodal framework integrating text, audio, and facial expressions. The proposed cGAN architecture is trained to generate synthetic emotion-rich data and improve classification accuracy across multiple modalities. Our experimental results demonstrate significant improvements in emotion recognition performance compared to baseline models. This work highlights the potential of cGANs in enhancing human-computer interaction systems by enabling more nuanced emotional understanding.

[453] Hierarchical Scoring for Machine Learning Classifier Error Impact Evaluation

Erin Lanus, Daniel Wolodkin, Laura J. Freeman

Main category: cs.LG

TL;DR: The paper introduces hierarchical scoring metrics for ML models, using scoring trees to evaluate misclassifications with finer granularity, moving beyond binary pass/fail metrics.

DetailsMotivation: Current classification and object detection metrics treat all misclassifications equally, ignoring hierarchical relationships between classes. This work aims to provide nuanced evaluation by considering the distance between predicted and ground truth labels in a taxonomy.

Method: Develops hierarchical scoring metrics using scoring trees to encode class relationships. Demonstrates metrics on an abstract use case with three weighting strategies, evaluating the type of errors discouraged.

Result: The metrics capture errors with finer granularity, allowing tuning via scoring trees. They rank models by error impact, not just quantity.

Conclusion: Hierarchical scoring metrics offer a refined approach to ML evaluation, enabling better understanding of misclassification impact. Python implementations will be open-sourced.

Abstract: A common use of machine learning (ML) models is predicting the class of a sample. Object detection is an extension of classification that includes localization of the object via a bounding box within the sample. Classification, and by extension object detection, is typically evaluated by counting a prediction as incorrect if the predicted label does not match the ground truth label. This pass/fail scoring treats all misclassifications as equivalent. In many cases, class labels can be organized into a class taxonomy with a hierarchical structure to either reflect relationships among the data or operator valuation of misclassifications. When such a hierarchical structure exists, hierarchical scoring metrics can return the model performance of a given prediction related to the distance between the prediction and the ground truth label. Such metrics can be viewed as giving partial credit to predictions instead of pass/fail, enabling a finer-grained understanding of the impact of misclassifications. This work develops hierarchical scoring metrics varying in complexity that utilize scoring trees to encode relationships between class labels and produce metrics that reflect distance in the scoring tree. The scoring metrics are demonstrated on an abstract use case with scoring trees that represent three weighting strategies and evaluated by the kind of errors discouraged. Results demonstrate that these metrics capture errors with finer granularity and the scoring trees enable tuning. This work demonstrates an approach to evaluating ML performance that ranks models not only by how many errors are made but by the kind or impact of errors. Python implementations of the scoring metrics will be available in an open-source repository at time of publication.

[454] Causal Reflection with Language Models

Abi Aryan, Zac Liu

Main category: cs.LG

TL;DR: A framework called Causal Reflection is introduced to enhance causal reasoning in agents by modeling causality dynamically and using a Reflect mechanism to correct mismatches.

DetailsMotivation: LLMs and traditional RL agents lack robust causal reasoning, relying on spurious correlations or reward optimization without understanding causality.

Method: Causal Reflection models causality dynamically over state, action, time, and perturbation, and uses a Reflect mechanism to generate causal hypotheses for model revision.

Result: The framework enables agents to reason about delayed, nonlinear effects and adapt by self-correcting and communicating causal understanding.

Conclusion: Causal Reflection provides a theoretical foundation for agents with improved causal reasoning, adaptability, and explanatory capabilities.

Abstract: While LLMs exhibit impressive fluency and factual recall, they struggle with robust causal reasoning, often relying on spurious correlations and brittle patterns. Similarly, traditional Reinforcement Learning agents also lack causal understanding, optimizing for rewards without modeling why actions lead to outcomes. We introduce Causal Reflection, a framework that explicitly models causality as a dynamic function over state, action, time, and perturbation, enabling agents to reason about delayed and nonlinear effects. Additionally, we define a formal Reflect mechanism that identifies mismatches between predicted and observed outcomes and generates causal hypotheses to revise the agent’s internal model. In this architecture, LLMs serve not as black-box reasoners, but as structured inference engines translating formal causal outputs into natural language explanations and counterfactuals. Our framework lays the theoretical groundwork for Causal Reflective agents that can adapt, self-correct, and communicate causal understanding in evolving environments.

[455] PRISM: Lightweight Multivariate Time-Series Classification through Symmetric Multi-Resolution Convolutional Layers

Federico Zucchi, Thomas Lampert

Main category: cs.LG

TL;DR: PRISM is a lightweight, efficient convolutional feature extractor for multivariate time-series classification, outperforming CNN and Transformer models with fewer parameters and FLOPs.

DetailsMotivation: Address the computational heaviness, limited frequency diversity, and high parameter requirements of existing Transformer- and CNN-based models in time-series classification.

Method: PRISM uses symmetric FIR filters at multiple temporal scales per channel, avoiding inter-channel convolutions to reduce complexity.

Result: PRISM matches or outperforms leading baselines in human-activity, sleep-stage, and biomedical benchmarks while using significantly fewer resources.

Conclusion: PRISM combines classical signal processing with deep learning for accurate, efficient multivariate time-series classification.

Abstract: Multivariate time-series classification is pivotal in domains ranging from wearable sensing to biomedical monitoring. Despite recent advances, Transformer- and CNN-based models often remain computationally heavy, offer limited frequency diversity, and require extensive parameter budgets. We propose PRISM (Per-channel Resolution-Informed Symmetric Module), a convolutional-based feature extractor that applies symmetric finite-impulse-response (FIR) filters at multiple temporal scales, independently per channel. This multi-resolution, per-channel design yields highly frequency-selective embeddings without any inter-channel convolutions, greatly reducing model size and complexity. Across human-activity, sleep-stage and biomedical benchmarks, PRISM, paired with lightweight classification heads, matches or outperforms leading CNN and Transformer baselines, while using roughly an order of magnitude fewer parameters and FLOPs. By uniting classical signal processing insights with modern deep learning, PRISM offers an accurate, resource-efficient solution for multivariate time-series classification.

[456] Channel-Independent Federated Traffic Prediction

Mo Zhang, Xiaoyu Li, Bin Xu, Meng Chen, Yongshun Gong

Main category: cs.LG

TL;DR: The paper introduces a Channel-Independent Paradigm (CIP) for federated traffic prediction, reducing communication overhead by enabling local predictions without inter-client communication. The Fed-CI framework based on CIP improves accuracy and efficiency.

DetailsMotivation: Traffic data is distributed and privacy-constrained, making traditional federated methods communication-heavy and slow. A solution is needed to reduce overhead while maintaining accuracy.

Method: Proposes CIP for local predictions without inter-client communication and develops Fed-CI, a federated learning framework, to mitigate information loss and reduce costs.

Result: Fed-CI outperforms existing methods, improving RMSE, MAE, and MAPE by 8%, 14%, and 16%, respectively, while cutting communication costs.

Conclusion: Fed-CI offers a scalable, efficient, and privacy-compliant solution for federated traffic prediction, addressing key challenges in the field.

Abstract: In recent years, traffic prediction has achieved remarkable success and has become an integral component of intelligent transportation systems. However, traffic data is typically distributed among multiple data owners, and privacy constraints prevent the direct utilization of these isolated datasets for traffic prediction. Most existing federated traffic prediction methods focus on designing communication mechanisms that allow models to leverage information from other clients in order to improve prediction accuracy. Unfortunately, such approaches often incur substantial communication overhead, and the resulting transmission delays significantly slow down the training process. As the volume of traffic data continues to grow, this issue becomes increasingly critical, making the resource consumption of current methods unsustainable. To address this challenge, we propose a novel variable relationship modeling paradigm for federated traffic prediction, termed the Channel-Independent Paradigm(CIP). Unlike traditional approaches, CIP eliminates the need for inter-client communication by enabling each node to perform efficient and accurate predictions using only local information. Based on the CIP, we further develop Fed-CI, an efficient federated learning framework, allowing each client to process its own data independently while effectively mitigating the information loss caused by the lack of direct data sharing among clients. Fed-CI significantly reduces communication overhead, accelerates the training process, and achieves state-of-the-art performance while complying with privacy regulations. Extensive experiments on multiple real-world datasets demonstrate that Fed-CI consistently outperforms existing methods across all datasets and federated settings. It achieves improvements of 8%, 14%, and 16% in RMSE, MAE, and MAPE, respectively, while also substantially reducing communication costs.

[457] Privacy Risk Predictions Based on Fundamental Understanding of Personal Data and an Evolving Threat Landscape

Haoran Niu, K. Suzanne Barber

Main category: cs.LG

TL;DR: The paper analyzes 5,000 identity theft cases to model privacy risks using a graph-based framework, predicting further data exposures when certain personal attributes are compromised.

DetailsMotivation: Understanding privacy risks is challenging without empirical data. This research aims to identify which personal data is exposed, how often, and the consequences, to improve protection.

Method: The study constructs an Identity Ecosystem graph where nodes are PII attributes and edges represent disclosure relationships. It uses graph theory and neural networks to predict further disclosures.

Result: The framework effectively predicts the likelihood of additional attribute disclosures when one is compromised, answering key privacy risk questions.

Conclusion: The graph-based model provides a foundational tool for assessing and mitigating privacy risks by predicting data exposure pathways.

Abstract: It is difficult for individuals and organizations to protect personal information without a fundamental understanding of relative privacy risks. By analyzing over 5,000 empirical identity theft and fraud cases, this research identifies which types of personal data are exposed, how frequently exposures occur, and what the consequences of those exposures are. We construct an Identity Ecosystem graph–a foundational, graph-based model in which nodes represent personally identifiable information (PII) attributes and edges represent empirical disclosure relationships between them (e.g., the probability that one PII attribute is exposed due to the exposure of another). Leveraging this graph structure, we develop a privacy risk prediction framework that uses graph theory and graph neural networks to estimate the likelihood of further disclosures when certain PII attributes are compromised. The results show that our approach effectively answers the core question: Can the disclosure of a given identity attribute possibly lead to the disclosure of another attribute?

[458] GraphProp: Training the Graph Foundation Models using Graph Properties

Ziheng Sun, Qi Feng, Lehao Lin, Chris Ding, Jicong Fan

Main category: cs.LG

TL;DR: GraphProp trains graph foundation models (GFMs) by focusing on structural generalization, outperforming competitors in graph classification tasks.

DetailsMotivation: Traditional GFMs lack structural cross-domain generalization, while graph structures provide consistent cross-domain information.

Method: GraphProp trains a structural GFM by predicting graph invariants, then uses these representations as positional encodings for a comprehensive GFM.

Result: GraphProp excels in supervised and few-shot learning, especially for graphs without node attributes.

Conclusion: Emphasizing structural generalization improves GFM performance in cross-domain tasks.

Abstract: This work focuses on training graph foundation models (GFMs) that have strong generalization ability in graph-level tasks such as graph classification. Effective GFM training requires capturing information consistent across different domains. We discover that graph structures provide more consistent cross-domain information compared to node features and graph labels. However, traditional GFMs primarily focus on transferring node features from various domains into a unified representation space but often lack structural cross-domain generalization. To address this, we introduce GraphProp, which emphasizes structural generalization. The training process of GraphProp consists of two main phases. First, we train a structural GFM by predicting graph invariants. Since graph invariants are properties of graphs that depend only on the abstract structure, not on particular labellings or drawings of the graph, this structural GFM has a strong ability to capture the abstract structural information and provide discriminative graph representations comparable across diverse domains. In the second phase, we use the representations given by the structural GFM as positional encodings to train a comprehensive GFM. This phase utilizes domain-specific node attributes and graph labels to further improve cross-domain node feature generalization. Our experiments demonstrate that GraphProp significantly outperforms the competitors in supervised learning and few-shot learning, especially in handling graphs without node attributes.

[459] Improved Training Strategies for Physics-Informed Neural Networks using Real Experimental Data in Aluminum Spot Welding

Jan A. Zak, Christian Weißenfels

Main category: cs.LG

TL;DR: Physics-informed neural networks predict weld nugget diameter in aluminum spot welding, using novel training strategies to integrate real-world data and ensure physical accuracy.

DetailsMotivation: Destructive testing for weld nugget diameter measurement limits quality control efficiency; a non-invasive, model-based approach is needed.

Method: Two novel training strategies: fading-in experimental losses and conditional updates of material parameters. A 2D model was used for accuracy and efficiency.

Result: The network predicts displacement and nugget growth within experimental confidence, enabling fast quality control.

Conclusion: The approach shows strong potential for industrial applications, offering efficient, non-invasive quality assessment.

Abstract: Resistance spot welding is the dominant joining process for the body-in-white in the automotive industry, where the weld nugget diameter is the key quality metric. Its measurement requires destructive testing, limiting the potential for efficient quality control. Physics-informed neural networks were investigated as a promising tool to reconstruct internal process states from experimental data, enabling model-based and non-invasive quality assessment in aluminum spot welding. A major challenge is the integration of real-world data into the network due to competing optimization objectives. To address this, we introduce two novel training strategies. First, experimental losses for dynamic displacement and nugget diameter are progressively included using a fading-in function to prevent excessive optimization conflicts. We also implement a custom learning rate scheduler and early stopping based on a rolling window to counteract premature reduction due to increased loss magnitudes. Second, we introduce a conditional update of temperature-dependent material parameters via a look-up table, activated only after a loss threshold is reached to ensure physically meaningful temperatures. An axially symmetric two-dimensional model was selected to represent the welding process accurately while maintaining computational efficiency. To reduce computational burden, the training strategies and model components were first systematically evaluated in one dimension, enabling controlled analysis of loss design and contact models. The two-dimensional network predicts dynamic displacement and nugget growth within the experimental confidence interval, supports transferring welding stages from steel to aluminum, and demonstrates strong potential for fast, model-based quality control in industrial applications.

[460] Multitask Learning with Stochastic Interpolants

Hugo Negrel, Florentin Coeurdoux, Michael S. Albergo, Eric Vanden-Eijnden

Main category: cs.LG

TL;DR: A framework generalizing flow and diffusion models by replacing scalar time with vectors/matrices/operators, enabling versatile generative models for multiple tasks without task-specific training.

DetailsMotivation: To generalize the dynamics of flow and diffusion models and bridge probability distributions across multiple dimensions for versatile generative modeling.

Method: Generalizes stochastic interpolants by replacing scalar time with vectors, matrices, or linear operators, enabling task-agnostic generative models.

Result: Demonstrates zero-shot efficacy in conditional generation, inpainting, fine-tuning, posterior sampling, and multiscale modeling.

Conclusion: The framework offers a unifying theoretical perspective and extends capabilities of existing generative models, serving as a generic task-agnostic alternative.

Abstract: We propose a framework for learning maps between probability distributions that broadly generalizes the time dynamics of flow and diffusion models. To enable this, we generalize stochastic interpolants by replacing the scalar time variable with vectors, matrices, or linear operators, allowing us to bridge probability distributions across multiple dimensional spaces. This approach enables the construction of versatile generative models capable of fulfilling multiple tasks without task-specific training. Our operator-based interpolants not only provide a unifying theoretical perspective for existing generative models but also extend their capabilities. Through numerical experiments, we demonstrate the zero-shot efficacy of our method on conditional generation and inpainting, fine-tuning and posterior sampling, and multiscale modeling, suggesting its potential as a generic task-agnostic alternative to specialized models.

[461] Neuromorphic Cybersecurity with Semi-supervised Lifelong Learning

Md Zesun Ahmed Mia, Malyaban Bal, Sen Lu, George M. Nishibuchi, Suhas Chelian, Srini Vasan, Abhronil Sengupta

Main category: cs.LG

TL;DR: A bio-inspired Spiking Neural Network (SNN) architecture for lifelong Network Intrusion Detection (NIDS) combines static and dynamic SNNs with adaptive learning rules, achieving 85.3% accuracy and energy efficiency.

DetailsMotivation: Inspired by the brain's hierarchical processing and energy efficiency, the paper aims to develop a lifelong NIDS using SNNs.

Method: The system uses a static SNN for intrusion detection and a dynamic SNN with GWR-inspired plasticity and Ad-STDP learning for attack classification.

Result: Tested on UNSW-NB15, it shows 85.3% accuracy, robust adaptation, and reduced catastrophic forgetting. Simulations confirm low-power potential.

Conclusion: The bio-plausible SNN architecture is effective for lifelong NIDS, offering high accuracy and energy efficiency for neuromorphic hardware.

Abstract: Inspired by the brain’s hierarchical processing and energy efficiency, this paper presents a Spiking Neural Network (SNN) architecture for lifelong Network Intrusion Detection System (NIDS). The proposed system first employs an efficient static SNN to identify potential intrusions, which then activates an adaptive dynamic SNN responsible for classifying the specific attack type. Mimicking biological adaptation, the dynamic classifier utilizes Grow When Required (GWR)-inspired structural plasticity and a novel Adaptive Spike-Timing-Dependent Plasticity (Ad-STDP) learning rule. These bio-plausible mechanisms enable the network to learn new threats incrementally while preserving existing knowledge. Tested on the UNSW-NB15 benchmark in a continual learning setting, the architecture demonstrates robust adaptation, reduced catastrophic forgetting, and achieves $85.3$% overall accuracy. Furthermore, simulations using the Intel Lava framework confirm high operational sparsity, highlighting the potential for low-power deployment on neuromorphic hardware.

[462] CaPulse: Detecting Anomalies by Tuning in to the Causal Rhythms of Time Series

Yutong Xia, Yingying Zhang, Yuxuan Liang, Lunting Fan, Qingsong Wen, Roger Zimmermann

Main category: cs.LG

TL;DR: CaPulse is a causality-based framework for time series anomaly detection, addressing data challenges like label scarcity and multi-periodicity, outperforming existing methods by 3-17% in AUROC.

DetailsMotivation: Existing methods fail to capture anomaly generation mechanisms and struggle with data challenges like label scarcity and multi-periodicity.

Method: Uses a structural causal model to understand anomaly generation and introduces Periodical Normalizing Flows with a mask mechanism and periodical learners for anomaly detection.

Result: Outperforms existing methods on seven datasets, achieving AUROC improvements of 3-17% with enhanced interpretability.

Conclusion: CaPulse effectively addresses data challenges and improves anomaly detection performance through causality and periodicity-aware methods.

Abstract: Time series anomaly detection has garnered considerable attention across diverse domains. While existing methods often fail to capture the underlying mechanisms behind anomaly generation in time series data. In addition, time series anomaly detection often faces several data-related inherent challenges, i.e., label scarcity, data imbalance, and complex multi-periodicity. In this paper, we leverage causal tools and introduce a new causality-based framework, CaPulse, which tunes in to the underlying causal pulse of time series data to effectively detect anomalies. Concretely, we begin by building a structural causal model to decipher the generation processes behind anomalies. To tackle the challenges posed by the data, we propose Periodical Normalizing Flows with a novel mask mechanism and carefully designed periodical learners, creating a periodicity-aware, density-based anomaly detection approach. Extensive experiments on seven real-world datasets demonstrate that CaPulse consistently outperforms existing methods, achieving AUROC improvements of 3% to 17%, with enhanced interpretability.

Yu Song, Zhigang Hua, Harry Shomer, Yan Xie, Jingzhe Liu, Bo Long, Hui Liu

Main category: cs.LG

TL;DR: The paper explores pretraining for Link Prediction (LP) in graph machine learning, addressing challenges like sparse supervision and poor generalization. It introduces a late fusion strategy, a Mixture-of-Experts framework, and a parameter-efficient tuning method, achieving state-of-the-art results with low computational overhead.

DetailsMotivation: Existing LP methods using GNNs face issues like sparse supervision, initialization sensitivity, and poor generalization. Pretraining is proposed to overcome these challenges by leveraging node- and edge-level information.

Method: The study systematically examines transferability of node- and edge-level modules, proposes a late fusion strategy, and introduces a Mixture-of-Experts framework for diverse pretraining data. A parameter-efficient tuning method is developed for fast adaptation.

Result: Experiments on 16 datasets show state-of-the-art performance in low-resource LP, with competitive results compared to end-to-end methods and 10,000x lower computational overhead.

Conclusion: Pretraining with late fusion and MoE effectively improves LP performance, addressing key challenges while maintaining computational efficiency.

Abstract: Link Prediction (LP) is a critical task in graph machine learning. While Graph Neural Networks (GNNs) have significantly advanced LP performance recently, existing methods face key challenges including limited supervision from sparse connectivity, sensitivity to initialization, and poor generalization under distribution shifts. We explore pretraining as a solution to address these challenges. Unlike node classification, LP is inherently a pairwise task, which requires the integration of both node- and edge-level information. In this work, we present the first systematic study on the transferability of these distinct modules and propose a late fusion strategy to effectively combine their outputs for improved performance. To handle the diversity of pretraining data and avoid negative transfer, we introduce a Mixture-of-Experts (MoE) framework that captures distinct patterns in separate experts, facilitating seamless application of the pretrained model on diverse downstream datasets. For fast adaptation, we develop a parameter-efficient tuning strategy that allows the pretrained model to adapt to unseen datasets with minimal computational overhead. Experiments on 16 datasets across two domains demonstrate the effectiveness of our approach, achieving state-of-the-art performance on low-resource link prediction while obtaining competitive results compared to end-to-end trained methods, with over 10,000x lower computational overhead.

[464] Robustly Learning Monotone Single-Index Models

Puqian Wang, Nikos Zarifis, Ilias Diakonikolas, Jelena Diakonikolas

Main category: cs.LG

TL;DR: A computationally efficient algorithm for learning Single-Index Models under adversarial label noise, achieving constant factor approximation for all monotone activations with bounded moment.

DetailsMotivation: Address the challenge of learning Single-Index Models efficiently under adversarial noise, extending applicability to a broader class of monotone activations.

Method: Develops an optimization framework using a novel vector field for updates, leveraging Gaussian space properties and monotone function regularity.

Result: First efficient algorithm achieving constant factor approximation for a wide class of monotone activations, including Lipschitz and discontinuous functions.

Conclusion: The proposed method advances prior work by handling a broader activation class efficiently, with potential applications in robust learning tasks.

Abstract: We consider the basic problem of learning Single-Index Models with respect to the square loss under the Gaussian distribution in the presence of adversarial label noise. Our main contribution is the first computationally efficient algorithm for this learning task, achieving a constant factor approximation, that succeeds for the class of {\em all} monotone activations with bounded moment of order $2 + \zeta,$ for $\zeta > 0.$ This class in particular includes all monotone Lipschitz functions and even discontinuous functions like (possibly biased) halfspaces. Prior work for the case of unknown activation either does not attain constant factor approximation or succeeds for a substantially smaller family of activations. The main conceptual novelty of our approach lies in developing an optimization framework that steps outside the boundaries of usual gradient methods and instead identifies a useful vector field to guide the algorithm updates by directly leveraging the problem structure, properties of Gaussian spaces, and regularity of monotone functions.

[465] From Cluster Assumption to Graph Convolution: Graph-based Semi-Supervised Learning Revisited

Zheng Wang, Hongming Ding, Li Pan, Jianhua Li, Zhiguo Gong, Philip S. Yu

Main category: cs.LG

TL;DR: The paper explores the relationship between traditional graph-based semi-supervised learning and graph convolutional networks (GCNs), proposing three new methods (OGC, GGC, GGCM) to address limitations in GCNs.

DetailsMotivation: To bridge the gap between traditional GSSL methods and GCNs, highlighting that GCNs may not effectively combine graph structure and label information at each layer.

Method: Proposes three methods: OGC (supervised, label-guided), GGC and GGCM (unsupervised, preserving graph structure).

Result: Extensive experiments demonstrate the effectiveness of the proposed methods.

Conclusion: The paper provides a unified framework for GSSL and GCNs, introducing practical solutions to improve performance.

Abstract: Graph-based semi-supervised learning (GSSL) has long been a hot research topic. Traditional methods are generally shallow learners, based on the cluster assumption. Recently, graph convolutional networks (GCNs) have become the predominant techniques for their promising performance. In this paper, we theoretically discuss the relationship between these two types of methods in a unified optimization framework. One of the most intriguing findings is that, unlike traditional ones, typical GCNs may not jointly consider the graph structure and label information at each layer. Motivated by this, we further propose three simple but powerful graph convolution methods. The first is a supervised method OGC which guides the graph convolution process with labels. The others are two unsupervised methods: GGC and its multi-scale version GGCM, both aiming to preserve the graph structure information during the convolution process. Finally, we conduct extensive experiments to show the effectiveness of our methods. Code is available at https://github.com/zhengwang100/ogc_ggcm.

[466] Time Evidence Fusion Network: Multi-source View in Long-Term Time Series Forecasting

Tianxiang Zhan, Yuanpeng He, Yong Deng, Zhen Li, Wenjie Du, Qingsong Wen

Main category: cs.LG

TL;DR: The paper proposes Time Evidence Fusion Network (TEFN), a novel architecture for efficient and accurate time series forecasting, using evidence theory for uncertainty capture and multi-source fusion for improved performance.

DetailsMotivation: The need for both accuracy and efficiency in time series forecasting drives the exploration of new model architectures.

Method: Introduces the Basic Probability Assignment (BPA) Module for uncertainty capture and a multi-source fusion method to integrate channel and time dimensions.

Result: TEFN matches state-of-the-art performance with lower complexity, faster training, high robustness, and interpretability.

Conclusion: TEFN balances accuracy, efficiency, stability, and interpretability, making it ideal for time series forecasting.

Abstract: In practical scenarios, time series forecasting necessitates not only accuracy but also efficiency. Consequently, the exploration of model architectures remains a perennially trending topic in research. To address these challenges, we propose a novel backbone architecture named Time Evidence Fusion Network (TEFN) from the perspective of information fusion. Specifically, we introduce the Basic Probability Assignment (BPA) Module based on evidence theory to capture the uncertainty of multivariate time series data from both channel and time dimensions. Additionally, we develop a novel multi-source information fusion method to effectively integrate the two distinct dimensions from BPA output, leading to improved forecasting accuracy. Lastly, we conduct extensive experiments to demonstrate that TEFN achieves performance comparable to state-of-the-art methods while maintaining significantly lower complexity and reduced training time. Also, our experiments show that TEFN exhibits high robustness, with minimal error fluctuations during hyperparameter selection. Furthermore, due to the fact that BPA is derived from fuzzy theory, TEFN offers a high degree of interpretability. Therefore, the proposed TEFN balances accuracy, efficiency, stability, and interpretability, making it a desirable solution for time series forecasting.

[467] One Model, Any Conjunctive Query: Graph Neural Networks for Answering Queries over Incomplete Knowledge Graphs

Krzysztof Olejniczak, Xingyue Huang, Mikhail Galkin, İsmail İlkan Ceylan

Main category: cs.LG

TL;DR: AnyCQ is a model for classifying and retrieving answers to conjunctive queries on incomplete knowledge graphs, using a GNN trained with reinforcement learning. It generalizes well and transfers to new graphs.

DetailsMotivation: Address the incompleteness of modern knowledge graphs by predicting answers not explicitly present in the graph but in its completion.

Method: Proposes AnyCQ, a graph neural network trained with reinforcement learning to classify and retrieve answers to Boolean queries.

Result: AnyCQ generalizes to large, complex queries and transfers to novel knowledge graphs, outperforming existing methods.

Conclusion: AnyCQ is effective for querying incomplete knowledge graphs and shows promise for broader applications.

Abstract: Motivated by the incompleteness of modern knowledge graphs, a new setup for query answering has emerged, where the goal is to predict answers that do not necessarily appear in the knowledge graph, but are present in its completion. In this paper, we formally introduce and study two query answering problems, namely, query answer classification and query answer retrieval. To solve these problems, we propose AnyCQ, a model that can classify answers to any conjunctive query on any knowledge graph. At the core of our framework lies a graph neural network trained using a reinforcement learning objective to answer Boolean queries. Trained only on simple, small instances, AnyCQ generalizes to large queries of arbitrary structure, reliably classifying and retrieving answers to queries that existing approaches fail to handle. This is empirically validated through our newly proposed, challenging benchmarks. Finally, we empirically show that AnyCQ can effectively transfer to completely novel knowledge graphs when equipped with an appropriate link prediction model, highlighting its potential for querying incomplete data.

[468] Beyond Adapter Retrieval: Latent Geometry-Preserving Composition via Sparse Task Projection

Pengfei Jin, Peng Shu, Sifan Song, Sekeun Kim, Qing Xiao, Cheng Chen, Tianming Liu, Xiang Li, Quanzheng Li

Main category: cs.LG

TL;DR: A new framework for adapter reuse in transfer learning uses geometry-aware sparse reconstruction to improve task-specific adapter composition.

DetailsMotivation: Existing methods for adapter reuse rely on simple heuristics or uniform averaging, ignoring latent task relationships in representation space.

Method: Formulates adapter composition as a geometry-aware sparse reconstruction problem, using latent prototype vectors and ℓ₁-regularized optimization to blend LoRA adapters.

Result: Demonstrates effectiveness in medical image segmentation, report generation, and image synthesis, showing improved zero-shot generalization.

Conclusion: Coupling retrieval with latent geometry-aware optimization enhances adapter reuse and interpretability.

Abstract: Recent advances in parameter-efficient transfer learning have demonstrated the utility of composing LoRA adapters from libraries of pretrained modules. However, most existing approaches rely on simple retrieval heuristics or uniform averaging, which overlook the latent structure of task relationships in representation space. We propose a new framework for adapter reuse that moves beyond retrieval, formulating adapter composition as a geometry-aware sparse reconstruction problem. Specifically, we represent each task by a latent prototype vector derived from the base model’s encoder and aim to approximate the target task prototype as a sparse linear combination of retrieved reference prototypes, under an $\ell_1$-regularized optimization objective. The resulting combination weights are then used to blend the corresponding LoRA adapters, yielding a composite adapter tailored to the target task. This formulation not only preserves the local geometric structure of the task representation manifold, but also promotes interpretability and efficient reuse by selecting a minimal set of relevant adapters. We demonstrate the effectiveness of our approach across multiple domains-including medical image segmentation, medical report generation and image synthesis. Our results highlight the benefit of coupling retrieval with latent geometry-aware optimization for improved zero-shot generalization.

[469] Multi-task neural networks by learned contextual inputs

Anders T. Sandnes, Bjarne Grimstad, Odd Kolbjørnsen

Main category: cs.LG

TL;DR: A multi-task learning architecture using learned-context neural networks with trainable task parameters, showing universal approximation capability and robustness.

DetailsMotivation: To develop a neural network architecture that efficiently adapts to multiple tasks with minimal task parameter space and handles sparse task data well.

Method: Uses a fully shared neural network with augmented input vectors containing trainable task parameters, focusing on low-dimensional task spaces.

Result: Demonstrates competitive performance on ten datasets, with a well-behaved task parameter space and robustness to sparse task data.

Conclusion: The architecture is effective for multi-task learning, offering simplicity in updates and adaptability with minimal task parameters.

Abstract: This paper explores learned-context neural networks. It is a multi-task learning architecture based on a fully shared neural network and an augmented input vector containing trainable task parameters. The architecture is interesting due to its powerful task adaption mechanism, which facilitates a low-dimensional task parameter space. Theoretically, we show that a scalar task parameter is sufficient for universal approximation of all tasks, which is not necessarily the case for more common architectures. Empirically it is shown that, for homogeneous tasks, the dimension of the task parameter may vary with the complexity of the tasks, but a small task parameter space is generally viable. The task parameter space is found to be well-behaved, which simplifies workflows related to updating models as new data arrives, and learning new tasks with the shared parameters are frozen. Additionally, the architecture displays robustness towards datasets where tasks have few data points. The architecture’s performance is compared to similar neural network architectures on ten datasets, with competitive results.

[470] DRL-ORA: Distributional Reinforcement Learning with Online Risk Adaption

Yupeng Wu, Wenyun Li, Wenjie Huang, Chin Pang Ho

Main category: cs.LG

TL;DR: Proposes DRL-ORA, a framework for dynamically adjusting epistemic risk in RL, unifying uncertainty quantification and risk adaptation for better efficiency and explainability.

DetailsMotivation: Addresses the challenge of decision-making in RL under incomplete environmental knowledge, aiming for reliable policies in safety-critical settings.

Method: Quantifies epistemic and aleatory uncertainties, adjusts risk levels via total variation minimization, and selects levels efficiently with a grid search algorithm.

Result: Outperforms fixed or manually adapted risk-level methods in various tasks.

Conclusion: DRL-ORA offers a flexible, explainable, and efficient solution for risk adaptation in RL.

Abstract: One of the main challenges in reinforcement learning (RL) is that the agent has to make decisions that would influence the future performance without having complete knowledge of the environment. Dynamically adjusting the level of epistemic risk during the learning process can help to achieve reliable policies in safety-critical settings with better efficiency. In this work, we propose a new framework, Distributional RL with Online Risk Adaptation (DRL-ORA). This framework quantifies both epistemic and implicit aleatory uncertainties in a unified manner and dynamically adjusts the epistemic risk levels by solving a total variation minimization problem online. The framework unifies the existing variants of risk adaption approaches and offers better explainability and flexibility. The selection of risk levels is performed efficiently via a grid search using a Follow-The-Leader-type algorithm, where the offline oracle also corresponds to a ‘‘satisficing measure’’ under a specially modified loss function. We show that DRL-ORA outperforms existing methods that rely on fixed risk levels or manually designed risk level adaptation in multiple classes of tasks.

[471] NACHOS: Neural Architecture Search for Hardware Constrained Early Exit Neural Networks

Matteo Gambella, Jary Pomponi, Simone Scardapane, Manuel Roveri

Main category: cs.LG

TL;DR: NACHOS is a NAS framework for designing optimal Early Exit Neural Networks (EENNs) under hardware constraints, jointly optimizing backbone and EECs for accuracy and MAC operations.

DetailsMotivation: Manual design of EENNs is complex and time-consuming, necessitating automated solutions like NAS to optimize placement, thresholding, and computational overhead of EECs.

Method: Proposes NACHOS, a NAS framework for joint design of backbone and EECs, ensuring Pareto optimality in accuracy and MAC operations. Includes novel regularization terms for EEC optimization.

Result: NACHOS-designed models match state-of-the-art EENNs, demonstrating effectiveness in balancing accuracy and computational efficiency.

Conclusion: NACHOS successfully automates EENN design, offering competitive performance and addressing the open problem of joint backbone-EEC optimization.

Abstract: Early Exit Neural Networks (EENNs) endow astandard Deep Neural Network (DNN) with Early Exit Classifiers (EECs), to provide predictions at intermediate points of the processing when enough confidence in classification is achieved. This leads to many benefits in terms of effectiveness and efficiency. Currently, the design of EENNs is carried out manually by experts, a complex and time-consuming task that requires accounting for many aspects, including the correct placement, the thresholding, and the computational overhead of the EECs. For this reason, the research is exploring the use of Neural Architecture Search (NAS) to automatize the design of EENNs. Currently, few comprehensive NAS solutions for EENNs have been proposed in the literature, and a fully automated, joint design strategy taking into consideration both the backbone and the EECs remains an open problem. To this end, this work presents Neural Architecture Search for Hardware Constrained Early Exit Neural Networks (NACHOS), the first NAS framework for the design of optimal EENNs satisfying constraints on the accuracy and the number of Multiply and Accumulate (MAC) operations performed by the EENNs at inference time. In particular, this provides the joint design of backbone and EECs to select a set of admissible (i.e., respecting the constraints) Pareto Optimal Solutions in terms of best tradeoff between the accuracy and number of MACs. The results show that the models designed by NACHOS are competitive with the state-of-the-art EENNs. Additionally, this work investigates the effectiveness of two novel regularization terms designed for the optimization of the auxiliary classifiers of the EENN

[472] Deep Exploration with PAC-Bayes

Bahareh Tasdighi, Manuel Haussmann, Nicklas Werge, Yi-Shan Wu, Melih Kandemir

Main category: cs.LG

TL;DR: The paper introduces PBAC, a PAC-Bayesian actor-critic algorithm for continuous control tasks with delayed rewards, outperforming existing methods.

DetailsMotivation: Addressing the under-explored problem of delayed rewards in RL for continuous control, especially for complex skills requiring intermediate prerequisites.

Method: Uses a PAC-Bayesian bound to quantify Bellman operator error, employing a bootstrapped critic ensemble and epsilon-soft exploration via randomly chosen actor heads.

Result: PBAC is the only algorithm to consistently discover delayed rewards in continuous control tasks of varying difficulty.

Conclusion: PBAC effectively solves deep exploration in continuous control, demonstrating superior performance in delayed reward scenarios.

Abstract: Reinforcement learning (RL) for continuous control under delayed rewards is an under-explored problem despite its significance in real-world applications. Many complex skills are based on intermediate ones as prerequisites. For instance, a humanoid locomotor must learn how to stand before it can learn to walk. To cope with delayed reward, an agent must perform deep exploration. However, existing deep exploration methods are designed for small discrete action spaces, and their generalization to state-of-the-art continuous control remains unproven. We address the deep exploration problem for the first time from a PAC-Bayesian perspective in the context of actor-critic learning. To do this, we quantify the error of the Bellman operator through a PAC-Bayes bound, where a bootstrapped ensemble of critic networks represents the posterior distribution, and their targets serve as a data-informed function-space prior. We derive an objective function from this bound and use it to train the critic ensemble. Each critic trains an individual soft actor network, implemented as a shared trunk and critic-specific heads. The agent performs deep exploration by acting epsilon-softly on a randomly chosen actor head. Our proposed algorithm, named {\it PAC-Bayesian Actor-Critic (PBAC)}, is the only algorithm to consistently discover delayed rewards on continuous control tasks with varying difficulty.

[473] Basis Selection: Low-Rank Decomposition of Pretrained Large Language Models for Target Applications

Yang Li, Daniel Agyei Asante, Changsheng Zhao, Ernie Chang, Yangyang Shi, Vikas Chandra

Main category: cs.LG

TL;DR: A low-rank decomposition method is introduced to compress LLMs by removing redundant components, maintaining accuracy while reducing model size.

DetailsMotivation: LLMs are computationally intensive and energy-demanding, making deployment on resource-limited devices challenging.

Method: Uses low-rank decomposition to identify and remove redundant parts of LLMs, retaining only necessary components for target applications.

Result: Significantly reduces model size (tested on Llama 2-7b and -13B) while maintaining accuracy for tasks like mathematical reasoning and code generation.

Conclusion: The method effectively compresses LLMs for specific applications, balancing efficiency and performance.

Abstract: Large language models (LLMs) significantly enhance the performance of various applications, but they are computationally intensive and energy-demanding. This makes it challenging to deploy them on devices with limited resources, such as personal computers and mobile/wearable devices, and results in substantial inference costs in resource-rich environments like cloud servers. To extend the use of LLMs, we introduce a low-rank decomposition approach to effectively compress these models, tailored to the requirements of specific applications. We observe that LLMs pretrained on general datasets contain many redundant components not needed for particular applications. Our method focuses on identifying and removing these redundant parts, retaining only the necessary elements for the target applications. Specifically, we represent the weight matrices of LLMs as a linear combination of base components. We then prune the irrelevant bases and enhance the model with new bases beneficial for specific applications. Deep compression results on the Llama 2-7b and -13B models, conducted on target applications including mathematical reasoning and code generation, show that our method significantly reduces model size while maintaining comparable accuracy to state-of-the-art low-rank compression techniques.

[474] A Survey of Controllable Learning: Methods and Applications in Information Retrieval

Chenglei Shen, Xiao Zhang, Teng Shi, Changshuo Zhang, Guofu Xie, Jun Xu

Main category: cs.LG

TL;DR: The paper defines controllable learning (CL) and explores its applications in information retrieval, categorizing it by controllability aspects, implementation methods, and challenges.

DetailsMotivation: Controllability is key for trustworthy machine learning, enabling dynamic adaptation without retraining, especially in complex, dynamic fields like information retrieval.

Method: The survey categorizes CL by controllability (objectives, user portrait, scenario), control agents (users/platforms), implementation methods (rule-based, Pareto optimization, hypernetwork), and control stages (pre-, in-, post-processing).

Result: Identifies challenges in training, evaluation, task setting, and deployment, and outlines future directions like theoretical analysis, efficient computation, and large language model integration.

Conclusion: CL is vital for adaptive ML, with promising future directions in theory, computation, and applications, though challenges remain in deployment and evaluation.

Abstract: Controllability has become a crucial aspect of trustworthy machine learning, enabling learners to meet predefined targets and adapt dynamically at test time without requiring retraining as the targets shift. We provide a formal definition of controllable learning (CL), and discuss its applications in information retrieval (IR) where information needs are often complex and dynamic. The survey categorizes CL according to what is controllable (e.g., multiple objectives, user portrait, scenario adaptation), who controls (users or platforms), how control is implemented (e.g., rule-based method, Pareto optimization, hypernetwork and others), and where to implement control (e.g., pre-processing, in-processing, post-processing methods). Then, we identify challenges faced by CL across training, evaluation, task setting, and deployment in online environments. Additionally, we outline promising directions for CL in theoretical analysis, efficient computation, empowering large language models, application scenarios and evaluation frameworks.

[475] Automatically Interpreting Millions of Features in Large Language Models

Gonçalo Paulo, Alex Mallen, Caden Juang, Nora Belrose

Main category: cs.LG

TL;DR: The paper introduces an automated pipeline for generating and evaluating natural language explanations of sparse autoencoder (SAE) features using LLMs, improving interpretability and introducing new scoring techniques.

DetailsMotivation: Deep neural network activations lack human-understandable interpretations, and manually interpreting millions of SAE features is impractical.

Method: An open-source pipeline generates explanations for SAE features using LLMs, tested on various SAEs. Five new scoring techniques, including intervention scoring, are introduced.

Result: SAE latents are more interpretable than neurons, even with sparsification. Independently trained SAEs on nearby layers show high similarity.

Conclusion: The framework enhances interpretability of SAEs, provides guidelines for better explanations, and identifies pitfalls in existing scoring methods.

Abstract: While the activations of neurons in deep neural networks usually do not have a simple human-understandable interpretation, sparse autoencoders (SAEs) can be used to transform these activations into a higher-dimensional latent space which may be more easily interpretable. However, these SAEs can have millions of distinct latent features, making it infeasible for humans to manually interpret each one. In this work, we build an open-source automated pipeline to generate and evaluate natural language explanations for SAE features using LLMs. We test our framework on SAEs of varying sizes, activation functions, and losses, trained on two different open-weight LLMs. We introduce five new techniques to score the quality of explanations that are cheaper to run than the previous state of the art. One of these techniques, intervention scoring, evaluates the interpretability of the effects of intervening on a feature, which we find explains features that are not recalled by existing methods. We propose guidelines for generating better explanations that remain valid for a broader set of activating contexts, and discuss pitfalls with existing scoring techniques. We use our explanations to measure the semantic similarity of independently trained SAEs, and find that SAEs trained on nearby layers of the residual stream are highly similar. Our large-scale analysis confirms that SAE latents are indeed much more interpretable than neurons, even when neurons are sparsified using top-$k$ postprocessing. Our code is available at https://github.com/EleutherAI/sae-auto-interp, and our explanations are available at https://huggingface.co/datasets/EleutherAI/auto_interp_explanations.

[476] Random Erasing vs. Model Inversion: A Promising Defense or a False Hope?

Viet-Hung Tran, Ngoc-Bao Nguyen, Son T. Mai, Hans Vandierendonck, Ira Assent, Alex Kot, Ngai-Man Cheung

Main category: cs.LG

TL;DR: Random Erasing (RE) is shown to effectively defend against Model Inversion (MI) attacks by disrupting feature alignment between reconstructed and private data, while maintaining model utility.

DetailsMotivation: Existing defenses focus on model-centric approaches, leaving the role of data in MI robustness unexplored. This work investigates RE's potential as a defense.

Method: Feature space analysis of models trained with RE-images reveals discrepancies in MI-reconstructed features. Properties like Partial Erasure and Random Location are explored.

Result: RE degrades MI reconstruction quality and attack accuracy without compromising natural accuracy, achieving SOTA privacy-utility trade-off in 37 setups.

Conclusion: RE is a simple, effective defense against MI attacks, easily integrable with existing privacy techniques, and outperforms current methods.

Abstract: Model Inversion (MI) attacks pose a significant privacy threat by reconstructing private training data from machine learning models. While existing defenses primarily concentrate on model-centric approaches, the impact of data on MI robustness remains largely unexplored. In this work, we explore Random Erasing (RE), a technique traditionally used for improving model generalization under occlusion, and uncover its surprising effectiveness as a defense against MI attacks. Specifically, our novel feature space analysis shows that models trained with RE-images introduce a significant discrepancy between the features of MI-reconstructed images and those of the private data. At the same time, features of private images remain distinct from other classes and well-separated from different classification regions. These effects collectively degrade MI reconstruction quality and attack accuracy while maintaining reasonable natural accuracy. Furthermore, we explore two critical properties of RE including Partial Erasure and Random Location. Partial Erasure prevents the model from observing entire objects during training. We find this has a significant impact on MI, which aims to reconstruct the entire objects. Random Location of erasure plays a crucial role in achieving a strong privacy-utility trade-off. Our findings highlight RE as a simple yet effective defense mechanism that can be easily integrated with existing privacy-preserving techniques. Extensive experiments across 37 setups demonstrate that our method achieves state-of-the-art (SOTA) performance in the privacy-utility trade-off. The results consistently demonstrate the superiority of our defense over existing methods across different MI attacks, network architectures, and attack configurations. For the first time, we achieve a significant degradation in attack accuracy without a decrease in utility for some configurations.

[477] Tool Unlearning for Tool-Augmented LLMs

Jiali Cheng, Hadi Amiri

Main category: cs.LG

TL;DR: ToolDelete is introduced as the first method for unlearning tools from tool-augmented LLMs, addressing unique challenges like knowledge removal and high optimization costs, with effective results.

DetailsMotivation: The need for tool unlearning arises from security, privacy, and deprecation concerns, yet it remains unexplored in unlearning literature.

Method: ToolDelete is proposed, featuring three key properties for effective tool unlearning and a new MIA model for evaluation.

Result: ToolDelete successfully unlearns tools while preserving non-deleted tool knowledge and general task performance.

Conclusion: ToolDelete bridges gaps in tool unlearning, offering a practical solution with validated effectiveness.

Abstract: Tool-augmented large language models (LLMs) are often trained on datasets of query-response pairs, which embed the ability to use tools or APIs directly into the parametric knowledge of LLMs. Tool-augmented LLMs need the ability to forget learned tools due to security vulnerabilities, privacy regulations, or tool deprecations. However, ``tool unlearning’’ has not been investigated in unlearning literature. We introduce this novel task, which requires addressing distinct challenges compared to traditional unlearning: knowledge removal rather than forgetting individual samples, the high cost of optimizing LLMs, and the need for principled evaluation metrics. To bridge these gaps, we propose ToolDelete, the first approach for unlearning tools from tool-augmented LLMs. It implements three key properties to address the above challenges for effective tool unlearning and introduces a new membership inference attack (MIA) model for effective evaluation. Extensive experiments on multiple tool learning datasets and tool-augmented LLMs show that ToolDelete effectively unlearns randomly selected tools, while preserving the LLM’s knowledge on non-deleted tools and maintaining performance on general tasks.

[478] PAK-UCB Contextual Bandit: An Online Learning Approach to Prompt-Aware Selection of Generative Models and LLMs

Xiaoyan Hu, Ho-fung Leung, Farzan Farnia

Main category: cs.LG

TL;DR: The paper proposes an online learning framework (PAK-UCB) to dynamically select the best generative model for different text prompts, improving efficiency by avoiding sub-optimal model queries.

DetailsMotivation: Current score-based selection methods ignore varying model performance across prompt types, leading to inefficiencies.

Method: Uses a contextual bandit (CB) setting with shared context variables and kernel-based functions, accelerated by random Fourier features (RFF).

Result: PAK-UCB successfully identifies the best generative model for diverse prompts in real and simulated experiments.

Conclusion: The framework efficiently reduces costs by dynamically selecting optimal models for different prompts.

Abstract: Selecting a sample generation scheme from multiple prompt-based generative models, including large language models (LLMs) and prompt-guided image and video generation models, is typically addressed by choosing the model that maximizes an averaged evaluation score. However, this score-based selection overlooks the possibility that different models achieve the best generation performance for different types of text prompts. An online identification of the best generation model for various input prompts can reduce the costs associated with querying sub-optimal models. In this work, we explore the possibility of varying rankings of text-based generative models for different text prompts and propose an online learning framework to predict the best data generation model for a given input prompt. The proposed PAK-UCB algorithm addresses a contextual bandit (CB) setting with shared context variables across the arms, utilizing the generated data to update kernel-based functions that predict the score of each model available for unseen text prompts. Additionally, we leverage random Fourier features (RFF) to accelerate the online learning process of PAK-UCB. Our numerical experiments on real and simulated text-to-image and image-to-text generative models show that RFF-UCB performs successfully in identifying the best generation model across different sample types. The code is available at: github.com/yannxiaoyanhu/dgm-online-select.

[479] Foundation Model of Electronic Medical Records for Adaptive Risk Estimation

Pawel Renc, Michal K. Grzeszczyk, Nassim Oufattole, Deirdre Goode, Yugang Jia, Szymon Bieganski, Matthew B. A. McDermott, Jaroslaw Was, Anthony E. Samir, Jonathan W. Cunningham, David W. Bates, Arkadiusz Sitek

Main category: cs.LG

TL;DR: ETHOS, an AI model, powers ARES for dynamic, personalized risk prediction in healthcare, outperforming traditional systems with superior accuracy and explainability.

DetailsMotivation: Hospitals face challenges in predicting critical outcomes due to limitations of static early warning systems like NEWS and MEWS. ETHOS and ARES aim to provide adaptable, accurate, and personalized solutions.

Method: ETHOS tokenizes patient health timelines (PHTs) from EHRs using transformer-based architectures. ARES leverages ETHOS for dynamic risk estimation and includes an explainability module. Evaluated on MIMIC-IV v2.2 dataset with 285,622 PHTs.

Result: ETHOS outperformed benchmarks in predicting admissions and prolonged stays, with robust performance across demographics. Explainability module provided patient-specific insights.

Conclusion: ARES, powered by ETHOS, offers advanced, real-time risk prediction with explainability. Future work will focus on clinical validation and real-world utility. Source code is released for research.

Abstract: Hospitals struggle to predict critical outcomes. Traditional early warning systems, like NEWS and MEWS, rely on static variables and fixed thresholds, limiting their adaptability, accuracy, and personalization. We previously developed the Enhanced Transformer for Health Outcome Simulation (ETHOS), an AI model that tokenizes patient health timelines (PHTs) from EHRs and uses transformer-based architectures to predict future PHTs. ETHOS is a versatile framework for developing a wide range of applications. In this work, we develop the Adaptive Risk Estimation System (ARES) that leverages ETHOS to compute dynamic, personalized risk probabilities for clinician-defined critical events. ARES also features a personalized explainability module that highlights key clinical factors influencing risk estimates. We evaluated ARES using the MIMIC-IV v2.2 dataset together with its Emergency Department (ED) extension and benchmarked performance against both classical early warning systems and contemporary machine learning models. The entire dataset was tokenized resulting in 285,622 PHTs, comprising over 360 million tokens. ETHOS outperformed benchmark models in predicting hospital admissions, ICU admissions, and prolonged stays, achieving superior AUC scores. Its risk estimates were robust across demographic subgroups, with calibration curves confirming model reliability. The explainability module provided valuable insights into patient-specific risk factors. ARES, powered by ETHOS, advances predictive healthcare AI by delivering dynamic, real-time, personalized risk estimation with patient-specific explainability. Although our results are promising, the clinical impact remains uncertain. Demonstrating ARES’s true utility in real-world settings will be the focus of our future work. We release the source code to facilitate future research.

[480] Ultra Memory-Efficient On-FPGA Training of Transformers via Tensor-Compressed Optimization

Jiayi Tian, Jinming Lu, Hai Li, Xiangwei Wang, Cong Hao, Ian Young, Zheng Zhang

Main category: cs.LG

TL;DR: The paper introduces an FPGA accelerator for transformer training using low-rank tensor compression, reducing memory and energy costs compared to GPU training.

DetailsMotivation: Addressing the challenge of training transformers on resource-constrained edge devices due to computational and memory demands.

Method: Proposes a bi-directional contraction flow for tensorized training and an on-chip-memory-only framework with custom kernels and parallelism.

Result: Achieves 30× to 51× memory reduction and up to 3.6× less energy cost per epoch compared to GPU training.

Conclusion: The FPGA accelerator enables efficient transformer training on edge devices with significant resource savings.

Abstract: Transformer models have achieved state-of-the-art performance across a wide range of machine learning tasks. There is growing interest in training transformers on resource-constrained edge devices due to considerations such as privacy, domain adaptation, and on-device scientific machine learning. However, the significant computational and memory demands required for transformer training often exceed the capabilities of an edge device. Leveraging low-rank tensor compression, this paper presents the first on-FPGA accelerator for end-to-end transformer training. On the algorithm side, we present a bi-directional contraction flow for tensorized transformer training, significantly reducing the computational FLOPS and intra-layer memory costs compared to existing tensor operations. On the hardware side, we store all highly compressed model parameters and gradient information on chip, creating an on-chip-memory-only framework for each stage in training. This reduces off-chip communication and minimizes latency and energy costs. Additionally, we implement custom computing kernels for each training stage and employ intra-layer parallelism and pipe-lining to further enhance run-time and memory efficiency. Through experiments on transformer models within $36.7$ to $93.5$ MB using FP-32 data formats on the ATIS dataset, our tensorized FPGA accelerator could conduct single-batch end-to-end training on the AMD Alevo U50 FPGA, with a memory budget of less than $6$-MB BRAM and $22.5$-MB URAM. Compared to uncompressed training on the NVIDIA RTX 3090 GPU, our on-FPGA training achieves a memory reduction of $30\times$ to $51\times$. Our FPGA accelerator also achieves up to $3.6\times$ less energy cost per epoch compared with tensor Transformer training on an NVIDIA RTX 3090 GPU.

[481] Dual-Label Learning With Irregularly Present Labels

Mingqian Li, Qiao Han, Ruifeng Li, Yao Yang, Hongyang Chen

Main category: cs.LG

TL;DR: The paper introduces Dual-Label Learning (DLL), a framework for handling irregularly missing labels in multi-task learning, improving prediction accuracy and robustness even at high missing rates.

DetailsMotivation: Address the challenge of irregularly missing labels in multi-task learning due to experimental limitations, requiring a method to maximize label utility.

Method: Proposes DLL with a dual-tower architecture for explicit label correlation modeling, imputing missing labels during training and jointly predicting them during inference.

Result: DLL outperforms baselines by up to 9.6% in F1-score or 10.2% in MAPE reduction, maintaining robustness at up to 60% missing rates.

Conclusion: DLL effectively handles irregular label presence, leveraging label correlations for superior performance, even with high missing rates.

Abstract: In multi-task learning, labels are often missing irregularly across samples, which can be fully labeled, partially labeled or unlabeled. The irregular label presence often appears in scientific studies due to experimental limitations. It triggers a demand for a new training and inference mechanism that could accommodate irregularly present labels and maximize their utility. This work focuses on the two-label learning task and proposes a novel training and inference framework, Dual-Label Learning (DLL). The DLL framework formulates the problem into a dual-function system, in which the two functions should simultaneously satisfy standard supervision, structural duality and probabilistic duality. DLL features a dual-tower model architecture that allows for explicit information exchange between labels, aimed at maximizing the utility of partially available labels. During training, missing labels are imputed as part of the forward propagation process, while during inference, labels are predicted jointly as unknowns of a bivariate system of equations. Our theoretical analysis guarantees the feasibility of DLL, and extensive experiments are conducted to verify that by explicitly modeling label correlation and maximizing label utility, our method makes consistently better prediction than baseline approaches by up to 9.6% gain in F1-score or 10.2% reduction in MAPE. Remarkably, DLL maintains robust performance at a label missing rate of up to 60%, achieving even better results than baseline approaches at lower missing rates down to only 10%.

[482] PTQ1.61: Push the Real Limit of Extremely Low-Bit Post-Training Quantization Methods for Large Language Models

Jiaqi Zhao, Miao Zhang, Ming Wang, Yuzhang Shang, Kaihao Zhang, Weili Guan, Yaowei Wang, Min Zhang

Main category: cs.LG

TL;DR: PTQ1.61 enables 1.61-bit weight quantization with minimal overhead, outperforming existing methods in extremely low-bit PTQ.

DetailsMotivation: Address performance degradation in LLMs under sub 2-bit quantization by introducing a structured mask and preprocessing.

Method: Uses a one-dimensional structured mask and block-wise scaling for binarization, plus quantization preprocessing.

Result: Achieves state-of-the-art performance in extremely low-bit quantization.

Conclusion: PTQ1.61 sets a new benchmark for low-bit PTQ with efficient preprocessing and structured masking.

Abstract: Large Language Models (LLMs) suffer severe performance degradation when facing extremely low-bit (sub 2-bit) quantization. Several existing sub 2-bit post-training quantization (PTQ) methods utilize a mix-precision scheme by leveraging an unstructured fine-grained mask to explicitly distinguish salient weights, while which introduces an extra 1-bit or more per weight. To explore the real limit of PTQ, we propose an extremely low-bit PTQ method called PTQ1.61, which enables weight quantization to 1.61-bit for the first time. Specifically, we first introduce a one-dimensional structured mask with negligibly additional 0.0002-bit per weight based on input activations from the perspective of reducing the upper bound of quantization error to allocate corresponding salient weight channels to 4-bit. For non-salient channels binarization, an efficient block-wise scaling factors optimization framework is then presented to take implicit row-wise correlations and angular biases into account. Different from prior works that concentrate on adjusting quantization methodologies, we further propose a novel paradigm called quantization preprocessing, where we argue that transforming the weight distribution of the pretrained model before quantization can alleviate the difficulty in per-channel extremely low-bit PTQ. Extensive experiments indicate our PTQ1.61 achieves state-of-the-art performance in extremely low-bit quantization. Codes are available at https://github.com/zjq0455/PTQ1.61.

[483] SimpleRL-Zoo: Investigating and Taming Zero Reinforcement Learning for Open Base Models in the Wild

Weihao Zeng, Yuzhen Huang, Qian Liu, Wei Liu, Keqing He, Zejun Ma, Junxian He

Main category: cs.LG

TL;DR: The paper investigates zero RL training across 10 diverse base models, achieving improvements in reasoning accuracy and response length, while noting distinct training patterns and the emergence of cognitive behaviors like the ‘aha moment.’

DetailsMotivation: To explore zero RL training beyond the Qwen2.5 model series, as existing efforts may not be representative due to the base models' inherent abilities.

Method: Leveraging key design strategies like adjusting format reward and controlling query difficulty, the study trains diverse models (e.g., LLama3-8B, Mistral-7B/24B).

Result: Substantial improvements in reasoning accuracy and response length, with distinct training patterns observed, including the ‘aha moment’ in small non-Qwen models.

Conclusion: Successful zero RL training is possible across diverse models, with open-sourced code, models, and tools to aid further research.

Abstract: DeepSeek-R1 has shown that long chain-of-thought (CoT) reasoning can naturally emerge through a simple reinforcement learning (RL) framework with rule-based rewards, where the training may directly start from the base models-a paradigm referred to as zero RL training. Most recent efforts to reproduce zero RL training have primarily focused on the Qwen2.5 model series, which may not be representative as we find the base models already exhibit strong instruction-following and self-reflection abilities. In this work, we investigate zero RL training across 10 diverse base models, spanning different families and sizes including LLama3-8B, Mistral-7B/24B, DeepSeek-Math-7B, Qwen2.5-math-7B, and all Qwen2.5 models from 0.5B to 32B. Leveraging several key design strategies-such as adjusting format reward and controlling query difficulty-we achieve substantial improvements in both reasoning accuracy and response length across most settings. However, by carefully monitoring the training dynamics, we observe that different base models exhibit distinct patterns during training. For instance, the increased response length does not always correlate with the emergence of certain cognitive behaviors such as verification (i.e., the “aha moment”). Notably, we observe the “aha moment” for the first time in small models not from the Qwen family. We share the key designs that enable successful zero RL training, along with our findings and practices. To facilitate further research, we open-source the code, models, and analysis tools.

[484] Understanding In-Context Learning of Linear Models in Transformers Through an Adversarial Lens

Usman Anwar, Johannes Von Oswald, Louis Kirsch, David Krueger, Spencer Frei

Main category: cs.LG

TL;DR: The paper explores adversarial robustness in transformers for in-context learning of linear models, showing vulnerability to hijacking attacks but improved robustness via adversarial training. It also compares adversarial vulnerabilities across models, revealing poor attack transferability between transformers and classical algorithms.

DetailsMotivation: To understand the adversarial robustness of in-context learning in transformers and compare vulnerabilities across different models and algorithms.

Method: Investigates hijacking attacks on transformers, tests adversarial training, and compares attack transferability between transformers and classical linear model algorithms.

Result: Transformers are vulnerable to hijacking attacks but robustness improves with adversarial training. Attack transferability is poor between transformers and classical algorithms.

Conclusion: Transformers may implement distinct in-context learning algorithms compared to classical methods, and adversarial training enhances robustness.

Abstract: In this work, we make two contributions towards understanding of in-context learning of linear models by transformers. First, we investigate the adversarial robustness of in-context learning in transformers to hijacking attacks – a type of adversarial attacks in which the adversary’s goal is to manipulate the prompt to force the transformer to generate a specific output. We show that both linear transformers and transformers with GPT-2 architectures are vulnerable to such hijacking attacks. However, adversarial robustness to such attacks can be significantly improved through adversarial training – done either at the pretraining or finetuning stage – and can generalize to stronger attack models. Our second main contribution is a comparative analysis of adversarial vulnerabilities across transformer models and other algorithms for learning linear models. This reveals two novel findings. First, adversarial attacks transfer poorly between larger transformer models trained from different seeds despite achieving similar in-distribution performance. This suggests that transformers of the same architecture trained according to the same recipe may implement different in-context learning algorithms for the same task. Second, we observe that attacks do not transfer well between classical learning algorithms for linear models (single-step gradient descent and ordinary least squares) and transformers. This suggests that there could be qualitative differences between the in-context learning algorithms that transformers implement and these traditional algorithms.

[485] UltraSTF: Ultra-Compact Model for Large-Scale Spatio-Temporal Forecasting

Chin-Chia Michael Yeh, Xiran Fan, Zhimeng Jiang, Yujie Fan, Huiyuan Chen, Uday Singh Saini, Vivian Lai, Xin Dai, Junpeng Wang, Zhongfang Zhuang, Liang Wang, Yan Zheng

Main category: cs.LG

TL;DR: UltraSTF improves spatio-temporal forecasting by combining cross-period dynamics with an ultra-compact shape bank, outperforming SparseTSF and other methods with minimal parameters.

DetailsMotivation: SparseTSF underperforms on spatio-temporal data due to poor intra-period dependency capture, prompting the need for a more efficient model.

Method: UltraSTF integrates cross-period forecasting with a shape bank component using attention to capture intra-period patterns.

Result: UltraSTF achieves state-of-the-art performance on LargeST with <0.2% of parameters compared to other methods.

Conclusion: UltraSTF extends the Pareto frontier in spatio-temporal forecasting by balancing model size and performance.

Abstract: Spatio-temporal data, prevalent in real-world applications such as traffic monitoring, financial transactions, and ride-share demands, represents a specialized case of multivariate time series characterized by high dimensionality. This high dimensionality necessitates computationally efficient models and benefits from applying univariate forecasting approaches through channel-independent strategies. SparseTSF, a recently proposed competitive univariate forecasting model, leverages periodicity to achieve compactness by focusing on cross-period dynamics, extending the Pareto frontier in terms of model size and predictive performance. However, it underperforms on spatio-temporal data due to limited capture of intra-period temporal dependencies. To address this limitation, we propose UltraSTF, which integrates a cross-period forecasting component with an ultra-compact shape bank component. Our model efficiently captures recurring patterns in time series using the attention mechanism of the shape bank component, significantly enhancing its capability to learn intra-period dynamics. UltraSTF achieves state-of-the-art performance on the LargeST benchmark while utilizing fewer than 0.2% of the parameters required by the second-best methods, thereby further extending the Pareto frontier of existing approaches.

[486] Efficient Unsupervised Domain Adaptation Regression for Spatial-Temporal Sensor Fusion

Keivan Faghih Niresi, Ismail Nejjar, Olga Fink

Main category: cs.LG

TL;DR: A novel unsupervised domain adaptation method for regression tasks, integrating with Spatial-Temporal Graph Neural Networks, improves sensor data quality without labeled target data.

DetailsMotivation: Address challenges like sensor drift, noise, and calibration in distributed sensor networks, which limit reliability in real-world applications.

Method: Proposes a UDA method using perturbed inverse Gram matrices alignment, inspired by Tikhonov regularization, for scalable domain adaptation.

Result: Achieves state-of-the-art performance on air quality monitoring and EEG signal reconstruction datasets.

Conclusion: The method enables robust, transferable sensor fusion models for environmental and physiological monitoring.

Abstract: The growing deployment of low-cost, distributed sensor networks in environmental and biomedical domains has enabled continuous, large-scale health monitoring. However, these systems often face challenges related to degraded data quality caused by sensor drift, noise, and insufficient calibration – factors that limit their reliability in real-world applications. Traditional machine learning methods for sensor fusion and calibration rely on extensive feature engineering and struggle to capture spatial-temporal dependencies or adapt to distribution shifts across varying deployment conditions. To address these challenges, we propose a novel unsupervised domain adaptation (UDA) method tailored for regression tasks. Our proposed method integrates effectively with Spatial-Temporal Graph Neural Networks and leverages the alignment of perturbed inverse Gram matrices between source and target domains, drawing inspiration from Tikhonov regularization. This approach enables scalable and efficient domain adaptation without requiring labeled data in the target domain. We validate our novel method on real-world datasets from two distinct applications: air quality monitoring and EEG signal reconstruction. Our method achieves state-of-the-art performance which paves the way for more robust and transferable sensor fusion models in both environmental and physiological contexts. Our code is available at https://github.com/EPFL-IMOS/TikUDA.

[487] Gradient-Based Multi-Objective Deep Learning: Algorithms, Theories, Applications, and Beyond

Weiyu Chen, Baijiong Lin, Xiaoyuan Zhang, Xi Lin, Han Zhao, Qingfu Zhang, James T. Kwok

Main category: cs.LG

TL;DR: A survey of gradient-based techniques for multi-objective deep learning, categorizing methods by their outputs, covering theory, applications, and open challenges.

DetailsMotivation: Addressing the challenges of balancing conflicting objectives in deep learning, such as multi-task learning and fairness-aware learning, by adapting Multi-Objective Optimization (MOO) principles.

Method: Systematic categorization of gradient-based MOO techniques into three types: single balanced solution, finite set of Pareto-optimal solutions, and continuous Pareto set.

Result: Provides a taxonomy, theoretical insights, applications, and practical resources, alongside a GitHub repository of algorithms.

Conclusion: Highlights open challenges and future research directions in multi-objective deep learning, emphasizing the need for efficient and user-preference-aware methods.

Abstract: Many modern deep learning applications require balancing multiple objectives that are often conflicting. Examples include multi-task learning, fairness-aware learning, and the alignment of Large Language Models (LLMs). This leads to multi-objective deep learning, which tries to find optimal trade-offs or Pareto-optimal solutions by adapting mathematical principles from the field of Multi-Objective Optimization (MOO). However, directly applying gradient-based MOO techniques to deep neural networks presents unique challenges, including high computational costs, optimization instability, and the difficulty of effectively incorporating user preferences. This paper provides a comprehensive survey of gradient-based techniques for multi-objective deep learning. We systematically categorize existing algorithms based on their outputs: (i) methods that find a single, well-balanced solution, (ii) methods that generate a finite set of diverse Pareto-optimal solutions, and (iii) methods that learn a continuous Pareto set of solutions. In addition to this taxonomy, the survey covers theoretical analyses, key applications, practical resources, and highlights open challenges and promising directions for future research. A comprehensive list of multi-objective deep learning algorithms is available at https://github.com/Baijiong-Lin/Awesome-Multi-Objective-Deep-Learning.

[488] Thought Anchors: Which LLM Reasoning Steps Matter?

Paul C. Bogdan, Uzay Macar, Neel Nanda, Arthur Conmy

Main category: cs.LG

TL;DR: The paper introduces three sentence-level attribution methods to analyze reasoning in large language models, identifying “thought anchors” that heavily influence reasoning processes.

DetailsMotivation: To address interpretability challenges in long-form chain-of-thought reasoning by decomposing the computation at the sentence level.

Method: Three attribution methods: (1) black-box counterfactual importance, (2) white-box attention pattern aggregation, and (3) causal attribution via attention suppression.

Result: Identification of “thought anchors”—key reasoning steps (e.g., planning or backtracking sentences) with disproportionate influence.

Conclusion: Sentence-level analysis enhances understanding of reasoning models, supported by consistent findings across methods and an open-source visualization tool.

Abstract: Reasoning large language models have recently achieved state-of-the-art performance in many fields. However, their long-form chain-of-thought reasoning creates interpretability challenges as each generated token depends on all previous ones, making the computation harder to decompose. We argue that analyzing reasoning traces at the sentence level is a promising approach to understanding reasoning processes. We present three complementary attribution methods: (1) a black-box method measuring each sentence’s counterfactual importance by comparing final answers across 100 rollouts conditioned on the model generating that sentence or one with a different meaning; (2) a white-box method of aggregating attention patterns between pairs of sentences, which identified “broadcasting” sentences that receive disproportionate attention from all future sentences via “receiver” attention heads; (3) a causal attribution method measuring logical connections between sentences by suppressing attention toward one sentence and measuring the effect on each future sentence’s tokens. Each method provides evidence for the existence of thought anchors, reasoning steps that have outsized importance and that disproportionately influence the subsequent reasoning process. These thought anchors are typically planning or backtracking sentences. We provide an open-source tool (www.thought-anchors.com) for visualizing the outputs of our methods, and present a case study showing converging patterns across methods that map how a model performs multi-step reasoning. The consistency across methods demonstrates the potential of sentence-level analysis for a deeper understanding of reasoning models.

[489] DBSCAN in domains with periodic boundary conditions

Xander M. de Wit, Alessandro Gabbana

Main category: cs.LG

TL;DR: A method for clustering data with periodic boundary conditions using DBSCAN, retaining optimized runtime and compatibility with existing implementations, demonstrated on synthetic and real-world data.

DetailsMotivation: Many scientific problems involve periodic data, requiring tailored clustering methods that respect boundary conditions.

Method: Extends DBSCAN to periodic domains while leveraging existing optimized implementations for open domains, maintaining $O(N\log N)$ complexity.

Result: Successfully applied to synthetic 1D, 2D, and 3D data and a real-world example of bubble clustering in turbulent flow.

Conclusion: The method is effective for periodic data and is available as a Python package for practical use.

Abstract: Many scientific problems involve data that is embedded in a space with periodic boundary conditions. This can for instance be related to an inherent cyclic or rotational symmetry in the data or a spatially extended periodicity. When analyzing such data, well-tailored methods are needed to obtain efficient approaches that obey the periodic boundary conditions of the problem. In this work, we present a method for applying a clustering algorithm to data embedded in a periodic domain based on the DBSCAN algorithm, a widely used unsupervised machine learning method that identifies clusters in data. The proposed method internally leverages the conventional DBSCAN algorithm for domains with open boundaries, such that it remains compatible with all optimized implementations for neighborhood searches in open domains. In this way, it retains the same optimized runtime complexity of $O(N\log N)$. We demonstrate the workings of the proposed method using synthetic data in one, two and three dimensions and also apply it to a real-world example involving the clustering of bubbles in a turbulent flow. The proposed approach is implemented in a ready-to-use Python package that we make publicly available.

[490] Learning richness modulates equality reasoning in neural networks

William L. Tong, Cengiz Pehlevan

Main category: cs.LG

TL;DR: The paper explores equality reasoning in neural networks, proposing a spectrum from conceptual to perceptual behavior based on learning richness, validated through MLP theory and vision experiments.

DetailsMotivation: To clarify the principles of equality reasoning in neural networks, given the lack of consensus despite extensive study, and to draw parallels with human and animal reasoning.

Method: Develops a theory of equality reasoning in MLPs, categorizing behavior into conceptual (task-specific, efficient) and perceptual (detail-sensitive, exhaustive training). Validates with vision SD experiments.

Result: Rich-regime MLPs show conceptual behavior, while lazy-regime MLPs exhibit perceptual behavior. Feature learning richness is key to successful equality reasoning.

Conclusion: Learning richness modulates equality reasoning in MLPs, suggesting similar dependencies in human and animal neural circuits.

Abstract: Equality reasoning is ubiquitous and purely abstract: sameness or difference may be evaluated no matter the nature of the underlying objects. As a result, same-different (SD) tasks have been extensively studied as a starting point for understanding abstract reasoning in humans and across animal species. With the rise of neural networks that exhibit striking apparent proficiency for abstractions, equality reasoning in these models has also gained interest. Yet despite extensive study, conclusions about equality reasoning vary widely and with little consensus. To clarify the underlying principles in learning SD tasks, we develop a theory of equality reasoning in multi-layer perceptrons (MLP). Following observations in comparative psychology, we propose a spectrum of behavior that ranges from conceptual to perceptual outcomes. Conceptual behavior is characterized by task-specific representations, efficient learning, and insensitivity to spurious perceptual details. Perceptual behavior is characterized by strong sensitivity to spurious perceptual details, accompanied by the need for exhaustive training to learn the task. We develop a mathematical theory to show that an MLP’s behavior is driven by learning richness. Rich-regime MLPs exhibit conceptual behavior, whereas lazy-regime MLPs exhibit perceptual behavior. We validate our theoretical findings in vision SD experiments, showing that rich feature learning promotes success by encouraging hallmarks of conceptual behavior. Overall, our work identifies feature learning richness as a key parameter modulating equality reasoning, and suggests that equality reasoning in humans and animals may similarly depend on learning richness in neural circuits.

[491] EcoTransformer: Attention without Multiplication

Xin Gao, Xingming Xu, Shirin Amiraslani, Hong Xu

Main category: cs.LG

TL;DR: EcoTransformer replaces dot-product attention with a Laplacian kernel convolution for energy-efficient performance, matching or surpassing traditional Transformers in tasks.

DetailsMotivation: The scaled dot-product attention in Transformers is computationally intensive and energy-consuming, prompting the need for a more efficient alternative.

Method: EcoTransformer constructs the output context vector using a Laplacian kernel convolution with L1 metric distances between queries and keys, eliminating matrix multiplication.

Result: EcoTransformer performs comparably or better than traditional Transformers in NLP, bioinformatics, and vision tasks while using less energy.

Conclusion: EcoTransformer offers a more energy-efficient alternative to traditional Transformers without sacrificing performance.

Abstract: The Transformer, with its scaled dot-product attention mechanism, has become a foundational architecture in modern AI. However, this mechanism is computationally intensive and incurs substantial energy costs. We propose a new Transformer architecture EcoTransformer, in which the output context vector is constructed as the convolution of the values using a Laplacian kernel, where the distances are measured by the L1 metric between the queries and keys. Compared to dot-product based attention, the new attention score calculation is free of matrix multiplication. It performs on par with, or even surpasses, scaled dot-product attention in NLP, bioinformatics, and vision tasks, while consuming significantly less energy. (This version (v2) supersedes v1 and reflects the intended release and licensing.)

[492] Risk-Based Thresholding for Reliable Anomaly Detection in Concentrated Solar Power Plants

Yorick Estievenart, Sukanya Patra, Souhaib Ben Taieb

Main category: cs.LG

TL;DR: A framework for reliable anomaly detection in CSP plants using risk control and density forecasting, with deployment insights and a simulated dataset.

DetailsMotivation: High-temperature solar receivers in CSP plants face operational risks like freezing and corrosion, leading to costly downtime. Current monitoring methods lack reliable decision thresholds and coverage guarantees.

Method: Proposes a risk-control framework for thresholding anomaly scores, an abstention mechanism for high-risk cases, and a density forecasting method for anomaly detection.

Result: Deployed across multiple CSP plants, providing insights for maintenance optimization. A simulated dataset is also provided due to data confidentiality.

Conclusion: The framework enhances reliability in anomaly detection for CSP plants, with practical deployment benefits and a publicly available simulated dataset.

Abstract: Efficient and reliable operation of Concentrated Solar Power (CSP) plants is essential for meeting the growing demand for sustainable energy. However, high-temperature solar receivers face severe operational risks, such as freezing, deformation, and corrosion, resulting in costly downtime and maintenance. To monitor CSP plants, cameras mounted on solar receivers record infrared images at irregular intervals ranging from one to five minutes throughout the day. Anomalous images can be detected by thresholding an anomaly score, where the threshold is chosen to optimize metrics such as the F1-score on a validation set. This work proposes a framework, using risk control, for generating more reliable decision thresholds with finite-sample coverage guarantees on any chosen risk function. Our framework also incorporates an abstention mechanism, allowing high-risk predictions to be deferred to domain experts. Second, we propose a density forecasting method to estimate the likelihood of an observed image given a sequence of previously observed images, using this likelihood as its anomaly score. Third, we analyze the deployment results of our framework across multiple training scenarios over several months for two CSP plants. This analysis provides valuable insights to our industry partner for optimizing maintenance operations. Finally, given the confidential nature of our dataset, we provide an extended simulated dataset, leveraging recent advancements in generative modeling to create diverse thermal images that simulate multiple CSP plants. Our code is publicly available.

[493] CITRAS: Covariate-Informed Transformer for Time Series Forecasting

Yosuke Yamaguchi, Issei Suemitsu, Wenpeng Wei

Main category: cs.LG

TL;DR: CITRAS, a decoder-only Transformer, improves time series forecasting by leveraging covariates with novel mechanisms like KV Shift and Attention Score Smoothing.

DetailsMotivation: Existing models struggle with leveraging future covariates and capturing dependencies between targets and covariates.

Method: CITRAS uses patch-wise cross-variate attention with KV Shift and Attention Score Smoothing to incorporate future covariates and refine dependencies.

Result: CITRAS outperforms state-of-the-art models on 13 real-world benchmarks.

Conclusion: CITRAS effectively leverages cross-variate and cross-time dependencies for superior forecasting accuracy.

Abstract: In practical time series forecasting, covariates provide rich contextual information that can potentially enhance the forecast of target variables. Although some covariates extend into the future forecasting horizon (e.g., calendar events, discount schedules), most multivariate models fail to leverage this pivotal insight due to the length discrepancy with target variables. Additionally, capturing the dependency between target variables and covariates is non-trivial, as models must precisely reflect the local impact of covariates while also capturing global cross-variate dependencies. To overcome these challenges, we propose CITRAS, a decoder-only Transformer that flexibly leverages multiple targets, past covariates, and future covariates. While preserving strong autoregressive capabilities, CITRAS introduces two novel mechanisms in patch-wise cross-variate attention: Key-Value (KV) Shift and Attention Score Smoothing. KV Shift seamlessly incorporates future covariates into the forecasting of target variables based on their concurrent dependencies. Additionally, Attention Score Smoothing refines locally accurate patch-wise cross-variate dependencies into global variate-level dependencies by smoothing the past series of attention scores. Experimentally, CITRAS outperforms state-of-the-art models on thirteen real-world benchmarks from both covariate-informed and multivariate settings, demonstrating its versatile ability to leverage cross-variate and cross-time dependencies for improved forecasting accuracy.

[494] ProtoECGNet: Case-Based Interpretable Deep Learning for Multi-Label ECG Classification with Contrastive Learning

Sahil Sethi, David Chen, Thomas Statchen, Michael C. Burkhart, Nipun Bhandari, Bashar Ramadan, Brett Beaulieu-Jones

Main category: cs.LG

TL;DR: ProtoECGNet is a prototype-based deep learning model for interpretable ECG classification, combining 1D and 2D CNNs with prototype learning for transparent, case-based explanations.

DetailsMotivation: Clinical adoption of deep learning ECG models is hindered by lack of transparency. Post hoc methods like saliency maps may not reflect true decision processes.

Method: ProtoECGNet uses a multi-branch architecture (1D CNN for rhythm, 2D CNNs for morphology and abnormalities) with prototype loss for multi-label learning, including clustering, separation, and contrastive loss.

Result: Competitive performance on PTB-XL dataset (71 labels) with structured explanations. Clinician review confirms prototype representativeness and clarity.

Conclusion: ProtoECGNet demonstrates scalable prototype learning for complex, multi-label ECG classification, advancing transparent and trustworthy clinical AI.

Abstract: Deep learning-based electrocardiogram (ECG) classification has shown impressive performance but clinical adoption has been slowed by the lack of transparent and faithful explanations. Post hoc methods such as saliency maps may fail to reflect a model’s true decision process. Prototype-based reasoning offers a more transparent alternative by grounding decisions in similarity to learned representations of real ECG segments, enabling faithful, case-based explanations. We introduce ProtoECGNet, a prototype-based deep learning model for interpretable, multi-label ECG classification. ProtoECGNet employs a structured, multi-branch architecture that reflects clinical interpretation workflows: it integrates a 1D CNN with global prototypes for rhythm classification, a 2D CNN with time-localized prototypes for morphology-based reasoning, and a 2D CNN with global prototypes for diffuse abnormalities. Each branch is trained with a prototype loss designed for multi-label learning, combining clustering, separation, diversity, and a novel contrastive loss that encourages appropriate separation between prototypes of unrelated classes while allowing clustering for frequently co-occurring diagnoses. We evaluate ProtoECGNet on all 71 diagnostic labels from the PTB-XL dataset, demonstrating competitive performance relative to state-of-the-art black-box models while providing structured, case-based explanations. To assess prototype quality, we conduct a structured clinician review of the final model’s projected prototypes, finding that they are rated as representative and clear. ProtoECGNet shows that prototype learning can be effectively scaled to complex, multi-label time-series classification, offering a practical path toward transparent and trustworthy deep learning models for clinical decision support.

[495] GenEDA: Towards Generative Netlist Functional Reasoning via Cross-Modal Circuit Encoder-Decoder Alignment

Wenji Fang, Jing Wang, Yao Lu, Shang Liu, Zhiyao Xie

Main category: cs.LG

TL;DR: GenEDA is a framework aligning circuit encoders and decoders in a shared latent space, enabling generative reasoning tasks for netlists.

DetailsMotivation: Existing circuit foundation models operate independently, limiting advanced capabilities. GenEDA bridges this gap.

Method: GenEDA aligns graph-based circuit representation with text-based LLMs, supporting both trainable and frozen LLMs.

Result: GenEDA enables generative netlist tasks, outperforming advanced LLMs like GPT and DeepSeek.

Conclusion: GenEDA advances circuit design by unifying encoders and decoders, enabling novel generative tasks.

Abstract: The success of foundation AI has motivated the research of circuit foundation models, which are customized to assist the integrated circuit (IC) design process. However, existing pre-trained circuit foundation models are typically limited to standalone encoders for predictive tasks or decoders for generative tasks. These two model types are developed independently, operate on different circuit modalities, and reside in separate latent spaces. This restricts their ability to complement each other for more advanced capabilities. In this work, we present GenEDA, the first framework that cross-modally aligns circuit encoders with decoders within a shared latent space. GenEDA bridges the gap between graph-based circuit representation learning and text-based large language models (LLMs), enabling communication between their respective latent spaces. To achieve the alignment, we propose two paradigms to support both open-source trainable LLMs and commercial frozen LLMs. We leverage this aligned architecture to develop the first generative foundation model for netlists, unleashing LLMs’ generative reasoning capability on the low-level and bit-blasted netlists. GenEDA enables three unprecedented generative netlist functional reasoning tasks, where it reversely generates high-level functionalities such as specifications and RTL code from low-level netlists. These tasks move beyond traditional gate function classification to direct generation of full-circuit functionality. Experiments demonstrate that GenEDA significantly boosts advanced LLMs’ (e.g., GPT and DeepSeek series) performance in all tasks.

[496] Let the Void Be Void: Robust Open-Set Semi-Supervised Learning via Selective Non-Alignment

You Rim Choi, Subeom Park, Seojun Heo, Eunchung Noh, Hyung-Sin Kim

Main category: cs.LG

TL;DR: SkipAlign introduces selective non-alignment in contrastive learning to improve OOD detection and ID classification by skipping alignment for uncertain samples.

DetailsMotivation: Existing OSSL methods discard uncertain samples or force-align them, leading to geometric collapse and overconfidence on seen OODs.

Method: SkipAlign adds a ‘skip’ operator to contrastive learning, selectively skipping alignment for low-confidence samples and using gentle repulsion against ID prototypes.

Result: SkipAlign achieves tighter ID clusters and dispersed OOD features, outperforming state-of-the-art methods in OOD detection without compromising ID accuracy.

Conclusion: SkipAlign effectively addresses limitations of existing OSSL methods, enhancing both OOD detection and ID classification.

Abstract: Open-set semi-supervised learning (OSSL) leverages unlabeled data containing both in-distribution (ID) and unknown out-of-distribution (OOD) samples, aiming simultaneously to improve closed-set accuracy and detect novel OOD instances. Existing methods either discard valuable information from uncertain samples or force-align every unlabeled sample into one or a few synthetic “catch-all” representations, resulting in geometric collapse and overconfidence on only seen OODs. To address the limitations, we introduce selective non-alignment, adding a novel “skip” operator into conventional pull and push operations of contrastive learning. Our framework, SkipAlign, selectively skips alignment (pulling) for low-confidence unlabeled samples, retaining only gentle repulsion against ID prototypes. This approach transforms uncertain samples into a pure repulsion signal, resulting in tighter ID clusters and naturally dispersed OOD features. Extensive experiments demonstrate that SkipAlign significantly outperforms state-of-the-art methods in detecting unseen OOD data without sacrificing ID classification accuracy.

[497] Mjölnir: A Deep Learning Parametrization Framework for Global Lightning Flash Density

Minjong Cheon

Main category: cs.LG

TL;DR: Mjölnir is a deep learning framework for global lightning flash density prediction, trained on ERA5 and WWLLN data, achieving high accuracy in capturing lightning patterns.

DetailsMotivation: To leverage AI for accurate global lightning parameterization, addressing the need for better lightning prediction in Earth system models.

Method: Uses InceptionNeXt backbone with SENet and multi-task learning to predict lightning occurrence and magnitude from ERA5 and WWLLN data.

Result: Achieves a global Pearson correlation of 0.96 for annual mean fields, accurately reproducing lightning distribution and variability.

Conclusion: Mjölnir is an effective AI-based lightning parameterization tool with potential for integration into next-generation Earth system models.

Abstract: Recent advances in AI-based weather forecasting models, such as FourCastNet, Pangu-Weather, and GraphCast, have demonstrated the remarkable ability of deep learning to emulate complex atmospheric dynamics. Building on this momentum, we propose Mj"olnir, a novel deep learning-based framework for global lightning flash density parameterization. Trained on ERA5 atmospheric predictors and World Wide Lightning Location Network (WWLLN) observations at a daily temporal resolution and 1 degree spatial resolution, Mj"olnir captures the nonlinear mapping between large-scale environmental conditions and lightning activity. The model architecture is based on the InceptionNeXt backbone with SENet, and a multi-task learning strategy to simultaneously predict lightning occurrence and magnitude. Extensive evaluations yield that Mollnir accurately reproduces the global distribution, seasonal variability, and regional characteristics of lightning activity, achieving a global Pearson correlation coefficient of 0.96 for annual mean fields. These results suggest that Mj"olnir serves not only as an effective data-driven global lightning parameterization but also as a promising AI-based scheme for next-generation Earth system models (AI-ESMs).

[498] Approximation Rates in Besov Norms and Sample-Complexity of Kolmogorov-Arnold Networks with Residual Connections

Anastasis Kratsios, Bum Jun Kim, Takashi Furuya

Main category: cs.LG

TL;DR: Kolmogorov-Arnold Networks (KANs) improve deep learning by using trainable spline-based activation functions. This paper proves KANs optimally approximate Besov functions and provides statistical guarantees, showing their effectiveness in learning smooth maps.

DetailsMotivation: To explore the theoretical foundations of KANs, demonstrating their superior approximation capabilities and statistical guarantees compared to traditional MLPs.

Method: Theoretical analysis of KANs’ approximation rates for Besov functions on bounded or fractal domains, complemented by bounding the pseudodimension of Res-KANs for statistical guarantees.

Result: KANs achieve optimal approximation rates for Besov functions and provide dimension-free sample complexity estimates for learning smooth maps.

Conclusion: KANs are theoretically validated as powerful tools for approximating and learning smooth functions, outperforming traditional MLPs.

Abstract: Inspired by the Kolmogorov-Arnold superposition theorem, Kolmogorov-Arnold Networks (KANs) have recently emerged as an improved backbone for most deep learning frameworks, promising more adaptivity than their multilayer perceptron (MLP) predecessor by allowing for trainable spline-based activation functions. In this paper, we probe the theoretical foundations of the KAN architecture by showing that it can optimally approximate any Besov function in $B^{s}{p,q}(\mathcal{X})$ on a bounded open, or even fractal, domain $\mathcal{X}$ in $\mathbb{R}^d$ at the optimal approximation rate with respect to any weaker Besov norm $B^{\alpha}{p,q}(\mathcal{X})$; where $\alpha < s$. We complement our approximation result with a statistical guarantee by bounding the pseudodimension of the relevant class of Res-KANs. As an application of the latter, we directly deduce a dimension-free estimate on the sample complexity of a residual KAN model when learning a function of Besov regularity from $N$ i.i.d. noiseless samples, showing that KANs can learn the smooth maps which they can approximate.

[499] GRILL: Gradient Signal Restoration in Ill-Conditioned Layers to Enhance Adversarial Attacks on Autoencoders

Chethan Krishnamurthy Ramanaik, Arjun Roy, Tobias Callies, Eirini Ntoutsi

Main category: cs.LG

TL;DR: GRILL improves adversarial attacks on autoencoders by addressing gradient vanishing in ill-conditioned layers, enhancing robustness evaluation.

DetailsMotivation: Adversarial robustness of autoencoders is understudied, with existing attacks often sub-optimal due to gradient vanishing in ill-conditioned layers.

Method: Introduces GRILL to locally restore gradient signals in ill-conditioned layers, optimizing norm-bounded adversarial perturbations.

Result: GRILL significantly boosts attack effectiveness across various autoencoder architectures and attack settings.

Conclusion: GRILL enables more rigorous evaluation of autoencoder robustness by overcoming gradient vanishing issues.

Abstract: Adversarial robustness of deep autoencoders (AEs) remains relatively unexplored, even though their non-invertible nature poses distinct challenges. Existing attack algorithms during the optimization of imperceptible, norm-bounded adversarial perturbations to maximize output damage in AEs, often stop at sub-optimal attacks. We observe that the adversarial loss gradient vanishes when backpropagated through ill-conditioned layers. This issue arises from near-zero singular values in the Jacobians of these layers, which weaken the gradient signal during optimization. We introduce GRILL, a technique that locally restores gradient signals in ill-conditioned layers, enabling more effective norm-bounded attacks. Through extensive experiments on different architectures of popular AEs, under both sample-specific and universal attack setups, and across standard and adaptive attack settings, we show that our method significantly increases the effectiveness of our adversarial attacks, enabling a more rigorous evaluation of AE robustness.

[500] Efficient Training of Physics-enhanced Neural ODEs via Direct Collocation and Nonlinear Programming

Linus Langenkamp, Philip Hannebohm, Bernhard Bachmann

Main category: cs.LG

TL;DR: A novel method trains Physics-enhanced Neural ODEs (PeN-ODEs) via dynamic optimization, using high-order implicit Runge-Kutta discretization and NLP solvers for improved stability, speed, and accuracy.

DetailsMotivation: Address limitations of ODE solver-based training (stability, runtime, accuracy) and extend Neural ODEs to incorporate physical constraints.

Method: Discretize the model with high-order implicit Runge-Kutta, solve as NLP using solvers like Ipopt, optimize parameters and trajectories simultaneously.

Result: Superior accuracy, speed, and generalization with smaller networks, demonstrated on benchmarks like Quarter Vehicle Model and Van-der-Pol oscillator.

Conclusion: The approach is effective for PeN-ODEs, with plans to integrate into OpenModelica for broader accessibility in training Neural DAEs.

Abstract: We propose a novel approach for training Physics-enhanced Neural ODEs (PeN-ODEs) by expressing the training process as a dynamic optimization problem. The full model, including neural components, is discretized using a high-order implicit Runge-Kutta method with flipped Legendre-Gauss-Radau points, resulting in a large-scale nonlinear program (NLP) efficiently solved by state-of-the-art NLP solvers such as Ipopt. This formulation enables simultaneous optimization of network parameters and state trajectories, addressing key limitations of ODE solver-based training in terms of stability, runtime, and accuracy. Extending on a recent direct collocation-based method for Neural ODEs, we generalize to PeN-ODEs, incorporate physical constraints, and present a custom, parallelized, open-source implementation. Benchmarks on a Quarter Vehicle Model and a Van-der-Pol oscillator demonstrate superior accuracy, speed, generalization with smaller networks compared to other training techniques. We also outline a planned integration into OpenModelica to enable accessible training of Neural DAEs.

[501] A Generative Neural Annealer for Black-Box Combinatorial Optimization

Yuan-Hang Zhang, Massimiliano Di Ventra

Main category: cs.LG

TL;DR: A generative, end-to-end solver for black-box combinatorial optimization, inspired by annealing, models the Boltzmann distribution to improve sample efficiency and solution quality.

DetailsMotivation: Addressing NP problems with a focus on sample efficiency and solution quality, leveraging annealing-based algorithms to model the energy landscape.

Method: Trains a neural network to model the Boltzmann distribution of the black-box objective, conditioned on temperature, to capture varying distributions from uniform to peaked around optima.

Result: Competitive performance against state-of-the-art black-box optimizers in challenging combinatorial tasks under varying query budgets.

Conclusion: The approach effectively learns the energy landscape structure, enabling global optimization and improving sample efficiency or problem understanding depending on query cost.

Abstract: We propose a generative, end-to-end solver for black-box combinatorial optimization that emphasizes both sample efficiency and solution quality on NP problems. Drawing inspiration from annealing-based algorithms, we treat the black-box objective as an energy function and train a neural network to model the associated Boltzmann distribution. By conditioning on temperature, the network captures a continuum of distributions–from near-uniform at high temperatures to sharply peaked around global optima at low temperatures–thereby learning the structure of the energy landscape and facilitating global optimization. When queries are expensive, the temperature-dependent distributions naturally enable data augmentation and improve sample efficiency. When queries are cheap but the problem remains hard, the model learns implicit variable interactions, effectively “opening” the black box. We validate our approach on challenging combinatorial tasks under both limited and unlimited query budgets, showing competitive performance against state-of-the-art black-box optimizers.

[502] Reconstructing Physics-Informed Machine Learning for Traffic Flow Modeling: a Multi-Gradient Descent and Pareto Learning Approach

Yuan-Zheng Lei, Yaobang Gong, Dianwei Chen, Yao Cheng, Xianfeng Terry Yang

Main category: cs.LG

TL;DR: The paper introduces a multi-objective optimization approach for physics-informed machine learning (PIML) in traffic flow modeling, outperforming traditional linear scalarization methods.

DetailsMotivation: Linear scalarization in PIML limits trade-off solutions and requires tedious tuning. A multi-objective approach can overcome these limitations.

Method: Reformulates PIML training as multi-objective optimization, using multi-gradient descent algorithms (MGDAs) like TMGD and DCGD to explore the Pareto front.

Result: MGDAs matched traditional methods in macroscopic traffic models and significantly outperformed them in microscopic models.

Conclusion: Multi-objective optimization enhances PIML performance, especially in complex scenarios, offering a superior alternative to linear scalarization.

Abstract: Physics-informed machine learning (PIML) is crucial in modern traffic flow modeling because it combines the benefits of both physics-based and data-driven approaches. In conventional PIML, physical information is typically incorporated by constructing a hybrid loss function that combines data-driven loss and physics loss through linear scalarization. The goal is to find a trade-off between these two objectives to improve the accuracy of model predictions. However, from a mathematical perspective, linear scalarization is limited to identifying only the convex region of the Pareto front, as it treats data-driven and physics losses as separate objectives. Given that most PIML loss functions are non-convex, linear scalarization restricts the achievable trade-off solutions. Moreover, tuning the weighting coefficients for the two loss components can be both time-consuming and computationally challenging. To address these limitations, this paper introduces a paradigm shift in PIML by reformulating the training process as a multi-objective optimization problem, treating data-driven loss and physics loss independently. We apply several multi-gradient descent algorithms (MGDAs), including traditional multi-gradient descent (TMGD) and dual cone gradient descent (DCGD), to explore the Pareto front in this multi-objective setting. These methods are evaluated on both macroscopic and microscopic traffic flow models. In the macroscopic case, MGDAs achieved comparable performance to traditional linear scalarization methods. Notably, in the microscopic case, MGDAs significantly outperformed their scalarization-based counterparts, demonstrating the advantages of a multi-objective optimization approach in complex PIML scenarios.

[503] Stepsize anything: A unified learning rate schedule for budgeted-iteration training

Anda Tang, Yiming Dong, Yutao Zeng, zhou Xun, Zhouchen Lin

Main category: cs.LG

TL;DR: The paper introduces the Unified Budget-Aware (UBA) schedule, a theoretically grounded learning rate schedule that outperforms heuristic-based schedules under constrained training budgets.

DetailsMotivation: The need for efficient learning rate schedules in budgeted-iteration training, given the heuristic and trial-and-error nature of existing methods.

Method: Proposes UBA, derived from a budget-aware optimization framework, with a single hyper-parameter for flexibility and simplicity.

Result: UBA consistently outperforms common schedules across diverse tasks and architectures under constrained budgets.

Conclusion: UBA provides a robust, interpretable, and efficient solution for learning rate scheduling in budgeted-iteration training.

Abstract: The expanding computational costs and limited resources underscore the critical need for budgeted-iteration training, which aims to achieve optimal learning within predetermined iteration budgets. While learning rate schedules fundamentally govern the performance of different networks and tasks, particularly in budgeted-iteration scenarios, their design remains largely heuristic, lacking theoretical foundations. In addition, the optimal learning rate schedule requires extensive trial-and-error selection, making the training process inefficient. In this work, we propose the Unified Budget-Aware (UBA) schedule, a theoretically grounded learning rate schedule that consistently outperforms commonly-used schedules among diverse architectures and tasks under different constrained training budgets. First, we bridge the gap by constructing a novel training budget-aware optimization framework, which explicitly accounts for the robustness to landscape curvature variations. From this framework, we derive the UBA schedule, controlled by a single hyper-parameter \varphi that provides a trade-off between flexibility and simplicity, eliminating the need for per-network numerical optimization. Moreover, we establish a theoretical connection between \varphi and the condition number, adding interpretation and justification to our approach. Besides, we prove the convergence for different values of \varphi. We offer practical guidelines for its selection via theoretical analysis and empirical results. Extensive experimental results show that UBA consistently surpasses the commonly-used schedules across diverse vision and language tasks, spanning network architectures (e.g., ResNet, OLMo) and scales, under different training-iteration budgets.

[504] AtmosMJ: Revisiting Gating Mechanism for AI Weather Forecasting Beyond the Year Scale

Minjong Cheon

Main category: cs.LG

TL;DR: AtmosMJ challenges the need for non-standard spatial domains in long-range weather forecasting by achieving stable 500-day forecasts on a standard latitude-longitude grid using a novel Gated Residual Fusion mechanism.

DetailsMotivation: The paper questions the assumption that non-standard spatial domains (e.g., spherical harmonics) are necessary for stable long-range weather forecasts, aiming to prove comparable performance on standard grids.

Method: Introduces AtmosMJ, a deep convolutional network operating directly on ERA5 data without spherical remapping, using Gated Residual Fusion (GRF) to prevent error accumulation.

Result: AtmosMJ achieves stable 500-day forecasts, competitive 10-day accuracy against leading models, and requires only 5.7 days of training on a V100 GPU.

Conclusion: Efficient architectural design (like GRF) can enable stable long-range forecasting on standard grids, reducing reliance on non-standard data representations.

Abstract: The advent of Large Weather Models (LWMs) has marked a turning point in data-driven forecasting, with many models now outperforming traditional numerical systems in the medium range. However, achieving stable, long-range autoregressive forecasts beyond a few weeks remains a significant challenge. Prevailing state-of-the-art models that achieve year-long stability, such as SFNO and DLWP-HPX, have relied on transforming input data onto non-standard spatial domains like spherical harmonics or HEALPix meshes. This has led to the prevailing assumption that such representations are necessary to enforce physical consistency and long-term stability. This paper challenges that assumption by investigating whether comparable long-range performance can be achieved on the standard latitude-longitude grid. We introduce AtmosMJ, a deep convolutional network that operates directly on ERA5 data without any spherical remapping. The model’s stability is enabled by a novel Gated Residual Fusion (GRF) mechanism, which adaptively moderates feature updates to prevent error accumulation over long recursive simulations. Our results demonstrate that AtmosMJ produces stable and physically plausible forecasts for about 500 days. In quantitative evaluations, it achieves competitive 10-day forecast accuracy against models like Pangu-Weather and GraphCast, all while requiring a remarkably low training budget of 5.7 days on a V100 GPU. Our findings suggest that efficient architectural design, rather than non-standard data representation, can be the key to unlocking stable and computationally efficient long-range weather prediction.

[505] Boost Post-Training Quantization via Null Space Optimization for Large Language Models

Jiaqi Zhao, Weili Guan, Ming Li, Miao Zhang

Main category: cs.LG

TL;DR: The paper introduces null space optimization for LLM quantization, proposing Q2N, a plug-and-play module to reduce quantization error by constraining weight perturbations within the null space of input activations.

DetailsMotivation: Existing quantization methods for LLMs show diminishing returns, suggesting a need for new strategies to achieve higher compression without performance loss.

Method: The paper proposes Q2N, a null space projection module, with an efficient approximation method and a closed-form solution to avoid extra memory overhead.

Result: Experiments on LLMs like LLaMA3, DeepSeek, and Qwen3 validate Q2N’s effectiveness and the null space optimization perspective.

Conclusion: This work pioneers null space insights for quantization, aiming to inspire advanced methods for further error reduction.

Abstract: Existing post-training quantization methods for large language models (LLMs) offer remarkable success. However, the increasingly marginal performance gains suggest that existing quantization strategies are insufficient to support the development of more compressed models. To inspire new directions for future research, this paper introduces the concept of null space into LLMs quantization. We argue that the quantization error can be effectively alleviated by constraining the post-quantization weight perturbation to lie within the null space of input activations. To prove this idea, we propose a plug-and-play null space projection module for existing milestone PTQ baselines named Q2N. Specifically, we first design an efficient and accurate null space projection approximation method tailored to the characteristics of LLMs. Subsequently, we theoretically derive a closed-form solution for an equivalent vector of the obtained projection matrix, which satisfies practical inference condition while avoiding additional memory overhead. Extensive experiments are conducted on various state-of-the-art LLMs (LLaMA3, DeepSeek, Qwen3) and baselines, demonstrating the effectiveness of both our Q2N and the perspective of null space optimization for LLMs quantization. We view this paper the first step to further alleviate the quantization error based on the insights of null space, hoping it inspiring future researchers to design more advanced quantization methods. Codes are available at https://github.com/zjq0455/q2n.

[506] 15,500 Seconds: Lean UAV Classification Using EfficientNet and Lightweight Fine-Tuning

Andrew P. Berg, Qian Zhang, Mia Y. Wang

Main category: cs.LG

TL;DR: The paper tackles data scarcity in UAV audio classification using efficient fine-tuning, data augmentation, and pre-trained networks, achieving 95% accuracy with EfficientNet-B0.

DetailsMotivation: Addressing the growing security concerns of UAVs by improving deep UAV audio classification despite data scarcity.

Method: Utilizes parameter-efficient fine-tuning, data augmentation, and pre-trained networks (EfficientNet-B0).

Result: Achieves over 95% validation accuracy.

Conclusion: The proposed methods effectively overcome data scarcity and enhance UAV audio classification performance.

Abstract: Unmanned Aerial Vehicles (UAVs) pose an escalating security concerns as the market for consumer and military UAVs grows. This paper address the critical data scarcity challenges in deep UAV audio classification. We build upon our previous work expanding novel approaches such as: parameter efficient fine-tuning, data augmentation, and pre-trained networks. We achieve performance upwards of 95% validation accuracy with EfficientNet-B0.

[507] Zero-Shot Neural Architecture Search with Weighted Response Correlation

Kun Jing, Luoyu Chen, Jungang Xu, Jianwei Tai, Yiyu Wang, Shuaimin Li

Main category: cs.LG

TL;DR: A novel training-free proxy, WRCor, is introduced for efficient neural architecture search (NAS), outperforming existing methods in speed and accuracy.

DetailsMotivation: To address the computational expense and instability of current NAS methods, the paper proposes a training-free proxy for faster and more reliable architecture estimation.

Method: The method uses weighted response correlation (WRCor) to calculate proxy scores from correlation coefficient matrices of responses across input samples, measuring expressivity and generalizability.

Result: WRCor and its voting proxies outperform existing methods in efficiency and accuracy, achieving a 22.1% test error on ImageNet-1k in just 4 GPU hours.

Conclusion: The proposed zero-shot NAS algorithm with WRCor is a highly efficient and effective alternative to traditional NAS methods, with publicly available code.

Abstract: Neural architecture search (NAS) is a promising approach for automatically designing neural network architectures. However, the architecture estimation of NAS is computationally expensive and time-consuming because of training multiple architectures from scratch. Although existing zero-shot NAS methods use training-free proxies to accelerate the architecture estimation, their effectiveness, stability, and generality are still lacking. We present a novel training-free estimation proxy called weighted response correlation (WRCor). WRCor utilizes correlation coefficient matrices of responses across different input samples to calculate the proxy scores of estimated architectures, which can measure their expressivity and generalizability. Experimental results on proxy evaluation demonstrate that WRCor and its voting proxies are more efficient estimation strategies than existing proxies. We also apply them with different search strategies in architecture search. Experimental results on architecture search show that our zero-shot NAS algorithm outperforms most existing NAS algorithms in different search spaces. Our NAS algorithm can discover an architecture with a 22.1% test error on the ImageNet-1k dataset within 4 GPU hours. All codes are publicly available at https://github.com/kunjing96/ZSNAS-WRCor.git.

[508] Confounder-Free Continual Learning via Recursive Feature Normalization

Yash Shah, Camila Gonzalez, Mohammad H. Abbasi, Qingyu Zhao, Kilian M. Pohl, Ehsan Adeli

Main category: cs.LG

TL;DR: The paper introduces Recursive MDN (R-MDN) to address confounder effects in continual learning, ensuring equitable predictions by updating feature representations dynamically.

DetailsMotivation: Confounders create spurious correlations in predictions, and existing methods like MDN don't adapt well to continual learning. R-MDN aims to solve this.

Method: R-MDN integrates into any deep learning model, using recursive least squares to update feature representations dynamically, adapting to changing confounder distributions.

Result: R-MDN reduces biased predictions and catastrophic forgetting in continual learning, promoting fairness across population groups.

Conclusion: R-MDN effectively mitigates confounder effects in continual learning, enhancing model fairness and adaptability.

Abstract: Confounders are extraneous variables that affect both the input and the target, resulting in spurious correlations and biased predictions. There are recent advances in dealing with or removing confounders in traditional models, such as metadata normalization (MDN), where the distribution of the learned features is adjusted based on the study confounders. However, in the context of continual learning, where a model learns continuously from new data over time without forgetting, learning feature representations that are invariant to confounders remains a significant challenge. To remove their influence from intermediate feature representations, we introduce the Recursive MDN (R-MDN) layer, which can be integrated into any deep learning architecture, including vision transformers, and at any model stage. R-MDN performs statistical regression via the recursive least squares algorithm to maintain and continually update an internal model state with respect to changing distributions of data and confounding variables. Our experiments demonstrate that R-MDN promotes equitable predictions across population groups, both within static learning and across different stages of continual learning, by reducing catastrophic forgetting caused by confounder effects changing over time.

[509] Algorithm Development in Neural Networks: Insights from the Streaming Parity Task

Loek van Rossem, Andrew M. Saxe

Main category: cs.LG

TL;DR: RNNs trained on the streaming parity task achieve infinite generalization via a phase transition, revealing a mechanism for learning algorithms from finite data.

DetailsMotivation: Understanding how neural networks generalize beyond training data, especially in tasks like streaming parity, to uncover mechanisms of algorithm learning.

Method: Case study of RNNs on the streaming parity task, analyzing learning dynamics and representational changes.

Result: RNNs exhibit a phase transition to perfect infinite generalization, forming a finite automaton for the task.

Conclusion: The study reveals a mechanism for infinite generalization in neural networks through implicit representational merger.

Abstract: Even when massively overparameterized, deep neural networks show a remarkable ability to generalize. Research on this phenomenon has focused on generalization within distribution, via smooth interpolation. Yet in some settings neural networks also learn to extrapolate to data far beyond the bounds of the original training set, sometimes even allowing for infinite generalization, implying that an algorithm capable of solving the task has been learned. Here we undertake a case study of the learning dynamics of recurrent neural networks (RNNs) trained on the streaming parity task in order to develop an effective theory of algorithm development. The streaming parity task is a simple but nonlinear task defined on sequences up to arbitrary length. We show that, with sufficient finite training experience, RNNs exhibit a phase transition to perfect infinite generalization. Using an effective theory for the representational dynamics, we find an implicit representational merger effect which can be interpreted as the construction of a finite automaton that reproduces the task. Overall, our results disclose one mechanism by which neural networks can generalize infinitely from finite training experience.

[510] Gauge Flow Models

Alexander Strunk, Roland Assam

Main category: cs.LG

TL;DR: Gauge Flow Models, a new class of Generative Flow Models, outperform traditional Flow Models by incorporating a learnable Gauge Field in the Flow ODE.

DetailsMotivation: To enhance the performance of Generative Flow Models by integrating a learnable Gauge Field into the Flow ODE framework.

Method: Introduces Gauge Flow Models with a mathematical framework and validates them using Flow Matching on Gaussian Mixture Models.

Result: Gauge Flow Models show significantly better performance than traditional Flow Models, even when smaller in size.

Conclusion: Gauge Flow Models are promising for generative tasks, with potential for broader applications beyond the tested scenarios.

Abstract: This paper introduces Gauge Flow Models, a novel class of Generative Flow Models. These models incorporate a learnable Gauge Field within the Flow Ordinary Differential Equation (ODE). A comprehensive mathematical framework for these models, detailing their construction and properties, is provided. Experiments using Flow Matching on Gaussian Mixture Models demonstrate that Gauge Flow Models yields significantly better performance than traditional Flow Models of comparable or even larger size. Additionally, unpublished research indicates a potential for enhanced performance across a broader range of generative tasks.

[511] The Price equation reveals a universal force-metric-bias law of algorithmic learning and natural selection

Steven A. Frank

Main category: cs.LG

TL;DR: The paper introduces a universal FMB law (force-metric-bias) using the Price equation, unifying diverse learning algorithms, optimization methods, and natural selection under a common mathematical framework.

DetailsMotivation: To reveal the shared mathematical structure among seemingly disparate learning and optimization processes, providing a unified understanding.

Method: The Price equation is used to partition change, deriving the FMB law: Δθ = Mf + b + ξ, where force, metric, bias, and noise components explain various algorithms.

Result: The FMB law unifies natural selection, Bayesian updating, Newton’s method, gradient descent, and others as special cases, also explaining the emergence of Fisher information and KL divergence.

Conclusion: The FMB law offers a principled foundation for comparing and designing learning algorithms across disciplines by exposing their common structure.

Abstract: Diverse learning algorithms, optimization methods, and natural selection share a common mathematical structure, despite their apparent differences. Here I show that a simple notational partitioning of change by the Price equation reveals a universal force-metric-bias (FMB) law: $\Delta\mathbf{\theta} = \mathbf{M},\mathbf{f} + \mathbf{b} + \mathbf{\xi}$. The force $\mathbf{f}$ drives improvement in parameters, $\Delta\mathbf{\theta}$, in proportion to the slope of performance with respect to the parameters. The metric $\mathbf{M}$ rescales movement by inverse curvature. The bias $\mathbf{b}$ adds momentum or changes in the frame of reference. The noise $\mathbf{\xi}$ enables exploration. This framework unifies natural selection, Bayesian updating, Newton’s method, stochastic gradient descent, stochastic Langevin dynamics, Adam optimization, and most other algorithms as special cases of the same underlying process. The Price equation also reveals why Fisher information, Kullback-Leibler divergence, and d’Alembert’s principle arise naturally in learning dynamics. By exposing this common structure, the FMB law provides a principled foundation for understanding, comparing, and designing learning algorithms across disciplines.

[512] From Entanglement to Alignment: Representation Space Decomposition for Unsupervised Time Series Domain Adaptation

Rongyao Cai, Ming Jin, Qingsong Wen, Kexin Zhang

Main category: cs.LG

TL;DR: DARSD is a novel UDA framework addressing domain shift in time series by decomposing representations into transferable and domain-specific components, outperforming 12 existing methods.

DetailsMotivation: Current UDA methods treat features as indivisible, ignoring their intrinsic compositions, leading to poor performance in domain adaptation tasks.

Method: DARSD decomposes representations into domain-invariant and domain-specific parts using adversarial learning, pseudo-labeling, and hybrid contrastive optimization.

Result: DARSD outperforms 12 UDA algorithms, achieving top performance in 35 out of 53 scenarios across four benchmarks.

Conclusion: DARSD provides a principled approach to domain adaptation by disentangling transferable knowledge, demonstrating superior performance in time series analysis.

Abstract: Domain shift poses a fundamental challenge in time series analysis, where models trained on source domain often fail dramatically when applied in target domain with different yet similar distributions. While current unsupervised domain adaptation (UDA) methods attempt to align cross-domain feature distributions, they typically treat features as indivisible entities, ignoring their intrinsic compositions that govern domain adaptation. We introduce DARSD, a novel UDA framework with theoretical explainability that explicitly realizes UDA tasks from the perspective of representation space decomposition. Our core insight is that effective domain adaptation requires not just alignment, but principled disentanglement of transferable knowledge from mixed representations. DARSD consists of three synergistic components: (I) An adversarial learnable common invariant basis that projects original features into a domain-invariant subspace while preserving semantic content; (II) A prototypical pseudo-labeling mechanism that dynamically separates target features based on confidence, hindering error accumulation; (III) A hybrid contrastive optimization strategy that simultaneously enforces feature clustering and consistency while mitigating emerging distribution gaps. Comprehensive experiments conducted on four benchmarks (WISDM, HAR, HHAR, and MFD) demonstrate DARSD’s superiority against 12 UDA algorithms, achieving optimal performance in 35 out of 53 scenarios and ranking first across all benchmarks.

[513] SPADE-S: A Sparsity-Robust Foundational Forecaster

Malcolm Wolff, Matthew Li, Ravi Kiran Selvam, Hanjing Zhu, Kin G. Olivares, Ruijun Ma, Abhinav Katoch, Shankar Ramasubramanian, Mengfei Cao, Roberto Bandarra, Rahul Gopalsamy, Stefania La Vattiata, Sitan Yang, Michael W. Mahoney

Main category: cs.LG

TL;DR: SPADE-S is a forecasting architecture addressing biases in time series with low magnitude and sparsity, improving accuracy by up to 15% in demand forecasting.

DetailsMotivation: Existing models underperform on low-magnitude and sparse time series due to biased loss functions, training methods, and encoding limitations.

Method: SPADE-S reduces biases through robust architecture design, addressing magnitude and sparsity issues.

Result: SPADE-S outperforms state-of-the-art models, achieving up to 15% better accuracy, with notable gains in P90 and P50 forecasts across datasets.

Conclusion: SPADE-S effectively mitigates biases and enhances forecasting accuracy for heterogeneous time series, proving its robustness in real-world demand forecasting.

Abstract: Despite significant advancements in time series forecasting, accurate modeling of time series with strong heterogeneity in magnitude and/or sparsity patterns remains challenging for state-of-the-art deep learning architectures. We identify several factors that lead existing models to systematically underperform on low-magnitude and sparse time series, including loss functions with implicit biases toward high-magnitude series, training-time sampling methods, and limitations of time series encoding methods. SPADE-S is a robust forecasting architecture that significantly reduces magnitude- and sparsity-based systematic biases and improves overall prediction accuracy. Empirical results demonstrate that SPADE-S outperforms existing state-of-the-art approaches across a diverse set of use cases in demand forecasting. In particular, we show that, depending on the quantile forecast and magnitude of the series, SPADE-S can improve forecast accuracy by up to 15%. This results in P90 overall forecast accuracy gains of 2.21%, 6.58%, and 4.28%, and P50 forecast accuracy gains of 0.92%, 0.77%, and 1.95%, respectively, for each of three distinct datasets, ranging from 3 million to 700 million series, from a large online retailer.

[514] KFS: KAN based adaptive Frequency Selection learning architecture for long term time series forecasting

Changning Wu, Gao Wu, Rongyao Cai, Yong Liu, Kexin Zhang

Main category: cs.LG

TL;DR: The paper introduces KFS, a KAN-based adaptive frequency selection architecture, to improve multi-scale time series forecasting by addressing noise interference and heterogeneous information distribution.

DetailsMotivation: Real-world time series suffer from noise interference and suboptimal multi-scale representation due to heterogeneous frequency information.

Method: Proposes KFS with a FreK module for dominant frequency selection, KAN for pattern representation, and feature mixing for scale-specific fusion.

Result: KFS achieves state-of-the-art performance on multiple real-world datasets.

Conclusion: KFS is a simple yet effective solution for multi-scale time series forecasting challenges.

Abstract: Multi-scale decomposition architectures have emerged as predominant methodologies in time series forecasting. However, real-world time series exhibit noise interference across different scales, while heterogeneous information distribution among frequency components at varying scales leads to suboptimal multi-scale representation. Inspired by Kolmogorov-Arnold Networks (KAN) and Parseval’s theorem, we propose a KAN based adaptive Frequency Selection learning architecture (KFS) to address these challenges. This framework tackles prediction challenges stemming from cross-scale noise interference and complex pattern modeling through its FreK module, which performs energy-distribution-based dominant frequency selection in the spectral domain. Simultaneously, KAN enables sophisticated pattern representation while timestamp embedding alignment synchronizes temporal representations across scales. The feature mixing module then fuses scale-specific patterns with aligned temporal features. Extensive experiments across multiple real-world time series datasets demonstrate that KT achieves state-of-the-art performance as a simple yet effective architecture.

[515] Proactive Constrained Policy Optimization with Preemptive Penalty

Ning Yang, Pengyu Wang, Guoqing Liu, Haifeng Zhang, Pin Lv, Jun Wang

Main category: cs.LG

TL;DR: PCPO introduces a proactive penalty mechanism and constraint-aware intrinsic reward to improve stability and safety in RL, outperforming traditional Lagrangian methods.

DetailsMotivation: Addressing issues like constraint violations and instability in Safe RL, PCPO aims to preemptively manage constraints rather than react post-violation.

Method: PCPO integrates barrier items into the objective function and uses a constraint-aware intrinsic reward for boundary-aware exploration, supported by policy iteration.

Result: Theoretical bounds for duality gap and performance are established, with experiments showing PCPO’s stability and robustness.

Conclusion: PCPO offers a promising solution for constrained policy optimization, with potential for future research and practical use.

Abstract: Safe Reinforcement Learning (RL) often faces significant issues such as constraint violations and instability, necessitating the use of constrained policy optimization, which seeks optimal policies while ensuring adherence to specific constraints like safety. Typically, constrained optimization problems are addressed by the Lagrangian method, a post-violation remedial approach that may result in oscillations and overshoots. Motivated by this, we propose a novel method named Proactive Constrained Policy Optimization (PCPO) that incorporates a preemptive penalty mechanism. This mechanism integrates barrier items into the objective function as the policy nears the boundary, imposing a cost. Meanwhile, we introduce a constraint-aware intrinsic reward to guide boundary-aware exploration, which is activated only when the policy approaches the constraint boundary. We establish theoretical upper and lower bounds for the duality gap and the performance of the PCPO update, shedding light on the method’s convergence characteristics. Additionally, to enhance the optimization performance, we adopt a policy iteration approach. An interesting finding is that PCPO demonstrates significant stability in experiments. Experimental results indicate that the PCPO framework provides a robust solution for policy optimization under constraints, with important implications for future research and practical applications.

[516] Stochastic Encodings for Active Feature Acquisition

Alexander Norcliffe, Changhee Lee, Fergus Imrie, Mihaela van der Schaar, Pietro Lio

Main category: cs.LG

TL;DR: A supervised latent variable model is introduced for Active Feature Acquisition, outperforming RL and greedy methods by reasoning over unobserved feature realizations.

DetailsMotivation: Addressing the limitations of Reinforcement Learning (training difficulties) and greedy conditional mutual information methods (myopic acquisitions) in Active Feature Acquisition.

Method: Uses a latent variable model trained supervisedly, reasoning about features across unobserved realizations in a stochastic latent space.

Result: Outperforms diverse baselines on synthetic and real datasets.

Conclusion: The proposed approach effectively addresses shortcomings of existing methods, demonstrating superior performance.

Abstract: Active Feature Acquisition is an instance-wise, sequential decision making problem. The aim is to dynamically select which feature to measure based on current observations, independently for each test instance. Common approaches either use Reinforcement Learning, which experiences training difficulties, or greedily maximize the conditional mutual information of the label and unobserved features, which makes myopic acquisitions. To address these shortcomings, we introduce a latent variable model, trained in a supervised manner. Acquisitions are made by reasoning about the features across many possible unobserved realizations in a stochastic latent space. Extensive evaluation on a large range of synthetic and real datasets demonstrates that our approach reliably outperforms a diverse set of baselines.

[517] Beyond Manually Designed Pruning Policies with Second-Level Performance Prediction: A Pruning Framework for LLMs

Zuxin Ma, Yunhe Cui, Yongbin Qin

Main category: cs.LG

TL;DR: PPF is a novel pruning framework for LLMs that uses second-level performance prediction to automate pruning decisions, outperforming manual methods in speed and accuracy.

DetailsMotivation: Existing non-uniform pruning methods rely on manual policies and slow evaluations, limiting adaptability to dynamic pruning needs.

Method: PPF employs an agent for real-time pruning actions and a lightweight predictor for fast policy evaluation.

Result: PPF reduces perplexity by up to 33.4% (dynamic) and 84.78% (static) and speeds up evaluation by 64x.

Conclusion: PPF effectively automates and accelerates pruning, outperforming manual methods in dynamic and static scenarios.

Abstract: Non-uniform structured network pruning methods can effectively reduce Large Language Model (LLM) size by eliminating redundant channels or layers, offering lower performance degradation than uniform strategies. However, existing non-uniform methods rely heavily on manually designed pruning policies (e.g., layer importance and scaling factors), and therefore cannot efficiently adapt to scenarios with dynamic pruning ratio requirements. Additionly, a critical bottleneck – the time-consuming evaluation of pruning policies – further limits the feasibility of iteratively and dynamically finding optimal pruning policies. To address these limitations, we propose PPF (Predictive Pruning Framework), a novel pruning framework for LLMs that eliminates manual design dependencies via second-level performance prediction. PPF not only supports real-time pruning decisions under dynamic pruning ratios but is also applicable to static pruning scenarios. It employs an agent for producing adaptive and real-time pruning actions, while a lightweight performance predictor that can evaluate a pruning policy in seconds, significantly speeding up the iterative optimization process. Experiments on Llama2-7B and Llama3-8B show that PPF can generate dynamic/static pruning policies and it reduces perplexity by up to 33.4% (dynamic pruning) and 84.78% (static pruning) over existing methods, outperforming manually designed pruning policies. The performance predictor achieves second-level performance prediction with high accuracy (prediction error < 0.0011). It reduces the mean evaluation latency from minute-level (1 minute and 38.02 seconds of test-set evaluation methods) to second-level (1.52 seconds), achieving over 64 times speedup. Our code will be available at https://github.com/Ma-zx/PPF .

[518] Entity Representation Learning Through Onsite-Offsite Graph for Pinterest Ads

Jiayin Jin, Zhimeng Pan, Yang Tang, Jiarui Feng, Kungang Li, Chongyuan Xiang, Jiacheng Li, Runze Su, Siping Ji, Han Sun, Ling Leng, Prathibha Deshikachar

Main category: cs.LG

TL;DR: The paper introduces TransRA, a novel KGE model, to integrate graph embeddings into Ads ranking models, overcoming initial challenges with a Large ID Embedding Table and attention-based finetuning, achieving significant CTR and CVR improvements.

DetailsMotivation: To leverage offsite conversion data and explore connections between onsite and offsite user activities for better Ads ranking models.

Method: Constructed a large-scale heterogeneous graph from user activities and introduced TransRA, combined with Large ID Embedding Table and attention-based KGE finetuning.

Result: Significant AUC lift in CTR and CVR predictions, with deployed framework contributing to 2.69% CTR lift and 1.34% CPC reduction.

Conclusion: The techniques can benefit other large-scale industrial models, demonstrating practical impact in Ads ranking.

Abstract: Graph Neural Networks (GNN) have been extensively applied to industry recommendation systems, as seen in models like GraphSage\cite{GraphSage}, TwHIM\cite{TwHIM}, LiGNN\cite{LiGNN} etc. In these works, graphs were constructed based on users’ activities on the platforms, and various graph models were developed to effectively learn node embeddings. In addition to users’ onsite activities, their offsite conversions are crucial for Ads models to capture their shopping interest. To better leverage offsite conversion data and explore the connection between onsite and offsite activities, we constructed a large-scale heterogeneous graph based on users’ onsite ad interactions and opt-in offsite conversion activities. Furthermore, we introduced TransRA (TransR\cite{TransR} with Anchors), a novel Knowledge Graph Embedding (KGE) model, to more efficiently integrate graph embeddings into Ads ranking models. However, our Ads ranking models initially struggled to directly incorporate Knowledge Graph Embeddings (KGE), and only modest gains were observed during offline experiments. To address this challenge, we employed the Large ID Embedding Table technique and innovated an attention based KGE finetuning approach within the Ads ranking models. As a result, we observed a significant AUC lift in Click-Through Rate (CTR) and Conversion Rate (CVR) prediction models. Moreover, this framework has been deployed in Pinterest’s Ads Engagement Model and contributed to $2.69%$ CTR lift and $1.34%$ CPC reduction. We believe the techniques presented in this paper can be leveraged by other large-scale industrial models.

[519] DMSC: Dynamic Multi-Scale Coordination Framework for Time Series Forecasting

Haonan Yang, Jianchao Tang, Zhuo Li, Long Lan

Main category: cs.LG

TL;DR: The paper introduces DMSC, a dynamic multi-scale coordination framework for time series forecasting, addressing static decomposition, fragmented dependencies, and inflexible fusion with novel components like EMPD, TIB, and ASR-MoE.

DetailsMotivation: Existing methods struggle with static decomposition, fragmented dependency modeling, and inflexible fusion, limiting their ability to capture intricate temporal dependencies in time series forecasting.

Method: Proposes DMSC with EMPD for dynamic multi-scale patch decomposition, TIB for joint dependency modeling, and ASR-MoE for adaptive fusion. These components are integrated into a multi-layer progressive cascade architecture.

Result: DMSC achieves state-of-the-art performance and superior computational efficiency on thirteen real-world benchmarks.

Conclusion: DMSC effectively addresses the limitations of existing methods, demonstrating robust performance and efficiency in time series forecasting tasks.

Abstract: Time Series Forecasting (TSF) faces persistent challenges in modeling intricate temporal dependencies across different scales. Despite recent advances leveraging different decomposition operations and novel architectures based on CNN, MLP or Transformer, existing methods still struggle with static decomposition strategies, fragmented dependency modeling, and inflexible fusion mechanisms, limiting their ability to model intricate temporal dependencies. To explicitly solve the mentioned three problems respectively, we propose a novel Dynamic Multi-Scale Coordination Framework (DMSC) with Multi-Scale Patch Decomposition block (EMPD), Triad Interaction Block (TIB) and Adaptive Scale Routing MoE block (ASR-MoE). Specifically, EMPD is designed as a built-in component to dynamically segment sequences into hierarchical patches with exponentially scaled granularities, eliminating predefined scale constraints through input-adaptive patch adjustment. TIB then jointly models intra-patch, inter-patch, and cross-variable dependencies within each layer’s decomposed representations. EMPD and TIB are jointly integrated into layers forming a multi-layer progressive cascade architecture, where coarse-grained representations from earlier layers adaptively guide fine-grained feature extraction in subsequent layers via gated pathways. And ASR-MoE dynamically fuses multi-scale predictions by leveraging specialized global and local experts with temporal-aware weighting. Comprehensive experiments on thirteen real-world benchmarks demonstrate that DMSC consistently maintains state-of-the-art (SOTA) performance and superior computational efficiency for TSF tasks. Code is available at https://github.com/1327679995/DMSC.

[520] Context-Adaptive Multi-Prompt Embedding with Large Language Models for Vision-Language Alignment

Dahun Kim, Anelia Angelova

Main category: cs.LG

TL;DR: A novel method, Context-Adaptive Multi-Prompt Embedding, enhances semantic representations in vision-language contrastive learning by using multiple adaptive prompts and a pretrained LLM, achieving better retrieval performance.

DetailsMotivation: Standard CLIP-style models use a single text embedding, limiting semantic richness. This work aims to capture diverse semantic aspects of text for better alignment with visual features.

Method: Introduces multiple structured prompts with adaptive tokens, processed jointly by a pretrained LLM. Combines embeddings into a unified text representation and uses diversity and negation-aware losses for improved contrastive learning.

Result: Consistent improvements on image-text and video-text retrieval benchmarks.

Conclusion: The method effectively enriches semantic representations and enhances retrieval performance through multi-prompt embeddings and specialized losses.

Abstract: We propose Context-Adaptive Multi-Prompt Embedding, a novel approach to enrich semantic representations in vision-language contrastive learning. Unlike standard CLIP-style models that rely on a single text embedding, our method introduces multiple structured prompts, each containing a distinct adaptive token that captures diverse semantic aspects of the input text. We leverage a pretrained LLM as the text encoder within the CLIP framework, processing all prompts jointly in a single forward pass. The resulting prompt embeddings are combined into a unified text representation, enabling semantically richer alignment with visual features. To further promote semantic diversity and representation quality, we incorporate a diversity regularization loss and a negation-aware loss, encouraging specialization across prompts and improving contrastive discrimination. Our method achieves consistent improvements on both image-text and video-text retrieval benchmarks.

[521] CauKer: classification time series foundation models can be pretrained on synthetic data only

Shifeng Xie, Vasilii Feofanov, Marius Alonso, Ambroise Odonnat, Jianfeng Zhang, Themis Palpanas, Ievgen Redko

Main category: cs.LG

TL;DR: CauKer is a novel algorithm for generating synthetic time series to efficiently pretrain Time Series Foundation Models (TSFMs), improving scalability and performance.

DetailsMotivation: Current TSFMs require costly pretraining on large real-world datasets. CauKer aims to provide a sample-efficient alternative with synthetic data.

Method: CauKer combines Gaussian Process kernel composition and Structural Causal Models to generate diverse, causally coherent synthetic time series.

Result: CauKer-generated datasets show clear scaling laws for dataset size and model capacity, unlike irregular real-world datasets.

Conclusion: CauKer enables efficient pretraining of TSFMs with synthetic data, offering scalability and performance benefits.

Abstract: Time series foundation models (TSFMs) have recently gained significant attention due to their strong zero-shot capabilities and widespread real-world applications. Such models typically require a computationally costly pretraining on large-scale, carefully curated collections of real-world sequences. To allow for a sample-efficient pretraining of TSFMs, we propose CauKer, a novel algorithm designed to generate diverse, causally coherent synthetic time series with realistic trends, seasonality, and nonlinear interactions. CauKer combines Gaussian Process (GP) kernel composition with Structural Causal Models (SCM) to produce data for sample-efficient pretraining of state-of-the-art classification TSFMs having different architectures and following different pretraining approaches. Additionally, our experiments reveal that CauKer-generated datasets exhibit clear scaling laws for both dataset size (10K to 10M samples) and model capacity (1M to 783M parameters), unlike real-world datasets, which display irregular scaling behavior.

[522] HALO: Hindsight-Augmented Learning for Online Auto-Bidding

Pusen Dong, Chenglong Cao, Xinyu Zhou, Jirong You, Linhe Xu, Feifan Xu, Shuo Yuan

Main category: cs.LG

TL;DR: HALO, a new auto-bidding method, addresses inefficiencies in traditional solutions by using hindsight learning and B-spline representation for robust adaptation to diverse advertiser constraints.

DetailsMotivation: Traditional auto-bidding solutions struggle with sample inefficiency and poor generalization under varying budget-ROI constraints due to advertiser heterogeneity.

Method: HALO employs hindsight-augmented learning to repurpose exploration data and B-spline functional representation for continuous bid mapping across constraint spaces.

Result: Evaluations show HALO outperforms traditional methods, reducing constraint violations and improving Gross Merchandise Value (GMV).

Conclusion: HALO provides a robust solution for multi-constraint bidding in digital advertising, adapting effectively to diverse advertiser requirements.

Abstract: Digital advertising platforms operate millisecond-level auctions through Real-Time Bidding (RTB) systems, where advertisers compete for ad impressions through algorithmic bids. This dynamic mechanism enables precise audience targeting but introduces profound operational complexity due to advertiser heterogeneity: budgets and ROI targets span orders of magnitude across advertisers, from individual merchants to multinational brands. This diversity creates a demanding adaptation landscape for Multi-Constraint Bidding (MCB). Traditional auto-bidding solutions fail in this environment due to two critical flaws: 1) severe sample inefficiency, where failed explorations under specific constraints yield no transferable knowledge for new budget-ROI combinations, and 2) limited generalization under constraint shifts, as they ignore physical relationships between constraints and bidding coefficients. To address this, we propose HALO: Hindsight-Augmented Learning for Online Auto-Bidding. HALO introduces a theoretically grounded hindsight mechanism that repurposes all explorations into training data for arbitrary constraint configuration via trajectory reorientation. Further, it employs B-spline functional representation, enabling continuous, derivative-aware bid mapping across constraint spaces. HALO ensures robust adaptation even when budget/ROI requirements differ drastically from training scenarios. Industrial dataset evaluations demonstrate the superiority of HALO in handling multi-scale constraints, reducing constraint violations while improving GMV.

[523] Heterogeneity-Oblivious Robust Federated Learning

Weiyao Zhang, Jinyang Li, Qi Song, Miao Wang, Chungang Lin, Haitong Luo, Xuying Meng, Yujun Zhang

Main category: cs.LG

TL;DR: Horus is a robust Federated Learning framework using low-rank adaptations (LoRAs) to counter poisoning attacks in hyper-heterogeneous environments.

DetailsMotivation: Federated Learning is vulnerable to poisoning attacks, especially in hyper-heterogeneous settings with diverse clients, making attacks hard to detect.

Method: Horus inserts LoRAs into stable layers, aggregates only LoRAs, and uses a Heterogeneity-Oblivious Poisoning Score to filter malicious clients. It also employs projection-aware aggregation for benign clients.

Result: Horus outperforms state-of-the-art baselines in robustness and accuracy across diverse datasets, models, and attacks.

Conclusion: Horus effectively mitigates poisoning attacks in hyper-heterogeneous FL settings by leveraging LoRAs and innovative aggregation strategies.

Abstract: Federated Learning (FL) remains highly vulnerable to poisoning attacks, especially under real-world hyper-heterogeneity, where clients differ significantly in data distributions, communication capabilities, and model architectures. Such heterogeneity not only undermines the effectiveness of aggregation strategies but also makes attacks more difficult to detect. Furthermore, high-dimensional models expand the attack surface. To address these challenges, we propose Horus, a heterogeneity-oblivious robust FL framework centered on low-rank adaptations (LoRAs). Rather than aggregating full model parameters, Horus inserts LoRAs into empirically stable layers and aggregates only LoRAs to reduce the attack uncover a key empirical observation that the input projection (LoRA-A) is markedly more stable than the output projection (LoRA-B) under heterogeneity and poisoning. Leveraging this, we design a Heterogeneity-Oblivious Poisoning Score using the features from LoRA-A to filter poisoned clients. For the remaining benign clients, we propose projection-aware aggregation mechanism to preserve collaborative signals while suppressing drifts, which reweights client updates by consistency with the global directions. Extensive experiments across diverse datasets, model architectures, and attacks demonstrate that Horus consistently outperforms state-of-the-art baselines in both robustness and accuracy.

[524] Streaming Generated Gaussian Process Experts for Online Learning and Control

Zewen Yang, Dongfa Zhang, Xiaobing Dai, Fengyi Yu, Chi Zhang, Bingkun Huang, Hamid Sadeghian, Sami Haddadin

Main category: cs.LG

TL;DR: The paper introduces SkyGP, a streaming kernel-induced expert framework for Gaussian Processes, addressing computational and memory constraints while maintaining performance guarantees. Two variants, SkyGP-Dense and SkyGP-Fast, are proposed for accuracy and efficiency, respectively, with validated superior performance.

DetailsMotivation: Exact Gaussian Processes (GPs) face scalability issues in real-time settings due to cubic computation time and quadratic memory complexity when processing streaming data.

Method: The proposed SkyGP framework maintains a bounded set of experts to handle computational and memory constraints, inheriting exact GP guarantees. Two variants, SkyGP-Dense (accuracy-focused) and SkyGP-Fast (efficiency-focused), are introduced.

Result: SkyGP demonstrates superior performance in benchmarks and real-time control experiments compared to state-of-the-art methods.

Conclusion: SkyGP effectively addresses scalability challenges of exact GPs in streaming data scenarios, offering flexible and efficient solutions with validated performance.

Abstract: Gaussian Processes (GPs), as a nonparametric learning method, offer flexible modeling capabilities and calibrated uncertainty quantification for function approximations. Additionally, GPs support online learning by efficiently incorporating new data with polynomial-time computation, making them well-suited for safety-critical dynamical systems that require rapid adaptation. However, the inference and online updates of exact GPs, when processing streaming data, incur cubic computation time and quadratic storage memory complexity, limiting their scalability to large datasets in real-time settings. In this paper, we propose a streaming kernel-induced progressively generated expert framework of Gaussian processes (SkyGP) that addresses both computational and memory constraints by maintaining a bounded set of experts, while inheriting the learning performance guarantees from exact Gaussian processes. Furthermore, two SkyGP variants are introduced, each tailored to a specific objective, either maximizing prediction accuracy (SkyGP-Dense) or improving computational efficiency (SkyGP-Fast). The effectiveness of SkyGP is validated through extensive benchmarks and real-time control experiments demonstrating its superior performance compared to state-of-the-art approaches.

[525] Self-Questioning Language Models

Lili Chen, Mihir Prabhudesai, Katerina Fragkiadaki, Hao Liu, Deepak Pathak

Main category: cs.LG

TL;DR: Large language models can self-improve by generating and solving their own questions without external data, using an asymmetric self-play framework called SQLM.

DetailsMotivation: To explore if language models can enhance reasoning skills autonomously by generating and solving their own questions, eliminating the need for curated datasets.

Method: Proposes SQLM: an asymmetric self-play framework with a proposer generating questions and a solver answering them, both trained via reinforcement learning. Rewards are based on question difficulty and correctness (via majority voting or unit tests).

Result: Tested on three benchmarks (multiplication, algebra, programming), showing improvement without external data.

Conclusion: Language models can autonomously improve reasoning skills through self-generated questions and solutions.

Abstract: Can large language models improve without external data – by generating their own questions and answers? We hypothesize that a pre-trained language model can improve its reasoning skills given only a single prompt specifying the topic (e.g., algebra word problems) and asking the model to generate its own questions. To do this, we propose Self-Questioning Language Models (SQLM): an asymmetric self-play framework where a proposer is given the topic and generates a question for a solver, who tries to answer it. Both the proposer and solver are trained via reinforcement learning. The proposer receives a reward if the problem is not too easy or too difficult, and the solver receives a reward based on majority voting, a proxy for correctness in the absence of ground-truth answers. For coding, the proposer can instead generate unit tests which are used for verification. We study this asymmetric self-play framework on three benchmarks: three-digit multiplication, algebra problems from the OMEGA benchmark, and programming problems from Codeforces. By continually generating more interesting problems and attempting to solve them, language models can improve on downstream benchmarks without access to any curated training datasets.

cs.MA

[526] Forgive and Forget? An Industry 5.0 Approach to Trust-Fatigue Co-regulation in Human-Cobot Order Picking

Soumyadeep Dhar

Main category: cs.MA

TL;DR: The paper explores trust and fatigue in human-cobot collaboration in Logistics 5.0, proposing a Stackelberg game model. Simulations show a refined trust model boosts productivity by 100%, and a proactive Trust-Repair Protocol reduces trust recovery time by 75%.

DetailsMotivation: To address challenges in human-robot symbiosis in smart logistics, focusing on trust and fatigue dynamics.

Method: Uses a dynamic leader-follower Stackelberg game with utility functions for fatigue and trust, validated through agent-based simulations.

Result: Refined trust model increases productivity by 100%; proactive Trust-Repair Protocol cuts trust recovery time by 75%.

Conclusion: Provides a framework for human-centric, sustainable, and resilient cobot behaviors in Industry 5.0.

Abstract: This paper investigates the critical role of trust and fatigue in human-cobot collaborative order picking, framing the challenge within the scope of Logistics 5.0 – the implementation of human-robot symbiosis in smart logistics. We propose a dynamic, leader-follower Stackelberg game to model this interaction, where utility functions explicitly account for human fatigue and trust. Through agent-based simulations, we demonstrate that while a naive model leads to a “trust death spiral,” a refined trust model creates a “trust synergy cycle,” increasing productivity by nearly 100 percent. Finally, we show that a cobot equipped with a proactive Trust-Repair Protocol can overcome system brittleness, reducing trust recovery time after a severe failure by over 75 percent compared to a non-adaptive model. Our findings provide a framework for designing intelligent cobot behaviors that fulfill the Industry 5.0 pillars of human-centricity, sustainability, and resilience.

[527] When Agents Break Down in Multiagent Path Finding

Foivos Fioravantes, Dušan Knop, Nikolaos Melissinos, Michal Opler

Main category: cs.MA

TL;DR: The paper introduces a dynamic adaptation framework for Multiagent Path Finding (MAPF) to handle agent malfunctions without full replanning, ensuring bounded delays and scalability.

DetailsMotivation: Addressing the challenge of maintaining optimal schedules in MAPF when agents experience malfunctions, which makes full replanning infeasible.

Method: Proposes two protocols: one for local agent coordination to adjust paths dynamically, and another offloading computations to network nodes for limited-capability agents.

Result: Proves that the primary protocol bounds makespan increase by k turns after k malfunctions, while the secondary protocol ensures robustness without extra agent processing.

Conclusion: The protocols offer a practical, scalable solution for resilient multiagent navigation despite agent failures.

Abstract: In Multiagent Path Finding (MAPF), the goal is to compute efficient, collision-free paths for multiple agents navigating a network from their sources to targets, minimizing the schedule’s makespan-the total time until all agents reach their destinations. We introduce a new variant that formally models scenarios where some agents may experience delays due to malfunctions, posing significant challenges for maintaining optimal schedules. Recomputing an entirely new schedule from scratch after each malfunction is often computationally infeasible. To address this, we propose a framework for dynamic schedule adaptation that does not rely on full replanning. Instead, we develop protocols enabling agents to locally coordinate and adjust their paths on the fly. We prove that following our primary communication protocol, the increase in makespan after k malfunctions is bounded by k additional turns, effectively limiting the impact of malfunctions on overall efficiency. Moreover, recognizing that agents may have limited computational capabilities, we also present a secondary protocol that shifts the necessary computations onto the network’s nodes, ensuring robustness without requiring enhanced agent processing power. Our results demonstrate that these protocols provide a practical, scalable approach to resilient multiagent navigation in the face of agent failures.

[528] DRAMA: A Dynamic and Robust Allocation-based Multi-Agent System for Changing Environments

Naibo Wang, Yifan Zhang, Sai Liu, Xinkui Zhao, Guanjie Cheng, Yueshen Xu

Main category: cs.MA

TL;DR: DRAMA is a dynamic multi-agent system designed for resilient collaboration in changing environments, featuring modular architecture and flexible task allocation.

DetailsMotivation: Existing MAS frameworks lack adaptability to dynamic environments due to static architectures and rigid task allocation.

Method: DRAMA uses a modular architecture with separate control and worker planes, abstracting agents and tasks as resource objects with affinity-based task allocation.

Result: The system enables real-time monitoring, flexible task reassignment, and robust execution in dynamic scenarios.

Conclusion: DRAMA enhances adaptability and robustness in multi-agent systems for dynamic environments.

Abstract: Multi-agent systems (MAS) have demonstrated significant effectiveness in addressing complex problems through coordinated collaboration among heterogeneous agents. However, real-world environments and task specifications are inherently dynamic, characterized by frequent changes, uncertainty, and variability. Despite this, most existing MAS frameworks rely on static architectures with fixed agent capabilities and rigid task allocation strategies, which greatly limits their adaptability to evolving conditions. This inflexibility poses substantial challenges for sustaining robust and efficient multi-agent cooperation in dynamic and unpredictable scenarios. To address these limitations, we propose DRAMA: a Dynamic and Robust Allocation-based Multi-Agent System designed to facilitate resilient collaboration in rapidly changing environments. DRAMA features a modular architecture with a clear separation between the control plane and the worker plane. Both agents and tasks are abstracted as resource objects with well-defined lifecycles, while task allocation is achieved via an affinity-based, loosely coupled mechanism. The control plane enables real-time monitoring and centralized planning, allowing flexible and efficient task reassignment as agents join, depart, or become unavailable, thereby ensuring continuous and robust task execution. The worker plane comprises a cluster of autonomous agents, each with local reasoning, task execution, the ability to collaborate, and the capability to take over unfinished tasks from other agents when needed.

[529] Position-Based Flocking for Robust Alignment

Hossein B. Jond

Main category: cs.MA

TL;DR: A position-based flocking model improves alignment and formation stability compared to velocity-based methods, demonstrated in 2D simulations with 50 agents.

DetailsMotivation: To enhance stable collective motion by balancing cohesion-separation and alignment in interacting agents.

Method: Modifies velocity-based flocking by using position differences and a threshold weight for sustained alignment.

Result: Produces stronger alignment, more rigid formations, and robust flocking behavior.

Conclusion: The position-based model ensures robust alignment, with potential applications in robotics and collective dynamics.

Abstract: This paper presents a position-based flocking model for interacting agents, balancing cohesion-separation and alignment to achieve stable collective motion. The model modifies a velocity-based approach by approximating velocity differences using initial and current positions, introducing a threshold weight to ensure sustained alignment. Simulations with 50 agents in 2D demonstrate that the position-based model produces stronger alignment and more rigid and compact formations compared to the velocity-based model. The alignment metric and separation distances highlight the efficacy of the proposed model in achieving robust flocking behavior. The model’s use of positions ensures robust alignment, with applications in robotics and collective dynamics.

[530] A Value Based Parallel Update MCTS Method for Multi-Agent Cooperative Decision Making of Connected and Automated Vehicles

Ye Han, Lijun Zhang, Dejian Meng, Zhuang Zhang, Xingyu Hu, Songyu Weng

Main category: cs.MA

TL;DR: Proposes a parallel-update MCTS method for multi-vehicle cooperative driving in CAVs, improving search depth and safety.

DetailsMotivation: Addresses lateral and longitudinal joint decision-making for CAVs in partial-steady-state traffic.

Method: Uses MCTS with parallel updates to exclude dangerous actions and enhance search depth.

Result: Outperforms SOTA reinforcement learning and heuristic methods in robustness, efficiency, and safety.

Conclusion: The algorithm demonstrates superior rationality, traffic efficiency, and safety compared to human drivers.

Abstract: To solve the problem of lateral and logitudinal joint decision-making of multi-vehicle cooperative driving for connected and automated vehicles (CAVs), this paper proposes a Monte Carlo tree search (MCTS) method with parallel update for multi-agent Markov game with limited horizon and time discounted setting. By analyzing the parallel actions in the multi-vehicle joint action space in the partial-steady-state traffic flow, the parallel update method can quickly exclude potential dangerous actions, thereby increasing the search depth without sacrificing the search breadth. The proposed method is tested in a large number of randomly generated traffic flow. The experiment results show that the algorithm has good robustness and better performance than the SOTA reinforcement learning algorithms and heuristic methods. The vehicle driving strategy using the proposed algorithm shows rationality beyond human drivers, and has advantages in traffic efficiency and safety in the coordinating zone.

[531] Accelerating Focal Search in Multi-Agent Path Finding with Tighter Lower Bounds

Yimin Tang, Zhenghong Yu, Jiaoyang Li, Sven Koenig

Main category: cs.MA

TL;DR: DECBS improves bounded suboptimal MAPF by addressing slow LB value increases in focal search, outperforming ECBS in efficiency and solution quality.

DetailsMotivation: Traditional focal search in MAPF methods like ECBS suffers from slow LB value increases early in the search, limiting efficiency.

Method: DECBS introduces a two-phase approach: first determining the maximum LB value, then using it to guide a best-first search for collision-free paths.

Result: DECBS reduces high-level CT nodes by 30%, low-level focal search nodes by 50%, and improves runtime by 23.5% over ECBS in dense scenarios.

Conclusion: DECBS is a more efficient bounded suboptimal MAPF algorithm, compatible with existing optimizations and superior to ECBS.

Abstract: Multi-Agent Path Finding (MAPF) involves finding collision-free paths for multiple agents while minimizing a cost function–an NP-hard problem. Bounded suboptimal methods like Enhanced Conflict-Based Search (ECBS) and Explicit Estimation CBS (EECBS) balance solution quality with computational efficiency using focal search mechanisms. While effective, traditional focal search faces a limitation: the lower bound (LB) value determining which nodes enter the FOCAL list often increases slowly in early search stages, resulting in a constrained search space that delays finding valid solutions. In this paper, we propose a novel bounded suboptimal algorithm, double-ECBS (DECBS), to address this issue by first determining the maximum LB value and then employing a best-first search guided by this LB to find a collision-free path. Experimental results demonstrate that DECBS outperforms ECBS in most test cases and is compatible with existing optimization techniques. DECBS can reduce nearly 30% high-level CT nodes and 50% low-level focal search nodes. When agent density is moderate to high, DECBS achieves a 23.5% average runtime improvement over ECBS with identical suboptimality bounds and optimizations.

cs.MM

[532] LUST: A Multi-Modal Framework with Hierarchical LLM-based Scoring for Learned Thematic Significance Tracking in Multimedia Content

Anderson de Lima Luiz

Main category: cs.MM

TL;DR: LUST is a framework for analyzing video segments’ thematic relevance to user-provided text, using multi-modal analysis and a two-stage LLM scoring system.

DetailsMotivation: To quantify and visualize the thematic relevance of video segments based on user-defined significance, enhancing content analysis.

Method: Multi-modal pipeline integrating visual and ASR-extracted text, with hierarchical two-stage LLM scoring (direct and contextual relevance).

Result: Annotated video with visualized relevance scores and analytical logs, providing nuanced, temporally-aware significance measures.

Conclusion: LUST effectively quantifies and visualizes thematic relevance in videos, leveraging multi-modal data and advanced scoring.

Abstract: This paper introduces the Learned User Significance Tracker (LUST), a framework designed to analyze video content and quantify the thematic relevance of its segments in relation to a user-provided textual description of significance. LUST leverages a multi-modal analytical pipeline, integrating visual cues from video frames with textual information extracted via Automatic Speech Recognition (ASR) from the audio track. The core innovation lies in a hierarchical, two-stage relevance scoring mechanism employing Large Language Models (LLMs). An initial “direct relevance” score, $S_{d,i}$, assesses individual segments based on immediate visual and auditory content against the theme. This is followed by a “contextual relevance” score, $S_{c,i}$, that refines the assessment by incorporating the temporal progression of preceding thematic scores, allowing the model to understand evolving narratives. The LUST framework aims to provide a nuanced, temporally-aware measure of user-defined significance, outputting an annotated video with visualized relevance scores and comprehensive analytical logs.

[533] Think Before You Segment: An Object-aware Reasoning Agent for Referring Audio-Visual Segmentation

Jinxing Zhou, Yanghao Zhou, Mingfei Han, Tong Wang, Xiaojun Chang, Hisham Cholakkal, Rao Muhammad Anwer

Main category: cs.MM

TL;DR: TGS-Agent decomposes Ref-AVS into Think-Ground-Segment steps, mimicking human reasoning, achieving SOTA results without pixel-level supervision.

DetailsMotivation: Prior methods lack interpretability and require strong supervision. TGS-Agent aims to improve this by explicit reference understanding.

Method: Proposes Ref-Thinker for multimodal reasoning, followed by Grounding-DINO and SAM2 for grounding and segmentation. Uses instruction-tuning dataset for fine-tuning.

Result: Achieves state-of-the-art performance on Ref-AVSBench and new R²-AVSBench.

Conclusion: TGS-Agent offers a more interpretable and effective approach for Ref-AVS, reducing reliance on pixel-level supervision.

Abstract: Referring Audio-Visual Segmentation (Ref-AVS) aims to segment target objects in audible videos based on given reference expressions. Prior works typically rely on learning latent embeddings via multimodal fusion to prompt a tunable SAM/SAM2 decoder for segmentation, which requires strong pixel-level supervision and lacks interpretability. From a novel perspective of explicit reference understanding, we propose TGS-Agent, which decomposes the task into a Think-Ground-Segment process, mimicking the human reasoning procedure by first identifying the referred object through multimodal analysis, followed by coarse-grained grounding and precise segmentation. To this end, we first propose Ref-Thinker, a multimodal language model capable of reasoning over textual, visual, and auditory cues. We construct an instruction-tuning dataset with explicit object-aware think-answer chains for Ref-Thinker fine-tuning. The object description inferred by Ref-Thinker is used as an explicit prompt for Grounding-DINO and SAM2, which perform grounding and segmentation without relying on pixel-level supervision. Additionally, we introduce R\textsuperscript{2}-AVSBench, a new benchmark with linguistically diverse and reasoning-intensive references for better evaluating model generalization. Our approach achieves state-of-the-art results on both standard Ref-AVSBench and proposed R\textsuperscript{2}-AVSBench. Code will be available at https://github.com/jasongief/TGS-Agent.

[534] Iola Walker: A Mobile Footfall Detection System for Music Composition

William B James

Main category: cs.MM

TL;DR: The paper introduces ‘Iola Walker,’ an app that enhances music by syncing it to a listener’s walking pace using a foot-mounted accelerometer and a recurrent neural network.

DetailsMotivation: To create a new form of music preferred by listeners, helping musicians reclaim live performances from digital advertising and promote equitable thriving in music.

Method: Uses a foot-mounted accelerometer and LSTM neural network to detect footfalls in real time, training on annotated accelerometer data.

Result: Successful detection of footfalls using an LSTM model, with the app outputting MIDI events to sync music to walking pace.

Conclusion: The project demonstrates a viable method for real-time footfall detection and music enhancement, with potential for further development.

Abstract: This paper is part of a larger music technology research project. http://willbjames.github.io The goal of this research is to find a method of materially enhancing music using hardware and software. Why might one want to do this, you might ask? Because if it was possible to create a new form of music that was preferred by listeners, that would be a great way for musicians to reclaim live musical performance from the digital advertising industry. This project is an initial iteration towards the broader research goal of promoting equitable human thriving in the music field. \par The project is dubbed “iola walker” in reference to a common polyrhythm, the hemiola. A listener goes for a walk, and the Iola Walker app detects their walking pace. Iola Walker picks up footfalls using a foot-mounted accelerometer, processing the signals in real time using a recurrent neural network in an android app. The android app outputs a midi event for each footfall. The iola walker player plays the version of the next music passage with underlying polyrhythms closest to the listener’s walking pace, as determined by the composer. \par This paper documents the process of training the model to detect a walking listener’s footfalls in real time. The model is trained on accelerometer data from an Mbient Labs foot-mounted IMU \cite{mbientlabs} at 200~Hz, with the ground truth for footfalls annotated by pressing the volume up button on the android device when the foot hits the ground. To collect training data, I walked around my neighborhood clicking the volume up button each time my foot hit the ground. I tried several methods for detecting footfalls in real time from sensor data, with the most success from an LSTM. Artifacts for this paper are available here: https://github.com/willbjames/iolawalker

[535] Can Sound Replace Vision in LLaVA With Token Substitution?

Ali Vosoughi, Jing Bi, Pinxin Liu, Yunlong Tang, Chenliang Xu

Main category: cs.MM

TL;DR: The paper investigates extreme audio-visual alignment, introduces a dataset with granular alignment scores, and studies its impact on model behavior, revealing differences between image-centric and text-centric encoders.

DetailsMotivation: Existing datasets treat audio-visual alignment as binary, lacking granularity. The study aims to explore the limits of alignment and its effects on model performance.

Method: Developed a dataset with detailed alignment scores, trained models on perfectly matched pairs, and evaluated performance changes in retrieval and generation tasks.

Result: Image-centric encoders excel in retrieval but lose linguistic quality in generation, while text-centric encoders balance both tasks better.

Conclusion: Encoder architecture determines alignment impact, with text-centric models maintaining better linguistic authenticity.

Abstract: What happens when we push audio-visual alignment to its absolute limits? To systematically investigate this question, we needed datasets with granular alignment quality annotations, but existing datasets treat alignment as binary, either synchronized or not. To address this limitation, we developed a comprehensive dataset featuring detailed alignment scores that reveal the hidden spectrum of audio-visual perceptual correspondence. Using these precise scores, we create “superaligned” representations by training exclusively on the most perfectly matched audio-visual pairs, then conduct our systematic investigation into how this extreme alignment transforms perceptual model behavior across retrieval and generation tasks. The encoders under study fall into two main groups consisting of image-centric encoders that were pretrained using visual modalities as intermediary hubs for connecting modalities, and text-centric encoders that were pretrained with direct audio-language alignment. We first measure the baseline performance of these encoders on two key tasks, namely cross-modal retrieval and text description generation in vision-language models. Subsequently, we realign all encoders with the CLIP space using highly coherent audio-visual data and observe the performance changes. Our findings reveal that the initial architectural type of the encoder determines how it responds to the alignment process. Image-centric encoders, which are inherently designed for alignment, demonstrate exceptional performance in cross-modal retrieval, but this intensive alignment causes compression of unique linguistic information and reduces the quality of their text description generation in vision-language models. In contrast, text-centric encoders, which possess stronger linguistic authenticity, are able to maintain a better balance between the two objectives.

eess.AS

[536] LCS-CTC: Leveraging Soft Alignments to Enhance Phonetic Transcription Robustness

Zongli Ye, Jiachen Lian, Akshaj Gupta, Xuanru Zhou, Krish Patel, Haodong Li, Hwi Joo Park, Chenxu Guo, Shuhe Li, Sam Wang, Cheol Jun Cho, Zoe Ezzes, Jet M. J. Vonk, Brittany T. Morin, Rian Bogley, Lisa Wauters, Zachary A. Miller, Maria Luisa Gorno-Tempini, Gopala Anumanchipalli

Main category: eess.AS

TL;DR: LCS-CTC improves phoneme-level speech recognition by combining similarity-aware local alignment with constrained CTC training, outperforming vanilla CTC.

DetailsMotivation: CTC is efficient but lacks performance in unclear or non-fluent speech. LCS-CTC aims to enhance recognition and alignment.

Method: Uses a two-stage framework: predicts frame-phoneme cost matrices and applies a modified LCS algorithm to constrain CTC decoding paths.

Result: Outperforms vanilla CTC on LibriSpeech and PPA datasets, improving robustness and generalization.

Conclusion: LCS-CTC unifies phoneme modeling for fluent and non-fluent speech, offering better recognition and alignment.

Abstract: Phonetic speech transcription is crucial for fine-grained linguistic analysis and downstream speech applications. While Connectionist Temporal Classification (CTC) is a widely used approach for such tasks due to its efficiency, it often falls short in recognition performance, especially under unclear and nonfluent speech. In this work, we propose LCS-CTC, a two-stage framework for phoneme-level speech recognition that combines a similarity-aware local alignment algorithm with a constrained CTC training objective. By predicting fine-grained frame-phoneme cost matrices and applying a modified Longest Common Subsequence (LCS) algorithm, our method identifies high-confidence alignment zones which are used to constrain the CTC decoding path space, thereby reducing overfitting and improving generalization ability, which enables both robust recognition and text-free forced alignment. Experiments on both LibriSpeech and PPA demonstrate that LCS-CTC consistently outperforms vanilla CTC baselines, suggesting its potential to unify phoneme modeling across fluent and non-fluent speech.

[537] Parallel GPT: Harmonizing the Independence and Interdependence of Acoustic and Semantic Information for Zero-Shot Text-to-Speech

Jingyuan Xing, Zhipeng Li, Jialong Mai, Xiaofen Xing, Xiangmin Xu

Main category: eess.AS

TL;DR: The paper introduces a hybrid AR-NAR TTS framework to improve zero-shot text-to-speech synthesis by better capturing acoustic-semantic correlations.

DetailsMotivation: Existing zero-shot TTS models struggle with expressiveness and similarity due to complex acoustic-semantic feature relationships.

Method: Combines AR and NAR modules: AR with Parallel Tokenizer for simultaneous semantic-acoustic token synthesis, and NAR for detailed token prediction.

Result: Outperforms existing zero-shot TTS models in quality and efficiency on English and Chinese datasets.

Conclusion: The proposed Parallel GPT framework enhances zero-shot TTS by harmonizing acoustic-semantic independence and interdependence.

Abstract: Advances in speech representation and large language models have enhanced zero-shot text-to-speech (TTS) performance. However, existing zero-shot TTS models face challenges in capturing the complex correlations between acoustic and semantic features, resulting in a lack of expressiveness and similarity. The primary reason lies in the complex relationship between semantic and acoustic features, which manifests independent and interdependent aspects.This paper introduces a TTS framework that combines both autoregressive (AR) and non-autoregressive (NAR) modules to harmonize the independence and interdependence of acoustic and semantic information. The AR model leverages the proposed Parallel Tokenizer to synthesize the top semantic and acoustic tokens simultaneously. In contrast, considering the interdependence, the Coupled NAR model predicts detailed tokens based on the general AR model’s output. Parallel GPT, built on this architecture, is designed to improve zero-shot text-to-speech synthesis through its parallel structure. Experiments on English and Chinese datasets demonstrate that the proposed model significantly outperforms the quality and efficiency of the synthesis of existing zero-shot TTS models. Speech demos are available at https://t1235-ch.github.io/pgpt/.

[538] Multilingual Source Tracing of Speech Deepfakes: A First Benchmark

Xi Xuan, Yang Xiao, Rohan Kumar Das, Tomi Kinnunen

Main category: eess.AS

TL;DR: The paper introduces a benchmark for tracing the source models of multilingual deepfake speech, addressing gaps in current research focused on detection rather than source tracing.

DetailsMotivation: The ease of creating deepfake speech raises concerns, but current research lacks focus on tracing the source models, especially in multilingual contexts.

Method: The study compares DSP- and SSL-based modeling, examines SSL representations fine-tuned on different languages, and evaluates generalization to unseen languages and speakers.

Result: The findings provide insights into the challenges of identifying speech generation models when training and inference languages differ.

Conclusion: The work offers a comprehensive benchmark for multilingual speech deepfake source tracing, with publicly available dataset, protocol, and code.

Abstract: Recent progress in generative AI has made it increasingly easy to create natural-sounding deepfake speech from just a few seconds of audio. While these tools support helpful applications, they also raise serious concerns by making it possible to generate convincing fake speech in many languages. Current research has largely focused on detecting fake speech, but little attention has been given to tracing the source models used to generate it. This paper introduces the first benchmark for multilingual speech deepfake source tracing, covering both mono- and cross-lingual scenarios. We comparatively investigate DSP- and SSL-based modeling; examine how SSL representations fine-tuned on different languages impact cross-lingual generalization performance; and evaluate generalization to unseen languages and speakers. Our findings offer the first comprehensive insights into the challenges of identifying speech generation models when training and inference languages differ. The dataset, protocol and code are available at https://github.com/xuanxixi/Multilingual-Source-Tracing.

[539] Towards interpretable emotion recognition: Identifying key features with machine learning

Yacouba Kaloga, Ina Kodrasi

Main category: eess.AS

TL;DR: The paper addresses the lack of interpretability in unsupervised audio models like wav2vec2 and HuBERT, focusing on emotion recognition to identify and generalize key interpretable features.

DetailsMotivation: Unsupervised models lack interpretability, limiting their use in critical domains like medicine. Understanding feature relevance is essential.

Method: Uses machine learning algorithms to identify and generalize interpretable features for emotion recognition.

Result: Aims to provide a broader, more robust framework for identifying important interpretable features, overcoming limitations of prior studies.

Conclusion: The work advances interpretability in unsupervised models, particularly for emotion recognition, addressing gaps in feature relevance understanding.

Abstract: Unsupervised methods, such as wav2vec2 and HuBERT, have achieved state-of-the-art performance in audio tasks, leading to a shift away from research on interpretable features. However, the lack of interpretability in these methods limits their applicability in critical domains like medicine, where understanding feature relevance is crucial. To better understand the features of unsupervised models, it remains critical to identify the interpretable features relevant to a given task. In this work, we focus on emotion recognition and use machine learning algorithms to identify and generalize the most important interpretable features for this task. While previous studies have explored feature relevance in emotion recognition, they are often constrained by narrow contexts and present inconsistent findings. Our approach aims to overcome these limitations, providing a broader and more robust framework for identifying the most important interpretable features.

[540] A Multi-stage Low-latency Enhancement System for Hearing Aids

Chengwei Ouyang, Kexin Fei, Haoshuai Zhou, Congxi Lu, Linkai Li

Main category: eess.AS

TL;DR: Proposes an end-to-end system for ICASSP 2023 Clarity Challenge with four innovations: multi-stage processing, asymmetric window pair, head rotation integration, and a post-processing module.

DetailsMotivation: To enhance speech perception in hearing aids by leveraging phase information, higher frequency resolution, and contextual cues like head rotation.

Method: Introduces a multi-stage system in magnitude and complex domains, asymmetric window pairs, head rotation integration, and a post-processing module.

Result: Achieves better enhancement and higher HASPI scores with the provided baseline hearing aid amplification.

Conclusion: The proposed system effectively improves speech perception in hearing aids through novel techniques and integration of contextual information.

Abstract: This paper proposes an end-to-end system for the ICASSP 2023 Clarity Challenge. In this work, we introduce four major novelties: (1) a novel multi-stage system in both the magnitude and complex domains to better utilize phase information; (2) an asymmetric window pair to achieve higher frequency resolution with the 5ms latency constraint; (3) the integration of head rotation information and the mixture signals to achieve better enhancement; (4) a post-processing module that achieves higher hearing aid speech perception index (HASPI) scores with the hearing aid amplification stage provided by the baseline system.

[541] Binaural Sound Event Localization and Detection Neural Network based on HRTF Localization Cues for Humanoid Robots

Gyeong-Tae Lee

Main category: eess.AS

TL;DR: A binaural sound event localization and detection (BiSELD) neural network is proposed to improve elevation estimation and reduce front-back confusion in humanoid robots using an eight-channel binaural time-frequency feature (BTFF).

DetailsMotivation: Conventional two-channel input struggles with elevation estimation and front-back confusion in sound event localization for humanoid robots.

Method: BiSELDnet uses BTFF, which includes left/right mel-spectrograms, V-maps, ITD, ILD, and SC maps, to learn time-frequency patterns and HRTF cues. A Trinity module-based implementation outputs direction vectors for each sound event class.

Result: BiSELDnet outperforms SOTA SELD models in urban noise, focusing on N1 notch frequency for elevation estimation.

Conclusion: The proposed BiSELD model effectively addresses elevation and front-back confusion, enhancing situational awareness for humanoid robots.

Abstract: Humanoid robots require simultaneous sound event type and direction estimation for situational awareness, but conventional two-channel input struggles with elevation estimation and front-back confusion. This paper proposes a binaural sound event localization and detection (BiSELD) neural network to address these challenges. BiSELDnet learns time-frequency patterns and head-related transfer function (HRTF) localization cues from binaural input features. A novel eight-channel binaural time-frequency feature (BTFF) is introduced, comprising left/right mel-spectrograms, V-maps, an interaural time difference (ITD) map (below 1.5 kHz), an interaural level difference (ILD) map (above 5 kHz with front-back asymmetry), and spectral cue (SC) maps (above 5 kHz for elevation). The effectiveness of BTFF was confirmed across omnidirectional, horizontal, and median planes. BiSELDnets, particularly one based on the efficient Trinity module, were implemented to output time series of direction vectors for each sound event class, enabling simultaneous detection and localization. Vector activation map (VAM) visualization was proposed to analyze network learning, confirming BiSELDnet’s focus on the N1 notch frequency for elevation estimation. Comparative evaluations under urban background noise conditions demonstrated that the proposed BiSELD model significantly outperforms state-of-the-art (SOTA) SELD models with binaural input.

[542] Text adaptation for speaker verification with speaker-text factorized embeddings

Yexin Yang, Shuai Wang, Xun Gong, Yanmin Qian, Kai Yu

Main category: eess.AS

TL;DR: A novel text adaptation framework improves text-dependent speaker verification by addressing text mismatch issues through speaker-text factorization and adaptation.

DetailsMotivation: Text mismatch between training/enrollment and test data harms speaker verification performance. Traditional data collection is costly and inflexible.

Method: Proposes a speaker-text factorization network to separate and integrate speaker and text embeddings, adapting text-independent embeddings to target content using a small adaptation dataset.

Result: Experiments on RSR2015 show significant performance improvement in text mismatch conditions.

Conclusion: The proposed framework effectively mitigates text mismatch issues, enhancing speaker verification without costly data collection.

Abstract: Text mismatch between pre-collected data, either training data or enrollment data, and the actual test data can significantly hurt text-dependent speaker verification (SV) system performance. Although this problem can be solved by carefully collecting data with the target speech content, such data collection could be costly and inflexible. In this paper, we propose a novel text adaptation framework to address the text mismatch issue. Here, a speaker-text factorization network is proposed to factorize the input speech into speaker embeddings and text embeddings and then integrate them into a single representation in the later stage. Given a small amount of speaker-independent adaptation utterances, text embeddings of target speech content can be extracted and used to adapt the text-independent speaker embeddings to text-customized speaker embeddings. Experiments on RSR2015 show that text adaptation can significantly improve the performance of text mismatch conditions.

[543] Melodic and Metrical Elements of Expressiveness in Hindustani Vocal Music

Yash Bhake, Ankit Anand, Preeti Rao

Main category: eess.AS

TL;DR: Study of North Indian Khayal music aesthetics focusing on expressive timing and pitch variations in performances, proposing computational representations to differentiate expressive styles.

DetailsMotivation: To understand the flexibility and artistic expression in Khayal music performances.

Method: Analyzed expressive timing and pitch variations, developed computational representations, and processed audio data from ten artists performing two songs in two ragas.

Result: Proposed computational models can discriminate between performances based on expression.

Conclusion: The study provides insights into the expressive nuances of Khayal music and offers tools for performance analysis.

Abstract: This paper presents an attempt to study the aesthetics of North Indian Khayal music with reference to the flexibility exercised by artists in performing popular compositions. We study expressive timing and pitch variations of the given lyrical content within and across performances and propose computational representations that can discriminate between different performances of the same song in terms of expression. We present the necessary audio processing and annotation procedures, and discuss our observations and insights from the analysis of a dataset of two songs in two ragas each rendered by ten prominent artists.

[544] Pitfalls and Limits in Automatic Dementia Assessment

Franziska Braun, Christopher Witzl, Andreas Erzigkeit, Hartmut Lehfeld, Thomas Hillemacher, Tobias Bocklet, Korbinian Riedhammer

Main category: eess.AS

TL;DR: The paper analyzes automated dementia assessment using speech, highlighting pitfalls like biased correlations and artifacts in test design, especially for healthy or mildly impaired individuals.

DetailsMotivation: To critically evaluate automated dementia assessments, focusing on the Syndrom-Kurz-Test, and identify biases in current methods.

Method: In-depth analysis of the Syndrom-Kurz-Test, examining correlations with human annotators and artifacts in test scoring.

Result: High correlations for severely impaired individuals but less so for healthy/mildly impaired ones; biases arise from test design and fallback handling.

Conclusion: Differentiated analysis of target groups is needed to address biases, independent of dataset group distributions.

Abstract: Current work on speech-based dementia assessment focuses on either feature extraction to predict assessment scales, or on the automation of existing test procedures. Most research uses public data unquestioningly and rarely performs a detailed error analysis, focusing primarily on numerical performance. We perform an in-depth analysis of an automated standardized dementia assessment, the Syndrom-Kurz-Test. We find that while there is a high overall correlation with human annotators, due to certain artifacts, we observe high correlations for the severely impaired individuals, which is less true for the healthy or mildly impaired ones. Speech production decreases with cognitive decline, leading to overoptimistic correlations when test scoring relies on word naming. Depending on the test design, fallback handling introduces further biases that favor certain groups. These pitfalls remain independent of group distributions in datasets and require differentiated analysis of target groups.

[545] UniTalker: Conversational Speech-Visual Synthesis

Yifan Hu, Rui Liu, Yi Ren, Xiang Yin, Haizhou Li

Main category: eess.AS

TL;DR: The paper introduces Conversational Speech-Visual Synthesis (CSVS) as an extension of traditional CSS, proposing UniTalker, a unified model for generating empathetic speech and natural talking-face animations using multimodal dialogue context.

DetailsMotivation: Existing CSS lacks multimodal perception (e.g., eye contact) and speech-only responses limit interactivity. CSVS aims to enhance user-agent interaction with coherent audiovisual responses.

Method: UniTalker integrates multimodal perception (text, speech, talking-face animations) using a large-scale language model and employs multi-task sequence prediction for emotion inference and generation. Key optimizations include a neural landmark codec, bimodal alignment decoding, and emotion-guided rendering.

Result: The model synthesizes empathetic speech and natural talking-face animations, with experiments confirming emotional consistency and improved user experience.

Conclusion: CSVS and UniTalker address CSS limitations by leveraging multimodal context, offering a more expressive and interactive conversational experience.

Abstract: Conversational Speech Synthesis (CSS) is a key task in the user-agent interaction area, aiming to generate more expressive and empathetic speech for users. However, it is well-known that “listening” and “eye contact” play crucial roles in conveying emotions during real-world interpersonal communication. Existing CSS research is limited to perceiving only text and speech within the dialogue context, which restricts its effectiveness. Moreover, speech-only responses further constrain the interactive experience. To address these limitations, we introduce a Conversational Speech-Visual Synthesis (CSVS) task as an extension of traditional CSS. By leveraging multimodal dialogue context, it provides users with coherent audiovisual responses. To this end, we develop a CSVS system named UniTalker, which is a unified model that seamlessly integrates multimodal perception and multimodal rendering capabilities. Specifically, it leverages a large-scale language model to comprehensively understand multimodal cues in the dialogue context, including speaker, text, speech, and the talking-face animations. After that, it employs multi-task sequence prediction to first infer the target utterance’s emotion and then generate empathetic speech and natural talking-face animations. To ensure that the generated speech-visual content remains consistent in terms of emotion, content, and duration, we introduce three key optimizations: 1) Designing a specialized neural landmark codec to tokenize and reconstruct facial expression sequences. 2) Proposing a bimodal speech-visual hard alignment decoding strategy. 3) Applying emotion-guided rendering during the generation stage. Comprehensive objective and subjective experiments demonstrate that our model synthesizes more empathetic speech and provides users with more natural and emotionally consistent talking-face animations.

[546] ContextASR-Bench: A Massive Contextual Speech Recognition Benchmark

He Wang, Linhan Ma, Dake Guo, Xiong Wang, Lei Xie, Jin Xu, Junyang Lin

Main category: eess.AS

TL;DR: The paper introduces ContextASR-Bench, a benchmark to evaluate the linguistic capabilities of ASR systems, highlighting the superiority of LALMs over traditional models due to their world knowledge and context modeling.

DetailsMotivation: Existing ASR benchmarks focus on acoustic robustness, neglecting linguistic competence, especially in recognizing named entities across diverse domains.

Method: Proposes ContextASR-Bench, a large-scale benchmark with 40,000 entries and 300,000 named entities across 10+ domains, including domain and entity context for evaluation.

Result: LALMs outperform conventional ASR models significantly, leveraging LLMs’ world knowledge and context modeling, but further improvement is needed.

Conclusion: ContextASR-Bench addresses the gap in evaluating linguistic capabilities of ASR systems, demonstrating the potential of LALMs and encouraging further research.

Abstract: Automatic Speech Recognition (ASR) has been extensively investigated, yet prior benchmarks have largely focused on assessing the acoustic robustness of ASR models, leaving evaluations of their linguistic capabilities relatively underexplored. This largely stems from the limited parameter sizes and training corpora of conventional ASR models, leaving them with insufficient world knowledge, which is crucial for accurately recognizing named entities across diverse domains. For instance, drug and treatment names in medicine or specialized technical terms in engineering. Recent breakthroughs in Large Language Models (LLMs) and corresponding Large Audio Language Models (LALMs) have markedly enhanced the visibility of advanced context modeling and general artificial intelligence capabilities. Leveraging LLMs, we envision a unified system capable of robust speech recognition across diverse real-world domains, yet existing benchmarks are inadequate for evaluating this objective. To address this gap, we propose ContextASR-Bench: a comprehensive, large-scale benchmark designed to assess the linguistic competence of ASR systems using corpora that feature numerous named entities across multiple domains. It encompasses up to 40,000 data entries with more than 300,000 named entities across over 10 domains. Beyond the audio and its transcription, each sample provides the domain it belongs to and a list of named entities it contains, which are referred to as the context. Based on this, we introduce three evaluation modes to assess how effectively models can exploit such context to improve ASR accuracy. Extensive evaluation on ContextASR-Bench highlights that LALMs outperform conventional ASR models by a large margin thanks to the strong world knowledge and context modeling of LLMs, yet there remains ample room for further improvement. The dataset and evaluation code have been released.

eess.IV

[547] Technical specification of a framework for the collection of clinical images and data

Alistair Mackenzie, Mark Halling-Brown, Ruben van Engen, Carlijn Roozemond, Lucy Warren, Dominic Ward, Nadia Smith

Main category: eess.IV

TL;DR: A framework for collecting clinical images and data for AI training and validation, emphasizing ethics, governance, and infrastructure for safe and shared data use.

DetailsMotivation: To ensure AI tools are trained and validated on up-to-date, representative data, combining older cases for long-term outcomes and current data for relevance.

Method: Describes an automated, ongoing collection framework for large-scale, long-term data, with alternatives like semi-automated methods for initial stages.

Result: Enables creation of datasets that reflect current practice and historical outcomes, improving AI tool validation.

Conclusion: Automated frameworks are ideal for large-scale AI data collection, but semi-automated methods can be practical starting points.

Abstract: In this report a framework for the collection of clinical images and data for use when training and validating artificial intelligence (AI) tools is described. The report contains not only information about the collection of the images and clinical data, but the ethics and information governance processes to consider ensuring the data is collected safely, and the infrastructure and agreements required to allow for the sharing of data with other groups. A key characteristic of the main collection framework described here is that it can enable automated and ongoing collection of datasets to ensure that the data is up-to-date and representative of current practice. This is important in the context of training and validating AI tools as it is vital that datasets have a mix of older cases with long term follow-up such that the clinical outcome is as accurate as possible, and current data. Validations run on old data will provide findings and conclusions relative to the status of the imaging units when that data was generated. It is important that a validation dataset can assess the AI tools with data that it would see if deployed and active now. Other types of collection frameworks, which do not follow a fully automated approach, are also described. Whilst the fully automated method is recommended for large scale, long-term image collection, there may be reasons to start data collection using semi-automated methods and indications of how to do that are provided.

[548] A Survey of Multimodal Ophthalmic Diagnostics: From Task-Specific Approaches to Foundational Models

Xiaoling Luo, Ruli Zheng, Qiaojian Zheng, Zibo Du, Shuo Yang, Meidan Ding, Qihao Xu, Chengliang Liu, Linlin Shen

Main category: eess.IV

TL;DR: A survey of multimodal deep learning in ophthalmology, covering task-specific methods and foundation models, with insights into datasets, challenges, and future directions like ultra-widefield imaging and reinforcement learning.

DetailsMotivation: To address the global health challenge of visual impairment by leveraging multimodal imaging for accurate diagnosis through advanced deep learning techniques.

Method: Systematic review of multimodal deep learning methods, categorized into task-specific approaches (e.g., lesion detection, disease diagnosis) and foundation models (e.g., vision-language architectures).

Result: Identifies key datasets, metrics, and innovations (e.g., self-supervised learning, attention-based fusion) while highlighting challenges like data variability and interpretability.

Conclusion: Future directions include ultra-widefield imaging and reinforcement learning to develop intelligent, interpretable AI systems for ophthalmology.

Abstract: Visual impairment represents a major global health challenge, with multimodal imaging providing complementary information that is essential for accurate ophthalmic diagnosis. This comprehensive survey systematically reviews the latest advances in multimodal deep learning methods in ophthalmology up to the year 2025. The review focuses on two main categories: task-specific multimodal approaches and large-scale multimodal foundation models. Task-specific approaches are designed for particular clinical applications such as lesion detection, disease diagnosis, and image synthesis. These methods utilize a variety of imaging modalities including color fundus photography, optical coherence tomography, and angiography. On the other hand, foundation models combine sophisticated vision-language architectures and large language models pretrained on diverse ophthalmic datasets. These models enable robust cross-modal understanding, automated clinical report generation, and decision support. The survey critically examines important datasets, evaluation metrics, and methodological innovations including self-supervised learning, attention-based fusion, and contrastive alignment. It also discusses ongoing challenges such as variability in data, limited annotations, lack of interpretability, and issues with generalizability across different patient populations. Finally, the survey outlines promising future directions that emphasize the use of ultra-widefield imaging and reinforcement learning-based reasoning frameworks to create intelligent, interpretable, and clinically applicable AI systems for ophthalmology.

[549] Improve Retinal Artery/Vein Classification via Channel Couplin

Shuang Zeng, Chee Hong Lee, Kaiwen Li, Boxu Xie, Ourui Fu, Hangzhou He, Lei Zhu, Yanye Lu, Fangxiao Cheng

Main category: eess.IV

TL;DR: The paper introduces a novel loss function and regularization term for retinal artery/vein classification, improving consistency and accuracy over existing methods.

DetailsMotivation: Manual segmentation and classification of retinal vessels are time-consuming and inconsistent, while existing automated methods treat artery, vein, and vessel segmentation as separate tasks, ignoring their intrinsic relationships.

Method: Proposes a Channel-Coupled Vessel Consistency Loss to enforce coherence between vessel, artery, and vein predictions, and an intra-image pixel-level contrastive loss for finer feature representation.

Result: Achieves state-of-the-art results on three public datasets (RITE, LES-AV, HRF).

Conclusion: The proposed method enhances consistency and accuracy in retinal A/V classification, addressing limitations of prior approaches.

Abstract: Retinal vessel segmentation plays a vital role in analyzing fundus images for the diagnosis of systemic and ocular diseases. Building on this, classifying segmented vessels into arteries and veins (A/V) further enables the extraction of clinically relevant features such as vessel width, diameter and tortuosity, which are essential for detecting conditions like diabetic and hypertensive retinopathy. However, manual segmentation and classification are time-consuming, costly and inconsistent. With the advancement of Convolutional Neural Networks, several automated methods have been proposed to address this challenge, but there are still some issues. For example, the existing methods all treat artery, vein and overall vessel segmentation as three separate binary tasks, neglecting the intrinsic coupling relationships between these anatomical structures. Considering artery and vein structures are subsets of the overall retinal vessel map and should naturally exhibit prediction consistency with it, we design a novel loss named Channel-Coupled Vessel Consistency Loss to enforce the coherence and consistency between vessel, artery and vein predictions, avoiding biasing the network toward three simple binary segmentation tasks. Moreover, we also introduce a regularization term named intra-image pixel-level contrastive loss to extract more discriminative feature-level fine-grained representations for accurate retinal A/V classification. SOTA results have been achieved across three public A/V classification datasets including RITE, LES-AV and HRF. Our code will be available upon acceptance.

[550] A Modified VGG19-Based Framework for Accurate and Interpretable Real-Time Bone Fracture Detection

Md. Ehsanul Haque, Abrar Fahim, Shamik Dey, Syoda Anamika Jahan, S. M. Jahidul Islam, Sakib Rokoni, Md Sakib Morshed

Main category: eess.IV

TL;DR: An automated framework using a modified VGG-19 model for accurate and interpretable bone fracture detection in X-ray images, achieving high accuracy and real-time performance.

DetailsMotivation: Early and accurate fracture detection is crucial for timely treatment, but current methods are time-consuming, error-prone, and lack interpretability for clinical use.

Method: The framework uses a modified VGG-19 model with preprocessing techniques (CLAHE, Otsu’s thresholding, Canny edge detection) and Grad-CAM for interpretability, deployed in a real-time web application.

Result: Achieves 99.78% classification accuracy and AUC score of 1.00, providing fast and reliable diagnostic feedback.

Conclusion: The framework offers a reliable, fast, and interpretable solution for bone fracture detection, improving diagnoses and patient care.

Abstract: Early and accurate detection of the bone fracture is paramount to initiating treatment as early as possible and avoiding any delay in patient treatment and outcomes. Interpretation of X-ray image is a time consuming and error prone task, especially when resources for such interpretation are limited by lack of radiology expertise. Additionally, deep learning approaches used currently, typically suffer from misclassifications and lack interpretable explanations to clinical use. In order to overcome these challenges, we propose an automated framework of bone fracture detection using a VGG-19 model modified to our needs. It incorporates sophisticated preprocessing techniques that include Contrast Limited Adaptive Histogram Equalization (CLAHE), Otsu’s thresholding, and Canny edge detection, among others, to enhance image clarity as well as to facilitate the feature extraction. Therefore, we use Grad-CAM, an Explainable AI method that can generate visual heatmaps of the model’s decision making process, as a type of model interpretability, for clinicians to understand the model’s decision making process. It encourages trust and helps in further clinical validation. It is deployed in a real time web application, where healthcare professionals can upload X-ray images and get the diagnostic feedback within 0.5 seconds. The performance of our modified VGG-19 model attains 99.78% classification accuracy and AUC score of 1.00, making it exceptionally good. The framework provides a reliable, fast, and interpretable solution for bone fracture detection that reasons more efficiently for diagnoses and better patient care.

[551] Boosting Vision Semantic Density with Anatomy Normality Modeling for Medical Vision-language Pre-training

Weiwei Cao, Jianpeng Zhang, Zhongyi Shui, Sinuo Wang, Zeli Chen, Xi Li, Le Lu, Xianghua Ye, Tingbo Liang, Qi Zhang, Ling Zhang

Main category: eess.IV

TL;DR: The paper proposes a method to enhance vision-language pre-training (VLP) for medical diagnostics by boosting visual semantic density, addressing the gap between low-SNR medical images and high-SNR reports. It uses disease-level contrastive learning and anatomical normality modeling to improve alignment and achieves state-of-the-art zero-shot performance.

DetailsMotivation: Aligning medical images with low signal-to-noise ratio (SNR) to high-SNR reports creates a semantic density gap, leading to visual alignment bias. The paper aims to bridge this gap for better diagnostic capabilities.

Method: The approach includes disease-level vision contrastive learning to differentiate normal/abnormal samples and anatomical normality modeling using VQ-VAE to reconstruct normal vision embeddings, amplifying abnormal signals.

Result: Experiments on chest and abdominal CT datasets show state-of-the-art zero-shot performance, with an average AUC of 84.9% across 54 diseases in 15 organs, outperforming existing methods.

Conclusion: The proposed method effectively enhances visual representation for medical diagnostics, improving alignment with reports and demonstrating superior transfer learning capabilities.

Abstract: Vision-language pre-training (VLP) has great potential for developing multifunctional and general medical diagnostic capabilities. However, aligning medical images with a low signal-to-noise ratio (SNR) to reports with a high SNR presents a semantic density gap, leading to visual alignment bias. In this paper, we propose boosting vision semantic density to improve alignment effectiveness. On one hand, we enhance visual semantics through disease-level vision contrastive learning, which strengthens the model’s ability to differentiate between normal and abnormal samples for each anatomical structure. On the other hand, we introduce an anatomical normality modeling method to model the distribution of normal samples for each anatomy, leveraging VQ-VAE for reconstructing normal vision embeddings in the latent space. This process amplifies abnormal signals by leveraging distribution shifts in abnormal samples, enhancing the model’s perception and discrimination of abnormal attributes. The enhanced visual representation effectively captures the diagnostic-relevant semantics, facilitating more efficient and accurate alignment with the diagnostic report. We conduct extensive experiments on two chest CT datasets, CT-RATE and Rad-ChestCT, and an abdominal CT dataset, MedVL-CT69K, and comprehensively evaluate the diagnosis performance across multiple tasks in the chest and abdominal CT scenarios, achieving state-of-the-art zero-shot performance. Notably, our method achieved an average AUC of 84.9% across 54 diseases in 15 organs, significantly surpassing existing methods. Additionally, we demonstrate the superior transfer learning capabilities of our pre-trained model. Code is available at https://github.com/alibaba-damo-academy/ViSD-Boost.

[552] Do We Need Pre-Processing for Deep Learning Based Ultrasound Shear Wave Elastography?

Sarah Grube, Sören Grünhagen, Sarah Latus, Michael Meyling, Alexander Schlaefer

Main category: eess.IV

TL;DR: Deep learning can reliably predict shear wave velocities in ultrasound elastography without extensive pre-processing, reducing bias and enabling faster clinical assessments.

DetailsMotivation: To investigate the necessity of ultrasound pre-processing steps for deep learning-based shear wave elastography and assess the impact on elasticity analysis.

Method: A 3D convolutional neural network was used to predict shear wave velocities from spatio-temporal ultrasound images, comparing different pre-processing levels (from raw radiofrequency data to fully processed images). Performance was compared to a conventional time-of-flight method across gelatin phantoms with varying elasticity.

Result: The deep learning approach reliably differentiated elasticity groups, even with raw data. Pre-processing slightly improved metrics but was not essential.

Conclusion: Deep learning reduces reliance on traditional pre-processing, offering faster and more reliable clinical elasticity assessments.

Abstract: Estimating the elasticity of soft tissue can provide useful information for various diagnostic applications. Ultrasound shear wave elastography offers a non-invasive approach. However, its generalizability and standardization across different systems and processing pipelines remain limited. Considering the influence of image processing on ultrasound based diagnostics, recent literature has discussed the impact of different image processing steps on reliable and reproducible elasticity analysis. In this work, we investigate the need of ultrasound pre-processing steps for deep learning-based ultrasound shear wave elastography. We evaluate the performance of a 3D convolutional neural network in predicting shear wave velocities from spatio-temporal ultrasound images, studying different degrees of pre-processing on the input images, ranging from fully beamformed and filtered ultrasound images to raw radiofrequency data. We compare the predictions from our deep learning approach to a conventional time-of-flight method across four gelatin phantoms with different elasticity levels. Our results demonstrate statistically significant differences in the predicted shear wave velocity among all elasticity groups, regardless of the degree of pre-processing. Although pre-processing slightly improves performance metrics, our results show that the deep learning approach can reliably differentiate between elasticity groups using raw, unprocessed radiofrequency data. These results show that deep learning-based approaches could reduce the need for and the bias of traditional ultrasound pre-processing steps in ultrasound shear wave elastography, enabling faster and more reliable clinical elasticity assessments.

[553] M$^3$HL: Mutual Mask Mix with High-Low Level Feature Consistency for Semi-Supervised Medical Image Segmentation

Yajun Liu, Zenghui Zhang, Jiang Yue, Weiwei Guo, Dongying Li

Main category: eess.IV

TL;DR: Proposes M$^3$HL, a method combining dynamic masking (M$^3$) and hierarchical feature consistency (HL) for semi-supervised medical image segmentation, outperforming benchmarks.

DetailsMotivation: Existing CutMix-based methods lack flexibility and feature-level consistency in semi-supervised medical image segmentation.

Method: M$^3$HL introduces dynamic masking (M$^3$) for better data augmentation and hierarchical consistency (HL) for feature alignment.

Result: Achieves state-of-the-art performance on ACDC and LA datasets.

Conclusion: M$^3$HL effectively improves segmentation by enhancing data augmentation and feature consistency.

Abstract: Data augmentation methods inspired by CutMix have demonstrated significant potential in recent semi-supervised medical image segmentation tasks. However, these approaches often apply CutMix operations in a rigid and inflexible manner, while paying insufficient attention to feature-level consistency constraints. In this paper, we propose a novel method called Mutual Mask Mix with High-Low level feature consistency (M$^3$HL) to address the aforementioned challenges, which consists of two key components: 1) M$^3$: An enhanced data augmentation operation inspired by the masking strategy from Masked Image Modeling (MIM), which advances conventional CutMix through dynamically adjustable masks to generate spatially complementary image pairs for collaborative training, thereby enabling effective information fusion between labeled and unlabeled images. 2) HL: A hierarchical consistency regularization framework that enforces high-level and low-level feature consistency between unlabeled and mixed images, enabling the model to better capture discriminative feature representations.Our method achieves state-of-the-art performance on widely adopted medical image segmentation benchmarks including the ACDC and LA datasets. Source code is available at https://github.com/PHPJava666/M3HL

[554] Classification non supervis{é}es d’acquisitions hyperspectrales cod{é}es : quelles v{é}rit{é}s terrain ?

Trung-tin Dinh, Hervé Carfantan, Antoine Monmayrant, Simon Lacroix

Main category: eess.IV

TL;DR: An unsupervised classification method for DD-CASSI hyperspectral data is proposed, addressing limitations of ground truths and demonstrating spectral coherence in Pavia University scenes.

DetailsMotivation: To improve unsupervised classification in hyperspectral imaging by addressing issues like unclear class definitions and high intra-class variability in ground truths.

Method: Uses a model of intra-class spectral variability to identify classes and estimate reference spectra from compressed data.

Result: Shows detection of spectrally coherent regions in Pavia University scene, questioning current evaluation methods.

Conclusion: Highlights the need to rethink evaluation of unsupervised classification methods due to limitations in ground truths.

Abstract: We propose an unsupervised classification method using a limited number of coded acquisitions from a DD-CASSI hyperspectral imager. Based on a simple model of intra-class spectral variability, this approach allow to identify classes and estimate reference spectra, despite data compression by a factor of ten. Here, we highlight the limitations of the ground truths commonly used to evaluate this type of method: lack of a clear definition of the notion of class, high intra-class variability, and even classification errors. Using the Pavia University scene, we show that with simple assumptions, it is possible to detect regions that are spectrally more coherent, highlighting the need to rethink the evaluation of classification methods, particularly in unsupervised scenarios.

[555] FUTransUNet-GradCAM: A Hybrid Transformer-U-Net with Self-Attention and Explainable Visualizations for Foot Ulcer Segmentation

Akwasi Asare, Mary Sagoe, Justice Williams Asare

Main category: eess.IV

TL;DR: FUTransUNet, a hybrid of Vision Transformers and U-Net, improves diabetic foot ulcer segmentation by combining global attention with local feature extraction, achieving high accuracy and interpretability.

DetailsMotivation: Diabetic foot ulcer segmentation is challenging due to heterogeneous appearance and complex backgrounds. Traditional CNNs like U-Net lack long-range spatial dependency modeling.

Method: Proposed FUTransUNet integrates Vision Transformers’ global attention into U-Net, maintaining spatial resolution via skip connections and decoding. Trained on the FUSeg dataset.

Result: Achieved Dice Coefficient of 0.8751 and IoU of 0.7780 on validation set. Grad-CAM visualizations ensured clinical transparency.

Conclusion: FUTransUNet offers a robust, accurate, and interpretable solution for DFU segmentation, enhancing clinical wound assessment and patient care.

Abstract: Automated segmentation of diabetic foot ulcers (DFUs) plays a critical role in clinical diagnosis, therapeutic planning, and longitudinal wound monitoring. However, this task remains challenging due to the heterogeneous appearance, irregular morphology, and complex backgrounds associated with ulcer regions in clinical photographs. Traditional convolutional neural networks (CNNs), such as U-Net, provide strong localization capabilities but struggle to model long-range spatial dependencies due to their inherently limited receptive fields. To address this, we propose FUTransUNet, a hybrid architecture that integrates the global attention mechanism of Vision Transformers (ViTs) into the U-Net framework. This combination allows the model to extract global contextual features while maintaining fine-grained spatial resolution through skip connections and an effective decoding pathway. We trained and validated FUTransUNet on the public Foot Ulcer Segmentation Challenge (FUSeg) dataset. FUTransUNet achieved a training Dice Coefficient of 0.8679, an IoU of 0.7672, and a training loss of 0.0053. On the validation set, the model achieved a Dice Coefficient of 0.8751, an IoU of 0.7780, and a validation loss of 0.009045. To ensure clinical transparency, we employed Grad-CAM visualizations, which highlighted model focus areas during prediction. These quantitative outcomes clearly demonstrate that our hybrid approach successfully integrates global and local feature extraction paradigms, thereby offering a highly robust, accurate, explainable, and interpretable solution and clinically translatable solution for automated foot ulcer analysis. The approach offers a reliable, high-fidelity solution for DFU segmentation, with implications for improving real-world wound assessment and patient care.

[556] Assessing the Impact of Image Super Resolution on White Blood Cell Classification Accuracy

Tatwadarshi P. Nagarhalli, Shruti S. Pawar, Soham A. Dahanukar, Uday Aswalekar, Ashwini M. Save, Sanket D. Patil

Main category: eess.IV

TL;DR: The paper explores how image super-resolution techniques improve white blood cell classification accuracy in low-resolution microscopic images by enhancing resolution and integrating augmented data into deep learning models.

DetailsMotivation: Accurate white blood cell classification is crucial for medical diagnostics, but low-resolution images hinder performance. The study aims to address this by investigating the impact of image enhancement on classification.

Method: The study uses large image dimension upscaling and integrates enhanced images into the training process of deep learning models to capture subtle morphological changes.

Result: Extensive testing with a well-known model shows improved classification accuracy by leveraging both standard and enhanced images.

Conclusion: The research highlights the benefits of image enhancement for classification and aims to develop more efficient algorithms tailored to white blood cell datasets.

Abstract: Accurately classifying white blood cells from microscopic images is essential to identify several illnesses and conditions in medical diagnostics. Many deep learning technologies are being employed to quickly and automatically classify images. However, most of the time, the resolution of these microscopic pictures is quite low, which might make it difficult to classify them correctly. Some picture improvement techniques, such as image super-resolution, are being utilized to improve the resolution of the photos to get around this issue. The suggested study uses large image dimension upscaling to investigate how picture-enhancing approaches affect classification performance. The study specifically looks at how deep learning models may be able to understand more complex visual information by capturing subtler morphological changes when image resolution is increased using cutting-edge techniques. The model may learn from standard and augmented data since the improved images are incorporated into the training process. This dual method seeks to comprehend the impact of image resolution on model performance and enhance classification accuracy. A well-known model for picture categorization is used to conduct extensive testing and thoroughly evaluate the effectiveness of this approach. This research intends to create more efficient image identification algorithms customized to a particular dataset of white blood cells by understanding the trade-offs between ordinary and enhanced images.

[557] Scaling Artificial Intelligence for Prostate Cancer Detection on MRI towards Population-Based Screening and Primary Diagnosis in a Global, Multiethnic Population (Study Protocol)

Anindo Saha, Joeran S. Bosma, Jasper J. Twilt, Alexander B. C. D. Ng, Aqua Asif, Kirti Magudia, Peder Larson, Qinglin Xie, Xiaodong Zhang, Chi Pham Minh, Samuel N. Gitau, Ivo G. Schoots, Martijn F. Boomsma, Renato Cuocolo, Nikolaos Papanikolaou, Daniele Regge, Derya Yakar, Mattijs Elschot, Jeroen Veltman, Baris Turkbey, Nancy A. Obuchowski, Jurgen J. Fütterer, Anwar R. Padhani, Hashim U. Ahmed, Tobias Nordström, Martin Eklund, Veeru Kasivisvanathan, Maarten de Rooij, Henkjan Huisman

Main category: eess.IV

TL;DR: The study validates the PI-CAI-2B AI model for detecting Gleason grade group ≥2 prostate cancer on MRI, using a large, diverse dataset from multiple countries and settings.

DetailsMotivation: To improve prostate cancer detection by developing and validating an efficient AI system (PI-CAI-2B) against standard clinical assessments.

Method: Retrospective cohort of 22,481 MRI exams from 22 countries, split into training/internal testing (20,471 cases) and external testing (2,010 cases). Primary endpoint is agreement with standard diagnoses.

Result: Hypothesis of diagnostic interchangeability with standard care at PI-RADS cut-offs, with secondary measures like AUROC to assess biases.

Conclusion: The PI-CAI-2B model shows promise for accurate prostate cancer detection, with potential for clinical use pending further validation.

Abstract: In this intercontinental, confirmatory study, we include a retrospective cohort of 22,481 MRI examinations (21,288 patients; 46 cities in 22 countries) to train and externally validate the PI-CAI-2B model, i.e., an efficient, next-generation iteration of the state-of-the-art AI system that was developed for detecting Gleason grade group $\geq$2 prostate cancer on MRI during the PI-CAI study. Of these examinations, 20,471 cases (19,278 patients; 26 cities in 14 countries) from two EU Horizon projects (ProCAncer-I, COMFORT) and 12 independent centers based in Europe, North America, Asia and Africa, are used for training and internal testing. Additionally, 2010 cases (2010 patients; 20 external cities in 12 countries) from population-based screening (STHLM3-MRI, IP1-PROSTAGRAM trials) and primary diagnostic settings (PRIME trial) based in Europe, North and South Americas, Asia and Australia, are used for external testing. Primary endpoint is the proportion of AI-based assessments in agreement with the standard of care diagnoses (i.e., clinical assessments made by expert uropathologists on histopathology, if available, or at least two expert urogenital radiologists in consensus; with access to patient history and peer consultation) in the detection of Gleason grade group $\geq$2 prostate cancer within the external testing cohorts. Our statistical analysis plan is prespecified with a hypothesis of diagnostic interchangeability to the standard of care at the PI-RADS $\geq$3 (primary diagnosis) or $\geq$4 (screening) cut-off, considering an absolute margin of 0.05 and reader estimates derived from the PI-CAI observer study (62 radiologists reading 400 cases). Secondary measures comprise the area under the receiver operating characteristic curve (AUROC) of the AI system stratified by imaging quality, patient age and patient ethnicity to identify underlying biases (if any).

[558] When Deep Learning Fails: Limitations of Recurrent Models on Stroke-Based Handwriting for Alzheimer’s Disease Detection

Emanuele Nardone, Tiziana D’Alessandro, Francesco Fontanella, Claudio De Stefano

Main category: eess.IV

TL;DR: Deep learning models (LSTM, GRU, RNN) underperform traditional methods in non-invasive Alzheimer’s detection via handwriting analysis due to data-model mismatch.

DetailsMotivation: To enable non-invasive Alzheimer's detection using handwriting analysis, avoiding costly or invasive procedures.

Method: Evaluated three recurrent neural architectures (LSTM, GRU, RNN) and traditional machine learning models on a dataset of 34 handwriting tasks from healthy and Alzheimer’s patients.

Result: Traditional ensemble methods outperformed deep learning models, which showed poor specificity and high variance due to data-model incompatibility.

Conclusion: Recurrent architectures fail for discrete stroke-based handwriting data; future research should address data representation and model compatibility.

Abstract: Alzheimer’s disease detection requires expensive neuroimaging or invasive procedures, limiting accessibility. This study explores whether deep learning can enable non-invasive Alzheimer’s disease detection through handwriting analysis. Using a dataset of 34 distinct handwriting tasks collected from healthy controls and Alzheimer’s disease patients, we evaluate and compare three recurrent neural architectures (LSTM, GRU, RNN) against traditional machine learning models. A crucial distinction of our approach is that the recurrent models process pre-extracted features from discrete strokes, not raw temporal signals. This violates the assumption of a continuous temporal flow that recurrent networks are designed to capture. Results reveal that they exhibit poor specificity and high variance. Traditional ensemble methods significantly outperform all deep architectures, achieving higher accuracy with balanced metrics. This demonstrates that recurrent architectures, designed for continuous temporal sequences, fail when applied to feature vectors extracted from ambiguously segmented strokes. Despite their complexity, deep learning models cannot overcome the fundamental disconnect between their architectural assumptions and the discrete, feature-based nature of stroke-level handwriting data. Although performance is limited, the study highlights several critical issues in data representation and model compatibility, pointing to valuable directions for future research.

[559] UNISELF: A Unified Network with Instance Normalization and Self-Ensembled Lesion Fusion for Multiple Sclerosis Lesion Segmentation

Jinwei Zhang, Lianrui Zuo, Blake E. Dewey, Samuel W. Remedios, Yihao Liu, Savannah P. Hays, Dzung L. Pham, Ellen M. Mowry, Scott D. Newsome, Peter A. Calabresi, Aaron Carass, Jerry L. Prince

Main category: eess.IV

TL;DR: UNISELF is a deep learning method for MS lesion segmentation that optimizes in-domain accuracy and out-of-domain generalization using test-time self-ensembled lesion fusion and TTIN.

DetailsMotivation: Current DL methods for MS lesion segmentation lack simultaneous optimization of in-domain accuracy and out-of-domain generalization when trained on limited single-source data.

Method: UNISELF uses test-time self-ensembled lesion fusion and TTIN to handle domain shifts and missing contrasts.

Result: UNISELF ranks among the best on ISBI 2015 test data and outperforms benchmarks on diverse out-of-domain datasets (MICCAI 2016, UMCL, and private multisite data).

Conclusion: UNISELF effectively addresses domain shifts and missing contrasts, achieving high accuracy and generalization in MS lesion segmentation.

Abstract: Automated segmentation of multiple sclerosis (MS) lesions using multicontrast magnetic resonance (MR) images improves efficiency and reproducibility compared to manual delineation, with deep learning (DL) methods achieving state-of-the-art performance. However, these DL-based methods have yet to simultaneously optimize in-domain accuracy and out-of-domain generalization when trained on a single source with limited data, or their performance has been unsatisfactory. To fill this gap, we propose a method called UNISELF, which achieves high accuracy within a single training domain while demonstrating strong generalizability across multiple out-of-domain test datasets. UNISELF employs a novel test-time self-ensembled lesion fusion to improve segmentation accuracy, and leverages test-time instance normalization (TTIN) of latent features to address domain shifts and missing input contrasts. Trained on the ISBI 2015 longitudinal MS segmentation challenge training dataset, UNISELF ranks among the best-performing methods on the challenge test dataset. Additionally, UNISELF outperforms all benchmark methods trained on the same ISBI training data across diverse out-of-domain test datasets with domain shifts and missing contrasts, including the public MICCAI 2016 and UMCL datasets, as well as a private multisite dataset. These test datasets exhibit domain shifts and/or missing contrasts caused by variations in acquisition protocols, scanner types, and imaging artifacts arising from imperfect acquisition. Our code is available at https://github.com/uponacceptance.

[560] PET2Rep: Towards Vision-Language Model-Drived Automated Radiology Report Generation for Positron Emission Tomography

Yichi Zhang, Wenbo Zhang, Zehui Ling, Gang Feng, Sisi Peng, Deshu Chen, Yuchen Liu, Hongwei Zhang, Shuqi Wang, Lanlan Li, Limei Han, Yuan Cheng, Zixin Hu, Yuan Qi, Le Xue

Main category: eess.IV

TL;DR: PET2Rep is a new benchmark for evaluating vision-language models (VLMs) in generating radiology reports for PET images, addressing a gap in existing datasets. Current VLMs perform poorly on this task.

DetailsMotivation: Manual radiology report generation is labor-intensive, and existing VLMs focus on structural imaging, neglecting PET's metabolic insights.

Method: Introduces PET2Rep, a dataset with whole-body PET image-report pairs, and evaluates 30 VLMs using clinical and natural language metrics.

Result: State-of-the-art VLMs underperform in PET report generation, highlighting key insufficiencies.

Conclusion: PET2Rep fills a critical gap, but VLMs need significant improvement for practical PET report generation.

Abstract: Positron emission tomography (PET) is a cornerstone of modern oncologic and neurologic imaging, distinguished by its unique ability to illuminate dynamic metabolic processes that transcend the anatomical focus of traditional imaging technologies. Radiology reports are essential for clinical decision making, yet their manual creation is labor-intensive and time-consuming. Recent advancements of vision-language models (VLMs) have shown strong potential in medical applications, presenting a promising avenue for automating report generation. However, existing applications of VLMs in the medical domain have predominantly focused on structural imaging modalities, while the unique characteristics of molecular PET imaging have largely been overlooked. To bridge the gap, we introduce PET2Rep, a large-scale comprehensive benchmark for evaluation of general and medical VLMs for radiology report generation for PET images. PET2Rep stands out as the first dedicated dataset for PET report generation with metabolic information, uniquely capturing whole-body image-report pairs that cover dozens of organs to fill the critical gap in existing benchmarks and mirror real-world clinical comprehensiveness. In addition to widely recognized natural language generation metrics, we introduce a series of clinical efficiency metrics to evaluate the quality of radiotracer uptake pattern description in key organs in generated reports. We conduct a head-to-head comparison of 30 cutting-edge general-purpose and medical-specialized VLMs. The results show that the current state-of-the-art VLMs perform poorly on PET report generation task, falling considerably short of fulfilling practical needs. Moreover, we identify several key insufficiency that need to be addressed to advance the development in medical applications.

[561] Edge2Prompt: Modality-Agnostic Model for Out-of-Distribution Liver Segmentation

Nathan Hollet, Oumeymah Cherkaoui, Philippe C. Cattin, Sidaty El hadramy

Main category: eess.IV

TL;DR: Edge2Prompt is a modality-agnostic liver segmentation pipeline combining edge detection and foundation models, outperforming traditional methods in data-scarce and out-of-distribution scenarios.

DetailsMotivation: Liver segmentation is crucial for clinical workflows but faces challenges due to modality-specific tools and limited data.

Method: Integrates edge detection with U-Net and SAM-2 to generate prompts for 2D segmentation, reconstructing into 3D volumes.

Result: Achieves 86.4% mean Dice Score on OOD tasks, surpassing U-Net by 27.4% and other methods by 9.1%.

Conclusion: Edge2Prompt bridges classical and foundation models for adaptable, data-efficient liver segmentation.

Abstract: Liver segmentation is essential for preoperative planning in interventions like tumor resection or transplantation, but implementation in clinical workflows faces challenges due to modality-specific tools and data scarcity. We propose Edge2Prompt, a novel pipeline for modality-agnostic liver segmentation that generalizes to out-of-distribution (OOD) data. Our method integrates classical edge detection with foundation models. Modality-agnostic edge maps are first extracted from input images, then processed by a U-Net to generate logit-based prompts. These prompts condition the Segment Anything Model 2 (SAM-2) to generate 2D liver segmentations, which can then be reconstructed into 3D volumes. Evaluated on the multi-modal CHAOS dataset, Edge2Prompt achieves competitive results compared to classical segmentation methods when trained and tested in-distribution (ID), and outperforms them in data-scarce scenarios due to the SAM-2 module. Furthermore, it achieves a mean Dice Score of 86.4% on OOD tasks, outperforming U-Net baselines by 27.4% and other self-prompting methods by 9.1%, demonstrating its effectiveness. This work bridges classical and foundation models for clinically adaptable, data-efficient segmentation.

[562] Discriminating Distal Ischemic Stroke from Seizure-Induced Stroke Mimics Using Dynamic Susceptibility Contrast MRI

Marijn Borghouts, Richard McKinley, Josien Pluim, Manuel Köstner, Roland Wiest, Ruisheng Su

Main category: eess.IV

TL;DR: The study investigates MRP imaging to differentiate distal AIS from seizures, achieving high diagnostic accuracy with PMDs.

DetailsMotivation: Distinguishing AIS from SMs, especially for distal occlusions, is challenging due to CT's limited sensitivity.

Method: Retrospective analysis of 162 patients using DSC-MRP images to extract PMDs, followed by statistical and logistic regression modeling.

Result: The model achieved AUROC of 0.90, AUPRC of 0.74, 92% specificity, and 73% sensitivity, identifying discriminative brain regions.

Conclusion: MRP-based PMDs show promise for distinguishing AIS from mimics, warranting further exploration.

Abstract: Distinguishing acute ischemic strokes (AIS) from stroke mimics (SMs), particularly in cases involving medium and small vessel occlusions, remains a significant diagnostic challenge. While computed tomography (CT) based protocols are commonly used in emergency settings, their sensitivity for detecting distal occlusions is limited. This study explores the potential of magnetic resonance perfusion (MRP) imaging as a tool for differentiating distal AIS from epileptic seizures, a prevalent SM. Using a retrospective dataset of 162 patients (129 AIS, 33 seizures), we extracted region-wise perfusion map descriptors (PMDs) from dynamic susceptibility contrast (DSC) images. Statistical analyses identified several brain regions, located mainly in the temporal and occipital lobe, exhibiting significant group differences in certain PMDs. Hemispheric asymmetry analyses further highlighted these regions as discriminative. A logistic regression model trained on PMDs achieved an area under the receiver operating characteristic (AUROC) curve of 0.90, and an area under the precision recall curve (AUPRC) of 0.74, with a specificity of 92% and a sensitivity of 73%, suggesting strong performance in distinguishing distal AIS from seizures. These findings support further exploration of MRP-based PMDs as interpretable features for distinguishing true strokes from various mimics. The code is openly available at our GitHub https://github.com/Marijn311/PMD_extraction_and_analysis{github.com/Marijn311/PMD_extraction_and_analysis

[563] Unmasking Interstitial Lung Diseases: Leveraging Masked Autoencoders for Diagnosis

Ethan Dack, Lorenzo Brigato, Vasilis Dedousis, Janine Gote-Schniering, Cheryl, Hanno Hoppe, Aristomenis Exadaktylos, Manuela Funke-Chambour, Thomas Geiser, Andreas Christe, Lukas Ebner, Stavroula Mougiakakou

Main category: eess.IV

TL;DR: MAEs pretrained on unlabelled chest CT scans improve diagnostic performance for diffused lung diseases, even with limited labelled data.

DetailsMotivation: Annotated imaging datasets for diffused lung diseases are scarce, making MAEs a promising solution for learning robust features from unlabelled data.

Method: Train an MAE on 5,000+ chest CT scans (mix of in-house and public data), then fine-tune for diffused lung disease classification.

Result: MAEs extract clinically meaningful features and enhance diagnostic accuracy without large labelled datasets.

Conclusion: MAEs are effective for diffused lung disease research, offering a viable solution for limited labelled data scenarios.

Abstract: Masked autoencoders (MAEs) have emerged as a powerful approach for pre-training on unlabelled data, capable of learning robust and informative feature representations. This is particularly advantageous in diffused lung disease research, where annotated imaging datasets are scarce. To leverage this, we train an MAE on a curated collection of over 5,000 chest computed tomography (CT) scans, combining in-house data with publicly available scans from related conditions that exhibit similar radiological patterns, such as COVID-19 and bacterial pneumonia. The pretrained MAE is then fine-tuned on a downstream classification task for diffused lung disease diagnosis. Our findings demonstrate that MAEs can effectively extract clinically meaningful features and improve diagnostic performance, even in the absence of large-scale labelled datasets. The code and the models are available here: https://github.com/eedack01/lung_masked_autoencoder.

[564] TotalRegistrator: Towards a Lightweight Foundation Model for CT Image Registration

Xuan Loc Pham, Gwendolyn Vuurberg, Marjan Doppen, Joey Roosen, Tip Stille, Thi Quynh Ha, Thuy Duong Quach, Quoc Vu Dang, Manh Ha Luu, Ewoud J. Smit, Hong Son Mai, Mattias Heinrich, Bram van Ginneken, Mathias Prokop, Alessa Hering

Main category: eess.IV

TL;DR: TotalRegistrator is a lightweight image registration framework using UNet and field decomposition to align multiple anatomical regions simultaneously, showing strong generalizability across datasets.

DetailsMotivation: Existing methods are limited to single-organ applications, lacking generalizability for multi-organ registration in clinical CT image analysis.

Method: Proposes TotalRegistrator, leveraging a UNet architecture and novel field decomposition strategy, trained on a large-scale longitudinal whole-body CT dataset.

Result: Outperforms baselines in multi-organ abdominal registration, with slight lung alignment drop. Achieves competitive results on external datasets without fine-tuning.

Conclusion: TotalRegistrator demonstrates robust generalizability and efficiency, making it suitable for clinical multi-organ image registration.

Abstract: Image registration is a fundamental technique in the analysis of longitudinal and multi-phase CT images within clinical practice. However, most existing methods are tailored for single-organ applications, limiting their generalizability to other anatomical regions. This work presents TotalRegistrator, an image registration framework capable of aligning multiple anatomical regions simultaneously using a standard UNet architecture and a novel field decomposition strategy. The model is lightweight, requiring only 11GB of GPU memory for training. To train and evaluate our method, we constructed a large-scale longitudinal dataset comprising 695 whole-body (thorax-abdomen-pelvic) paired CT scans from individual patients acquired at different time points. We benchmarked TotalRegistrator against a generic classical iterative algorithm and a recent foundation model for image registration. To further assess robustness and generalizability, we evaluated our model on three external datasets: the public thoracic and abdominal datasets from the Learn2Reg challenge, and a private multiphase abdominal dataset from a collaborating hospital. Experimental results on the in-house dataset show that the proposed approach generally surpasses baseline methods in multi-organ abdominal registration, with a slight drop in lung alignment performance. On out-of-distribution datasets, it achieved competitive results compared to leading single-organ models, despite not being fine-tuned for those tasks, demonstrating strong generalizability. The source code will be publicly available at: https://github.com/DIAGNijmegen/oncology_image_registration.git.

[565] OpenDCVCs: A PyTorch Open Source Implementation and Performance Evaluation of the DCVC series Video Codecs

Yichi Zhang, Fengqing Zhu

Main category: eess.IV

TL;DR: OpenDCVCs is an open-source PyTorch implementation for reproducible research in learned video compression, offering unified training-ready implementations of four DCVC models.

DetailsMotivation: To address the lack of public training codes for DCVC models, hindering reproducibility and further development.

Method: Provides a comprehensive framework for end-to-end training and evaluation of four DCVC models, with detailed documentation and benchmarking.

Result: Enables transparent comparison and extension, with all code and tools publicly available.

Conclusion: OpenDCVCs fosters collaboration and accelerates research in learned video compression.

Abstract: We present OpenDCVCs, an open-source PyTorch implementation designed to advance reproducible research in learned video compression. OpenDCVCs provides unified and training-ready implementations of four representative Deep Contextual Video Compression (DCVC) models–DCVC, DCVC with Temporal Context Modeling (DCVC-TCM), DCVC with Hybrid Entropy Modeling (DCVC-HEM), and DCVC with Diverse Contexts (DCVC-DC). While the DCVC series achieves substantial bitrate reductions over both classical codecs and advanced learned models, previous public code releases have been limited to evaluation codes, presenting significant barriers to reproducibility, benchmarking, and further development. OpenDCVCs bridges this gap by offering a comprehensive, self-contained framework that supports both end-to-end training and evaluation for all included algorithms. The implementation includes detailed documentation, evaluation protocols, and extensive benchmarking results across diverse datasets, providing a transparent and consistent foundation for comparison and extension. All code and experimental tools are publicly available at https://gitlab.com/viper-purdue/opendcvcs, empowering the community to accelerate research and foster collaboration.

[566] Conditional Fetal Brain Atlas Learning for Automatic Tissue Segmentation

Johannes Tischer, Patric Kienast, Marlene Stümpflen, Gregor Kasprian, Georg Langs, Roxane Licandro

Main category: eess.IV

TL;DR: A deep-learning framework for generating continuous, age-specific fetal brain atlases from MRI data, achieving high accuracy in segmentation and registration, with real-time performance for clinical and research use.

DetailsMotivation: Addressing challenges in fetal brain MRI assessment due to variability in maturation, imaging protocols, and gestational age estimates by providing a standardized reference framework.

Method: Combines a direct registration model with a conditional discriminator, trained on 219 neurotypical fetal MRIs (21-37 weeks gestation).

Result: High registration accuracy, sharp structural detail, robust segmentation (average DSC of 86.3%), and detailed neurotypical growth trajectories.

Conclusion: The framework enables individualized developmental assessment with minimal pre-processing and real-time performance, benefiting research and clinical applications.

Abstract: Magnetic Resonance Imaging (MRI) of the fetal brain has become a key tool for studying brain development in vivo. Yet, its assessment remains challenging due to variability in brain maturation, imaging protocols, and uncertain estimates of Gestational Age (GA). To overcome these, brain atlases provide a standardized reference framework that facilitates objective evaluation and comparison across subjects by aligning the atlas and subjects in a common coordinate system. In this work, we introduce a novel deep-learning framework for generating continuous, age-specific fetal brain atlases for real-time fetal brain tissue segmentation. The framework combines a direct registration model with a conditional discriminator. Trained on a curated dataset of 219 neurotypical fetal MRIs spanning from 21 to 37 weeks of gestation. The method achieves high registration accuracy, captures dynamic anatomical changes with sharp structural detail, and robust segmentation performance with an average Dice Similarity Coefficient (DSC) of 86.3% across six brain tissues. Furthermore, volumetric analysis of the generated atlases reveals detailed neurotypical growth trajectories, providing valuable insights into the maturation of the fetal brain. This approach enables individualized developmental assessment with minimal pre-processing and real-time performance, supporting both research and clinical applications. The model code is available at https://github.com/cirmuw/fetal-brain-atlas

[567] LA-CaRe-CNN: Cascading Refinement CNN for Left Atrial Scar Segmentation

Franz Thaler, Darko Stern, Gernot Plank, Martin Urschler

Main category: eess.IV

TL;DR: LA-CaRe-CNN, a 2-stage CNN cascade, accurately segments left atrium and scar tissue from LGE MR scans for personalized AF ablation therapy.

DetailsMotivation: Improving atrial fibrillation treatment by enabling precise ablation therapy through accurate segmentation of cardiac tissues.

Method: A 2-stage CNN cascade trained end-to-end in 3D, using strong augmentation to handle domain shifts.

Result: Achieves 89.21% DSC for left atrium and 64.59% DSC for scar tissue, showing high accuracy.

Conclusion: LA-CaRe-CNN is promising for creating digital twin models and personalized AF ablation therapy.

Abstract: Atrial fibrillation (AF) represents the most prevalent type of cardiac arrhythmia for which treatment may require patients to undergo ablation therapy. In this surgery cardiac tissues are locally scarred on purpose to prevent electrical signals from causing arrhythmia. Patient-specific cardiac digital twin models show great potential for personalized ablation therapy, however, they demand accurate semantic segmentation of healthy and scarred tissue typically obtained from late gadolinium enhanced (LGE) magnetic resonance (MR) scans. In this work we propose the Left Atrial Cascading Refinement CNN (LA-CaRe-CNN), which aims to accurately segment the left atrium as well as left atrial scar tissue from LGE MR scans. LA-CaRe-CNN is a 2-stage CNN cascade that is trained end-to-end in 3D, where Stage 1 generates a prediction for the left atrium, which is then refined in Stage 2 in conjunction with the original image information to obtain a prediction for the left atrial scar tissue. To account for domain shift towards domains unknown during training, we employ strong intensity and spatial augmentation to increase the diversity of the training dataset. Our proposed method based on a 5-fold ensemble achieves great segmentation results, namely, 89.21% DSC and 1.6969 mm ASSD for the left atrium, as well as 64.59% DSC and 91.80% G-DSC for the more challenging left atrial scar tissue. Thus, segmentations obtained through LA-CaRe-CNN show great potential for the generation of patient-specific cardiac digital twin models and downstream tasks like personalized targeted ablation therapy to treat AF.

[568] A Comprehensive Framework for Uncertainty Quantification of Voxel-wise Supervised Models in IVIM MRI

Nicola Casali, Alessandro Brusaferri, Giuseppe Baselli, Stefano Fumagalli, Edoardo Micotti, Gianluigi Forloni, Riaz Hussein, Giovanna Rizzo, Alfonso Mastropietro

Main category: eess.IV

TL;DR: A probabilistic deep learning framework using Deep Ensembles of Mixture Density Networks is proposed for accurate IVIM parameter estimation with uncertainty quantification, outperforming non-probabilistic and single Gaussian methods.

DetailsMotivation: Accurate IVIM parameter estimation is challenging due to noise sensitivity and ill-posed inverse problems, necessitating robust uncertainty-aware methods.

Method: The framework combines Deep Ensembles and Mixture Density Networks to estimate predictive uncertainty, decomposing it into aleatoric and epistemic components. Benchmarking included synthetic and in vivo data evaluation.

Result: MDNs provided calibrated and sharper predictions for D and f parameters, though D* showed slight overconfidence. Elevated epistemic uncertainty in vivo indicated data-acquisition mismatch.

Conclusion: The framework enables reliable IVIM fitting with uncertainty quantification, adaptable for other physical models, highlighting the importance of epistemic uncertainty in real-world applications.

Abstract: Accurate estimation of intravoxel incoherent motion (IVIM) parameters from diffusion-weighted MRI remains challenging due to the ill-posed nature of the inverse problem and high sensitivity to noise, particularly in the perfusion compartment. In this work, we propose a probabilistic deep learning framework based on Deep Ensembles (DE) of Mixture Density Networks (MDNs), enabling estimation of total predictive uncertainty and decomposition into aleatoric (AU) and epistemic (EU) components. The method was benchmarked against non probabilistic neural networks, a Bayesian fitting approach and a probabilistic network with single Gaussian parametrization. Supervised training was performed on synthetic data, and evaluation was conducted on both simulated and two in vivo datasets. The reliability of the quantified uncertainties was assessed using calibration curves, output distribution sharpness, and the Continuous Ranked Probability Score (CRPS). MDNs produced more calibrated and sharper predictive distributions for the D and f parameters, although slight overconfidence was observed in D*. The Robust Coefficient of Variation (RCV) indicated smoother in vivo estimates for D* with MDNs compared to Gaussian model. Despite the training data covering the expected physiological range, elevated EU in vivo suggests a mismatch with real acquisition conditions, highlighting the importance of incorporating EU, which was allowed by DE. Overall, we present a comprehensive framework for IVIM fitting with uncertainty quantification, which enables the identification and interpretation of unreliable estimates. The proposed approach can also be adopted for fitting other physical models through appropriate architectural and simulation adjustments.

[569] BlurryScope enables compact, cost-effective scanning microscopy for HER2 scoring using deep learning on blurry images

Michael John Fanous, Christopher Michael Seybold, Hanlong Chen, Nir Pillar, Aydogan Ozcan

Main category: eess.IV

TL;DR: BlurryScope is a cost-effective, compact optical microscope using continuous image acquisition and deep learning for automated tissue analysis, matching commercial scanners in speed but at lower cost and size. It successfully classifies HER2 scores in breast tissue sections.

DetailsMotivation: To provide an affordable, compact, and automated solution for digital pathology, addressing the high cost and size of commercial scanners.

Method: Leverages continuous image acquisition and deep learning for automated inspection, including image scanning, stitching, cropping, and HER2 score classification.

Result: Achieved 79.3% accuracy for 4-class and 89.7% for 2-class HER2 classification on 284 patient cores, matching high-end scanner results.

Conclusion: BlurryScope offers a viable, cost-effective alternative to commercial digital pathology scanners with comparable performance.

Abstract: We developed a rapid scanning optical microscope, termed “BlurryScope”, that leverages continuous image acquisition and deep learning to provide a cost-effective and compact solution for automated inspection and analysis of tissue sections. This device offers comparable speed to commercial digital pathology scanners, but at a significantly lower price point and smaller size/weight. Using BlurryScope, we implemented automated classification of human epidermal growth factor receptor 2 (HER2) scores on motion-blurred images of immunohistochemically (IHC) stained breast tissue sections, achieving concordant results with those obtained from a high-end digital scanning microscope. Using a test set of 284 unique patient cores, we achieved testing accuracies of 79.3% and 89.7% for 4-class (0, 1+, 2+, 3+) and 2-class (0/1+, 2+/3+) HER2 classification, respectively. BlurryScope automates the entire workflow, from image scanning to stitching and cropping, as well as HER2 score classification.

[570] CostFilter-AD: Enhancing Anomaly Detection through Matching Cost Filtering

Zhe Zhang, Mingxiu Cai, Hanxiao Wang, Gaochang Wu, Tianyou Chai, Xiatian Zhu

Main category: eess.IV

TL;DR: CostFilter-AD improves unsupervised anomaly detection by refining matching costs between input and normal samples, enhancing accuracy in anomaly localization.

DetailsMotivation: Existing methods rely on inaccurate image- or feature-level matching, leading to sub-optimal anomaly detection.

Method: Introduces cost filtering via a cost volume and a filtering network, refining matches while preserving edge structures and subtle anomalies.

Result: Validated on MVTec-AD and VisA benchmarks, showing benefits for single- and multi-class UAD tasks.

Conclusion: CostFilter-AD is a versatile plug-in that enhances existing UAD methods by improving matching accuracy.

Abstract: Unsupervised anomaly detection (UAD) seeks to localize the anomaly mask of an input image with respect to normal samples. Either by reconstructing normal counterparts (reconstruction-based) or by learning an image feature embedding space (embedding-based), existing approaches fundamentally rely on image-level or feature-level matching to derive anomaly scores. Often, such a matching process is inaccurate yet overlooked, leading to sub-optimal detection. To address this issue, we introduce the concept of cost filtering, borrowed from classical matching tasks, such as depth and flow estimation, into the UAD problem. We call this approach {\em CostFilter-AD}. Specifically, we first construct a matching cost volume between the input and normal samples, comprising two spatial dimensions and one matching dimension that encodes potential matches. To refine this, we propose a cost volume filtering network, guided by the input observation as an attention query across multiple feature layers, which effectively suppresses matching noise while preserving edge structures and capturing subtle anomalies. Designed as a generic post-processing plug-in, CostFilter-AD can be integrated with either reconstruction-based or embedding-based methods. Extensive experiments on MVTec-AD and VisA benchmarks validate the generic benefits of CostFilter-AD for both single- and multi-class UAD tasks. Code and models will be released at https://github.com/ZHE-SAPI/CostFilter-AD.

[571] UnMix-NeRF: Spectral Unmixing Meets Neural Radiance Fields

Fabian Perez, Sara Rojas, Carlos Hinojosa, Hoover Rueda-Chacón, Bernard Ghanem

Main category: eess.IV

TL;DR: UnMix-NeRF integrates spectral unmixing into NeRF for hyperspectral view synthesis and unsupervised material segmentation, outperforming existing methods.

DetailsMotivation: Existing NeRF-based segmentation lacks material properties, limiting applications like robotics and AR. UnMix-NeRF addresses this by incorporating spectral data.

Method: UnMix-NeRF models spectral reflectance with diffuse and specular components, using a learned dictionary of endmembers and per-point abundances for material segmentation.

Result: The method achieves superior spectral reconstruction and material segmentation compared to existing approaches.

Conclusion: UnMix-NeRF enables flexible material-based scene editing and advances material perception in NeRF frameworks.

Abstract: Neural Radiance Field (NeRF)-based segmentation methods focus on object semantics and rely solely on RGB data, lacking intrinsic material properties. This limitation restricts accurate material perception, which is crucial for robotics, augmented reality, simulation, and other applications. We introduce UnMix-NeRF, a framework that integrates spectral unmixing into NeRF, enabling joint hyperspectral novel view synthesis and unsupervised material segmentation. Our method models spectral reflectance via diffuse and specular components, where a learned dictionary of global endmembers represents pure material signatures, and per-point abundances capture their distribution. For material segmentation, we use spectral signature predictions along learned endmembers, allowing unsupervised material clustering. Additionally, UnMix-NeRF enables scene editing by modifying learned endmember dictionaries for flexible material-based appearance manipulation. Extensive experiments validate our approach, demonstrating superior spectral reconstruction and material segmentation to existing methods. Project page: https://www.factral.co/UnMix-NeRF.

[572] GR-Gaussian: Graph-Based Radiative Gaussian Splatting for Sparse-View CT Reconstruction

Yikuang Yuluo, Yue Ma, Kuan Shen, Tongtong Jin, Wang Liao, Yangpu Ma, Fuquan Wang

Main category: eess.IV

TL;DR: GR-Gaussian improves CT reconstruction under sparse-view conditions by reducing needle-like artifacts and enhancing accuracy with denoised point cloud initialization and pixel-graph-aware gradients.

DetailsMotivation: Existing 3D Gaussian Splatting methods for CT reconstruction suffer from needle-like artifacts in sparse-view conditions due to reliance on average gradient magnitudes.

Method: GR-Gaussian introduces a denoised point cloud initialization strategy and a pixel-graph-aware gradient strategy to refine gradient computation and density representation.

Result: Experiments show PSNR improvements of 0.67 dB and 0.92 dB, and SSIM gains of 0.011 and 0.021 on X-3D and real-world datasets.

Conclusion: GR-Gaussian is effective for accurate CT reconstruction in sparse-view conditions, outperforming existing methods.

Abstract: 3D Gaussian Splatting (3DGS) has emerged as a promising approach for CT reconstruction. However, existing methods rely on the average gradient magnitude of points within the view, often leading to severe needle-like artifacts under sparse-view conditions. To address this challenge, we propose GR-Gaussian, a graph-based 3D Gaussian Splatting framework that suppresses needle-like artifacts and improves reconstruction accuracy under sparse-view conditions. Our framework introduces two key innovations: (1) a Denoised Point Cloud Initialization Strategy that reduces initialization errors and accelerates convergence; and (2) a Pixel-Graph-Aware Gradient Strategy that refines gradient computation using graph-based density differences, improving splitting accuracy and density representation. Experiments on X-3D and real-world datasets validate the effectiveness of GR-Gaussian, achieving PSNR improvements of 0.67 dB and 0.92 dB, and SSIM gains of 0.011 and 0.021. These results highlight the applicability of GR-Gaussian for accurate CT reconstruction under challenging sparse-view conditions.

Last updated: 2025-08-22
Built with Hugo, theme modified on Stack