Today’s Research Highlights
AI-enhanced summaries of the latest research papers from arXiv.
Table of Contents
- cs.CL [Total: 61]
- cs.CV [Total: 142]
- cs.AI [Total: 21]
- cs.SD [Total: 8]
- cs.LG [Total: 96]
- cs.MA [Total: 5]
- cs.MM [Total: 3]
- eess.AS [Total: 10]
- eess.IV [Total: 17]
cs.CL
[1] A2HCoder: An LLM-Driven Coding Agent for Hierarchical Algorithm-to-HDL Translation
Jie Lei, Ruofan Jia, J. Andrew Zhang, Hao Zhang
Main category: cs.CL
TL;DR: A2HCoder is a hierarchical algorithm-to-HDL coding agent using LLMs to bridge the gap between algorithm design and hardware implementation, improving robustness and interpretability.
Details
Motivation: The gap between algorithm design and hardware implementation in wireless communication systems requires extensive expertise and manual effort due to mismatches in programming languages and HDLs.Method: A2HCoder uses a hierarchical framework with modular decomposition (horizontal) and step-by-step translation (vertical), leveraging external tools for debugging and synthesis.
Result: Validated in 5G wireless communication, A2HCoder shows practicality, reliability, and efficiency in deployment.
Conclusion: A2HCoder effectively addresses the algorithm-to-hardware translation challenge, mitigating hallucinations and ensuring correctness.
Abstract: In wireless communication systems, stringent requirements such as ultra-low latency and power consumption have significantly increased the demand for efficient algorithm-to-hardware deployment. However, a persistent and substantial gap remains between algorithm design and hardware implementation. Bridging this gap traditionally requires extensive domain expertise and time-consuming manual development, due to fundamental mismatches between high-level programming languages like MATLAB and hardware description languages (HDLs) such as Verilog-in terms of memory access patterns, data processing manners, and datatype representations. To address this challenge, we propose A2HCoder: a Hierarchical Algorithm-to-HDL Coding Agent, powered by large language models (LLMs), designed to enable agile and reliable algorithm-to-hardware translation. A2HCoder introduces a hierarchical framework that enhances both robustness and interpretability while suppressing common hallucination issues in LLM-generated code. In the horizontal dimension, A2HCoder decomposes complex algorithms into modular functional blocks, simplifying code generation and improving consistency. In the vertical dimension, instead of relying on end-to-end generation, A2HCoder performs step-by-step, fine-grained translation, leveraging external toolchains such as MATLAB and Vitis HLS for debugging and circuit-level synthesis. This structured process significantly mitigates hallucinations and ensures hardware-level correctness. We validate A2HCoder through a real-world deployment case in the 5G wireless communication domain, demonstrating its practicality, reliability, and deployment efficiency.
[2] PersonaTwin: A Multi-Tier Prompt Conditioning Framework for Generating and Evaluating Personalized Digital Twins
Sihan Chen, John P. Lalor, Yi Yang, Ahmed Abbasi
Main category: cs.CL
TL;DR: PersonaTwin, a multi-tier prompt conditioning framework, enhances LLMs by integrating demographic, behavioral, and psychometric data to create adaptive digital twins, outperforming standard LLMs in fidelity and fairness.
Details
Motivation: LLMs often fail to capture nuanced user behaviors, prompting the need for a framework like PersonaTwin to improve user modeling.Method: PersonaTwin integrates diverse user data and benchmarks against standard LLMs using text similarity and demographic parity metrics.
Result: The framework achieves simulation fidelity comparable to oracle settings and maintains accuracy and fairness in downstream models.
Conclusion: PersonaTwin demonstrates the potential of LLM-based digital twins for realistic, nuanced user simulations and personalized behavior analysis.
Abstract: While large language models (LLMs) afford new possibilities for user modeling and approximation of human behaviors, they often fail to capture the multidimensional nuances of individual users. In this work, we introduce PersonaTwin, a multi-tier prompt conditioning framework that builds adaptive digital twins by integrating demographic, behavioral, and psychometric data. Using a comprehensive data set in the healthcare context of more than 8,500 individuals, we systematically benchmark PersonaTwin against standard LLM outputs, and our rigorous evaluation unites state-of-the-art text similarity metrics with dedicated demographic parity assessments, ensuring that generated responses remain accurate and unbiased. Experimental results show that our framework produces simulation fidelity on par with oracle settings. Moreover, downstream models trained on persona-twins approximate models trained on individuals in terms of prediction and fairness metrics across both GPT-4o-based and Llama-based models. Together, these findings underscore the potential for LLM digital twin-based approaches in producing realistic and emotionally nuanced user simulations, offering a powerful tool for personalized digital user modeling and behavior analysis.
[3] gpt-oss-120b & gpt-oss-20b Model Card
OpenAI, :, Sandhini Agarwal, Lama Ahmad, Jason Ai, Sam Altman, Andy Applebaum, Edwin Arbus, Rahul K. Arora, Yu Bai, Bowen Baker, Haiming Bao, Boaz Barak, Ally Bennett, Tyler Bertao, Nivedita Brett, Eugene Brevdo, Greg Brockman, Sebastien Bubeck, Che Chang, Kai Chen, Mark Chen, Enoch Cheung, Aidan Clark, Dan Cook, Marat Dukhan, Casey Dvorak, Kevin Fives, Vlad Fomenko, Timur Garipov, Kristian Georgiev, Mia Glaese, Tarun Gogineni, Adam Goucher, Lukas Gross, Katia Gil Guzman, John Hallman, Jackie Hehir, Johannes Heidecke, Alec Helyar, Haitang Hu, Romain Huet, Jacob Huh, Saachi Jain, Zach Johnson, Chris Koch, Irina Kofman, Dominik Kundel, Jason Kwon, Volodymyr Kyrylov, Elaine Ya Le, Guillaume Leclerc, James Park Lennon, Scott Lessans, Mario Lezcano-Casado, Yuanzhi Li, Zhuohan Li, Ji Lin, Jordan Liss, Lily, Liu, Jiancheng Liu, Kevin Lu, Chris Lu, Zoran Martinovic, Lindsay McCallum, Josh McGrath, Scott McKinney, Aidan McLaughlin, Song Mei, Steve Mostovoy, Tong Mu, Gideon Myles, Alexander Neitz, Alex Nichol, Jakub Pachocki, Alex Paino, Dana Palmie, Ashley Pantuliano, Giambattista Parascandolo, Jongsoo Park, Leher Pathak, Carolina Paz, Ludovic Peran, Dmitry Pimenov, Michelle Pokrass, Elizabeth Proehl, Huida Qiu, Gaby Raila, Filippo Raso, Hongyu Ren, Kimmy Richardson, David Robinson, Bob Rotsted, Hadi Salman, Suvansh Sanjeev, Max Schwarzer, D. Sculley, Harshit Sikchi, Kendal Simon, Karan Singhal, Yang Song, Dane Stuckey, Zhiqing Sun, Philippe Tillet, Sam Toizer, Foivos Tsimpourlas, Nikhil Vyas, Eric Wallace, Xin Wang, Miles Wang, Olivia Watkins, Kevin Weil, Amy Wendling, Kevin Whinnery, Cedric Whitney, Hannah Wong, Lin Yang, Yu Yang, Michihiro Yasunaga, Kristen Ying, Wojciech Zaremba, Wenting Zhan, Cyril Zhang, Brian Zhang, Eddie Zhang, Shengjia Zhao
Main category: cs.CL
TL;DR: GPT-OSS-120B and GPT-OSS-20B are open-weight reasoning models with high accuracy and cost efficiency, optimized for agentic capabilities and released under Apache 2.0.
Details
Motivation: To advance the frontier of accuracy and inference cost in reasoning models while enabling broad use and research.Method: Uses a mixture-of-expert transformer architecture, trained with large-scale distillation and reinforcement learning, optimized for agentic capabilities like research browsing and tool use.
Result: Strong performance on benchmarks in mathematics, coding, and safety.
Conclusion: The models and tools are released under Apache 2.0 to foster further research and application.
Abstract: We present gpt-oss-120b and gpt-oss-20b, two open-weight reasoning models that push the frontier of accuracy and inference cost. The models use an efficient mixture-of-expert transformer architecture and are trained using large-scale distillation and reinforcement learning. We optimize the models to have strong agentic capabilities (deep research browsing, python tool use, and support for developer-provided functions), all while using a rendered chat format that enables clear instruction following and role delineation. Both models achieve strong results on benchmarks ranging from mathematics, coding, and safety. We release the model weights, inference implementations, tool environments, and tokenizers under an Apache 2.0 license to enable broad use and further research.
[4] Modeling and Detecting Company Risks from News: A Case Study in Bloomberg News
Jiaxin Pei, Soumya Vadlamannati, Liang-Kang Huang, Daniel Preotiuc-Pietro, Xinyu Hua
Main category: cs.CL
TL;DR: A computational framework extracts company risk factors from news articles, benchmarking models and finding fine-tuned pre-trained models outperform LLMs like LLaMA-2.
Details
Motivation: Identifying company risks is crucial for investors and financial markets, necessitating automated extraction from news.Method: Proposed a schema with seven risk aspects, annotated 744 articles, and tested models including LLMs and fine-tuned pre-trained models.
Result: Fine-tuned models outperformed zero-shot/few-shot LLMs, and analysis of 277K Bloomberg articles provided insights into company/industry risks.
Conclusion: Automated risk factor extraction from news offers valuable insights, with fine-tuned models being more effective than LLMs.
Abstract: Identifying risks associated with a company is important to investors and the well-being of the overall financial market. In this study, we build a computational framework to automatically extract company risk factors from news articles. Our newly proposed schema comprises seven distinct aspects, such as supply chain, regulations, and competitions. We sample and annotate 744 news articles and benchmark various machine learning models. While large language models have achieved huge progress in various types of NLP tasks, our experiment shows that zero-shot and few-shot prompting state-of-the-art LLMs (e.g. LLaMA-2) can only achieve moderate to low performances in identifying risk factors. And fine-tuned pre-trained language models are performing better on most of the risk factors. Using this model, we analyze over 277K Bloomberg news articles and demonstrate that identifying risk factors from news could provide extensive insight into the operations of companies and industries.
[5] Rule2Text: A Framework for Generating and Evaluating Natural Language Explanations of Knowledge Graph Rules
Nasim Shirvani-Mahdavi, Chengkai Li
Main category: cs.CL
TL;DR: Rule2Text uses LLMs to generate natural language explanations for complex KG rules, improving interpretability and usability.
Details
Motivation: Logical rules in KGs are hard to interpret due to complexity and labeling conventions, limiting accessibility.Method: Leverages LLMs (e.g., Gemini 2.0 Flash) with various prompting strategies, human evaluation, and fine-tuning (Zephyr model) to generate explanations.
Result: Significant improvement in explanation quality, especially for domain-specific datasets, and scalable LLM-as-a-judge evaluation.
Conclusion: Rule2Text enhances KG usability by making rules interpretable, with open-source code and data for broader adoption.
Abstract: Knowledge graphs (KGs) can be enhanced through rule mining; however, the resulting logical rules are often difficult for humans to interpret due to their inherent complexity and the idiosyncratic labeling conventions of individual KGs. This work presents Rule2Text, a comprehensive framework that leverages large language models (LLMs) to generate natural language explanations for mined logical rules, thereby improving KG accessibility and usability. We conduct extensive experiments using multiple datasets, including Freebase variants (FB-CVT-REV, FB+CVT-REV, and FB15k-237) as well as the ogbl-biokg dataset, with rules mined using AMIE 3.5.1. We systematically evaluate several LLMs across a comprehensive range of prompting strategies, including zero-shot, few-shot, variable type incorporation, and Chain-of-Thought reasoning. To systematically assess models’ performance, we conduct a human evaluation of generated explanations on correctness and clarity. To address evaluation scalability, we develop and validate an LLM-as-a-judge framework that demonstrates strong agreement with human evaluators. Leveraging the best-performing model (Gemini 2.0 Flash), LLM judge, and human-in-the-loop feedback, we construct high-quality ground truth datasets, which we use to fine-tune the open-source Zephyr model. Our results demonstrate significant improvements in explanation quality after fine-tuning, with particularly strong gains in the domain-specific dataset. Additionally, we integrate a type inference module to support KGs lacking explicit type information. All code and data are publicly available at https://github.com/idirlab/KGRule2NL.
[6] Improving Text Style Transfer using Masked Diffusion Language Models with Inference-time Scaling
Tejomay Kishor Padole, Suyash P Awate, Pushpak Bhattacharyya
Main category: cs.CL
TL;DR: Masked diffusion language models (MDMs) are scalable and easy to train, making them state-of-the-art non-autoregressive generators. This work introduces a verifier-based inference-time scaling method to improve generation quality, showing MDMs outperform autoregressive models in text-style transfer tasks.
Details
Motivation: To enhance the generation quality of MDMs by leveraging inference-time scaling with verifiers, addressing limitations of existing methods.Method: Proposes a verifier-based inference-time scaling method for MDMs, using off-the-shelf pre-trained embedding models to guide generation.
Result: MDMs outperform autoregressive models in text-style transfer tasks, with significant gains in generation quality using the proposed verifier setup.
Conclusion: MDMs with verifier-based scaling are a superior alternative to autoregressive models, offering improved generation quality and scalability.
Abstract: Masked diffusion language models (MDMs) have recently gained traction as a viable generative framework for natural language. This can be attributed to its scalability and ease of training compared to other diffusion model paradigms for discrete data, establishing itself as the state-of-the-art non-autoregressive generator for discrete data. Diffusion models, in general, have shown excellent ability to improve the generation quality by leveraging inference-time scaling either by increasing the number of denoising steps or by using external verifiers on top of the outputs of each step to guide the generation. In this work, we propose a verifier-based inference-time scaling method that aids in finding a better candidate generation during the denoising process of the MDM. Our experiments demonstrate the application of MDMs for standard text-style transfer tasks and establish MDMs as a better alternative to autoregressive language models. Additionally, we show that a simple soft-value-based verifier setup for MDMs using off-the-shelf pre-trained embedding models leads to significant gains in generation quality even when used on top of typical classifier-free guidance setups in the existing literature.
[7] SproutBench: A Benchmark for Safe and Ethical Large Language Models for Youth
Wenpeng Xing, Lanyi Wei, Haixiao Hu, Rongchang Li, Mohan Li, Changting Lin, Meng Han
Main category: cs.CL
TL;DR: The paper introduces SproutBench, a new evaluation suite for assessing LLM safety risks for minors, revealing significant vulnerabilities and providing guidelines for child-centric AI.
Details
Motivation: Existing AI safety frameworks for LLMs overlook the unique developmental vulnerabilities of children and adolescents, necessitating a reassessment.Method: Developed SproutBench, a suite of 1,283 adversarial prompts targeting age-specific risks, and evaluated 47 diverse LLMs.
Result: Found substantial safety vulnerabilities, with correlations between safety dimensions and an inverse relationship between interactivity and age appropriateness.
Conclusion: The study offers practical guidelines for improving child-centric AI design and deployment.
Abstract: The rapid proliferation of large language models (LLMs) in applications targeting children and adolescents necessitates a fundamental reassessment of prevailing AI safety frameworks, which are largely tailored to adult users and neglect the distinct developmental vulnerabilities of minors. This paper highlights key deficiencies in existing LLM safety benchmarks, including their inadequate coverage of age-specific cognitive, emotional, and social risks spanning early childhood (ages 0–6), middle childhood (7–12), and adolescence (13–18). To bridge these gaps, we introduce SproutBench, an innovative evaluation suite comprising 1,283 developmentally grounded adversarial prompts designed to probe risks such as emotional dependency, privacy violations, and imitation of hazardous behaviors. Through rigorous empirical evaluation of 47 diverse LLMs, we uncover substantial safety vulnerabilities, corroborated by robust inter-dimensional correlations (e.g., between Safety and Risk Prevention) and a notable inverse relationship between Interactivity and Age Appropriateness. These insights yield practical guidelines for advancing child-centric AI design and deployment.
[8] Beyond the Rosetta Stone: Unification Forces in Generalization Dynamics
Carter Blum, Katja Filipova, Ann Yuan, Asma Ghandeharioun, Julian Zimmert, Fred Zhang, Jessica Hoffmann, Tal Linzen, Martin Wattenberg, Lucas Dixon, Mor Geva
Main category: cs.CL
TL;DR: The paper investigates cross-lingual knowledge transfer issues in LLMs, using controlled experiments with small Transformers to identify causes and solutions.
Details
Motivation: To understand why LLMs hallucinate facts across languages and improve cross-lingual transfer.Method: Train small Transformer models on synthetic multilingual datasets, analyze learning phases, and manipulate data distribution/tokenization.
Result: Unification of facts across languages is key for transfer, influenced by mutual information and language extraction ease. Methods to modulate transfer are developed.
Conclusion: Controlled settings reveal pre-training dynamics, offering new ways to enhance cross-lingual transfer in LLMs.
Abstract: Large language models (LLMs) struggle with cross-lingual knowledge transfer: they hallucinate when asked in one language about facts expressed in a different language during training. This work introduces a controlled setting to study the causes and dynamics of this phenomenon by training small Transformer models from scratch on synthetic multilingual datasets. We identify a learning phase wherein a model develops either separate or unified representations of the same facts across languages, and show that unification is essential for cross-lingual transfer. We also show that the degree of unification depends on mutual information between facts and training data language, and on how easy it is to extract that language. Based on these insights, we develop methods to modulate the level of cross-lingual transfer by manipulating data distribution and tokenization, and we introduce metrics and visualizations to formally characterize their effects on unification. Our work shows how controlled settings can shed light on pre-training dynamics and suggests new directions for improving cross-lingual transfer in LLMs.
[9] Hell or High Water: Evaluating Agentic Recovery from External Failures
Andrew Wang, Sophia Hager, Adi Asija, Daniel Khashabi, Nicholas Andrews
Main category: cs.CL
TL;DR: Language agents struggle to adapt to external failures and formulate backup plans, despite identifying correct functions in context.
Details
Motivation: To study how language model agents handle unexpected failures in complex planning tasks and adapt to environmental feedback.Method: A specialized benchmark with over four thousand function possibilities, introducing external failures while ensuring solvability.
Result: Agents fail to adapt to feedback or pursue alternate actions, even with restricted search spaces. Scaling model size offers limited benefits.
Conclusion: Key challenges for generative models identified, with directions for future work to improve adaptability and resilience.
Abstract: As language model agents are applied to real world problems of increasing complexity, they will be expected to formulate plans across large search spaces. If those plans fail for reasons beyond their control, how well do language agents search for alternative ways to achieve their goals? We devise a specialized agentic planning benchmark to study this question. Each planning problem is solved via combinations of function calls. The agent searches for relevant functions from a set of over four thousand possibilities, and observes environmental feedback in the form of function outputs or error messages. Our benchmark confronts the agent with external failures in its workflow, such as functions that suddenly become unavailable. At the same time, even with the introduction of these failures, we guarantee that the task remains solvable. Ideally, an agent’s performance on the planning task should not be affected by the presence of external failures. Overall, we find that language agents struggle to formulate and execute backup plans in response to environment feedback. While state-of-the-art models are often able to identify the correct function to use in the right context, they struggle to adapt to feedback from the environment and often fail to pursue alternate courses of action, even when the search space is artificially restricted. We provide a systematic analysis of the failures of both open-source and commercial models, examining the effects of search space size, as well as the benefits of scaling model size in our setting. Our analysis identifies key challenges for current generative models as well as promising directions for future work.
[10] BIPOLAR: Polarization-based granular framework for LLM bias evaluation
Martin Pavlíček, Tomáš Filip, Petr Sosík
Main category: cs.CL
TL;DR: The paper introduces a reusable, topic-agnostic framework to evaluate polarization-related biases in LLMs, using synthetic datasets and sentiment metrics. It tests the framework on models like GPT-4 and Llama-3, revealing biases in the Russia-Ukraine war context.
Details
Motivation: Address underexplored challenges in bias detection and mitigation in LLMs, particularly for sensitive topics like political discourse and national stereotypes.Method: Combines polarization-sensitive sentiment metrics with a synthetically generated balanced dataset of conflict-related statements, tested on multiple LLMs.
Result: Revealed biases in LLMs, with a general positive sentiment toward Ukraine, and fine-grained variations across semantic categories and models.
Conclusion: The framework enables automated dataset generation and fine-grained bias assessment, applicable to diverse polarization-driven scenarios.
Abstract: Large language models (LLMs) are known to exhibit biases in downstream tasks, especially when dealing with sensitive topics such as political discourse, gender identity, ethnic relations, or national stereotypes. Although significant progress has been made in bias detection and mitigation techniques, certain challenges remain underexplored. This study proposes a reusable, granular, and topic-agnostic framework to evaluate polarisation-related biases in LLM (both open-source and closed-source). Our approach combines polarisation-sensitive sentiment metrics with a synthetically generated balanced dataset of conflict-related statements, using a predefined set of semantic categories. As a case study, we created a synthetic dataset that focusses on the Russia-Ukraine war, and we evaluated the bias in several LLMs: Llama-3, Mistral, GPT-4, Claude 3.5, and Gemini 1.0. Beyond aggregate bias scores, with a general trend for more positive sentiment toward Ukraine, the framework allowed fine-grained analysis with considerable variation between semantic categories, uncovering divergent behavioural patterns among models. Adaptation to prompt modifications showed further bias towards preconceived language and citizenship modification. Overall, the framework supports automated dataset generation and fine-grained bias assessment, is applicable to a variety of polarisation-driven scenarios and topics, and is orthogonal to many other bias-evaluation strategies.
[11] Novel Parasitic Dual-Scale Modeling for Efficient and Accurate Multilingual Speech Translation
Chenyang Le, Yinfeng Xia, Huiyan Li, Manhong Wang, Yutao Sun, Xingyang Ma, Yanmin Qian
Main category: cs.CL
TL;DR: The paper introduces a Parasitic Dual-Scale Approach to improve multilingual speech-to-text translation, combining speculative sampling, model compression, and knowledge distillation for efficiency and performance.
Details
Motivation: Unified multilingual models often have large parameter sizes, hindering inference efficiency and local deployment. The goal is to balance performance and efficiency.Method: The proposed method enhances the Whisper Medium model into whisperM2M, integrating a novel KVSPN module for speculative sampling and combining it with model compression and knowledge distillation.
Result: The approach achieves SOTA performance in six languages, with a 40% speedup (no BLEU degradation) and 2.6× speedup over the original Whisper Medium.
Conclusion: The Parasitic Dual-Scale Approach effectively balances efficiency and performance, making multilingual speech translation more practical for local deployment.
Abstract: Recent advancements in speech-to-text translation have led to the development of multilingual models capable of handling multiple language pairs simultaneously. However, these unified models often suffer from large parameter sizes, making it challenging to balance inference efficiency and performance, particularly in local deployment scenarios. We propose an innovative Parasitic Dual-Scale Approach, which combines an enhanced speculative sampling method with model compression and knowledge distillation techniques. Building on the Whisper Medium model, we enhance it for multilingual speech translation into whisperM2M, and integrate our novel KVSPN module, achieving state-of-the-art (SOTA) performance across six popular languages with improved inference efficiency. KVSPN enables a 40% speedup with no BLEU score degradation. Combined with distillation methods, it represents a 2.6$\times$ speedup over the original Whisper Medium with superior performance.
[12] Approaching the Source of Symbol Grounding with Confluent Reductions of Abstract Meaning Representation Directed Graphs
Nicolas Goulet, Alexandre Blondin Massé, Moussa Abdendi
Main category: cs.CL
TL;DR: The paper explores embedding digital dictionaries into AMR graphs using pre-trained language models, reducing these graphs confluently, and analyzing their properties in relation to symbol grounding.
Details
Motivation: To address the symbol grounding problem by leveraging AMR graphs and pre-trained language models for semantic representation.Method: Embedding digital dictionaries into AMR digraphs using state-of-the-art pre-trained language models and reducing them confluently.
Result: Reduced digraphs with preserved circuit space properties are analyzed.
Conclusion: The study provides insights into the symbol grounding problem through the properties of reduced AMR digraphs.
Abstract: Abstract meaning representation (AMR) is a semantic formalism used to represent the meaning of sentences as directed acyclic graphs. In this paper, we describe how real digital dictionaries can be embedded into AMR directed graphs (digraphs), using state-of-the-art pre-trained large language models. Then, we reduce those graphs in a confluent manner, i.e. with transformations that preserve their circuit space. Finally, the properties of these reduces digraphs are analyzed and discussed in relation to the symbol grounding problem.
[13] Representing Speech Through Autoregressive Prediction of Cochlear Tokens
Greta Tuckute, Klemen Kotar, Evelina Fedorenko, Daniel L. K. Yamins
Main category: cs.CL
TL;DR: AuriStream is a biologically inspired two-stage model for speech encoding, achieving state-of-the-art performance on speech tasks and generating interpretable audio continuations.
Details
Motivation: To develop a human-like speech processing model inspired by auditory hierarchy for efficient handling of diverse speech tasks.Method: A two-stage framework: cochlea-inspired time-frequency transformation followed by an autoregressive sequence model on cochlear tokens.
Result: Learns meaningful phoneme and word representations, excels in SUPERB speech tasks, and generates interpretable audio continuations.
Conclusion: AuriStream advances human-like speech models, combining strong representation learning with interpretable predictions.
Abstract: We introduce AuriStream, a biologically inspired model for encoding speech via a two-stage framework inspired by the human auditory processing hierarchy. The first stage transforms raw audio into a time-frequency representation based on the human cochlea, from which we extract discrete \textbf{cochlear tokens}. The second stage applies an autoregressive sequence model over the cochlear tokens. AuriStream learns meaningful phoneme and word representations, and state-of-the-art lexical semantics. AuriStream shows competitive performance on diverse downstream SUPERB speech tasks. Complementing AuriStream’s strong representational capabilities, it generates continuations of audio which can be visualized in a spectrogram space and decoded back into audio, providing insights into the model’s predictions. In summary, we present a two-stage framework for speech representation learning to advance the development of more human-like models that efficiently handle a range of speech-based tasks.
[14] Towards Reliable Multi-Agent Systems for Marketing Applications via Reflection, Memory, and Planning
Lorenzo Jaime Yu Flores, Junyi Shen, Xiaoyuan Gu
Main category: cs.CL
TL;DR: The paper introduces RAMP, a multi-agent framework for marketing audience curation, leveraging LLM planning, memory, and iterative verification to improve accuracy and user satisfaction.
Details
Motivation: Addressing the limited reliability of LLMs in real-world applications, particularly in dynamic marketing tasks like audience curation.Method: The RAMP framework iteratively plans, calls tools, verifies outputs, and suggests improvements, enhanced by a long-term memory store for client-specific knowledge.
Result: RAMP improves accuracy by 28 percentage points on evaluation queries and boosts recall by ~20 points with iterative verification, alongside higher user satisfaction.
Conclusion: The study offers practical insights for deploying reliable LLM-based systems in dynamic, industry-facing environments.
Abstract: Recent advances in large language models (LLMs) enabled the development of AI agents that can plan and interact with tools to complete complex tasks. However, literature on their reliability in real-world applications remains limited. In this paper, we introduce a multi-agent framework for a marketing task: audience curation. To solve this, we introduce a framework called RAMP that iteratively plans, calls tools, verifies the output, and generates suggestions to improve the quality of the audience generated. Additionally, we equip the model with a long-term memory store, which is a knowledge base of client-specific facts and past queries. Overall, we demonstrate the use of LLM planning and memory, which increases accuracy by 28 percentage points on a set of 88 evaluation queries. Moreover, we show the impact of iterative verification and reflection on more ambiguous queries, showing progressively better recall (roughly +20 percentage points) with more verify/reflect iterations on a smaller challenge set, and higher user satisfaction. Our results provide practical insights for deploying reliable LLM-based systems in dynamic, industry-facing environments.
[15] MoNaCo: More Natural and Complex Questions for Reasoning Across Dozens of Documents
Tomer Wolfson, Harsh Trivedi, Mor Geva, Yoav Goldberg, Dan Roth, Tushar Khot, Ashish Sabharwal, Reut Tsarfaty
Main category: cs.CL
TL;DR: MoNaCo is a benchmark for evaluating LLMs on complex, time-consuming questions, highlighting their limitations in handling real-world information-seeking tasks.
Details
Motivation: Current LLM benchmarks lack natural, complex questions that mimic real-world information-seeking challenges.Method: Developed a decomposed annotation pipeline to create MoNaCo, a benchmark of 1,315 complex questions requiring extensive intermediate steps.
Result: Frontier LLMs achieved at most 61.2% F1 on MoNaCo, struggling with low recall and hallucinations.
Conclusion: MoNaCo addresses a critical gap in LLM evaluation, emphasizing the need for improved reasoning models for complex queries.
Abstract: Large language models (LLMs) are emerging as a go-to tool for querying information. However, current LLM benchmarks rarely feature natural questions that are both information-seeking as well as genuinely time-consuming for humans. To address this gap we introduce MoNaCo, a benchmark of 1,315 natural and complex questions that require dozens, and at times hundreds, of intermediate steps to solve – far more than any existing QA benchmark. To build MoNaCo, we developed a decomposed annotation pipeline to elicit and manually answer natural time-consuming questions at scale. Frontier LLMs evaluated on MoNaCo achieve at most 61.2% F1, hampered by low recall and hallucinations. Our results underscore the need for reasoning models that better handle the complexity and sheer breadth of real-world information-seeking questions – with MoNaCo providing an effective resource for tracking such progress. The MONACO benchmark, codebase, prompts and models predictions are publicly available at: https://tomerwolgithub.github.io/monaco
[16] MobQA: A Benchmark Dataset for Semantic Understanding of Human Mobility Data through Question Answering
Hikaru Asano, Hiroki Ouchi, Akira Kasuga, Ryo Yonetani
Main category: cs.CL
TL;DR: MobQA is a benchmark dataset to evaluate LLMs’ semantic understanding of human mobility data via question answering, revealing strengths in factual retrieval but weaknesses in reasoning and explanation tasks.
Details
Motivation: Existing models predict human movement well but lack understanding of the underlying reasons or semantic meaning of patterns. MobQA fills this gap by evaluating LLMs' semantic capabilities.Method: MobQA includes 5,800 question-answer pairs across three types: factual retrieval, multiple-choice reasoning, and free-form explanation, requiring spatial, temporal, and semantic reasoning.
Result: LLMs perform well on factual retrieval but struggle with semantic reasoning and explanation tasks, with trajectory length affecting performance.
Conclusion: MobQA highlights the strengths and limitations of state-of-the-art LLMs in semantic mobility understanding, providing a framework for future improvements.
Abstract: This paper presents MobQA, a benchmark dataset designed to evaluate the semantic understanding capabilities of large language models (LLMs) for human mobility data through natural language question answering. While existing models excel at predicting human movement patterns, it remains unobvious how much they can interpret the underlying reasons or semantic meaning of those patterns. MobQA provides a comprehensive evaluation framework for LLMs to answer questions about diverse human GPS trajectories spanning daily to weekly granularities. It comprises 5,800 high-quality question-answer pairs across three complementary question types: factual retrieval (precise data extraction), multiple-choice reasoning (semantic inference), and free-form explanation (interpretive description), which all require spatial, temporal, and semantic reasoning. Our evaluation of major LLMs reveals strong performance on factual retrieval but significant limitations in semantic reasoning and explanation question answering, with trajectory length substantially impacting model effectiveness. These findings demonstrate the achievements and limitations of state-of-the-art LLMs for semantic mobility understanding.\footnote{MobQA dataset is available at https://github.com/CyberAgentAILab/mobqa.}
[17] Overcoming Low-Resource Barriers in Tulu: Neural Models and Corpus Creation for OffensiveLanguage Identification
Anusha M D, Deepthi Vikram, Bharathi Raja Chakravarthi, Parameshwar R Hegde
Main category: cs.CL
TL;DR: This paper introduces the first benchmark dataset for Offensive Language Identification (OLI) in code-mixed Tulu social media content, evaluating deep learning models. The BiGRU model with self-attention performs best, outperforming transformer models.
Details
Motivation: Tulu, a low-resource Dravidian language, lacks computational resources despite its digital growth. This study aims to address this gap by creating a benchmark dataset for OLI in code-mixed Tulu content.Method: The dataset includes 3,845 YouTube comments annotated into four classes. Deep learning models (GRU, LSTM, BiGRU, BiLSTM, CNN, attention-based variants) and transformers (mBERT, XLM-RoBERTa) are evaluated.
Result: The BiGRU model with self-attention achieves 82% accuracy and a 0.81 macro F1-score, outperforming transformer models, which struggle in code-mixed, low-resource contexts.
Conclusion: This work establishes a foundation for NLP research in Tulu and similar low-resource, code-mixed languages, highlighting the limitations of multilingual pretraining in such contexts.
Abstract: Tulu, a low-resource Dravidian language predominantly spoken in southern India, has limited computational resources despite its growing digital presence. This study presents the first benchmark dataset for Offensive Language Identification (OLI) in code-mixed Tulu social media content, collected from YouTube comments across various domains. The dataset, annotated with high inter-annotator agreement (Krippendorff’s alpha = 0.984), includes 3,845 comments categorized into four classes: Not Offensive, Not Tulu, Offensive Untargeted, and Offensive Targeted. We evaluate a suite of deep learning models, including GRU, LSTM, BiGRU, BiLSTM, CNN, and attention-based variants, alongside transformer architectures (mBERT, XLM-RoBERTa). The BiGRU model with self-attention achieves the best performance with 82% accuracy and a 0.81 macro F1-score. Transformer models underperform, highlighting the limitations of multilingual pretraining in code-mixed, under-resourced contexts. This work lays the foundation for further NLP research in Tulu and similar low-resource, code-mixed languages.
[18] Personalized Distractor Generation via MCTS-Guided Reasoning Reconstruction
Tao Wu, Jingyuan Chen, Wang Lin, Jian Zhan, Mengze Li, Kun Kuang, Fei Wu
Main category: cs.CL
TL;DR: The paper introduces personalized distractor generation for MCQs, addressing limitations of group-level approaches by tailoring distractors to individual student misconceptions using a training-free two-stage framework.
Details
Motivation: Existing group-level distractor generation fails to capture individual student reasoning errors, limiting diagnostic effectiveness in educational assessments.Method: A training-free two-stage framework: (1) constructs student-specific misconception prototypes using MCTS to infer reasoning trajectories, and (2) simulates reasoning to generate personalized distractors.
Result: The approach outperforms in generating plausible, personalized distractors for 140 students and generalizes well to group-level settings.
Conclusion: The proposed framework effectively addresses individual student misconceptions, enhancing diagnostic accuracy and adaptability in MCQ assessments.
Abstract: Distractors, incorrect but plausible answer choices in multiple-choice questions (MCQs), play a critical role in educational assessment by diagnosing student misconceptions. Recent work has leveraged large language models (LLMs) to generate shared, group-level distractors by learning common error patterns across large student populations. However, such distractors often fail to capture the diverse reasoning errors of individual students, limiting their diagnostic effectiveness. To address this limitation, we introduce the task of personalized distractor generation, which aims to generate tailored distractors based on individual misconceptions inferred from each student’s past question-answering (QA) records, ensuring every student receives options that effectively exposes their specific reasoning errors. While promising, this task is challenging because each student typically has only a few QA records, which often lack the student’s underlying reasoning processes, making training-based group-level approaches infeasible. To overcome this, we propose a training-free two-stage framework. In the first stage, we construct a student-specific misconception prototype by applying Monte Carlo Tree Search (MCTS) to recover the student’s reasoning trajectories from past incorrect answers. In the second stage, this prototype guides the simulation of the student’s reasoning on new questions, enabling the generation of personalized distractors that align with the student’s recurring misconceptions. Experiments show that our approach achieves the best performance in generating plausible, personalized distractors for 140 students, and also effectively generalizes to group-level settings, highlighting its robustness and adaptability.
[19] E-CaTCH: Event-Centric Cross-Modal Attention with Temporal Consistency and Class-Imbalance Handling for Misinformation Detection
Ahmad Mousavi, Yeganeh Abdollahinejad, Roberto Corizzo, Nathalie Japkowicz, Zois Boukouvalas
Main category: cs.CL
TL;DR: E-CaTCH is a framework for detecting multimodal misinformation by clustering posts into pseudo-events, aligning features, and modeling temporal evolution, outperforming existing methods.
Details
Motivation: Challenges in detecting misinformation due to modality inconsistencies, temporal changes, and class imbalance motivate the need for a robust, interpretable solution.Method: E-CaTCH clusters posts into pseudo-events, extracts and aligns textual/visual features, models temporal evolution with a trend-aware LSTM, and uses adaptive techniques for class imbalance.
Result: E-CaTCH outperforms state-of-the-art baselines on datasets like Fakeddit, IND, and COVID-19 MISINFOGRAPH, showing robustness and generalizability.
Conclusion: E-CaTCH effectively addresses misinformation detection challenges, offering scalability, interpretability, and superior performance across diverse scenarios.
Abstract: Detecting multimodal misinformation on social media remains challenging due to inconsistencies between modalities, changes in temporal patterns, and substantial class imbalance. Many existing methods treat posts independently and fail to capture the event-level structure that connects them across time and modality. We propose E-CaTCH, an interpretable and scalable framework for robustly detecting misinformation. If needed, E-CaTCH clusters posts into pseudo-events based on textual similarity and temporal proximity, then processes each event independently. Within each event, textual and visual features are extracted using pre-trained BERT and ResNet encoders, refined via intra-modal self-attention, and aligned through bidirectional cross-modal attention. A soft gating mechanism fuses these representations to form contextualized, content-aware embeddings of each post. To model temporal evolution, E-CaTCH segments events into overlapping time windows and uses a trend-aware LSTM, enhanced with semantic shift and momentum signals, to encode narrative progression over time. Classification is performed at the event level, enabling better alignment with real-world misinformation dynamics. To address class imbalance and promote stable learning, the model integrates adaptive class weighting, temporal consistency regularization, and hard-example mining. The total loss is aggregated across all events. Extensive experiments on Fakeddit, IND, and COVID-19 MISINFOGRAPH demonstrate that E-CaTCH consistently outperforms state-of-the-art baselines. Cross-dataset evaluations further demonstrate its robustness, generalizability, and practical applicability across diverse misinformation scenarios.
[20] Cross-Granularity Hypergraph Retrieval-Augmented Generation for Multi-hop Question Answering
Changjian Wang, Weihong Deng, Weili Guan, Quan Lu, Ning Jiang
Main category: cs.CL
TL;DR: HGRAG improves multi-hop QA by integrating structural and semantic info via hypergraphs, outperforming existing methods in accuracy and speed.
Details
Motivation: Traditional RAG methods lack structural knowledge integration, while GraphRAG over-relies on structure, neglecting semantics. HGRAG aims to balance both.Method: Uses hypergraphs: entity nodes and passage hyperedges for structure, and hypergraph diffusion for semantic retrieval. Includes a refinement module.
Result: Outperforms state-of-the-art in QA performance with 6x retrieval speedup.
Conclusion: HGRAG effectively combines structural and semantic info for superior MHQA performance and efficiency.
Abstract: Multi-hop question answering (MHQA) requires integrating knowledge scattered across multiple passages to derive the correct answer. Traditional retrieval-augmented generation (RAG) methods primarily focus on coarse-grained textual semantic similarity and ignore structural associations among dispersed knowledge, which limits their effectiveness in MHQA tasks. GraphRAG methods address this by leveraging knowledge graphs (KGs) to capture structural associations, but they tend to overly rely on structural information and fine-grained word- or phrase-level retrieval, resulting in an underutilization of textual semantics. In this paper, we propose a novel RAG approach called HGRAG for MHQA that achieves cross-granularity integration of structural and semantic information via hypergraphs. Structurally, we construct an entity hypergraph where fine-grained entities serve as nodes and coarse-grained passages as hyperedges, and establish knowledge association through shared entities. Semantically, we design a hypergraph retrieval method that integrates fine-grained entity similarity and coarse-grained passage similarity via hypergraph diffusion. Finally, we employ a retrieval enhancement module, which further refines the retrieved results both semantically and structurally, to obtain the most relevant passages as context for answer generation with the LLM. Experimental results on benchmark datasets demonstrate that our approach outperforms state-of-the-art methods in QA performance, and achieves a 6$\times$ speedup in retrieval efficiency.
[21] UNVEILING: What Makes Linguistics Olympiad Puzzles Tricky for LLMs?
Mukund Choudhary, KV Aditya Srivatsa, Gaurja Aeron, Antara Raaghavi Bhattacharya, Dang Khoa Dang Dinh, Ikhlasul Akmal Hanif, Daria Kotova, Ekaterina Kochmar, Monojit Choudhury
Main category: cs.CL
TL;DR: LLMs perform poorly on linguistic puzzles, especially with high morphological complexity, but pre-processing words into morphemes helps.
Details
Motivation: To assess LLMs' linguistic reasoning abilities in low-resource languages using Linguistics Olympiad puzzles.Method: Analyzed 629 problems across 41 low-resource languages, labeling them with linguistic features.
Result: LLMs struggle with high morphological complexity but improve when words are split into morphemes.
Conclusion: LLMs need better, language-specific tokenizers for improved linguistic reasoning in low-resource languages.
Abstract: Large language models (LLMs) have demonstrated potential in reasoning tasks, but their performance on linguistics puzzles remains consistently poor. These puzzles, often derived from Linguistics Olympiad (LO) contests, provide a minimal contamination environment to assess LLMs’ linguistic reasoning abilities across low-resource languages. This work analyses LLMs’ performance on 629 problems across 41 low-resource languages by labelling each with linguistically informed features to unveil weaknesses. Our analyses show that LLMs struggle with puzzles involving higher morphological complexity and perform better on puzzles involving linguistic features that are also found in English. We also show that splitting words into morphemes as a pre-processing step improves solvability, indicating a need for more informed and language-specific tokenisers. These findings thus offer insights into some challenges in linguistic reasoning and modelling of low-resource languages.
[22] LETToT: Label-Free Evaluation of Large Language Models On Tourism Using Expert Tree-of-Thought
Ruiyan Qi, Congding Wen, Weibo Zhou, Shangsong Liang, Lingbo Li
Main category: cs.CL
TL;DR: LETToT is a label-free framework for evaluating LLMs in tourism using expert-derived reasoning structures, showing quality gains and insights into model scaling and reasoning architectures.
Details
Motivation: Challenges in evaluating LLMs in specialized domains like tourism due to high costs of annotated benchmarks and issues like hallucinations.Method: Proposes LETToT, leveraging expert-derived reasoning structures (ToT) instead of labeled data, refining hierarchical ToT components with expert feedback.
Result: Systematically optimized expert ToT shows 4.99-14.15% quality gains; reveals scaling laws in specialized domains and benefits of reasoning-enhanced smaller models.
Conclusion: LETToT provides a scalable, label-free paradigm for domain-specific LLM evaluation, offering a robust alternative to annotated benchmarks.
Abstract: Evaluating large language models (LLMs) in specific domain like tourism remains challenging due to the prohibitive cost of annotated benchmarks and persistent issues like hallucinations. We propose $\textbf{L}$able-Free $\textbf{E}$valuation of LLM on $\textbf{T}$ourism using Expert $\textbf{T}$ree-$\textbf{o}$f-$\textbf{T}$hought (LETToT), a framework that leverages expert-derived reasoning structures-instead of labeled data-to access LLMs in tourism. First, we iteratively refine and validate hierarchical ToT components through alignment with generic quality dimensions and expert feedback. Results demonstrate the effectiveness of our systematically optimized expert ToT with 4.99-14.15% relative quality gains over baselines. Second, we apply LETToT’s optimized expert ToT to evaluate models of varying scales (32B-671B parameters), revealing: (1) Scaling laws persist in specialized domains (DeepSeek-V3 leads), yet reasoning-enhanced smaller models (e.g., DeepSeek-R1-Distill-Llama-70B) close this gap; (2) For sub-72B models, explicit reasoning architectures outperform counterparts in accuracy and conciseness ($p<0.05$). Our work established a scalable, label-free paradigm for domain-specific LLM evaluation, offering a robust alternative to conventional annotated benchmarks.
[23] ToxiFrench: Benchmarking and Enhancing Language Models via CoT Fine-Tuning for French Toxicity Detection
Axel Delaval, Shujian Yang, Haicheng Wang, Han Qiu, Jialiang Lu
Main category: cs.CL
TL;DR: TOXIFRENCH, a new French toxicity detection benchmark, shows SLMs outperform larger models. A novel CoT fine-tuning strategy improves performance, achieving SOTA results.
Details
Motivation: Toxicity detection in French is underdeveloped due to lack of datasets. TOXIFRENCH addresses this gap.Method: Semi-automated annotation pipeline (10% manual), benchmarking models, and proposing CoT fine-tuning with dynamic weighted loss.
Result: SLMs outperform larger models. Fine-tuned 4B model achieves 13% F1 improvement over baseline, surpassing GPT-40 and Gemini-2.5.
Conclusion: Methodology is effective for French toxicity detection and can extend to other languages and safety-critical tasks.
Abstract: Detecting toxic content using language models is crucial yet challenging. While substantial progress has been made in English, toxicity detection in French remains underdeveloped, primarily due to the lack of culturally relevant, large-scale datasets. In this work, we introduce TOXIFRENCH, a new public benchmark of 53,622 French online comments, constructed via a semi-automated annotation pipeline that reduces manual labeling to only 10% through high-confidence LLM-based pre-annotation and human verification. Then, we benchmark a broad range of models and uncover a counterintuitive insight: Small Language Models (SLMs) outperform many larger models in robustness and generalization under the toxicity detection task. Motivated by this finding, we propose a novel Chain-of-Thought (CoT) fine-tuning strategy using a dynamic weighted loss that progressively emphasizes the model’s final decision, significantly improving faithfulness. Our fine-tuned 4B model achieves state-of-the-art performance, improving its F1 score by 13% over its baseline and outperforming LLMs such as GPT-40 and Gemini-2.5. Further evaluation on a cross-lingual toxicity benchmark demonstrates strong multilingual ability, suggesting that our methodology can be effectively extended to other languages and safety-critical classification tasks.
[24] AI in Mental Health: Emotional and Sentiment Analysis of Large Language Models’ Responses to Depression, Anxiety, and Stress Queries
Arya VarastehNezhad, Reza Tavasoli, Soroush Elyasi, MohammadHossein LotfiNia, Hamed Farbeh
Main category: cs.CL
TL;DR: The study evaluates how eight LLMs respond to mental health questions, revealing model-specific emotional patterns and minimal demographic influence.
Details
Motivation: To understand how LLMs emotionally respond to mental health queries and assess the impact of user profiles and mental health conditions.Method: Analyzed 2,880 answers from eight LLMs to 20 questions framed for six user profiles, scoring sentiment and emotions using advanced tools.
Result: Emotional responses varied by LLM and mental health condition, with anxiety eliciting fear, depression sadness, and stress optimism. Demographic framing had minimal impact.
Conclusion: Model selection is crucial for mental health applications due to distinct emotional signatures, while demographics play a minor role.
Abstract: Depression, anxiety, and stress are widespread mental health concerns that increasingly drive individuals to seek information from Large Language Models (LLMs). This study investigates how eight LLMs (Claude Sonnet, Copilot, Gemini Pro, GPT-4o, GPT-4o mini, Llama, Mixtral, and Perplexity) reply to twenty pragmatic questions about depression, anxiety, and stress when those questions are framed for six user profiles (baseline, woman, man, young, old, and university student). The models generated 2,880 answers, which we scored for sentiment and emotions using state-of-the-art tools. Our analysis revealed that optimism, fear, and sadness dominated the emotional landscape across all outputs, with neutral sentiment maintaining consistently high values. Gratitude, joy, and trust appeared at moderate levels, while emotions such as anger, disgust, and love were rarely expressed. The choice of LLM significantly influenced emotional expression patterns. Mixtral exhibited the highest levels of negative emotions including disapproval, annoyance, and sadness, while Llama demonstrated the most optimistic and joyful responses. The type of mental health condition dramatically shaped emotional responses: anxiety prompts elicited extraordinarily high fear scores (0.974), depression prompts generated elevated sadness (0.686) and the highest negative sentiment, while stress-related queries produced the most optimistic responses (0.755) with elevated joy and trust. In contrast, demographic framing of queries produced only marginal variations in emotional tone. Statistical analyses confirmed significant model-specific and condition-specific differences, while demographic influences remained minimal. These findings highlight the critical importance of model selection in mental health applications, as each LLM exhibits a distinct emotional signature that could significantly impact user experience and outcomes.
[25] SafeConstellations: Steering LLM Safety to Reduce Over-Refusals Through Task-Specific Trajectory
Utsav Maskey, Sumit Yadav, Mark Dras, Usman Naseem
Main category: cs.CL
TL;DR: The paper addresses over-refusal in LLMs, where safety mechanisms reject benign prompts resembling harmful content, reducing utility. It introduces SafeConstellations, an inference-time method that reduces over-refusal by 73% while preserving model utility.
Details
Motivation: Over-refusal in LLMs diminishes their practical utility, especially in repetitive or task-specific applications, prompting the need for a solution that mitigates this without compromising safety.Method: The study evaluates LLM refusal behavior, identifies distinct embedding-space patterns (“constellations”), and introduces SafeConstellations to guide representations toward non-refusal pathways.
Result: SafeConstellations reduces over-refusal rates by up to 73% with minimal impact on utility, demonstrating effectiveness in balancing safety and usability.
Conclusion: The proposed method offers a principled solution to over-refusal, enhancing LLM utility while maintaining safety, applicable to production settings.
Abstract: LLMs increasingly exhibit over-refusal behavior, where safety mechanisms cause models to reject benign instructions that superficially resemble harmful content. This phenomena diminishes utility in production applications that repeatedly rely on common prompt templates or applications that frequently rely on LLMs for specific tasks (e.g. sentiment analysis, language translation). Through comprehensive evaluation, we demonstrate that LLMs still tend to refuse responses to harmful instructions when those instructions are reframed to appear as benign tasks. Our mechanistic analysis reveal that LLMs follow distinct “constellation” patterns in embedding space as representations traverse layers, with each task maintaining consistent trajectories that shift predictably between refusal and non-refusal cases. We introduce SafeConstellations, an inference-time trajectory-shifting approach that tracks task-specific trajectory patterns and guides representations toward non-refusal pathways. By selectively guiding model behavior only on tasks prone to over-refusal, and by preserving general model behavior, our method reduces over-refusal rates by up to 73% with minimal impact on utility-offering a principled approach to mitigating over-refusals.
[26] SGSimEval: A Comprehensive Multifaceted and Similarity-Enhanced Benchmark for Automatic Survey Generation Systems
Beichen Guo, Zhiyuan Wen, Yu Yang, Peng Gao, Ruosong Yang, Jiaxing Shen
Main category: cs.CL
TL;DR: SGSimEval is a benchmark for evaluating automatic survey generation (ASG) systems, addressing limitations like biased metrics and lack of human preference by integrating outline, content, and reference assessments with LLM-based and human preference metrics.
Details
Motivation: Existing evaluation methods for ASG are flawed due to biased metrics, lack of human preference, and over-reliance on LLMs-as-judges.Method: Proposes SGSimEval, a benchmark combining LLM-based scoring, quantitative metrics, and human preference metrics to evaluate outline, content, and references in ASG systems.
Result: Current ASG systems excel in outline generation but need improvement in content and reference generation; SGSimEval metrics align well with human assessments.
Conclusion: SGSimEval provides a robust, multifaceted evaluation framework for ASG, highlighting areas for improvement and maintaining consistency with human judgments.
Abstract: The growing interest in automatic survey generation (ASG), a task that traditionally required considerable time and effort, has been spurred by recent advances in large language models (LLMs). With advancements in retrieval-augmented generation (RAG) and the rising popularity of multi-agent systems (MASs), synthesizing academic surveys using LLMs has become a viable approach, thereby elevating the need for robust evaluation methods in this domain. However, existing evaluation methods suffer from several limitations, including biased metrics, a lack of human preference, and an over-reliance on LLMs-as-judges. To address these challenges, we propose SGSimEval, a comprehensive benchmark for Survey Generation with Similarity-Enhanced Evaluation that evaluates automatic survey generation systems by integrating assessments of the outline, content, and references, and also combines LLM-based scoring with quantitative metrics to provide a multifaceted evaluation framework. In SGSimEval, we also introduce human preference metrics that emphasize both inherent quality and similarity to humans. Extensive experiments reveal that current ASG systems demonstrate human-comparable superiority in outline generation, while showing significant room for improvement in content and reference generation, and our evaluation metrics maintain strong consistency with human assessments.
[27] LLM Compression: How Far Can We Go in Balancing Size and Performance?
Sahil Sk, Debasish Dhal, Sonal Khosla, Sk Shahid, Sambit Shekhar, Akash Dhaka, Shantipriya Parida, Dilip K. Prasad, Ondřej Bojar
Main category: cs.CL
TL;DR: The study evaluates 4-bit Group Scaling Quantization (GSQ) and GPTQ on LLaMA, Qwen, and PHI models, benchmarking their performance on NLP tasks like MS MARCO, BoolQ, and GSM8K to analyze trade-offs between compression and accuracy.
Details
Motivation: To improve accessibility of large language models by reducing memory and computational costs while maintaining performance through quantization.Method: Applied GSQ and GPTQ to LLaMA 1B, Qwen 0.5B, and PHI 1.5B, then benchmarked on MS MARCO, BoolQ, and GSM8K datasets.
Result: Analyzed accuracy, inference latency, and throughput to measure trade-offs between model compression and task performance.
Conclusion: Provides insights into low-bit quantization suitability for real-world deployment, helping users decide based on specifications.
Abstract: Quantization is an essential and popular technique for improving the accessibility of large language models (LLMs) by reducing memory usage and computational costs while maintaining performance. In this study, we apply 4-bit Group Scaling Quantization (GSQ) and Generative Pretrained Transformer Quantization (GPTQ) to LLaMA 1B, Qwen 0.5B, and PHI 1.5B, evaluating their impact across multiple NLP tasks. We benchmark these models on MS MARCO (Information Retrieval), BoolQ (Boolean Question Answering), and GSM8K (Mathematical Reasoning) datasets, assessing both accuracy and efficiency across various tasks. The study measures the trade-offs between model compression and task performance, analyzing key evaluation metrics, namely accuracy, inference latency, and throughput (total output tokens generated per second), providing insights into the suitability of low-bit quantization for real-world deployment. Using the results, users can then make suitable decisions based on the specifications that need to be met. We discuss the pros and cons of GSQ and GPTQ techniques on models of different sizes, which also serve as a benchmark for future experiments.
[28] SpecDetect: Simple, Fast, and Training-Free Detection of LLM-Generated Text via Spectral Analysis
Haitong Luo, Weiyao Zhang, Suhang Wang, Wenji Zou, Chungang Lin, Xuying Meng, Yujun Zhang
Main category: cs.CL
TL;DR: The paper introduces SpecDetect and SpecDetect++, novel methods for detecting LLM-generated text by analyzing token log-probabilities in the frequency domain, outperforming existing approaches in efficiency and accuracy.
Details
Motivation: The need for reliable detection of LLM-generated text due to its proliferation, as current methods rely on surface-level statistics and miss deeper signal properties.Method: Reframes detection as a signal processing problem, using global DFT and local STFT to analyze spectral properties of token log-probabilities. SpecDetect uses DFT total energy, while SpecDetect++ adds a sampling discrepancy mechanism.
Result: Human-written text shows higher spectral energy, enabling SpecDetect and SpecDetect++ to outperform state-of-the-art models in accuracy and speed.
Conclusion: Signal processing techniques provide an efficient, interpretable solution for LLM-generated text detection, demonstrating their power for modern challenges.
Abstract: The proliferation of high-quality text from Large Language Models (LLMs) demands reliable and efficient detection methods. While existing training-free approaches show promise, they often rely on surface-level statistics and overlook fundamental signal properties of the text generation process. In this work, we reframe detection as a signal processing problem, introducing a novel paradigm that analyzes the sequence of token log-probabilities in the frequency domain. By systematically analyzing the signal’s spectral properties using the global Discrete Fourier Transform (DFT) and the local Short-Time Fourier Transform (STFT), we find that human-written text consistently exhibits significantly higher spectral energy. This higher energy reflects the larger-amplitude fluctuations inherent in human writing compared to the suppressed dynamics of LLM-generated text. Based on this key insight, we construct SpecDetect, a detector built on a single, robust feature from the global DFT: DFT total energy. We also propose an enhanced version, SpecDetect++, which incorporates a sampling discrepancy mechanism to further boost robustness. Extensive experiments demonstrate that our approach outperforms the state-of-the-art model while running in nearly half the time. Our work introduces a new, efficient, and interpretable pathway for LLM-generated text detection, showing that classical signal processing techniques offer a surprisingly powerful solution to this modern challenge.
[29] Feedback Indicators: The Alignment between Llama and a Teacher in Language Learning
Sylvio Rüdian, Yassin Elsir, Marvin Kretschmer, Sabine Cayrou, Niels Pinkwart
Main category: cs.CL
TL;DR: The paper explores using Llama 3.1 to extract feedback indicators from student submissions, showing strong alignment with human ratings, and suggests potential for auto-generating formative feedback.
Details
Motivation: Automated feedback can improve learning efficiency and teacher workload by providing timely, targeted feedback, requiring reliable indicator extraction.Method: The study uses Llama 3.1 to extract indicators from student submissions and compares them with human ratings across feedback criteria.
Result: Strong correlations between LLM-generated and human-rated indicators were found, even for unexpected combinations.
Conclusion: The method shows promise for extracting indicators to auto-generate transparent formative feedback in future research.
Abstract: Automated feedback generation has the potential to enhance students’ learning progress by providing timely and targeted feedback. Moreover, it can assist teachers in optimizing their time, allowing them to focus on more strategic and personalized aspects of teaching. To generate high-quality, information-rich formative feedback, it is essential first to extract relevant indicators, as these serve as the foundation upon which the feedback is constructed. Teachers often employ feedback criteria grids composed of various indicators that they evaluate systematically. This study examines the initial phase of extracting such indicators from students’ submissions of a language learning course using the large language model Llama 3.1. Accordingly, the alignment between indicators generated by the LLM and human ratings across various feedback criteria is investigated. The findings demonstrate statistically significant strong correlations, even in cases involving unanticipated combinations of indicators and criteria. The methodology employed in this paper offers a promising foundation for extracting indicators from students’ submissions using LLMs. Such indicators can potentially be utilized to auto-generate explainable and transparent formative feedback in future research.
[30] When Punctuation Matters: A Large-Scale Comparison of Prompt Robustness Methods for LLMs
Mikhail Seleznyov, Mikhail Chaichuk, Gleb Ershov, Alexander Panchenko, Elena Tutubalina, Oleg Somov
Main category: cs.CL
TL;DR: The paper evaluates 5 methods to improve prompt robustness in LLMs, testing them on 8 models and 52 tasks, and extends analysis to GPT-4.1 and DeepSeek V3.
Details
Motivation: LLMs are sensitive to non-semantic prompt variations, necessitating systematic evaluation of robustness methods.Method: Benchmarked 5 robustness techniques across 8 models and 52 tasks, covering fine-tuned and in-context learning paradigms.
Result: Provides insights into the effectiveness of robustness methods for stable LLM performance.
Conclusion: Offers actionable guidance for practitioners to enhance LLM reliability in real-world applications.
Abstract: Large Language Models (LLMs) are highly sensitive to subtle, non-semantic variations in prompt phrasing and formatting. In this work, we present the first systematic evaluation of 5 methods for improving prompt robustness within a unified experimental framework. We benchmark these techniques on 8 models from Llama, Qwen and Gemma families across 52 tasks from Natural Instructions dataset. Our evaluation covers robustness methods from both fine-tuned and in-context learning paradigms, and tests their generalization against multiple types of distribution shifts. Finally, we extend our analysis to GPT-4.1 and DeepSeek V3 to assess frontier models’ current robustness to format perturbations. Our findings offer actionable insights into the relative effectiveness of these robustness methods, enabling practitioners to make informed decisions when aiming for stable and reliable LLM performance in real-world applications. Code: https://github.com/AIRI-Institute/when-punctuation-matters.
[31] Retrieval-augmented reasoning with lean language models
Ryan Sze-Yin Chan, Federico Nanni, Tomas Lazauskas, Rosie Wood, Penelope Yong, Lionel Tarassenko, Mark Girolami, James Geddes, Andrew Duncan
Main category: cs.CL
TL;DR: A novel approach combines reasoning and retrieval augmented generation (RAG) in a lean language model, focusing on performance and privacy for resource-constrained environments.
Details
Motivation: Addresses the need for efficient, privacy-preserving RAG systems deployable in secure or limited-resource settings.Method: Integrates a dense retriever with fine-tuned Qwen2.5-Instruct models, using synthetic query generation and reasoning traces from frontier models over a curated corpus (NHS A-to-Z).
Result: Domain-specific fine-tuning improves answer accuracy and consistency, nearing frontier-level performance while enabling local deployment.
Conclusion: The approach is feasible for local use, with code released for reproducibility and adaptation across domains.
Abstract: This technical report details a novel approach to combining reasoning and retrieval augmented generation (RAG) within a single, lean language model architecture. While existing RAG systems typically rely on large-scale models and external APIs, our work addresses the increasing demand for performant and privacy-preserving solutions deployable in resource-constrained or secure environments. Building on recent developments in test-time scaling and small-scale reasoning models, we develop a retrieval augmented conversational agent capable of interpreting complex, domain-specific queries using a lightweight backbone model. Our system integrates a dense retriever with fine-tuned Qwen2.5-Instruct models, using synthetic query generation and reasoning traces derived from frontier models (e.g., DeepSeek-R1) over a curated corpus, in this case, the NHS A-to-Z condition pages. We explore the impact of summarisation-based document compression, synthetic data design, and reasoning-aware fine-tuning on model performance. Evaluation against both non-reasoning and general-purpose lean models demonstrates that our domain-specific fine-tuning approach yields substantial gains in answer accuracy and consistency, approaching frontier-level performance while remaining feasible for local deployment. All implementation details and code are publicly released to support reproducibility and adaptation across domains.
[32] Model Interpretability and Rationale Extraction by Input Mask Optimization
Marc Brinner, Sina Zarriess
Main category: cs.CL
TL;DR: A new gradient-based method generates extractive explanations for neural network predictions by masking non-indicative input parts, ensuring sufficiency, comprehensiveness, and compactness.
Details
Motivation: Address the need for explainable predictions in black-box neural networks, bridging interpretability and rationale extraction.Method: Gradient-based optimization with regularization for masking non-indicative input parts, applied to text and image inputs.
Result: High-quality explanations for both text and image classifications, showing broader applicability of rationale extraction principles.
Conclusion: Rationale extraction can be performed without specialized models, using only trained classifiers, and applies across input types.
Abstract: Concurrent to the rapid progress in the development of neural-network based models in areas like natural language processing and computer vision, the need for creating explanations for the predictions of these black-box models has risen steadily. We propose a new method to generate extractive explanations for predictions made by neural networks, that is based on masking parts of the input which the model does not consider to be indicative of the respective class. The masking is done using gradient-based optimization combined with a new regularization scheme that enforces sufficiency, comprehensiveness and compactness of the generated explanation, three properties that are known to be desirable from the related field of rationale extraction in natural language processing. In this way, we bridge the gap between model interpretability and rationale extraction, thereby proving that the latter of which can be performed without training a specialized model, only on the basis of a trained classifier. We further apply the same method to image inputs and obtain high quality explanations for image classifications, which indicates that the conditions proposed for rationale extraction in natural language processing are more broadly applicable to different input types.
[33] Rationalizing Transformer Predictions via End-To-End Differentiable Self-Training
Marc Brinner, Sina Zarrieß
Main category: cs.CL
TL;DR: An end-to-end differentiable training paradigm for stable transformer classifier training, simplifying the three-player-game approach into a single model, improving efficiency and alignment with human annotations.
Details
Motivation: To address training instabilities in existing rationalized models and improve alignment with human annotations without explicit supervision.Method: Proposes a single model fulfilling the roles of rationale selector, classifier, and complement classifier, extending to class-wise rationales with advanced parameterization and regularization.
Result: Achieves state-of-the-art alignment with human annotations and more stable training.
Conclusion: The simplified approach offers efficient, stable training and improved rationale quality.
Abstract: We propose an end-to-end differentiable training paradigm for stable training of a rationalized transformer classifier. Our approach results in a single model that simultaneously classifies a sample and scores input tokens based on their relevance to the classification. To this end, we build on the widely-used three-player-game for training rationalized models, which typically relies on training a rationale selector, a classifier and a complement classifier. We simplify this approach by making a single model fulfill all three roles, leading to a more efficient training paradigm that is not susceptible to the common training instabilities that plague existing approaches. Further, we extend this paradigm to produce class-wise rationales while incorporating recent advances in parameterizing and regularizing the resulting rationales, thus leading to substantially improved and state-of-the-art alignment with human annotations without any explicit supervision.
[34] Survey-to-Behavior: Downstream Alignment of Human Values in LLMs via Survey Questions
Shangrui Nie, Florian Mai, David Kaczér, Charles Welch, Zhixue Zhao, Lucie Flek
Main category: cs.CL
TL;DR: The paper explores fine-tuning LLMs on value survey questions to modify their implicit value systems, showing effectiveness in both in-domain and out-of-domain tasks.
Details
Motivation: To investigate if a model's value system can be reliably altered without extensive training data by using value survey questions.Method: Construct value profiles of LLMs, fine-tune them on value surveys, and evaluate changes in behavior on in-domain and out-of-domain tasks.
Result: Fine-tuning successfully changes model answers to survey questions and aligns behavior in downstream tasks like moral judgments and text-based games.
Conclusion: A simple fine-tuning approach can effectively modify LLMs’ value systems and influence their behavior in diverse contexts.
Abstract: Large language models implicitly encode preferences over human values, yet steering them often requires large training data. In this work, we investigate a simple approach: Can we reliably modify a model’s value system in downstream behavior by training it to answer value survey questions accordingly? We first construct value profiles of several open-source LLMs by asking them to rate a series of value-related descriptions spanning 20 distinct human values, which we use as a baseline for subsequent experiments. We then investigate whether the value system of a model can be governed by fine-tuning on the value surveys. We evaluate the effect of finetuning on the model’s behavior in two ways; first, we assess how answers change on in-domain, held-out survey questions. Second, we evaluate whether the model’s behavior changes in out-of-domain settings (situational scenarios). To this end, we construct a contextualized moral judgment dataset based on Reddit posts and evaluate changes in the model’s behavior in text-based adventure games. We demonstrate that our simple approach can not only change the model’s answers to in-domain survey questions, but also produces substantial shifts (value alignment) in implicit downstream task behavior.
[35] HumorPlanSearch: Structured Planning and HuCoT for Contextual AI Humor
Shivam Dubey
Main category: cs.CL
TL;DR: HumorPlanSearch improves AI humor generation by modeling context through modular steps like Plan-Search, HuCoT, and iterative revisions, boosting performance by 15.4%.
Details
Motivation: Existing LLM-based humor generation lacks context sensitivity, leading to generic or tone-deaf jokes.Method: Uses Plan-Search, HuCoT, Knowledge Graph, novelty filtering, and iterative revisions to model context.
Result: Achieves a 15.4% boost in Humor Generation Score (HGS) over baselines.
Conclusion: HumorPlanSearch enhances AI humor by prioritizing context, coherence, and cultural adaptation.
Abstract: Automated humor generation with Large Language Models (LLMs) often yields jokes that feel generic, repetitive, or tone-deaf because humor is deeply situated and hinges on the listener’s cultural background, mindset, and immediate context. We introduce HumorPlanSearch, a modular pipeline that explicitly models context through: (1) Plan-Search for diverse, topic-tailored strategies; (2) Humor Chain-of-Thought (HuCoT) templates capturing cultural and stylistic reasoning; (3) a Knowledge Graph to retrieve and adapt high-performing historical strategies; (4) novelty filtering via semantic embeddings; and (5) an iterative judge-driven revision loop. To evaluate context sensitivity and comedic quality, we propose the Humor Generation Score (HGS), which fuses direct ratings, multi-persona feedback, pairwise win-rates, and topic relevance. In experiments across nine topics with feedback from 13 human judges, our full pipeline (KG + Revision) boosts mean HGS by 15.4 percent (p < 0.05) over a strong baseline. By foregrounding context at every stage from strategy planning to multi-signal evaluation, HumorPlanSearch advances AI-driven humor toward more coherent, adaptive, and culturally attuned comedy.
[36] Online Anti-sexist Speech: Identifying Resistance to Gender Bias in Political Discourse
Aditi Dutta, Susan Banducci
Main category: cs.CL
TL;DR: LLMs often misclassify anti-sexist speech as harmful, especially during politically charged events, risking the silencing of marginalized voices. The study suggests moderation improvements like human review and inclusive training data.
Details
Motivation: To address the challenge of automated content moderation systems misclassifying anti-sexist speech as harmful, particularly during high-salience political events.Method: Analyzed how five LLMs classify sexist, anti-sexist, and neutral political tweets from the UK in 2022, focusing on trigger events involving female MPs.
Result: Models frequently misclassified anti-sexist speech as harmful, especially during politically charged events where harm and resistance rhetoric overlapped.
Conclusion: Moderation systems need to move beyond binary classifications, incorporate human review, and include counter-speech in training data to protect resistance speech.
Abstract: Anti-sexist speech, i.e., public expressions that challenge or resist gendered abuse and sexism, plays a vital role in shaping democratic debate online. Yet automated content moderation systems, increasingly powered by large language models (LLMs), may struggle to distinguish such resistance from the sexism it opposes. This study examines how five LLMs classify sexist, anti-sexist, and neutral political tweets from the UK, focusing on high-salience trigger events involving female Members of Parliament in the year 2022. Our analysis show that models frequently misclassify anti-sexist speech as harmful, particularly during politically charged events where rhetorical styles of harm and resistance converge. These errors risk silencing those who challenge sexism, with disproportionate consequences for marginalised voices. We argue that moderation design must move beyond binary harmful/not-harmful schemas, integrate human-in-the-loop review during sensitive events, and explicitly include counter-speech in training data. By linking feminist scholarship, event-based analysis, and model evaluation, this work highlights the sociotechnical challenges of safeguarding resistance speech in digital political spaces.
[37] CoDiEmb: A Collaborative yet Distinct Framework for Unified Representation Learning in Information Retrieval and Semantic Textual Similarity
Bowen Zhang, Zixin Song, Chunquan Chen, Qian-Wen Zhang, Di Yin, Xing Sun
Main category: cs.CL
TL;DR: CoDiEmb is a unified framework for learning text embeddings that effectively balances Information Retrieval (IR) and Semantic Textual Similarity (STS) tasks by decoupling task-specific signals and integrating specialized objectives, dynamic sampling, and delta-guided fusion.
Details
Motivation: Negative transfer and performance trade-offs in joint training of IR and STS tasks motivate the need for a systematic approach to decouple task-specific learning signals.Method: CoDiEmb introduces task-specialized objectives with dynamic sampling, delta-guided model fusion, and a single-stage training pipeline to optimize IR and STS jointly.
Result: Experiments on 15 benchmarks show CoDiEmb mitigates cross-task trade-offs and improves embedding space properties.
Conclusion: CoDiEmb successfully reconciles IR and STS requirements, offering a stable and effective solution for unified text embeddings.
Abstract: Learning unified text embeddings that excel across diverse downstream tasks is a central goal in representation learning, yet negative transfer remains a persistent obstacle. This challenge is particularly pronounced when jointly training a single encoder for Information Retrieval (IR) and Semantic Textual Similarity (STS), two essential but fundamentally disparate tasks for which naive co-training typically yields steep performance trade-offs. We argue that resolving this conflict requires systematically decoupling task-specific learning signals throughout the training pipeline. To this end, we introduce CoDiEmb, a unified framework that reconciles the divergent requirements of IR and STS in a collaborative yet distinct manner. CoDiEmb integrates three key innovations for effective joint optimization: (1) Task-specialized objectives paired with a dynamic sampler that forms single-task batches and balances per-task updates, thereby preventing gradient interference. For IR, we employ a contrastive loss with multiple positives and hard negatives, augmented by cross-device sampling. For STS, we adopt order-aware objectives that directly optimize correlation and ranking consistency. (2) A delta-guided model fusion strategy that computes fine-grained merging weights for checkpoints by analyzing each parameter’s deviation from its pre-trained initialization, proving more effective than traditional Model Soups. (3) An efficient, single-stage training pipeline that is simple to implement and converges stably. Extensive experiments on 15 standard IR and STS benchmarks across three base encoders validate CoDiEmb. Our results and analysis demonstrate that the framework not only mitigates cross-task trade-offs but also measurably improves the geometric properties of the embedding space.
[38] Reference Points in LLM Sentiment Analysis: The Role of Structured Context
Junichiro Niimi
Main category: cs.CL
TL;DR: The study explores how supplementary information in JSON format improves sentiment analysis in LLMs for marketing, outperforming baselines without fine-tuning.
Details
Motivation: Marketing theories suggest customer evaluations depend on more than just review text, prompting investigation into how additional reference points affect sentiment analysis.Method: Compares natural language and JSON-formatted prompts using a 3B parameter LLM on Yelp Restaurant and Nightlife data.
Result: JSON prompts with extra info boost Macro-F1 by 1.6% and 4%, reduce RMSE by 16% and 9.1%, and enable edge-device deployment.
Conclusion: Structured prompting helps smaller models match larger ones, offering a practical alternative to resource-heavy deployments.
Abstract: Large language models (LLMs) are now widely used across many fields, including marketing research. Sentiment analysis, in particular, helps firms understand consumer preferences. While most NLP studies classify sentiment from review text alone, marketing theories, such as prospect theory and expectation–disconfirmation theory, point out that customer evaluations are shaped not only by the actual experience but also by additional reference points. This study therefore investigates how the content and format of such supplementary information affect sentiment analysis using LLMs. We compare natural language (NL) and JSON-formatted prompts using a lightweight 3B parameter model suitable for practical marketing applications. Experiments on two Yelp categories (Restaurant and Nightlife) show that the JSON prompt with additional information outperforms all baselines without fine-tuning: Macro-F1 rises by 1.6% and 4% while RMSE falls by 16% and 9.1%, respectively, making it deployable in resource-constrained edge devices. Furthermore, a follow-up analysis confirms that performance gains stem from genuine contextual reasoning rather than label proxying. This work demonstrates that structured prompting can enable smaller models to achieve competitive performance, offering a practical alternative to large-scale model deployment.
[39] Speciesism in AI: Evaluating Discrimination Against Animals in Large Language Models
Monika Jotautaitė, Lucius Caviola, David A. Brewster, Thilo Hagendorff
Main category: cs.CL
TL;DR: The paper examines speciesist bias in large language models (LLMs), revealing their tendency to normalize harm toward non-human animals while reflecting mixed ethical views.
Details
Motivation: To investigate whether LLMs exhibit speciesist bias and how they value non-human animals, addressing gaps in AI fairness and alignment frameworks.Method: Three paradigms were used: SpeciesismBench (a benchmark for speciesist statements), psychological measures comparing LLMs to humans, and text-generation tasks probing speciesist rationalizations.
Result: LLMs detected speciesist statements but rarely condemned them, showing mixed ethical tendencies. They prioritized cognitive capacity over species, often rationalizing harm to farmed animals.
Conclusion: Expanding AI fairness frameworks to include non-human moral patients is crucial to reduce speciesist biases in LLMs and their societal influence.
Abstract: As large language models (LLMs) become more widely deployed, it is crucial to examine their ethical tendencies. Building on research on fairness and discrimination in AI, we investigate whether LLMs exhibit speciesist bias – discrimination based on species membership – and how they value non-human animals. We systematically examine this issue across three paradigms: (1) SpeciesismBench, a 1,003-item benchmark assessing recognition and moral evaluation of speciesist statements; (2) established psychological measures comparing model responses with those of human participants; (3) text-generation tasks probing elaboration on, or resistance to, speciesist rationalizations. In our benchmark, LLMs reliably detected speciesist statements but rarely condemned them, often treating speciesist attitudes as morally acceptable. On psychological measures, results were mixed: LLMs expressed slightly lower explicit speciesism than people, yet in direct trade-offs they more often chose to save one human over multiple animals. A tentative interpretation is that LLMs may weight cognitive capacity rather than species per se: when capacities were equal, they showed no species preference, and when an animal was described as more capable, they tended to prioritize it over a less capable human. In open-ended text generation tasks, LLMs frequently normalized or rationalized harm toward farmed animals while refusing to do so for non-farmed animals. These findings suggest that while LLMs reflect a mixture of progressive and mainstream human views, they nonetheless reproduce entrenched cultural norms around animal exploitation. We argue that expanding AI fairness and alignment frameworks to explicitly include non-human moral patients is essential for reducing these biases and preventing the entrenchment of speciesist attitudes in AI systems and the societies they influence.
[40] Language models align with brain regions that represent concepts across modalities
Maria Ryskina, Greta Tuckute, Alexander Fung, Ashley Malkin, Evelina Fedorenko
Main category: cs.CL
TL;DR: The paper explores how language models (LMs) align with brain activity, focusing on linguistic processing and cross-modal conceptual meaning. Findings suggest LMs may represent meaning consistently across modalities.
Details
Motivation: To understand whether LMs internally represent conceptual meaning beyond language, by comparing LM-brain alignment with neural metrics for linguistic processing and meaning consistency.Method: Analyzed fMRI data to measure brain activation during sentence processing and introduced a novel metric for meaning consistency across input modalities (sentence, word cloud, image). Evaluated LM-brain alignment using these metrics.
Result: Both language-only and language-vision models predicted brain signals better in meaning-consistent areas, even if these areas were not strongly language-sensitive, indicating LMs may encode cross-modal conceptual meaning.
Conclusion: LMs likely represent conceptual meaning consistently across modalities, suggesting their internal representations align with brain regions processing meaning beyond language.
Abstract: Cognitive science and neuroscience have long faced the challenge of disentangling representations of language from representations of conceptual meaning. As the same problem arises in today’s language models (LMs), we investigate the relationship between LM–brain alignment and two neural metrics: (1) the level of brain activation during processing of sentences, targeting linguistic processing, and (2) a novel measure of meaning consistency across input modalities, which quantifies how consistently a brain region responds to the same concept across paradigms (sentence, word cloud, image) using an fMRI dataset (Pereira et al., 2018). Our experiments show that both language-only and language-vision models predict the signal better in more meaning-consistent areas of the brain, even when these areas are not strongly sensitive to language processing, suggesting that LMs might internally represent cross-modal conceptual meaning.
[41] AgentMental: An Interactive Multi-Agent Framework for Explainable and Adaptive Mental Health Assessment
Jinpeng Hu, Ao Wang, Qianqian Xie, Hui Ma, Zhuo Li, Dan Guo
Main category: cs.CL
TL;DR: A multi-agent framework for mental health evaluation simulates clinical dialogues, using adaptive questioning and tree-structured memory to improve assessment accuracy.
Details
Motivation: Traditional mental health assessments are limited by clinician shortages and static text analysis, prompting the need for dynamic, interactive AI solutions.Method: The framework employs specialized agents for questioning, evaluation, scoring, and updating, with adaptive questioning and tree-structured memory for dynamic interaction.
Result: The method outperforms existing approaches on the DAIC-WOZ dataset, demonstrating enhanced information extraction and contextual tracking.
Conclusion: The proposed multi-agent framework offers a more effective and dynamic solution for automated mental health assessment.
Abstract: Mental health assessment is crucial for early intervention and effective treatment, yet traditional clinician-based approaches are limited by the shortage of qualified professionals. Recent advances in artificial intelligence have sparked growing interest in automated psychological assessment, yet most existing approaches are constrained by their reliance on static text analysis, limiting their ability to capture deeper and more informative insights that emerge through dynamic interaction and iterative questioning. Therefore, in this paper, we propose a multi-agent framework for mental health evaluation that simulates clinical doctor-patient dialogues, with specialized agents assigned to questioning, adequacy evaluation, scoring, and updating. We introduce an adaptive questioning mechanism in which an evaluation agent assesses the adequacy of user responses to determine the necessity of generating targeted follow-up queries to address ambiguity and missing information. Additionally, we employ a tree-structured memory in which the root node encodes the user’s basic information, while child nodes (e.g., topic and statement) organize key information according to distinct symptom categories and interaction turns. This memory is dynamically updated throughout the interaction to reduce redundant questioning and further enhance the information extraction and contextual tracking capabilities. Experimental results on the DAIC-WOZ dataset illustrate the effectiveness of our proposed method, which achieves better performance than existing approaches.
[42] Aware First, Think Less: Dynamic Boundary Self-Awareness Drives Extreme Reasoning Efficiency in Large Language Models
Qiguang Chen, Dengyun Peng, Jinhao Liu, HuiKang Su, Jiannan Guan, Libo Qin, Wanxiang Che
Main category: cs.CL
TL;DR: DR. SAF improves LLM efficiency by dynamically adjusting reasoning depth, reducing tokens by 49.27% with minimal accuracy loss.
Details
Motivation: Current CoT methods in LLMs are inefficient due to redundancy and misaligned human-defined difficulty priors.Method: DR. SAF uses Boundary Self-Awareness Alignment, Adaptive Reward Management, and Boundary Preservation to optimize reasoning.
Result: Achieves 49.27% fewer tokens, 6.59x token efficiency, 5x faster training, and 16% accuracy gain in extreme cases.
Conclusion: DR. SAF balances efficiency and accuracy, making it ideal for resource-limited settings.
Abstract: Recent advancements in large language models (LLMs) have greatly improved their capabilities on complex reasoning tasks through Long Chain-of-Thought (CoT). However, this approach often results in substantial redundancy, impairing computational efficiency and causing significant delays in real-time applications. To improve the efficiency, current methods often rely on human-defined difficulty priors, which do not align with the LLM’s self-awared difficulty, leading to inefficiencies. In this paper, we introduce the Dynamic Reasoning-Boundary Self-Awareness Framework (DR. SAF), which enables models to dynamically assess and adjust their reasoning depth in response to problem complexity. DR. SAF integrates three key components: Boundary Self-Awareness Alignment, Adaptive Reward Management, and a Boundary Preservation Mechanism. These components allow models to optimize their reasoning processes, balancing efficiency and accuracy without compromising performance. Our experimental results demonstrate that DR. SAF achieves a 49.27% reduction in total response tokens with minimal loss in accuracy. The framework also delivers a 6.59x gain in token efficiency and a 5x reduction in training time, making it well-suited to resource-limited settings. During extreme training, DR. SAF can even surpass traditional instruction-based models in token efficiency with more than 16% accuracy improvement.
[43] Dataset Creation for Visual Entailment using Generative AI
Rob Reijtenbach, Suzan Verberne, Gijs Wijnholds
Main category: cs.CL
TL;DR: A new synthetic dataset for visual entailment is created using Stable Diffusion and SNLI, showing minimal performance drop compared to real data.
Details
Motivation: Existing visual entailment datasets are small and sparse, and manual creation is labor-intensive.Method: Generate synthetic images from SNLI textual premises using Stable Diffusion, then evaluate intrinsically and extrinsically with CLIP-based classifiers.
Result: Synthetic data performs slightly worse (F-score 0.686 vs. 0.703 on SNLI-VE; 0.384 vs. 0.400 on SICK-VTE).
Conclusion: Synthetic data is a viable solution for visual entailment in data-sparse settings.
Abstract: In this paper we present and validate a new synthetic dataset for training visual entailment models. Existing datasets for visual entailment are small and sparse compared to datasets for textual entailment. Manually creating datasets is labor-intensive. We base our synthetic dataset on the SNLI dataset for textual entailment. We take the premise text from SNLI as input prompts in a generative image model, Stable Diffusion, creating an image to replace each textual premise. We evaluate our dataset both intrinsically and extrinsically. For extrinsic evaluation, we evaluate the validity of the generated images by using them as training data for a visual entailment classifier based on CLIP feature vectors. We find that synthetic training data only leads to a slight drop in quality on SNLI-VE, with an F-score 0.686 compared to 0.703 when trained on real data. We also compare the quality of our generated training data to original training data on another dataset: SICK-VTE. Again, there is only a slight drop in F-score: from 0.400 to 0.384. These results indicate that in settings with data sparsity, synthetic data can be a promising solution for training visual entailment models.
[44] TinyTim: A Family of Language Models for Divergent Generation
Christopher J. Agostino
Main category: cs.CL
TL;DR: TinyTim, a model fine-tuned on ‘Finnegans Wake,’ shows high lexical diversity but low semantic coherence, useful for creative applications.
Details
Motivation: To explore how specialized language models can serve as divergent knowledge sources in creative architectures.Method: Fine-tuning large language models on ‘Finnegans Wake’ and evaluating against baselines.
Result: TinyTim V1 exhibits high lexical diversity and low semantic coherence.
Conclusion: Specialized models like TinyTim can enhance automated discovery in creative settings.
Abstract: This work introduces TinyTim, a family of large language models fine-tuned on James Joyce’s `Finnegans Wake’. Through quantitative evaluation against baseline models, we demonstrate that TinyTim V1 produces a statistically distinct generative profile characterized by high lexical diversity and low semantic coherence. These findings are interpreted through theories of creativity and complex problem-solving, arguing that such specialized models can function as divergent knowledge sources within more extensive creative architectures, powering automated discovery mechanisms in diverse settings.
[45] A Survey on Recent Advances in LLM-Based Multi-turn Dialogue Systems
Zihao Yi, Jiarui Ouyang, Zhe Xu, Yuwen Liu, Tianhao Liao, Haohao Luo, Ying Shen
Main category: cs.CL
TL;DR: A survey on multi-turn dialogue systems, focusing on LLM-based approaches, covering existing models, adaptation methods, recent advances, datasets, evaluation metrics, and future research directions.
Details
Motivation: To summarize and analyze the current state of LLM-based multi-turn dialogue systems, highlighting advancements, challenges, and future opportunities.Method: Review and synthesis of existing literature on LLMs, their adaptation to dialogue tasks, and analysis of ODD and TOD systems, datasets, and metrics.
Result: Comprehensive overview of LLM-based dialogue systems, including recent progress, tools, and evaluation frameworks.
Conclusion: Identifies key research gaps and future directions for improving multi-turn dialogue systems leveraging LLMs.
Abstract: This survey provides a comprehensive review of research on multi-turn dialogue systems, with a particular focus on multi-turn dialogue systems based on large language models (LLMs). This paper aims to (a) give a summary of existing LLMs and approaches for adapting LLMs to downstream tasks; (b) elaborate recent advances in multi-turn dialogue systems, covering both LLM-based open-domain dialogue (ODD) and task-oriented dialogue (TOD) systems, along with datasets and evaluation metrics; (c) discuss some future emphasis and recent research problems arising from the development of LLMs and the increasing demands on multi-turn dialogue systems.
[46] Bridging Context Gaps: Leveraging Coreference Resolution for Long Contextual Understanding
Yanming Liu, Xinyue Peng, Jiannan Cao, Yanxin Shen, Tianyu Du, Sheng Cheng, Xun Wang, Jianwei Yin, Xuhong Zhang
Main category: cs.CL
TL;DR: The paper introduces the LQCA method to improve LLMs’ performance in understanding long contexts and question answering by focusing on coreference resolution.
Details
Motivation: LLMs struggle with lengthy contexts and effective question answering due to complexity and ambiguity in longer texts.Method: The LQCA method involves four steps: resolving coreferences in sub-documents, computing mention distances, defining representative mentions, and answering questions via mention replacement.
Result: Experiments show notable improvements on models like OpenAI-o1-mini and GPT-4o, proving LQCA’s effectiveness in bridging context gaps.
Conclusion: LQCA enhances LLMs’ ability to handle long contexts through systematic coreference resolution, with publicly available code.
Abstract: Large language models (LLMs) have shown remarkable capabilities in natural language processing; however, they still face difficulties when tasked with understanding lengthy contexts and executing effective question answering. These challenges often arise due to the complexity and ambiguity present in longer texts. To enhance the performance of LLMs in such scenarios, we introduce the Long Question Coreference Adaptation (LQCA) method. This innovative framework focuses on coreference resolution tailored to long contexts, allowing the model to identify and manage references effectively. The LQCA method encompasses four key steps: resolving coreferences within sub-documents, computing the distances between mentions, defining a representative mention for coreference, and answering questions through mention replacement. By processing information systematically, the framework provides easier-to-handle partitions for LLMs, promoting better understanding. Experimental evaluations on a range of LLMs and datasets have yielded positive results, with a notable improvements on OpenAI-o1-mini and GPT-4o models, highlighting the effectiveness of leveraging coreference resolution to bridge context gaps in question answering. Our code is public at https://github.com/OceannTwT/LQCA.
[47] RULEBREAKERS: Challenging LLMs at the Crossroads between Formal Logic and Human-like Reasoning
Jason Chan, Robert Gaizauskas, Zhixue Zhao
Main category: cs.CL
TL;DR: The paper introduces RULEBREAKERS, a dataset to evaluate LLMs’ ability to handle rulebreaker scenarios, revealing their limitations in human-like reasoning.
Details
Motivation: To assess LLMs' reasoning in rulebreaker scenarios, where formal logic fails to align with human common sense.Method: Created the RULEBREAKERS dataset and evaluated seven LLMs, including GPT-4o, on their ability to recognize and respond to rulebreakers.
Result: Most LLMs performed poorly, over-applying logical rules and underutilizing world knowledge, diverging from human reasoning.
Conclusion: Highlights limitations of LLMs and cautions against over-reliance on formal logic for improving reasoning, as it may widen the gap between AI and human-like reasoning.
Abstract: Formal logic enables computers to reason in natural language by representing sentences in symbolic forms and applying rules to derive conclusions. However, in what our study characterizes as “rulebreaker” scenarios, this method can lead to conclusions that are typically not inferred or accepted by humans given their common sense and factual knowledge. Inspired by works in cognitive science, we create RULEBREAKERS, the first dataset for rigorously evaluating the ability of large language models (LLMs) to recognize and respond to rulebreakers (versus non-rulebreakers) in a human-like manner. Evaluating seven LLMs, we find that most models, including GPT-4o, achieve mediocre accuracy on RULEBREAKERS and exhibit some tendency to over-rigidly apply logical rules unlike what is expected from typical human reasoners. Further analysis suggests that this apparent failure is potentially associated with the models’ poor utilization of their world knowledge and their attention distribution patterns. Whilst revealing a limitation of current LLMs, our study also provides a timely counterbalance to a growing body of recent works that propose methods relying on formal logic to improve LLMs’ general reasoning capabilities, highlighting their risk of further increasing divergence between LLMs and human-like reasoning.
[48] Personalized LLM for Generating Customized Responses to the Same Query from Different Users
Hang Zeng, Chaoyue Niu, Fan Wu, Chengfei Lv, Guihai Chen
Main category: cs.CL
TL;DR: The paper introduces a querier-aware LLM personalization method, using a dual-tower model and contrastive learning to tailor responses to individual queriers, validated by a new dataset (MQDialog) and outperforming baselines.
Details
Motivation: Existing LLM personalization methods ignore querier diversity, limiting response customization for different users.Method: Proposes a dual-tower model (general and querier-specific encoders) with contrastive learning and query clustering to enhance personalization.
Result: Achieves 8.4% to 48.7% ROUGE-L improvement and 54% to 82% win rates over baselines.
Conclusion: The approach effectively improves personalized response generation by addressing querier diversity and query variability.
Abstract: Existing work on large language model (LLM) personalization assigned different responding roles to LLMs, but overlooked the diversity of queriers. In this work, we propose a new form of querier-aware LLM personalization, generating different responses even for the same query from different queriers. We design a dual-tower model architecture with a cross-querier general encoder and a querier-specific encoder. We further apply contrastive learning with multi-view augmentation, pulling close the dialogue representations of the same querier, while pulling apart those of different queriers. To mitigate the impact of query diversity on querier-contrastive learning, we cluster the dialogues based on query similarity and restrict the scope of contrastive learning within each cluster. To address the lack of datasets designed for querier-aware personalization, we also build a multi-querier dataset from English and Chinese scripts, as well as WeChat records, called MQDialog, containing 173 queriers and 12 responders. Extensive evaluations demonstrate that our design significantly improves the quality of personalized response generation, achieving relative improvement of 8.4% to 48.7% in ROUGE-L scores and winning rates ranging from 54% to 82% compared with various baseline methods.
[49] A Dual-Perspective NLG Meta-Evaluation Framework with Automatic Benchmark and Better Interpretability
Xinyu Hu, Mingqi Gao, Li Lin, Zhenghan Yu, Xiaojun Wan
Main category: cs.CL
TL;DR: The paper proposes a dual-perspective NLG meta-evaluation framework to address limitations in traditional approaches, offering better interpretability and automated benchmark construction without new human annotations.
Details
Motivation: Traditional NLG meta-evaluation methods have issues with human ratings and correlation measure selection, limiting their effectiveness.Method: A dual-perspective framework is introduced, focusing on different evaluation capabilities, and a method for automated benchmark construction is proposed.
Result: Experiments with 16 LLMs as evaluators analyze their performance comprehensively under the new framework.
Conclusion: The proposed framework improves meta-evaluation interpretability and reduces reliance on human annotations.
Abstract: In NLG meta-evaluation, evaluation metrics are typically assessed based on their consistency with humans. However, we identify some limitations in traditional NLG meta-evaluation approaches, such as issues in handling human ratings and ambiguous selections of correlation measures, which undermine the effectiveness of meta-evaluation. In this work, we propose a dual-perspective NLG meta-evaluation framework that focuses on different evaluation capabilities, thereby providing better interpretability. In addition, we introduce a method of automatically constructing the corresponding benchmarks without requiring new human annotations. Furthermore, we conduct experiments with 16 representative LLMs as the evaluators based on our proposed framework, comprehensively analyzing their evaluation performance from different perspectives.
[50] Visual-RAG: Benchmarking Text-to-Image Retrieval Augmented Generation for Visual Knowledge Intensive Queries
Yin Wu, Quanyu Long, Jing Li, Jianfei Yu, Wenya Wang
Main category: cs.CL
TL;DR: Visual-RAG is a benchmark for evaluating how multimodal LLMs use retrieved images in knowledge-intensive QA, showing current models struggle with visual evidence.
Details
Motivation: Existing benchmarks for multimodal RAG focus on text retrieval, lacking assessment of visual evidence utilization.Method: Introduces Visual-RAG, a QA benchmark requiring text-to-image retrieval and visual evidence integration for answers.
Result: Images provide strong evidence, but even top models struggle with visual knowledge extraction and use.
Conclusion: Highlights the need for better visual retrieval, grounding, and attribution in multimodal RAG systems.
Abstract: Retrieval-augmented generation (RAG) is a paradigm that augments large language models (LLMs) with external knowledge to tackle knowledge-intensive question answering. While several benchmarks evaluate Multimodal LLMs (MLLMs) under Multimodal RAG settings, they predominantly retrieve from textual corpora and do not explicitly assess how models exploit visual evidence during generation. Consequently, there still lacks benchmark that isolates and measures the contribution of retrieved images in RAG. We introduce Visual-RAG, a question-answering benchmark that targets visually grounded, knowledge-intensive questions. Unlike prior work, Visual-RAG requires text-to-image retrieval and the integration of retrieved clue images to extract visual evidence for answer generation. With Visual-RAG, we evaluate 5 open-source and 3 proprietary MLLMs, showcasing that images provide strong evidence in augmented generation. However, even state-of-the-art models struggle to efficiently extract and utilize visual knowledge. Our results highlight the need for improved visual retrieval, grounding, and attribution in multimodal RAG systems.
[51] Cognitive Behaviors that Enable Self-Improving Reasoners, or, Four Habits of Highly Effective STaRs
Kanishk Gandhi, Ayush Chakravarthy, Anikait Singh, Nathan Lile, Noah D. Goodman
Main category: cs.CL
TL;DR: The paper investigates why some language models (e.g., Qwen) self-improve effectively under RL, while others (e.g., Llama) plateau. It identifies four key reasoning behaviors (verification, backtracking, subgoal setting, backward chaining) as critical for improvement. Priming Llama with these behaviors enables it to match Qwen’s performance, even with incorrect solutions. Continued pretraining with reasoning-focused data further bridges the gap.
Details
Motivation: To understand why certain language models (like Qwen) outperform others (like Llama) in self-improvement under RL, despite identical training conditions.Method: Analyzes four cognitive behaviors (verification, backtracking, subgoal setting, backward chaining) in models. Tests priming Llama with these behaviors and evaluates performance. Uses controlled behavioral datasets and continued pretraining with OpenWebMath data.
Result: Priming Llama with reasoning behaviors improves its RL performance to match or exceed Qwen’s. Correctness of answers is less critical than reasoning patterns. Continued pretraining with reasoning-focused data further enhances Llama’s self-improvement.
Conclusion: Initial reasoning behaviors are crucial for a model’s capacity to self-improve under RL. Models lacking these behaviors can be primed or pretrained to achieve comparable performance, highlighting the importance of reasoning patterns over answer correctness.
Abstract: Test-time inference has emerged as a powerful paradigm for enabling language models to ``think’’ longer and more carefully about complex challenges, much like skilled human experts. While reinforcement learning (RL) can drive self-improvement in language models on verifiable tasks, some models exhibit substantial gains while others quickly plateau. For instance, we find that Qwen-2.5-3B far exceeds Llama-3.2-3B under identical RL training for the game of Countdown. This discrepancy raises a critical question: what intrinsic properties enable effective self-improvement? We introduce a framework to investigate this question by analyzing four key cognitive behaviors – verification, backtracking, subgoal setting, and backward chaining – that both expert human problem solvers and successful language models employ. Our study reveals that Qwen naturally exhibits these reasoning behaviors, whereas Llama initially lacks them. In systematic experimentation with controlled behavioral datasets, we find that priming Llama with examples containing these reasoning behaviors enables substantial improvements during RL, matching or exceeding Qwen’s performance. Importantly, the presence of reasoning behaviors, rather than correctness of answers, proves to be the critical factor – models primed with incorrect solutions containing proper reasoning patterns achieve comparable performance to those trained on correct solutions. Finally, leveraging continued pretraining with OpenWebMath data, filtered to amplify reasoning behaviors, enables the Llama model to match Qwen’s self-improvement trajectory. Our findings establish a fundamental relationship between initial reasoning behaviors and the capacity for improvement, explaining why some language models effectively utilize additional computation while others plateau.
[52] Feather-SQL: A Lightweight NL2SQL Framework with Dual-Model Collaboration Paradigm for Small Language Models
Wenqi Pei, Hailing Xu, Hengyuan Zhao, Shizheng Hou, Han Chen, Zining Zhang, Pingyi Luo, Bingsheng He
Main category: cs.CL
TL;DR: Feather-SQL is a lightweight framework for small language models (SLMs) to improve NL2SQL performance, addressing challenges like poor accuracy and compatibility. It uses schema pruning, multi-path generation, and a 1+1 Model Collaboration Paradigm.
Details
Motivation: Closed-source LLMs and high resource demands limit NL2SQL adoption, while SLMs perform poorly. Feather-SQL aims to bridge this gap for SLMs.Method: Feather-SQL employs schema pruning, linking, multi-path generation, and pairs a general chat model with a fine-tuned SQL specialist (1+1 Model Collaboration Paradigm).
Result: Feather-SQL boosts SLM performance by ~10% without fine-tuning and raises accuracy to 54.76% on BIRD benchmark.
Conclusion: Feather-SQL effectively enhances SLM capabilities for NL2SQL, offering a practical alternative to resource-heavy LLMs.
Abstract: Natural Language to SQL (NL2SQL) has seen significant advancements with large language models (LLMs). However, these models often depend on closed-source systems and high computational resources, posing challenges in data privacy and deployment. In contrast, small language models (SLMs) struggle with NL2SQL tasks, exhibiting poor performance and incompatibility with existing frameworks. To address these issues, we introduce Feather-SQL, a new lightweight framework tailored for SLMs. Feather-SQL improves SQL executability and accuracy through 1) schema pruning and linking, 2) multi-path and multi-candidate generation. Additionally, we introduce the 1+1 Model Collaboration Paradigm, which pairs a strong general-purpose chat model with a fine-tuned SQL specialist, combining strong analytical reasoning with high-precision SQL generation. Experimental results on BIRD demonstrate that Feather-SQL improves NL2SQL performance on SLMs, with around 10% boost for models without fine-tuning. The proposed paradigm raises the accuracy ceiling of SLMs to 54.76%, highlighting its effectiveness.
[53] Inaccuracy of an E-Dictionary and Its Influence on Chinese Language Users
Shiyang Zhang, Fanfei Meng, Xi Wang, Lan Li
Main category: cs.CL
TL;DR: The study examines the reliability of Youdao, a popular E-dictionary in China, revealing issues with inaccurate definitions and user consultation habits, and calls for better dictionary literacy and AI improvements.
Details
Motivation: To address the lack of scrutiny on E-dictionary accuracy and the scarcity of research on their limitations, particularly for L2 learners.Method: Combined experimentation (translation task with retrospective reflection), user survey, and dictionary critique.
Result: Incomplete or misleading definitions in Youdao caused serious misunderstandings, and users exhibited problematic consultation habits. Issues in data processing and AI integration were identified.
Conclusion: The study highlights the need for improved dictionary literacy training for users and enhancements in AI models for E-dictionary construction.
Abstract: Electronic dictionaries have largely replaced paper dictionaries and become central tools for L2 learners seeking to expand their vocabulary. Users often assume these resources are reliable and rarely question the validity of the definitions provided. The accuracy of major E-dictionaries is seldom scrutinized, and little attention has been paid to how their corpora are constructed. Research on dictionary use, particularly the limitations of electronic dictionaries, remains scarce. This study adopts a combined method of experimentation, user survey, and dictionary critique to examine Youdao, one of the most widely used E-dictionaries in China. The experiment involved a translation task paired with retrospective reflection. Participants were asked to translate sentences containing words that are insufficiently or inaccurately defined in Youdao. Their consultation behavior was recorded to analyze how faulty definitions influenced comprehension. Results show that incomplete or misleading definitions can cause serious misunderstandings. Additionally, students exhibited problematic consultation habits. The study further explores how such flawed definitions originate, highlighting issues in data processing and the integration of AI and machine learning technologies in dictionary construction. The findings suggest a need for better training in dictionary literacy for users, as well as improvements in the underlying AI models used to build E-dictionaries.
[54] Investigating the Effect of Parallel Data in the Cross-Lingual Transfer for Vision-Language Encoders
Andrei-Alexandru Manea, Jindřich Libovický
Main category: cs.CL
TL;DR: The paper explores transferring pre-trained Vision-Language (VL) models to multilingual tasks using parallel data, highlighting the impact of data domain and language count.
Details
Motivation: Most VL models and training data are English-only, limiting multilingual applications. Cross-lingual transfer methods are studied as an alternative.Method: Transferring a trained encoder using parallel data, analyzing the effects of data domain and language diversity.
Result: Machine-translated task data performs best on average, but authentic parallel data excels in some languages. Multilingual training benefits most languages.
Conclusion: Parallel data quality and multilingual training significantly impact VL model performance across languages.
Abstract: Most pre-trained Vision-Language (VL) models and training data for the downstream tasks are only available in English. Therefore, multilingual VL tasks are solved using cross-lingual transfer: fine-tune a multilingual pre-trained model or transfer the text encoder using parallel data. We study the alternative approach: transferring an already trained encoder using parallel data. We investigate the effect of parallel data: domain and the number of languages, which were out of focus in previous work. Our results show that even machine-translated task data are the best on average, caption-like authentic parallel data outperformed it in some languages. Further, we show that most languages benefit from multilingual training.
[55] Relationship Detection on Tabular Data Using Statistical Analysis and Large Language Models
Panagiotis Koletsis, Christos Panagiotopoulos, Georgios Th. Papadopoulos, Vasilis Efthymiou
Main category: cs.CL
TL;DR: A hybrid approach combining Knowledge Graphs and LLMs for detecting column relationships in unlabeled tabular data, evaluated on SemTab benchmarks.
Details
Motivation: Advance table interpretation tasks by leveraging KGs and LLMs to improve relationship detection in unlabeled data.Method: Uses domain/range constraints and relation co-appearance analysis to reduce KG search space, integrating LLMs and statistical analysis.
Result: Competitive performance on SemTab datasets, with evaluations on LLM quantization and prompting techniques.
Conclusion: The hybrid approach is effective and publicly available, matching state-of-the-art methods.
Abstract: Over the past few years, table interpretation tasks have made significant progress due to their importance and the introduction of new technologies and benchmarks in the field. This work experiments with a hybrid approach for detecting relationships among columns of unlabeled tabular data, using a Knowledge Graph (KG) as a reference point, a task known as CPA. This approach leverages large language models (LLMs) while employing statistical analysis to reduce the search space of potential KG relations. The main modules of this approach for reducing the search space are domain and range constraints detection, as well as relation co-appearance analysis. The experimental evaluation on two benchmark datasets provided by the SemTab challenge assesses the influence of each module and the effectiveness of different state-of-the-art LLMs at various levels of quantization. The experiments were performed, as well as at different prompting techniques. The proposed methodology, which is publicly available on github, proved to be competitive with state-of-the-art approaches on these datasets.
[56] PilotRL: Training Language Model Agents via Global Planning-Guided Progressive Reinforcement Learning
Keer Lu, Chong Chen, Bin Cui, Huang Leng, Wentao Zhang
Main category: cs.CL
TL;DR: The paper introduces AdaPlan and PilotRL to enhance LLM agents’ long-term planning and execution coordination, outperforming GPT-4o by 3.60%.
Details
Motivation: Existing LLM agent paradigms like ReAct struggle with long-term strategic planning and coordination, limiting effectiveness in complex tasks. Supervised fine-tuning also restricts generalization.Method: Proposes AdaPlan for adaptive global planning and PilotRL, a reinforcement learning framework, to improve planning and execution coordination in LLM agents.
Result: PilotRL achieves state-of-the-art performance, with LLaMA3.1-8B-Instruct + PilotRL surpassing GPT-4o by 3.60% and GPT-4o-mini by 55.78%.
Conclusion: AdaPlan and PilotRL effectively address limitations in LLM agent paradigms, enhancing long-horizon decision-making and generalization.
Abstract: Large Language Models (LLMs) have shown remarkable advancements in tackling agent-oriented tasks. Despite their potential, existing work faces challenges when deploying LLMs in agent-based environments. The widely adopted agent paradigm ReAct centers on integrating single-step reasoning with immediate action execution, which limits its effectiveness in complex tasks requiring long-term strategic planning. Furthermore, the coordination between the planner and executor during problem-solving is also a critical factor to consider in agent design. Additionally, current approaches predominantly rely on supervised fine-tuning, which often leads models to memorize established task completion trajectories, thereby restricting their generalization ability when confronted with novel problem contexts. To address these challenges, we introduce an adaptive global plan-based agent paradigm AdaPlan, aiming to synergize high-level explicit guidance with execution to support effective long-horizon decision-making. Based on the proposed paradigm, we further put forward PilotRL, a global planning-guided training framework for LLM agents driven by progressive reinforcement learning. We first develop the model’s ability to follow explicit guidance from global plans when addressing agent tasks. Subsequently, based on this foundation, we focus on optimizing the quality of generated plans. Finally, we conduct joint optimization of the model’s planning and execution coordination. Experiments indicate that PilotRL could achieve state-of-the-art performances, with LLaMA3.1-8B-Instruct + PilotRL surpassing closed-sourced GPT-4o by 3.60%, while showing a more substantial gain of 55.78% comparing to GPT-4o-mini at a comparable parameter scale.
[57] MVISU-Bench: Benchmarking Mobile Agents for Real-World Tasks by Multi-App, Vague, Interactive, Single-App and Unethical Instructions
Zeyu Huang, Juyuan Wang, Longfeng Chen, Boyi Xiao, Leng Cai, Yawen Zeng, Jin Xu
Main category: cs.CL
TL;DR: MVISU-Bench is a bilingual benchmark with 404 tasks across 137 apps, addressing real-world user needs. Aider, a plug-and-play module, improves task success rates by 19.55%, especially for unethical and interactive tasks.
Details
Motivation: Existing benchmarks for mobile agents lack real-world relevance and fail to address diverse user requirements.Method: Developed MVISU-Bench with 404 tasks across 137 apps and introduced Aider, a dynamic prompt prompter.
Result: Aider improved success rates by 19.55%, with notable gains in unethical (53.52%) and interactive (29.41%) tasks.
Conclusion: The study reveals a gap between current mobile agents and real-world user expectations, showcasing Aider’s effectiveness.
Abstract: Given the significant advances in Large Vision Language Models (LVLMs) in reasoning and visual understanding, mobile agents are rapidly emerging to meet users’ automation needs. However, existing evaluation benchmarks are disconnected from the real world and fail to adequately address the diverse and complex requirements of users. From our extensive collection of user questionnaire, we identified five tasks: Multi-App, Vague, Interactive, Single-App, and Unethical Instructions. Around these tasks, we present \textbf{MVISU-Bench}, a bilingual benchmark that includes 404 tasks across 137 mobile applications. Furthermore, we propose Aider, a plug-and-play module that acts as a dynamic prompt prompter to mitigate risks and clarify user intent for mobile agents. Our Aider is easy to integrate into several frameworks and has successfully improved overall success rates by 19.55% compared to the current state-of-the-art (SOTA) on MVISU-Bench. Specifically, it achieves success rate improvements of 53.52% and 29.41% for unethical and interactive instructions, respectively. Through extensive experiments and analysis, we highlight the gap between existing mobile agents and real-world user expectations.
[58] Slow Tuning and Low-Entropy Masking for Safe Chain-of-Thought Distillation
Ziyang Ma, Qingyue Yuan, Linhai Zhang, Deyu Zhou
Main category: cs.CL
TL;DR: The paper introduces SLowED, a safe distillation method for Small Language Models (SLMs) to maintain safety while enhancing reasoning, addressing negative safety effects from CoT distillation.
Details
Motivation: Existing CoT distillation methods improve SLM reasoning but compromise safety. Current safety alignment methods require extra resources and may reduce reasoning ability.Method: Proposes SLowED with two modules: Slow Tuning (limits weight changes) and Low-Entropy Masking (excludes low-entropy tokens from fine-tuning).
Result: SLowED maintains SLM safety and improves reasoning on benchmarks (BBH, BB-Sub, ARC, AGIEval) and safety evaluation (AdvBench).
Conclusion: SLowED effectively balances safety and reasoning in SLMs, with Slow Tuning and Low-Entropy Masking playing complementary roles.
Abstract: Previous chain-of-thought (CoT) distillation methods primarily focused on enhancing the reasoning capabilities of Small Language Models (SLMs) by utilizing high-quality rationales generated by powerful Large Language Models (LLMs, e.g., GPT-4). However, few works have noted the negative effects on SLM safety brought by the training, which are revealed in this study. Although there are works on safety alignment that fine-tune language models or manipulate model weights to defend against harmful inputs, they require extra computation or annotated data, and probably impact the reasoning ability of SLMs. In this paper, we investigate how to maintain the safety of SLMs during the CoT distillation process. Specifically, we propose a safe distillation method, Slow Tuning and Low-Entropy Masking Distillation (SLowED), containing two modules: Slow Tuning and Low-Entropy Masking. Slow Tuning scales down the magnitude of model weight changes to optimize the model weights in the neighboring space near the initial weight distribution. Low-Entropy Masking masks low-entropy tokens, which are regarded as unnecessary learning targets, to exclude them from fine-tuning. Experiments on three SLMs (Qwen2.5-1.5B, Llama-3.2-1B, BLOOM-1.1B) across reasoning benchmarks (BBH, BB-Sub, ARC, AGIEval) and safety evaluation (AdvBench) show that SLowED retains the safety of SLMs and comparably improves their reasoning capability compared to existing distillation methods. Furthermore, our ablation study presents the effectiveness of Slow Tuning and Low-Entropy Masking, with the former maintaining the model’s safety in the early stage and the latter prolonging the safe training epochs.
[59] Bridging AI Innovation and Healthcare Needs: Lessons Learned from Incorporating Modern NLP at The BC Cancer Registry
Lovedeep Gondara, Gregory Arbour, Raymond Ng, Jonathan Simkin, Shebnum Devji
Main category: cs.CL
TL;DR: The paper discusses lessons from deploying NLP in healthcare, emphasizing problem definition, iterative development, interdisciplinary collaboration, pragmatic model selection, data quality, error mitigation, and AI literacy.
Details
Motivation: To improve healthcare efficiency by automating data extraction from clinical documents using NLP, addressing practical challenges in deployment.Method: Implemented NLP models for information extraction and classification at BCCR, focusing on clear business objectives, iterative development, and interdisciplinary collaboration.
Result: Key insights include the importance of problem definition, pragmatic model selection, data quality, error mitigation, and AI literacy for successful NLP deployment.
Conclusion: The lessons learned are generalizable and provide guidance for healthcare organizations implementing AI/NLP to enhance data management and patient care.
Abstract: Automating data extraction from clinical documents offers significant potential to improve efficiency in healthcare settings, yet deploying Natural Language Processing (NLP) solutions presents practical challenges. Drawing upon our experience implementing various NLP models for information extraction and classification tasks at the British Columbia Cancer Registry (BCCR), this paper shares key lessons learned throughout the project lifecycle. We emphasize the critical importance of defining problems based on clear business objectives rather than solely technical accuracy, adopting an iterative approach to development, and fostering deep interdisciplinary collaboration and co-design involving domain experts, end-users, and ML specialists from inception. Further insights highlight the need for pragmatic model selection (including hybrid approaches and simpler methods where appropriate), rigorous attention to data quality (representativeness, drift, annotation), robust error mitigation strategies involving human-in-the-loop validation and ongoing audits, and building organizational AI literacy. These practical considerations, generalizable beyond cancer registries, provide guidance for healthcare organizations seeking to successfully implement AI/NLP solutions to enhance data management processes and ultimately improve patient care and public health outcomes.
[60] Training-Free Multimodal Large Language Model Orchestration
Tianyu Xie, Yuhang Wu, Yongdong Luo, Jiayi Ji, Xiawu Zheng
Main category: cs.CL
TL;DR: MLLM Orchestration enables multimodal AI systems without training, using LLMs to coordinate specialized models for efficiency and interpretability.
Details
Motivation: Existing MLLMs lack direct integration into unified systems, requiring training for alignment and efficiency. This work addresses these challenges without additional training.Method: Uses a central LLM controller, parallel Text-to-Speech architecture, and cross-modal memory integration to coordinate specialized models dynamically.
Result: Achieves 7.8% performance improvement, 10.3% latency reduction, and enhanced interpretability over traditional methods.
Conclusion: MLLM Orchestration offers a modular, efficient, and interpretable solution for multimodal AI systems without training.
Abstract: Different Multimodal Large Language Models (MLLMs) cannot be integrated into a unified multimodal input-output system directly. In previous work, training has been considered as an inevitable component due to challenges in modal alignment, Text-to-Speech efficiency and other integration issues. In this paper, we introduce Multimodal Large Language Model Orchestration, an effective approach for creating interactive multimodal AI systems without additional training. MLLM Orchestration leverages the inherent reasoning capabilities of large language models to coordinate specialized models through explicit workflows, enabling natural multimodal interactions while maintaining modularity, improving interpretability, and significantly enhancing computational efficiency. Our orchestration framework is built upon three key innovations: (1) a central controller LLM that analyzes user inputs and dynamically routes tasks to appropriate specialized models through carefully designed agents; (2) a parallel Text-to-Speech architecture that enables true full-duplex interaction with seamless interruption handling and natural conversational flow; and (3) a cross-modal memory integration system that maintains coherent context across modalities through intelligent information synthesis and retrieval, selectively avoiding unnecessary modality calls in certain scenarios to improve response speed. Extensive evaluations demonstrate that MLLM Orchestration achieves comprehensive multimodal capabilities without additional training, performance improvements of up to 7.8% over traditional jointly-trained approaches on standard benchmarks, reduced latency by 10.3%, and significantly enhanced interpretability through explicit orchestration processes.
[61] When Explainability Meets Privacy: An Investigation at the Intersection of Post-hoc Explainability and Differential Privacy in the Context of Natural Language Processing
Mahdi Dhaini, Stephen Meisenbacher, Ege Erdogan, Florian Matthes, Gjergji Kasneci
Main category: cs.CL
TL;DR: The paper explores the trade-off between privacy and explainability in NLP, using Differential Privacy and Post-hoc Explainability methods, and provides practical recommendations for future research.
Details
Motivation: There is a gap in understanding whether NLP systems can achieve both explainability and privacy, as these fields have not been sufficiently studied together.Method: The study empirically investigates the privacy-explainability trade-off using Differential Privacy and Post-hoc Explainability methods.
Result: The relationship between privacy and explainability is complex, influenced by task nature and method choices, but coexistence is possible.
Conclusion: The paper highlights the potential for privacy and explainability to coexist and offers practical recommendations for future research at this intersection.
Abstract: In the study of trustworthy Natural Language Processing (NLP), a number of important research fields have emerged, including that of explainability and privacy. While research interest in both explainable and privacy-preserving NLP has increased considerably in recent years, there remains a lack of investigation at the intersection of the two. This leaves a considerable gap in understanding of whether achieving both explainability and privacy is possible, or whether the two are at odds with each other. In this work, we conduct an empirical investigation into the privacy-explainability trade-off in the context of NLP, guided by the popular overarching methods of Differential Privacy (DP) and Post-hoc Explainability. Our findings include a view into the intricate relationship between privacy and explainability, which is formed by a number of factors, including the nature of the downstream task and choice of the text privatization and explainability method. In this, we highlight the potential for privacy and explainability to co-exist, and we summarize our findings in a collection of practical recommendations for future work at this important intersection.
cs.CV
[62] Privacy Enhancement for Gaze Data Using a Noise-Infused Autoencoder
Samantha Aziz, Oleg Komogortsev
Main category: cs.CV
TL;DR: A privacy-enhancing mechanism for gaze signals using a latent-noise autoencoder reduces re-identification risks while maintaining data usability.
Details
Motivation: To protect users' sensitive gaze data from re-identification without consent, ensuring privacy without compromising utility for benign tasks.Method: Uses a latent-noise autoencoder to anonymize gaze signals, balancing privacy and utility by retaining physiologically plausible gaze patterns.
Result: Significantly reduces biometric identifiability with minimal utility degradation, outperforming prior methods.
Conclusion: The framework effectively advances privacy in gaze-based systems by offering a usable and protective solution for sensitive data.
Abstract: We present a privacy-enhancing mechanism for gaze signals using a latent-noise autoencoder that prevents users from being re-identified across play sessions without their consent, while retaining the usability of the data for benign tasks. We evaluate privacy-utility trade-offs across biometric identification and gaze prediction tasks, showing that our approach significantly reduces biometric identifiability with minimal utility degradation. Unlike prior methods in this direction, our framework retains physiologically plausible gaze patterns suitable for downstream use, which produces favorable privacy-utility trade-off. This work advances privacy in gaze-based systems by providing a usable and effective mechanism for protecting sensitive gaze data.
[63] Topological Structure Description for Artcode Detection Using the Shape of Orientation Histogram
Liming Xu, Dave Towey, Andrew P. French, Steve Benford
Main category: cs.CV
TL;DR: The paper proposes a feature descriptor for detecting Artcodes, decorative markers blending virtual and physical worlds, and demonstrates its effectiveness through experiments.
Details
Motivation: With the rise of smartphones and VR/AR, there's a need to detect objects like Artcodes that connect virtual elements to the physical world for interaction.Method: The authors introduce a new feature descriptor, the shape of orientation histogram, to describe Artcodes’ topological structure and evaluate its performance via datasets and experiments.
Result: Experiments confirm the descriptor’s feasibility for representing topological structures and the system’s effectiveness in detecting Artcodes.
Conclusion: This work pioneers feature-based detection of topological objects like Artcodes, enabling new interaction possibilities and applications.
Abstract: The increasing ubiquity of smartphones and resurgence of VR/AR techniques, it is expected that our everyday environment may soon be decorating with objects connecting with virtual elements. Alerting to the presence of these objects is therefore the first step for motivating follow-up further inspection and triggering digital material attached to the objects. This work studies a special kind of these objects – Artcodes – a human-meaningful and machine-readable decorative markers that camouflage themselves with freeform appearance by encoding information into their topology. We formulate this problem of recongising the presence of Artcodes as Artcode proposal detection, a distinct computer vision task that classifies topologically similar but geometrically and semantically different objects as a same class. To deal with this problem, we propose a new feature descriptor, called the shape of orientation histogram, to describe the generic topological structure of an Artcode. We collect datasets and conduct comprehensive experiments to evaluate the performance of the Artcode detection proposer built upon this new feature vector. Our experimental results show the feasibility of the proposed feature vector for representing topological structures and the effectiveness of the system for detecting Artcode proposals. Although this work is an initial attempt to develop a feature-based system for detecting topological objects like Artcodes, it would open up new interaction opportunities and spark potential applications of topological object detection.
[64] A Survey on Video Temporal Grounding with Multimodal Large Language Model
Jianlong Wu, Wei Liu, Ye Liu, Meng Liu, Liqiang Nie, Zhouchen Lin, Chang Wen Chen
Main category: cs.CV
TL;DR: This survey reviews video temporal grounding (VTG) using multimodal large language models (MLLMs), highlighting their superiority over traditional methods, and provides a taxonomy covering MLLM roles, training paradigms, and video feature processing.
Details
Motivation: To address the lack of comprehensive reviews on VTG-MLLMs despite their growing importance in video understanding.Method: Systematic examination through a three-dimensional taxonomy: MLLM functional roles, training paradigms, and video feature processing techniques.
Result: Identifies competitive performance and generalization strengths of VTG-MLLMs, along with benchmark datasets and evaluation protocols.
Conclusion: Highlights limitations and proposes future research directions, encouraging further exploration via a provided repository.
Abstract: The recent advancement in video temporal grounding (VTG) has significantly enhanced fine-grained video understanding, primarily driven by multimodal large language models (MLLMs). With superior multimodal comprehension and reasoning abilities, VTG approaches based on MLLMs (VTG-MLLMs) are gradually surpassing traditional fine-tuned methods. They not only achieve competitive performance but also excel in generalization across zero-shot, multi-task, and multi-domain settings. Despite extensive surveys on general video-language understanding, comprehensive reviews specifically addressing VTG-MLLMs remain scarce. To fill this gap, this survey systematically examines current research on VTG-MLLMs through a three-dimensional taxonomy: 1) the functional roles of MLLMs, highlighting their architectural significance; 2) training paradigms, analyzing strategies for temporal reasoning and task adaptation; and 3) video feature processing techniques, which determine spatiotemporal representation effectiveness. We further discuss benchmark datasets, evaluation protocols, and summarize empirical findings. Finally, we identify existing limitations and propose promising research directions. For additional resources and details, readers are encouraged to visit our repository at https://github.com/ki-lw/Awesome-MLLMs-for-Video-Temporal-Grounding.
[65] Empowering Multimodal LLMs with External Tools: A Comprehensive Survey
Wenbin An, Jiahao Nie, Yaqiang Wu, Feng Tian, Shijian Lu, Qinghua Zheng
Main category: cs.CV
TL;DR: The paper surveys how external tools can enhance Multimodal Large Language Models (MLLMs) by improving data quality, task performance, evaluation, and addressing limitations.
Details
Motivation: Current MLLMs face challenges like poor data quality, limited task performance, and inadequate evaluation, hindering their reliability and broader use.Method: The paper reviews four key ways external tools (e.g., APIs, expert models) can augment MLLMs: data acquisition, task performance, evaluation, and addressing limitations.
Result: The survey highlights the transformative potential of external tools in advancing MLLMs.
Conclusion: External tools offer a promising pathway to enhance MLLMs, with future directions focusing on their development and broader applications.
Abstract: By integrating the perception capabilities of multimodal encoders with the generative power of Large Language Models (LLMs), Multimodal Large Language Models (MLLMs), exemplified by GPT-4V, have achieved great success in various multimodal tasks, pointing toward a promising pathway to artificial general intelligence. Despite this progress, the limited quality of multimodal data, poor performance on many complex downstream tasks, and inadequate evaluation protocols continue to hinder the reliability and broader applicability of MLLMs across diverse domains. Inspired by the human ability to leverage external tools for enhanced reasoning and problem-solving, augmenting MLLMs with external tools (e.g., APIs, expert models, and knowledge bases) offers a promising strategy to overcome these challenges. In this paper, we present a comprehensive survey on leveraging external tools to enhance MLLM performance. Our discussion is structured along four key dimensions about external tools: (1) how they can facilitate the acquisition and annotation of high-quality multimodal data; (2) how they can assist in improving MLLM performance on challenging downstream tasks; (3) how they enable comprehensive and accurate evaluation of MLLMs; (4) the current limitations and future directions of tool-augmented MLLMs. Through this survey, we aim to underscore the transformative potential of external tools in advancing MLLM capabilities, offering a forward-looking perspective on their development and applications. The project page of this paper is publicly available athttps://github.com/Lackel/Awesome-Tools-for-MLLMs.
[66] VSF: Simple, Efficient, and Effective Negative Guidance in Few-Step Image Generation Models By \underline{V}alue \underline{S}ign \underline{F}lip
Wenqi Guo, Shan Du
Main category: cs.CV
TL;DR: VSF is a method for improving negative prompt guidance in image generation by flipping attention values, outperforming existing techniques with minimal computational cost.
Details
Motivation: Existing methods like CFG, NASA, and NAG are inefficient or ineffective for negative prompt guidance in few-step models.Method: VSF dynamically flips the sign of attention values from negative prompts to suppress undesired content, working with architectures like Stable Diffusion 3.5 Turbo and Wan.
Result: VSF shows superior performance in static image and video generation, improving negative prompt adherence over prior methods while maintaining image quality.
Conclusion: VSF is a simple, efficient, and effective solution for negative prompt guidance in few-step diffusion and flow-matching models.
Abstract: We introduce Value Sign Flip (VSF), a simple and efficient method for incorporating negative prompt guidance in few-step diffusion and flow-matching image generation models. Unlike existing approaches such as classifier-free guidance (CFG), NASA, and NAG, VSF dynamically suppresses undesired content by flipping the sign of attention values from negative prompts. Our method requires only small computational overhead and integrates effectively with MMDiT-style architectures such as Stable Diffusion 3.5 Turbo, as well as cross-attention-based models like Wan. We validate VSF on challenging datasets with complex prompt pairs and demonstrate superior performance in both static image and video generation tasks. Experimental results show that VSF significantly improves negative prompt adherence compared to prior methods in few-step models, and even CFG in non-few-step models, while maintaining competitive image quality. Code and ComfyUI node are available in https://github.com/weathon/VSF/tree/main.
[67] Advancing 3D Scene Understanding with MV-ScanQA Multi-View Reasoning Evaluation and TripAlign Pre-training Dataset
Wentao Mo, Qingchao Chen, Yuxin Peng, Siyuan Huang, Yang Liu
Main category: cs.CV
TL;DR: The paper introduces MV-ScanQA and TripAlign datasets to address limitations in 3D vision-language learning, focusing on multi-view reasoning and richer contextual alignments. It also presents LEGO, a baseline method achieving state-of-the-art results.
Details
Motivation: Existing 3D VL datasets lack multi-view reasoning and rich contextual alignments, limiting deep 3D scene understanding.Method: Introduces MV-ScanQA for multi-view QA and TripAlign for pre-training. Develops LEGO, a method transferring 2D LVLM knowledge to 3D.
Result: LEGO pre-trained on TripAlign achieves state-of-the-art performance on MV-ScanQA and existing benchmarks.
Conclusion: The proposed datasets and method advance 3D VL learning by enabling multi-view reasoning and richer alignments.
Abstract: The advancement of 3D vision-language (3D VL) learning is hindered by several limitations in existing 3D VL datasets: they rarely necessitate reasoning beyond a close range of objects in single viewpoint, and annotations often link instructions to single objects, missing richer contextual alignments between multiple objects. This significantly curtails the development of models capable of deep, multi-view 3D scene understanding over distant objects. To address these challenges, we introduce MV-ScanQA, a novel 3D question answering dataset where 68% of questions explicitly require integrating information from multiple views (compared to less than 7% in existing datasets), thereby rigorously testing multi-view compositional reasoning. To facilitate the training of models for such demanding scenarios, we present TripAlign dataset, a large-scale and low-cost 2D-3D-language pre-training corpus containing 1M <2D view, set of 3D objects, text> triplets that explicitly aligns groups of contextually related objects with text, providing richer, view-grounded multi-object multimodal alignment signals than previous single-object annotations. We further develop LEGO, a baseline method for the multi-view reasoning challenge in MV-ScanQA, transferring knowledge from pre-trained 2D LVLMs to 3D domain with TripAlign. Empirically, LEGO pre-trained on TripAlign achieves state-of-the-art performance not only on the proposed MV-ScanQA, but also on existing benchmarks for 3D dense captioning and question answering. Datasets and code are available at https://matthewdm0816.github.io/tripalign-mvscanqa.
[68] Relative Pose Regression with Pose Auto-Encoders: Enhancing Accuracy and Data Efficiency for Retail Applications
Yoli Shavit, Yosi Keller
Main category: cs.CV
TL;DR: The paper extends Camera Pose Auto-Encoders (PAEs) to Relative Pose Regression (RPR) for improved camera localization in retail, reducing data needs while maintaining accuracy.
Details
Motivation: Enhancing camera localization accuracy in retail for better customer experiences and operations, addressing limitations of Absolute Pose Regression (APR).Method: Proposes PAE-based RPR and a re-localization scheme to refine APR predictions without extra data storage.
Result: PAE-based RPR outperforms image-based RPR, and the refinement improves APR accuracy, even with 30% training data.
Conclusion: The method offers a data-efficient solution for retail camera localization, achieving competitive performance with reduced data collection.
Abstract: Accurate camera localization is crucial for modern retail environments, enabling enhanced customer experiences, streamlined inventory management, and autonomous operations. While Absolute Pose Regression (APR) from a single image offers a promising solution, approaches that incorporate visual and spatial scene priors tend to achieve higher accuracy. Camera Pose Auto-Encoders (PAEs) have recently been introduced to embed such priors into APR. In this work, we extend PAEs to the task of Relative Pose Regression (RPR) and propose a novel re-localization scheme that refines APR predictions using PAE-based RPR, without requiring additional storage of images or pose data. We first introduce PAE-based RPR and establish its effectiveness by comparing it with image-based RPR models of equivalent architectures. We then demonstrate that our refinement strategy, driven by a PAE-based RPR, enhances APR localization accuracy on indoor benchmarks. Notably, our method is shown to achieve competitive performance even when trained with only 30% of the data, substantially reducing the data collection burden for retail deployment. Our code and pre-trained models are available at: https://github.com/yolish/camera-pose-auto-encoders
[69] ViPE: Video Pose Engine for 3D Geometric Perception
Jiahui Huang, Qunjie Zhou, Hesam Rabeti, Aleksandr Korovko, Huan Ling, Xuanchi Ren, Tianchang Shen, Jun Gao, Dmitry Slepichev, Chen-Hsuan Lin, Jiawei Ren, Kevin Xie, Joydeep Biswas, Laura Leal-Taixe, Sanja Fidler
Main category: cs.CV
TL;DR: ViPE is a video processing engine for accurate 3D perception, excelling in unconstrained video scenarios and outperforming baselines in pose estimation.
Details
Motivation: Addressing the challenge of acquiring precise 3D annotations from in-the-wild videos for spatial AI systems.Method: ViPE estimates camera intrinsics, motion, and dense depth maps from raw videos, supporting diverse scenarios and camera models.
Result: Outperforms baselines by 18%/50% on TUM/KITTI, runs at 3-5FPS on a GPU, and annotates a large-scale dataset (96M frames).
Conclusion: ViPE and its dataset are open-sourced to advance spatial AI development.
Abstract: Accurate 3D geometric perception is an important prerequisite for a wide range of spatial AI systems. While state-of-the-art methods depend on large-scale training data, acquiring consistent and precise 3D annotations from in-the-wild videos remains a key challenge. In this work, we introduce ViPE, a handy and versatile video processing engine designed to bridge this gap. ViPE efficiently estimates camera intrinsics, camera motion, and dense, near-metric depth maps from unconstrained raw videos. It is robust to diverse scenarios, including dynamic selfie videos, cinematic shots, or dashcams, and supports various camera models such as pinhole, wide-angle, and 360{\deg} panoramas. We have benchmarked ViPE on multiple benchmarks. Notably, it outperforms existing uncalibrated pose estimation baselines by 18%/50% on TUM/KITTI sequences, and runs at 3-5FPS on a single GPU for standard input resolutions. We use ViPE to annotate a large-scale collection of videos. This collection includes around 100K real-world internet videos, 1M high-quality AI-generated videos, and 2K panoramic videos, totaling approximately 96M frames – all annotated with accurate camera poses and dense depth maps. We open-source ViPE and the annotated dataset with the hope of accelerating the development of spatial AI systems.
[70] Logic Unseen: Revealing the Logical Blindspots of Vision-Language Models
Yuchen Zhou, Jiayu Tang, Shuo Yang, Xiaoyan Xiao, Yuqin Dai, Wenhao Yang, Chao Gou, Xiaobo Xia, Tat-Seng Chua
Main category: cs.CV
TL;DR: LogicBench is introduced to diagnose logical blindspots in VLMs like CLIP, revealing their underperformance in logical tasks. LogicCLIP, a new training framework, improves logical comprehension without sacrificing general alignment.
Details
Motivation: Existing VLMs lack logical understanding, limiting their reliability. LogicBench and LogicCLIP aim to address this gap.Method: LogicBench evaluates VLMs on 50,000+ vision-language pairs across 9 logical categories. LogicCLIP uses logic-aware data generation and contrastive learning with novel objectives.
Result: VLMs perform 40+ points below humans in logical tasks. LogicCLIP significantly improves logical comprehension and maintains general alignment.
Conclusion: LogicBench and LogicCLIP advance VLM logical capabilities, offering valuable resources for future research.
Abstract: Vision-Language Models (VLMs), exemplified by CLIP, have emerged as foundational for multimodal intelligence. However, their capacity for logical understanding remains significantly underexplored, resulting in critical ‘’logical blindspots’’ that limit their reliability in practical applications. To systematically diagnose this, we introduce LogicBench, a comprehensive benchmark with over 50,000 vision-language pairs across 9 logical categories and 4 diverse scenarios: images, videos, anomaly detection, and medical diagnostics. Our evaluation reveals that existing VLMs, even the state-of-the-art ones, fall at over 40 accuracy points below human performance, particularly in challenging tasks like Causality and Conditionality, highlighting their reliance on surface semantics over critical logical structures. To bridge this gap, we propose LogicCLIP, a novel training framework designed to boost VLMs’ logical sensitivity through advancements in both data generation and optimization objectives. LogicCLIP utilizes logic-aware data generation and a contrastive learning strategy that combines coarse-grained alignment, a fine-grained multiple-choice objective, and a novel logical structure-aware objective. Extensive experiments demonstrate LogicCLIP’s substantial improvements in logical comprehension across all LogicBench domains, significantly outperforming baselines. Moreover, LogicCLIP retains, and often surpasses, competitive performance on general vision-language benchmarks, demonstrating that the enhanced logical understanding does not come at the expense of general alignment. We believe that LogicBench and LogicCLIP will be important resources for advancing VLM logical capabilities.
[71] HQ-OV3D: A High Box Quality Open-World 3D Detection Framework based on Diffision Model
Qi Liu, Yabei Li, Hongsong Wang, Lei He
Main category: cs.CV
TL;DR: HQ-OV3D improves open-vocabulary 3D detection by generating high-quality pseudo-labels with cross-modality consistency and denoising.
Details
Motivation: Traditional closed-set 3D detection fails in open-world applications, and existing methods neglect geometric quality of pseudo-labels.Method: HQ-OV3D uses an Intra-Modality Cross-Validated Proposal Generator and an Annotated-Class Assisted Denoiser for high-quality pseudo-label generation and refinement.
Result: Achieves a 7.37% mAP improvement on novel classes compared to state-of-the-art methods.
Conclusion: HQ-OV3D is effective as a standalone detector or a plug-in pseudo-label generator for open-vocabulary tasks.
Abstract: Traditional closed-set 3D detection frameworks fail to meet the demands of open-world applications like autonomous driving. Existing open-vocabulary 3D detection methods typically adopt a two-stage pipeline consisting of pseudo-label generation followed by semantic alignment. While vision-language models (VLMs) recently have dramatically improved the semantic accuracy of pseudo-labels, their geometric quality, particularly bounding box precision, remains commonly neglected.To address this issue, we propose a High Box Quality Open-Vocabulary 3D Detection (HQ-OV3D) framework, dedicated to generate and refine high-quality pseudo-labels for open-vocabulary classes. The framework comprises two key components: an Intra-Modality Cross-Validated (IMCV) Proposal Generator that utilizes cross-modality geometric consistency to generate high-quality initial 3D proposals, and an Annotated-Class Assisted (ACA) Denoiser that progressively refines 3D proposals by leveraging geometric priors from annotated categories through a DDIM-based denoising mechanism.Compared to the state-of-the-art method, training with pseudo-labels generated by our approach achieves a 7.37% improvement in mAP on novel classes, demonstrating the superior quality of the pseudo-labels produced by our framework. HQ-OV3D can serve not only as a strong standalone open-vocabulary 3D detector but also as a plug-in high-quality pseudo-label generator for existing open-vocabulary detection or annotation pipelines.
[72] Vision-Only Gaussian Splatting for Collaborative Semantic Occupancy Prediction
Cheng Chen, Hao Huang, Saurabh Bagchi
Main category: cs.CV
TL;DR: A novel method using sparse 3D semantic Gaussian splatting for collaborative 3D semantic occupancy prediction improves accuracy and reduces communication costs compared to existing approaches.
Details
Motivation: Overcome limitations of single-agent systems and high communication costs or depth reliance in existing collaborative methods.Method: Leverages sparse 3D semantic Gaussian splatting, sharing and fusing intermediate Gaussian primitives for efficient cross-agent fusion and reduced communication.
Result: Outperforms single-agent and baseline collaborative methods by significant margins in mIoU and IoU, even with reduced communication volume.
Conclusion: The proposed method is effective for collaborative perception, offering robust performance under limited communication budgets.
Abstract: Collaborative perception enables connected vehicles to share information, overcoming occlusions and extending the limited sensing range inherent in single-agent (non-collaborative) systems. Existing vision-only methods for 3D semantic occupancy prediction commonly rely on dense 3D voxels, which incur high communication costs, or 2D planar features, which require accurate depth estimation or additional supervision, limiting their applicability to collaborative scenarios. To address these challenges, we propose the first approach leveraging sparse 3D semantic Gaussian splatting for collaborative 3D semantic occupancy prediction. By sharing and fusing intermediate Gaussian primitives, our method provides three benefits: a neighborhood-based cross-agent fusion that removes duplicates and suppresses noisy or inconsistent Gaussians; a joint encoding of geometry and semantics in each primitive, which reduces reliance on depth supervision and allows simple rigid alignment; and sparse, object-centric messages that preserve structural information while reducing communication volume. Extensive experiments demonstrate that our approach outperforms single-agent perception and baseline collaborative methods by +8.42 and +3.28 points in mIoU, and +5.11 and +22.41 points in IoU, respectively. When further reducing the number of transmitted Gaussians, our method still achieves a +1.9 improvement in mIoU, using only 34.6% communication volume, highlighting robust performance under limited communication budgets.
[73] Personalized Face Super-Resolution with Identity Decoupling and Fitting
Jiarui Yang, Hang Guo, Wen Huang, Tao Dai, Shutao Xia
Main category: cs.CV
TL;DR: The paper proposes IDFSR, a novel face super-resolution method for extreme degradation scenarios, improving ID consistency and reducing hallucination effects.
Details
Motivation: Existing FSR methods struggle with ID consistency and realism in extreme degradation (e.g., scale > 8×), often producing hallucinated faces.Method: IDFSR uses masking, warping, and ID embeddings to decouple style and ID, pretrains a diffusion model, and fine-tunes ID embeddings for personalized adaptation.
Result: IDFSR outperforms existing methods in extreme degradation, achieving superior ID consistency and perceptual quality.
Conclusion: IDFSR effectively addresses extreme FSR challenges, enhancing ID restoration and reducing hallucination.
Abstract: In recent years, face super-resolution (FSR) methods have achieved remarkable progress, generally maintaining high image fidelity and identity (ID) consistency under standard settings. However, in extreme degradation scenarios (e.g., scale $> 8\times$), critical attributes and ID information are often severely lost in the input image, making it difficult for conventional models to reconstruct realistic and ID-consistent faces. Existing methods tend to generate hallucinated faces under such conditions, producing restored images lacking authentic ID constraints. To address this challenge, we propose a novel FSR method with Identity Decoupling and Fitting (IDFSR), designed to enhance ID restoration under large scaling factors while mitigating hallucination effects. Our approach involves three key designs: 1) \textbf{Masking} the facial region in the low-resolution (LR) image to eliminate unreliable ID cues; 2) \textbf{Warping} a reference image to align with the LR input, providing style guidance; 3) Leveraging \textbf{ID embeddings} extracted from ground truth (GT) images for fine-grained ID modeling and personalized adaptation. We first pretrain a diffusion-based model to explicitly decouple style and ID by forcing it to reconstruct masked LR face regions using both style and identity embeddings. Subsequently, we freeze most network parameters and perform lightweight fine-tuning of the ID embedding using a small set of target ID images. This embedding encodes fine-grained facial attributes and precise ID information, significantly improving both ID consistency and perceptual quality. Extensive quantitative evaluations and visual comparisons demonstrate that the proposed IDFSR substantially outperforms existing approaches under extreme degradation, particularly achieving superior performance on ID consistency.
[74] Casual3DHDR: High Dynamic Range 3D Gaussian Splatting from Casually Captured Videos
Shucheng Gong, Lingzhe Zhao, Wenpu Li, Hong Xie, Yin Zhang, Shiyu Zhao, Peidong Liu
Main category: cs.CV
TL;DR: Casual3DHDR is a one-stage method for 3D HDR scene reconstruction from casually-captured auto-exposure videos, overcoming limitations of existing LDR-based methods.
Details
Motivation: Existing methods for novel view synthesis rely on LDR images, limiting performance in high-contrast environments. HDR reconstruction often requires impractical multi-view sharp images with fixed exposures.Method: Casual3DHDR integrates continuous camera trajectory into a physical imaging model, jointly optimizing exposure times, trajectory, and camera response function.
Result: Outperforms existing methods in robustness and rendering quality on synthetic and real-world datasets.
Conclusion: Casual3DHDR offers a flexible, practical solution for HDR scene reconstruction from casually-captured videos.
Abstract: Photo-realistic novel view synthesis from multi-view images, such as neural radiance field (NeRF) and 3D Gaussian Splatting (3DGS), has gained significant attention for its superior performance. However, most existing methods rely on low dynamic range (LDR) images, limiting their ability to capture detailed scenes in high-contrast environments. While some prior works address high dynamic range (HDR) scene reconstruction, they typically require multi-view sharp images with varying exposure times captured at fixed camera positions, which is time-consuming and impractical. To make data acquisition more flexible, we propose \textbf{Casual3DHDR}, a robust one-stage method that reconstructs 3D HDR scenes from casually-captured auto-exposure (AE) videos, even under severe motion blur and unknown, varying exposure times. Our approach integrates a continuous camera trajectory into a unified physical imaging model, jointly optimizing exposure times, camera trajectory, and the camera response function (CRF). Extensive experiments on synthetic and real-world datasets demonstrate that \textbf{Casual3DHDR} outperforms existing methods in robustness and rendering quality. Our source code and dataset will be available at https://lingzhezhao.github.io/CasualHDRSplat/
[75] Deep Learning for Automated Identification of Vietnamese Timber Species: A Tool for Ecological Monitoring and Conservation
Tianyu Song, Van-Doan Duong, Thi-Phuong Le, Ton Viet Ta
Main category: cs.CV
TL;DR: Deep learning models, especially ShuffleNetV2, achieve high accuracy (99.29%) in automating wood species classification, offering scalable solutions for ecological monitoring.
Details
Motivation: Traditional wood species classification is labor-intensive and requires expertise; deep learning can automate and improve efficiency.Method: Five CNN architectures (ResNet50, EfficientNet, MobileViT, MobileNetV3, ShuffleNetV2) were tested on a custom dataset of Vietnamese wood species.
Result: ShuffleNetV2 performed best with 99.29% accuracy and 99.35% F1-score, balancing performance and computational efficiency.
Conclusion: Lightweight deep learning models like ShuffleNetV2 enable real-time, high-accuracy species identification in resource-limited settings.
Abstract: Accurate identification of wood species plays a critical role in ecological monitoring, biodiversity conservation, and sustainable forest management. Traditional classification approaches relying on macroscopic and microscopic inspection are labor-intensive and require expert knowledge. In this study, we explore the application of deep learning to automate the classification of ten wood species commonly found in Vietnam. A custom image dataset was constructed from field-collected wood samples, and five state-of-the-art convolutional neural network architectures–ResNet50, EfficientNet, MobileViT, MobileNetV3, and ShuffleNetV2–were evaluated. Among these, ShuffleNetV2 achieved the best balance between classification performance and computational efficiency, with an average accuracy of 99.29% and F1-score of 99.35% over 20 independent runs. These results demonstrate the potential of lightweight deep learning models for real-time, high-accuracy species identification in resource-constrained environments. Our work contributes to the growing field of ecological informatics by providing scalable, image-based solutions for automated wood classification and forest biodiversity assessment.
[76] NIRMAL Pooling: An Adaptive Max Pooling Approach with Non-linear Activation for Enhanced Image Classification
Nirmal Gaud, Krishna Kumar Jha, Jhimli Adhikari, Adhini Nasarin P S, Joydeep Das, Samarth S Deshpande, Nitasha Barara, Vaduguru Venkata Ramya, Santu Saha, Mehmet Tarik Baran, Sarangi Venkateshwarlu, Anusha M D, Surej Mouli, Preeti Katiyar, Vipin Kumar Chaudhary
Main category: cs.CV
TL;DR: NIRMAL Pooling is a new CNN pooling layer combining adaptive max pooling with ReLU activation, outperforming standard Max Pooling on image classification tasks.
Details
Motivation: To enhance CNN performance by improving robustness and feature expressiveness in image classification tasks.Method: Integrates adaptive max pooling with ReLU activation, dynamically adjusting pooling parameters based on output dimensions.
Result: Achieves higher test accuracies on MNIST Digits (99.25%), MNIST Fashion (91.59%), and CIFAR-10 (70.49%) compared to Max Pooling.
Conclusion: NIRMAL Pooling is a flexible and reliable alternative to traditional pooling methods, enhancing CNN performance in diverse image recognition tasks.
Abstract: This paper presents NIRMAL Pooling, a novel pooling layer for Convolutional Neural Networks (CNNs) that integrates adaptive max pooling with non-linear activation function for image classification tasks. The acronym NIRMAL stands for Non-linear Activation, Intermediate Aggregation, Reduction, Maximum, Adaptive, and Localized. By dynamically adjusting pooling parameters based on desired output dimensions and applying a Rectified Linear Unit (ReLU) activation post-pooling, NIRMAL Pooling improves robustness and feature expressiveness. We evaluated its performance against standard Max Pooling on three benchmark datasets: MNIST Digits, MNIST Fashion, and CIFAR-10. NIRMAL Pooling achieves test accuracies of 99.25% (vs. 99.12% for Max Pooling) on MNIST Digits, 91.59% (vs. 91.44%) on MNIST Fashion, and 70.49% (vs. 68.87%) on CIFAR-10, demonstrating consistent improvements, particularly on complex datasets. This work highlights the potential of NIRMAL Pooling to enhance CNN performance in diverse image recognition tasks, offering a flexible and reliable alternative to traditional pooling methods.
[77] Analysis of the Compaction Behavior of Textile Reinforcements in Low-Resolution In-Situ CT Scans via Machine-Learning and Descriptor-Based Methods
Christian Düreth, Jan Condé-Wolter, Marek Danczak, Karsten Tittmann, Jörn Jaschinski, Andreas Hornig, Maik Gude
Main category: cs.CV
TL;DR: A framework using low-resolution CT and 3D-UNet quantifies nesting in textile composites, achieving high accuracy in segmentation and structural analysis.
Details
Motivation: Understanding nesting in textile composites is crucial for modeling mechanical properties like stiffness and damage tolerance.Method: In-situ compaction experiments with CT scans (20.22 µm/voxel) and 3D-UNet segmentation analyzed using two-point correlation function.
Result: Achieved mean IoU of 0.822 and F1 score of 0.902, with strong validation against micrographs.
Conclusion: The method robustly extracts key features for reverse modeling and structural analysis of composites.
Abstract: A detailed understanding of material structure across multiple scales is essential for predictive modeling of textile-reinforced composites. Nesting – characterized by the interlocking of adjacent fabric layers through local interpenetration and misalignment of yarns – plays a critical role in defining mechanical properties such as stiffness, permeability, and damage tolerance. This study presents a framework to quantify nesting behavior in dry textile reinforcements under compaction using low-resolution computed tomography (CT). In-situ compaction experiments were conducted on various stacking configurations, with CT scans acquired at 20.22 $\mu$m per voxel resolution. A tailored 3D{-}UNet enabled semantic segmentation of matrix, weft, and fill phases across compaction stages corresponding to fiber volume contents of 50–60 %. The model achieved a minimum mean Intersection-over-Union of 0.822 and an $F1$ score of 0.902. Spatial structure was subsequently analyzed using the two-point correlation function $S_2$, allowing for probabilistic extraction of average layer thickness and nesting degree. The results show strong agreement with micrograph-based validation. This methodology provides a robust approach for extracting key geometrical features from industrially relevant CT data and establishes a foundation for reverse modeling and descriptor-based structural analysis of composite preforms.
[78] IPG: Incremental Patch Generation for Generalized Adversarial Patch Training
Wonho Lee, Hyunsik Na, Jisu Lee, Daeseon Choi
Main category: cs.CV
TL;DR: The paper introduces Incremental Patch Generation (IPG), a method to create adversarial patches more efficiently than existing methods, while maintaining attack performance. It demonstrates effectiveness through experiments and suggests applications in AI security and real-world scenarios.
Details
Motivation: Adversarial patches challenge AI model robustness, especially in computer vision. Current methods are inefficient, prompting the need for a more effective solution like IPG.Method: IPG generates adversarial patches incrementally, achieving 11.1x efficiency over existing methods. It uses YOLO’s feature visualization and adversarial training for validation.
Result: IPG produces well-generalized patches covering broader model vulnerabilities. Its datasets aid in building robust models for AI security.
Conclusion: IPG shows promise for adversarial defense and real-world applications like autonomous vehicles and medical imaging, enhancing AI resilience.
Abstract: The advent of adversarial patches poses a significant challenge to the robustness of AI models, particularly in the domain of computer vision tasks such as object detection. In contradistinction to traditional adversarial examples, these patches target specific regions of an image, resulting in the malfunction of AI models. This paper proposes Incremental Patch Generation (IPG), a method that generates adversarial patches up to 11.1 times more efficiently than existing approaches while maintaining comparable attack performance. The efficacy of IPG is demonstrated by experiments and ablation studies including YOLO’s feature distribution visualization and adversarial training results, which show that it produces well-generalized patches that effectively cover a broader range of model vulnerabilities. Furthermore, IPG-generated datasets can serve as a robust knowledge foundation for constructing a robust model, enabling structured representation, advanced reasoning, and proactive defenses in AI security ecosystems. The findings of this study suggest that IPG has considerable potential for future utilization not only in adversarial patch defense but also in real-world applications such as autonomous vehicles, security systems, and medical imaging, where AI models must remain resilient to adversarial attacks in dynamic and high-stakes environments.
[79] iWatchRoad: Scalable Detection and Geospatial Visualization of Potholes for Smart Cities
Rishi Raj Sahoo, Surbhi Saswati Mohanty, Subhankar Mishra
Main category: cs.CV
TL;DR: iWatchRoad is an automated system for detecting, tagging, and mapping potholes in real-time using dashcam footage, YOLO, OCR, and OSM, tailored for Indian road conditions.
Details
Motivation: Potholes pose safety risks and maintenance challenges, especially in India's diverse road conditions, necessitating an automated, scalable solution.Method: The system uses a curated dataset of 7,000 frames to fine-tune YOLO for pothole detection, OCR for timestamp extraction, and GPS for geotagging, with results stored and visualized via OSM.
Result: iWatchRoad achieves accurate real-time pothole detection and provides government-compatible outputs for road maintenance planning.
Conclusion: The system is cost-effective, scalable, and practical for road management in developing regions, with a user-friendly web interface.
Abstract: Potholes on the roads are a serious hazard and maintenance burden. This poses a significant threat to road safety and vehicle longevity, especially on the diverse and under-maintained roads of India. In this paper, we present a complete end-to-end system called iWatchRoad for automated pothole detection, Global Positioning System (GPS) tagging, and real time mapping using OpenStreetMap (OSM). We curated a large, self-annotated dataset of over 7,000 frames captured across various road types, lighting conditions, and weather scenarios unique to Indian environments, leveraging dashcam footage. This dataset is used to fine-tune, Ultralytics You Only Look Once (YOLO) model to perform real time pothole detection, while a custom Optical Character Recognition (OCR) module was employed to extract timestamps directly from video frames. The timestamps are synchronized with GPS logs to geotag each detected potholes accurately. The processed data includes the potholes’ details and frames as metadata is stored in a database and visualized via a user friendly web interface using OSM. iWatchRoad not only improves detection accuracy under challenging conditions but also provides government compatible outputs for road assessment and maintenance planning through the metadata visible on the website. Our solution is cost effective, hardware efficient, and scalable, offering a practical tool for urban and rural road management in developing regions, making the system automated. iWatchRoad is available at https://smlab.niser.ac.in/project/iwatchroad
[80] MedAtlas: Evaluating LLMs for Multi-Round, Multi-Task Medical Reasoning Across Diverse Imaging Modalities and Clinical Text
Ronghao Xu, Zhen Huang, Yangbo Wei, Xiaoqian Zhou, Zikang Xu, Ting Liu, Zihang Jiang, S. Kevin Zhou
Main category: cs.CV
TL;DR: MedAtlas is a benchmark framework for evaluating large language models on realistic medical reasoning tasks, featuring multi-turn dialogue, multi-modal image interaction, and high clinical fidelity.
Details
Motivation: Existing medical benchmarks lack multi-modal integration and longitudinal reasoning, limiting their real-world applicability.Method: MedAtlas introduces multi-turn dialogue, multi-modal image interaction, and multi-task integration, with tasks like open-ended QA and disease diagnosis.
Result: Benchmark results show significant performance gaps in multi-stage clinical reasoning among existing models.
Conclusion: MedAtlas provides a robust platform to advance trustworthy medical AI by addressing real-world clinical challenges.
Abstract: Artificial intelligence has demonstrated significant potential in clinical decision-making; however, developing models capable of adapting to diverse real-world scenarios and performing complex diagnostic reasoning remains a major challenge. Existing medical multi-modal benchmarks are typically limited to single-image, single-turn tasks, lacking multi-modal medical image integration and failing to capture the longitudinal and multi-modal interactive nature inherent to clinical practice. To address this gap, we introduce MedAtlas, a novel benchmark framework designed to evaluate large language models on realistic medical reasoning tasks. MedAtlas is characterized by four key features: multi-turn dialogue, multi-modal medical image interaction, multi-task integration, and high clinical fidelity. It supports four core tasks: open-ended multi-turn question answering, closed-ended multi-turn question answering, multi-image joint reasoning, and comprehensive disease diagnosis. Each case is derived from real diagnostic workflows and incorporates temporal interactions between textual medical histories and multiple imaging modalities, including CT, MRI, PET, ultrasound, and X-ray, requiring models to perform deep integrative reasoning across images and clinical texts. MedAtlas provides expert-annotated gold standards for all tasks. Furthermore, we propose two novel evaluation metrics: Round Chain Accuracy and Error Propagation Resistance. Benchmark results with existing multi-modal models reveal substantial performance gaps in multi-stage clinical reasoning. MedAtlas establishes a challenging evaluation platform to advance the development of robust and trustworthy medical AI.
[81] From Promise to Practical Reality: Transforming Diffusion MRI Analysis with Fast Deep Learning Enhancement
Xinyi Wang, Michael Barnett, Frederique Boonstra, Yael Barnett, Mariano Cabezas, Arkiev D’Souza, Matthew C. Kiernan, Kain Kyle, Meng Law, Lynette Masters, Zihao Tang, Stephen Tisch, Sicong Tu, Anneke Van Der Walt, Dongang Wang, Fernando Calamante, Weidong Cai, Chenyu Wang
Main category: cs.CV
TL;DR: FastFOD-Net is a deep learning framework enhancing Fiber Orientation Distribution (FOD) from clinical MRI data, validated across healthy and neurological subjects, offering superior performance and efficiency.
Details
Motivation: Addressing the challenge of generating reliable FODs from low-quality clinical MRI data and bridging the gap for clinical adoption of deep learning methods.Method: An accelerated end-to-end deep learning framework (FastFOD-Net) optimized for enhancing FODs, validated on healthy controls and six neurological disorders.
Result: FastFOD-Net achieves superior performance, 60x faster than its predecessor, and enables robust analysis of clinical MRI data comparable to high-quality research acquisitions.
Conclusion: FastFOD-Net facilitates clinical neuroscience research, improves disease differentiation, and builds trust in deep learning for diffusion MRI enhancement.
Abstract: Fiber orientation distribution (FOD) is an advanced diffusion MRI modeling technique that represents complex white matter fiber configurations, and a key step for subsequent brain tractography and connectome analysis. Its reliability and accuracy, however, heavily rely on the quality of the MRI acquisition and the subsequent estimation of the FODs at each voxel. Generating reliable FODs from widely available clinical protocols with single-shell and low-angular-resolution acquisitions remains challenging but could potentially be addressed with recent advances in deep learning-based enhancement techniques. Despite advancements, existing methods have predominantly been assessed on healthy subjects, which have proved to be a major hurdle for their clinical adoption. In this work, we validate a newly optimized enhancement framework, FastFOD-Net, across healthy controls and six neurological disorders. This accelerated end-to-end deep learning framework enhancing FODs with superior performance and delivering training/inference efficiency for clinical use ($60\times$ faster comparing to its predecessor). With the most comprehensive clinical evaluation to date, our work demonstrates the potential of FastFOD-Net in accelerating clinical neuroscience research, empowering diffusion MRI analysis for disease differentiation, improving interpretability in connectome applications, and reducing measurement errors to lower sample size requirements. Critically, this work will facilitate the more widespread adoption of, and build clinical trust in, deep learning based methods for diffusion MRI enhancement. Specifically, FastFOD-Net enables robust analysis of real-world, clinical diffusion MRI data, comparable to that achievable with high-quality research acquisitions.
[82] ORBIT: An Object Property Reasoning Benchmark for Visual Inference Tasks
Abhishek Kolari, Mohammadhossein Khojasteh, Yifan Jiang, Floris den Hengst, Filip Ilievski
Main category: cs.CV
TL;DR: The paper introduces ORBIT, a VQA benchmark for evaluating VLMs’ ability to reason about object properties, revealing their limitations compared to humans.
Details
Motivation: Current VQA benchmarks lack representativeness in reasoning and image categories, blending perception and reasoning. The paper aims to address this gap.Method: A systematic evaluation framework with diverse images, reasoning levels, and object property dimensions is developed into the ORBIT benchmark.
Result: Experiments show VLMs perform poorly (40% accuracy) compared to humans, especially with realistic images and complex reasoning.
Conclusion: ORBIT highlights the need for scalable benchmarking, better annotation guidelines, and improved reasoning VLMs.
Abstract: While vision-language models (VLMs) have made remarkable progress on many popular visual question answering (VQA) benchmarks, it remains unclear whether they abstract and reason over depicted objects. Inspired by human object categorisation, object property reasoning involves identifying and recognising low-level details and higher-level abstractions. While current VQA benchmarks consider a limited set of object property attributes like size, they typically blend perception and reasoning, and lack representativeness in terms of reasoning and image categories. To this end, we introduce a systematic evaluation framework with images of three representative types, three reasoning levels of increasing complexity, and four object property dimensions driven by prior work on commonsense reasoning. We develop a procedure to instantiate this benchmark into ORBIT, a multi-level reasoning VQA benchmark for object properties comprising 360 images paired with a total of 1,080 count-based questions. Experiments with 12 state-of-the-art VLMs in zero-shot settings reveal significant limitations compared to humans, with the best-performing model only reaching 40% accuracy. VLMs struggle particularly with realistic (photographic) images, counterfactual reasoning about physical and functional properties, and higher counts. ORBIT points to the need to develop methods for scalable benchmarking, generalize annotation guidelines, and explore additional reasoning VLMs. We make the ORBIT benchmark and the experimental code available to support such endeavors.
[83] CSNR and JMIM Based Spectral Band Selection for Reducing Metamerism in Urban Driving
Jiarong Li, Imad Ali Shah, Diarmaid Geever, Fiachra Collins, Enda Ward, Martin Glavin, Edward Jones, Brian Deegan
Main category: cs.CV
TL;DR: The paper proposes using hyperspectral imaging (HSI) to improve VRU detection by overcoming RGB metamerism, identifying key spectral bands for better separability.
Details
Motivation: Addressing the challenge of VRU detection under visual ambiguity caused by metamerism in RGB imagery.Method: A band selection strategy combining information theory and image quality metrics to identify informative HSI bands.
Result: Selected HSI bands (497 nm, 607 nm, 895 nm) significantly improve VRU separability metrics over RGB.
Conclusion: HSI-based method enhances VRU detection, supporting safer ADAS and AD systems.
Abstract: Protecting Vulnerable Road Users (VRU) is a critical safety challenge for automotive perception systems, particularly under visual ambiguity caused by metamerism, a phenomenon where distinct materials appear similar in RGB imagery. This work investigates hyperspectral imaging (HSI) to overcome this limitation by capturing unique material signatures beyond the visible spectrum, especially in the Near-Infrared (NIR). To manage the inherent high-dimensionality of HSI data, we propose a band selection strategy that integrates information theory techniques (joint mutual information maximization, correlation analysis) with a novel application of an image quality metric (contrast signal-to-noise ratio) to identify the most spectrally informative bands. Using the Hyperspectral City V2 (H-City) dataset, we identify three informative bands (497 nm, 607 nm, and 895 nm, $\pm$27 nm) and reconstruct pseudo-color images for comparison with co-registered RGB. Quantitative results demonstrate increased dissimilarity and perceptual separability of VRU from the background. The selected HSI bands yield improvements of 70.24%, 528.46%, 1206.83%, and 246.62% for dissimilarity (Euclidean, SAM, $T^2$) and perception (CIE $\Delta E$) metrics, consistently outperforming RGB and confirming a marked reduction in metameric confusion. By providing a spectrally optimized input, our method enhances VRU separability, establishing a robust foundation for downstream perception tasks in Advanced Driver Assistance Systems (ADAS) and Autonomous Driving (AD), ultimately contributing to improved road safety.
[84] EVCtrl: Efficient Control Adapter for Visual Generation
Zixiang Yang, Yue Ma, Yinhan Zhang, Shanhui Mo, Dongrui Liu, Linfeng Zhang
Main category: cs.CV
TL;DR: EVCtrl is a lightweight control adapter for visual generation, reducing latency and redundant computation in ControlNet without retraining, using spatio-temporal caching for efficiency.
Details
Motivation: Current methods like ControlNet introduce latency and redundant computation, especially in video generation, prompting the need for a more efficient solution.Method: EVCtrl employs a spatio-temporal dual caching strategy: spatial redundancy is addressed by partitioning the network into global/local zones, and temporal redundancy is reduced by omitting unnecessary denoising steps.
Result: Achieves 2.16x and 2.05x speedups on CogVideo-Controlnet and Wan2.1-Controlnet with minimal quality loss.
Conclusion: EVCtrl effectively improves efficiency in controllable visual generation without requiring model retraining.
Abstract: Visual generation includes both image and video generation, training probabilistic models to create coherent, diverse, and semantically faithful content from scratch. While early research focused on unconditional sampling, practitioners now demand controllable generation that allows precise specification of layout, pose, motion, or style. While ControlNet grants precise spatial-temporal control, its auxiliary branch markedly increases latency and introduces redundant computation in both uncontrolled regions and denoising steps, especially for video. To address this problem, we introduce EVCtrl, a lightweight, plug-and-play control adapter that slashes overhead without retraining the model. Specifically, we propose a spatio-temporal dual caching strategy for sparse control information. For spatial redundancy, we first profile how each layer of DiT-ControlNet responds to fine-grained control, then partition the network into global and local functional zones. A locality-aware cache focuses computation on the local zones that truly need the control signal, skipping the bulk of redundant computation in global regions. For temporal redundancy, we selectively omit unnecessary denoising steps to improve efficiency. Extensive experiments on CogVideo-Controlnet, Wan2.1-Controlnet, and Flux demonstrate that our method is effective in image and video control generation without the need for training. For example, it achieves 2.16 and 2.05 times speedups on CogVideo-Controlnet and Wan2.1-Controlnet, respectively, with almost no degradation in generation quality.Codes are available in the supplementary materials.
[85] Not There Yet: Evaluating Vision Language Models in Simulating the Visual Perception of People with Low Vision
Rosiana Natalie, Wenqian Xu, Ruei-Che Chang, Rada Mihalcea, Anhong Guo
Main category: cs.CV
TL;DR: The paper evaluates VLMs’ ability to simulate low vision individuals’ image perception, finding minimal prompts yield low agreement (0.59), while combining vision info and example responses improves it (0.70).
Details
Motivation: Prior research hasn't explored VLMs' simulation capabilities in accessibility, specifically for low vision individuals.Method: A benchmark dataset from 40 low vision participants was used to create VLM prompts, varying vision info and example responses. Agreement between VLM and participant responses was measured.
Result: VLMs inferred beyond specified vision ability with minimal prompts (agreement 0.59). Combining vision info and example responses significantly improved agreement (0.70).
Conclusion: Combining vision info and example responses enhances VLM simulation accuracy for low vision perception, with diminishing returns from additional examples.
Abstract: Advances in vision language models (VLMs) have enabled the simulation of general human behavior through their reasoning and problem solving capabilities. However, prior research has not investigated such simulation capabilities in the accessibility domain. In this paper, we evaluate the extent to which VLMs can simulate the vision perception of low vision individuals when interpreting images. We first compile a benchmark dataset through a survey study with 40 low vision participants, collecting their brief and detailed vision information and both open-ended and multiple-choice image perception and recognition responses to up to 25 images. Using these responses, we construct prompts for VLMs (GPT-4o) to create simulated agents of each participant, varying the included information on vision information and example image responses. We evaluate the agreement between VLM-generated responses and participants’ original answers. Our results indicate that VLMs tend to infer beyond the specified vision ability when given minimal prompts, resulting in low agreement (0.59). The agreement between the agent’ and participants’ responses remains low when only either the vision information (0.59) or example image responses (0.59) are provided, whereas a combination of both significantly increase the agreement (0.70, p < 0.0001). Notably, a single example combining both open-ended and multiple-choice responses, offers significant performance improvements over either alone (p < 0.0001), while additional examples provided minimal benefits (p > 0.05).
[86] Are Large Pre-trained Vision Language Models Effective Construction Safety Inspectors?
Xuezheng Chen, Zhengbo Zou
Main category: cs.CV
TL;DR: The paper introduces ConstructionSite 10k, a dataset of 10,000 annotated construction site images for evaluating and fine-tuning Vision Language Models (VLMs) in safety inspections.
Details
Motivation: Address the lack of open datasets for comprehensive evaluation and fine-tuning of VLMs in construction safety inspection.Method: Propose ConstructionSite 10k with annotations for image captioning, safety rule violation VQA, and construction element visual grounding.
Result: Evaluation shows VLMs generalize well in zero-shot and few-shot settings but require additional training for real-world application.
Conclusion: The dataset serves as a benchmark for training and evaluating VLMs, advancing construction safety inspection research.
Abstract: Construction safety inspections typically involve a human inspector identifying safety concerns on-site. With the rise of powerful Vision Language Models (VLMs), researchers are exploring their use for tasks such as detecting safety rule violations from on-site images. However, there is a lack of open datasets to comprehensively evaluate and further fine-tune VLMs in construction safety inspection. Current applications of VLMs use small, supervised datasets, limiting their applicability in tasks they are not directly trained for. In this paper, we propose the ConstructionSite 10k, featuring 10,000 construction site images with annotations for three inter-connected tasks, including image captioning, safety rule violation visual question answering (VQA), and construction element visual grounding. Our subsequent evaluation of current state-of-the-art large pre-trained VLMs shows notable generalization abilities in zero-shot and few-shot settings, while additional training is needed to make them applicable to actual construction sites. This dataset allows researchers to train and evaluate their own VLMs with new architectures and techniques, providing a valuable benchmark for construction safety inspection.
[87] Can Multi-modal (reasoning) LLMs detect document manipulation?
Zisheng Liang, Kidus Zewde, Rudra Pratap Singh, Disha Patil, Zexi Chen, Jiayu Xue, Yao Yao, Yifei Chen, Qinzhe Liu, Simiao Ren
Main category: cs.CV
TL;DR: The study evaluates multi-modal LLMs for detecting fraudulent documents, finding top models outperform traditional methods but with variability in performance. Task-specific tuning is key.
Details
Motivation: Document fraud threatens industries needing secure documentation, requiring advanced detection methods.Method: Benchmarked multi-modal LLMs (e.g., OpenAI, Gemini, Claude) on a standard dataset, analyzing prompt optimization and reasoning processes for fraud indicators.
Result: Top LLMs excel in zero-shot generalization, but performance varies; model size and reasoning don’t strongly predict accuracy.
Conclusion: Multi-modal LLMs show promise for fraud detection, emphasizing the need for task-specific tuning and future research on scalable strategies.
Abstract: Document fraud poses a significant threat to industries reliant on secure and verifiable documentation, necessitating robust detection mechanisms. This study investigates the efficacy of state-of-the-art multi-modal large language models (LLMs)-including OpenAI O1, OpenAI 4o, Gemini Flash (thinking), Deepseek Janus, Grok, Llama 3.2 and 4, Qwen 2 and 2.5 VL, Mistral Pixtral, and Claude 3.5 and 3.7 Sonnet-in detecting fraudulent documents. We benchmark these models against each other and prior work on document fraud detection techniques using a standard dataset with real transactional documents. Through prompt optimization and detailed analysis of the models’ reasoning processes, we evaluate their ability to identify subtle indicators of fraud, such as tampered text, misaligned formatting, and inconsistent transactional sums. Our results reveal that top-performing multi-modal LLMs demonstrate superior zero-shot generalization, outperforming conventional methods on out-of-distribution datasets, while several vision LLMs exhibit inconsistent or subpar performance. Notably, model size and advanced reasoning capabilities show limited correlation with detection accuracy, suggesting task-specific fine-tuning is critical. This study underscores the potential of multi-modal LLMs in enhancing document fraud detection systems and provides a foundation for future research into interpretable and scalable fraud mitigation strategies.
[88] Med3DVLM: An Efficient Vision-Language Model for 3D Medical Image Analysis
Yu Xin, Gorkem Can Ates, Kuang Gong, Wei Shao
Main category: cs.CV
TL;DR: Med3DVLM introduces a 3D vision-language model for medical imaging, addressing computational and alignment challenges with innovations in encoding, contrastive learning, and feature fusion, achieving superior performance on benchmarks.
Details
Motivation: Extending vision-language models to 3D medical imaging is challenging due to computational demands and feature-text alignment difficulties.Method: Med3DVLM uses DCFormer for efficient 3D encoding, SigLIP for improved image-text alignment, and a dual-stream MLP-Mixer for feature fusion.
Result: Outperforms state-of-the-art in image-text retrieval (61.00% R@1), report generation (36.42% METEOR), and VQA (36.76% METEOR, 79.95% accuracy).
Conclusion: Med3DVLM effectively bridges 3D imaging and language, enabling scalable multi-task clinical reasoning.
Abstract: Vision-language models (VLMs) have shown promise in 2D medical image analysis, but extending them to 3D remains challenging due to the high computational demands of volumetric data and the difficulty of aligning 3D spatial features with clinical text. We present Med3DVLM, a 3D VLM designed to address these challenges through three key innovations: (1) DCFormer, an efficient encoder that uses decomposed 3D convolutions to capture fine-grained spatial features at scale; (2) SigLIP, a contrastive learning strategy with pairwise sigmoid loss that improves image-text alignment without relying on large negative batches; and (3) a dual-stream MLP-Mixer projector that fuses low- and high-level image features with text embeddings for richer multi-modal representations. We evaluate our model on the M3D dataset, which includes radiology reports and VQA data for 120,084 3D medical images. Results show that Med3DVLM achieves superior performance across multiple benchmarks. For image-text retrieval, it reaches 61.00% R@1 on 2,000 samples, significantly outperforming the current state-of-the-art M3D model (19.10%). For report generation, it achieves a METEOR score of 36.42% (vs. 14.38%). In open-ended visual question answering (VQA), it scores 36.76% METEOR (vs. 33.58%), and in closed-ended VQA, it achieves 79.95% accuracy (vs. 75.78%). These results highlight Med3DVLM’s ability to bridge the gap between 3D imaging and language, enabling scalable, multi-task reasoning across clinical applications. Our code is publicly available at https://github.com/mirthAI/Med3DVLM.
[89] MedSAMix: A Training-Free Model Merging Approach for Medical Image Segmentation
Yanwu Yang, Guinan Su, Jiesi Hu, Francesco Sammarco, Jonas Geiping, Thomas Wolfers
Main category: cs.CV
TL;DR: MedSAMix is a training-free model merging method combining generalist (e.g., SAM) and specialist (e.g., MedSAM) models for medical image segmentation, improving performance by 6.67% on specialized tasks and 4.37% on multi-task evaluations.
Details
Motivation: Existing fine-tuned medical segmentation models like MedSAM struggle with limited, heterogeneous data, hindering generalization. MedSAMix aims to integrate generalist and specialist models for better performance.Method: Proposes a zero-order optimization method for automatic layer-wise merging of models, with two regimes for domain-specificity and generalizability.
Result: Achieves improvements of 6.67% on specialized tasks and 4.37% on multi-task evaluations across 25 medical segmentation tasks.
Conclusion: MedSAMix effectively balances domain-specific accuracy and generalization, outperforming traditional merging approaches.
Abstract: Universal medical image segmentation models have emerged as a promising paradigm due to their strong generalizability across diverse tasks, showing great potential for a wide range of clinical applications. This potential has been partly driven by the success of general-purpose vision models such as the Segment Anything Model (SAM), which has inspired the development of various fine-tuned variants for medical segmentation tasks. However, fine-tuned variants like MedSAM are trained on comparatively limited medical imaging data that often suffers from heterogeneity, scarce annotations, and distributional shifts. These challenges limit their ability to generalize across a wide range of medical segmentation tasks. In this regard, we propose MedSAMix, a training-free model merging method that integrates the strengths of both generalist models (e.g., SAM) and specialist models (e.g., MedSAM) for medical image segmentation. In contrast to traditional model merging approaches that rely on manual configuration and often result in suboptimal outcomes, we propose a zero-order optimization method to automatically discover optimal layer-wise merging solutions. Furthermore, for clinical applications, we develop two regimes to meet the demand of domain-specificity and generalizability in different scenarios by single-task optimization and multi-objective optimization respectively. Extensive evaluations on 25 medical segmentation tasks demonstrate that MedSAMix effectively mitigates model bias and consistently improves performance in both domain-specific accuracy and generalization, achieving improvements of 6.67% on specialized tasks and 4.37% on multi-task evaluations.
[90] Part Segmentation of Human Meshes via Multi-View Human Parsing
James Dickens, Kamyar Hamad
Main category: cs.CV
TL;DR: The paper bridges point cloud deep learning and human parsing by enabling semantic segmentation of human meshes using geometric data, introducing a pseudo-ground truth pipeline and memory-efficient sampling.
Details
Motivation: To enable per-vertex semantic segmentation of large-scale human meshes by leveraging raw geometry, bridging point cloud deep learning and human parsing.Method: Developed a pseudo-ground truth labeling pipeline for Thuman2.1, introduced windowed iterative FPS with space-filling curve serialization, and used PointTransformer for geometric segmentation.
Result: The approach effectively segments human meshes without texture, achieving high accuracy.
Conclusion: The proposed method is accurate and efficient, with code and data made publicly available.
Abstract: Recent advances in point cloud deep learning have led to models that achieve high per-part labeling accuracy on large-scale point clouds, using only the raw geometry of unordered point sets. In parallel, the field of human parsing focuses on predicting body part and clothing/accessory labels from images. This work aims to bridge these two domains by enabling per-vertex semantic segmentation of large-scale human meshes. To achieve this, a pseudo-ground truth labeling pipeline is developed for the Thuman2.1 dataset: meshes are first aligned to a canonical pose, segmented from multiple viewpoints, and the resulting point-level labels are then backprojected onto the original mesh to produce per-point pseudo ground truth annotations. Subsequently, a novel, memory-efficient sampling strategy is introduced, a windowed iterative farthest point sampling (FPS) with space-filling curve-based serialization to effectively downsample the point clouds. This is followed by a purely geometric segmentation using PointTransformer, enabling semantic parsing of human meshes without relying on texture information. Experimental results confirm the effectiveness and accuracy of the proposed approach. Project code and pre-processed data is available at https://github.com/JamesMcCullochDickens/Human3DParsing/tree/master.
[91] Data-Driven Abdominal Phenotypes of Type 2 Diabetes in Lean, Overweight, and Obese Cohorts
Lucas W. Remedios, Chloe Choe, Trent M. Schwartz, Dingjie Su, Gaurav Rudravaram, Chenyu Gao, Aravind R. Krishnan, Adam M. Saunders, Michael E. Kim, Shunxing Bao, Alvin C. Powers, Bennett A. Landman, John Virostko
Main category: cs.CV
TL;DR: The study uses AI to analyze 3D clinical imaging for detailed body composition, identifying shared abdominal patterns linked to type 2 diabetes across BMI subgroups.
Details
Motivation: Despite BMI being a known risk factor for type 2 diabetes, its inconsistency in lean and obese individuals suggests the need for deeper body composition analysis.Method: The study segmented abdominal CT scans into explainable measurements, used random forest classification for diabetes risk, and applied SHAP analysis to identify contributing features.
Result: Random forests achieved AUCs of 0.72-0.74, revealing shared diabetes signatures like fatty skeletal muscle, visceral fat, and pancreas changes across BMI subgroups.
Conclusion: Abdominal drivers of type 2 diabetes appear consistent across weight classes, suggesting broader applicability of these findings.
Abstract: Purpose: Although elevated BMI is a well-known risk factor for type 2 diabetes, the disease’s presence in some lean adults and absence in others with obesity suggests that detailed body composition may uncover abdominal phenotypes of type 2 diabetes. With AI, we can now extract detailed measurements of size, shape, and fat content from abdominal structures in 3D clinical imaging at scale. This creates an opportunity to empirically define body composition signatures linked to type 2 diabetes risk and protection using large-scale clinical data. Approach: To uncover BMI-specific diabetic abdominal patterns from clinical CT, we applied our design four times: once on the full cohort (n = 1,728) and once on lean (n = 497), overweight (n = 611), and obese (n = 620) subgroups separately. Briefly, our experimental design transforms abdominal scans into collections of explainable measurements through segmentation, classifies type 2 diabetes through a cross-validated random forest, measures how features contribute to model-estimated risk or protection through SHAP analysis, groups scans by shared model decision patterns (clustering from SHAP) and links back to anatomical differences (classification). Results: The random-forests achieved mean AUCs of 0.72-0.74. There were shared type 2 diabetes signatures in each group; fatty skeletal muscle, older age, greater visceral and subcutaneous fat, and a smaller or fat-laden pancreas. Univariate logistic regression confirmed the direction of 14-18 of the top 20 predictors within each subgroup (p < 0.05). Conclusions: Our findings suggest that abdominal drivers of type 2 diabetes may be consistent across weight classes.
[92] HierOctFusion: Multi-scale Octree-based 3D Shape Generation via Part-Whole-Hierarchy Message Passing
Xinjie Gao, Bi’an Du, Wei Hu
Main category: cs.CV
TL;DR: HierOctFusion improves 3D content generation by leveraging part-aware multi-scale octree diffusion and cross-attention conditioning for better hierarchical feature interaction and efficiency.
Details
Motivation: Existing octree-based diffusion models treat 3D objects holistically, ignoring semantic part hierarchies and computational inefficiency in high-resolution modeling.Method: Proposes HierOctFusion, a part-aware multi-scale octree diffusion model with cross-attention conditioning for part-level information injection and hierarchical feature propagation.
Result: Achieves superior shape quality and efficiency compared to prior methods.
Conclusion: HierOctFusion effectively addresses the limitations of holistic modeling by incorporating part hierarchies and layered generation, enhancing 3D content generation.
Abstract: 3D content generation remains a fundamental yet challenging task due to the inherent structural complexity of 3D data. While recent octree-based diffusion models offer a promising balance between efficiency and quality through hierarchical generation, they often overlook two key insights: 1) existing methods typically model 3D objects as holistic entities, ignoring their semantic part hierarchies and limiting generalization; and 2) holistic high-resolution modeling is computationally expensive, whereas real-world objects are inherently sparse and hierarchical, making them well-suited for layered generation. Motivated by these observations, we propose HierOctFusion, a part-aware multi-scale octree diffusion model that enhances hierarchical feature interaction for generating fine-grained and sparse object structures. Furthermore, we introduce a cross-attention conditioning mechanism that injects part-level information into the generation process, enabling semantic features to propagate effectively across hierarchical levels from parts to the whole. Additionally, we construct a 3D dataset with part category annotations using a pre-trained segmentation model to facilitate training and evaluation. Experiments demonstrate that HierOctFusion achieves superior shape quality and efficiency compared to prior methods.
[93] UWB-PostureGuard: A Privacy-Preserving RF Sensing System for Continuous Ergonomic Sitting Posture Monitoring
Haotang Li, Zhenyu Qi, Sen He, Kebin Peng, Sheng Tan, Yili Ren, Tomas Cerny, Jiyue Zhao, Zi Wang
Main category: cs.CV
TL;DR: UWB-PostureGuard is a privacy-preserving UWB sensing system for monitoring ergonomic sitting posture with high accuracy and robustness.
Details
Motivation: Addressing privacy and comfort issues in traditional posture monitoring solutions.Method: Uses commercial UWB devices and PoseGBDT for temporal posture pattern analysis.
Result: Achieves 99.11% accuracy in real-world evaluations across diverse conditions.
Conclusion: Offers a scalable, low-cost mobile health solution for proactive ergonomic management.
Abstract: Improper sitting posture during prolonged computer use has become a significant public health concern. Traditional posture monitoring solutions face substantial barriers, including privacy concerns with camera-based systems and user discomfort with wearable sensors. This paper presents UWB-PostureGuard, a privacy-preserving ultra-wideband (UWB) sensing system that advances mobile technologies for preventive health management through continuous, contactless monitoring of ergonomic sitting posture. Our system leverages commercial UWB devices, utilizing comprehensive feature engineering to extract multiple ergonomic sitting posture features. We develop PoseGBDT to effectively capture temporal dependencies in posture patterns, addressing limitations of traditional frame-wise classification approaches. Extensive real-world evaluation across 10 participants and 19 distinct postures demonstrates exceptional performance, achieving 99.11% accuracy while maintaining robustness against environmental variables such as clothing thickness, additional devices, and furniture configurations. Our system provides a scalable, privacy-preserving mobile health solution on existing platforms for proactive ergonomic management, improving quality of life at low costs.
[94] Residual-based Efficient Bidirectional Diffusion Model for Image Dehazing and Haze Generation
Bing Liu, Le Wang, Hao Liu, Mingming Liu
Main category: cs.CV
TL;DR: Proposes a residual-based bidirectional diffusion model (RBDM) for translating between hazy and haze-free images, achieving efficient performance with minimal steps.
Details
Motivation: Existing deep dehazing methods lack bidirectional translation capability between hazy and haze-free images.Method: Uses dual Markov chains for residual shifts, perturbs images at timesteps to predict noise, and employs a unified score function on patches for efficiency.
Result: Achieves superior/comparable performance to state-of-the-art methods on synthetic and real-world datasets with only 15 sampling steps.
Conclusion: RBDM effectively enables bidirectional transitions between hazy and haze-free images, improving efficiency and performance.
Abstract: Current deep dehazing methods only focus on removing haze from hazy images, lacking the capability to translate between hazy and haze-free images. To address this issue, we propose a residual-based efficient bidirectional diffusion model (RBDM) that can model the conditional distributions for both dehazing and haze generation. Firstly, we devise dual Markov chains that can effectively shift the residuals and facilitate bidirectional smooth transitions between them. Secondly, the RBDM perturbs the hazy and haze-free images at individual timesteps and predicts the noise in the perturbed data to simultaneously learn the conditional distributions. Finally, to enhance performance on relatively small datasets and reduce computational costs, our method introduces a unified score function learned on image patches instead of entire images. Our RBDM successfully implements size-agnostic bidirectional transitions between haze-free and hazy images with only 15 sampling steps. Extensive experiments demonstrate that the proposed method achieves superior or at least comparable performance to state-of-the-art methods on both synthetic and real-world datasets.
[95] A Cross-Modal Rumor Detection Scheme via Contrastive Learning by Exploring Text and Image internal Correlations
Bin Ma, Yifei Zhang, Yongjin Xian, Qi Li, Linna Zhou, Gongxun Miao
Main category: cs.CV
TL;DR: A novel cross-modal rumor detection method (MICC) uses contrastive learning to explore multi-scale image and context relationships, outperforming existing methods.
Details
Motivation: Existing methods ignore image content and multi-scale context-image relationships, missing critical rumor identification information.Method: Uses an SCLIP encoder for unified semantic embeddings, a Cross-Modal Multi-Scale Alignment module, and a scale-aware fusion network to integrate features.
Result: Achieves significant performance improvement over state-of-the-art methods on real-world datasets.
Conclusion: MICC is effective and practical for rumor detection by leveraging multi-scale image-context relationships.
Abstract: Existing rumor detection methods often neglect the content within images as well as the inherent relationships between contexts and images across different visual scales, thereby resulting in the loss of critical information pertinent to rumor identification. To address these issues, this paper presents a novel cross-modal rumor detection scheme based on contrastive learning, namely the Multi-scale Image and Context Correlation exploration algorithm (MICC). Specifically, we design an SCLIP encoder to generate unified semantic embeddings for text and multi-scale image patches through contrastive pretraining, enabling their relevance to be measured via dot-product similarity. Building upon this, a Cross-Modal Multi-Scale Alignment module is introduced to identify image regions most relevant to the textual semantics, guided by mutual information maximization and the information bottleneck principle, through a Top-K selection strategy based on a cross-modal relevance matrix constructed between the text and multi-scale image patches. Moreover, a scale-aware fusion network is designed to integrate the highly correlated multi-scale image features with global text features by assigning adaptive weights to image regions based on their semantic importance and cross-modal relevance. The proposed methodology has been extensively evaluated on two real-world datasets. The experimental results demonstrate that it achieves a substantial performance improvement over existing state-of-the-art approaches in rumor detection, highlighting its effectiveness and potential for practical applications.
[96] LEARN: A Story-Driven Layout-to-Image Generation Framework for STEM Instruction
Maoquan Zhang, Bisser Raytchev, Xiujuan Sun
Main category: cs.CV
TL;DR: LEARN is a layout-aware diffusion framework for generating pedagogically aligned STEM illustrations, leveraging structured visual cues and narrative layouts to enhance learning.
Details
Motivation: The framework aims to improve STEM education by generating coherent visual sequences that align with educational theories like Bloom's taxonomy and Cognitive Load Theory, reducing fragmented attention.Method: LEARN uses layout-conditioned generation, contrastive visual-semantic training, and prompt modulation to create structured, story-driven illustrations.
Result: The framework produces pedagogically effective visuals, supports mid-to-high-level reasoning, and shows potential for integration with multimodal systems and knowledge graphs.
Conclusion: LEARN introduces a novel generative AI approach for education, combining storytelling, semantic learning, and cognitive scaffolding, with plans to release code and dataset for further research.
Abstract: LEARN is a layout-aware diffusion framework designed to generate pedagogically aligned illustrations for STEM education. It leverages a curated BookCover dataset that provides narrative layouts and structured visual cues, enabling the model to depict abstract and sequential scientific concepts with strong semantic alignment. Through layout-conditioned generation, contrastive visual-semantic training, and prompt modulation, LEARN produces coherent visual sequences that support mid-to-high-level reasoning in line with Bloom’s taxonomy while reducing extraneous cognitive load as emphasized by Cognitive Load Theory. By fostering spatially organized and story-driven narratives, the framework counters fragmented attention often induced by short-form media and promotes sustained conceptual focus. Beyond static diagrams, LEARN demonstrates potential for integration with multimodal systems and curriculum-linked knowledge graphs to create adaptive, exploratory educational content. As the first generative approach to unify layout-based storytelling, semantic structure learning, and cognitive scaffolding, LEARN represents a novel direction for generative AI in education. The code and dataset will be released to facilitate future research and practical deployment.
[97] Semi-supervised Image Dehazing via Expectation-Maximization and Bidirectional Brownian Bridge Diffusion Models
Bing Liu, Le Wang, Mingming Liu, Hao Liu, Rui Yao, Yong Zhou, Peng Liu, Tongqiang Xia
Main category: cs.CV
TL;DR: Proposes EM-B3DM, a semi-supervised dehazing method using Expectation-Maximization and Bidirectional Brownian Bridge Diffusion Models, outperforming state-of-the-art methods.
Details
Motivation: Addresses the challenge of real-world haze image dehazing due to lack of paired data and robust priors.Method: Uses a two-stage learning scheme: EM algorithm for decoupling distributions and Brownian Bridge diffusion for modeling correlations, followed by leveraging unpaired data. Introduces RDC block for detail enhancement.
Result: Achieves superior or comparable performance on synthetic and real-world datasets.
Conclusion: EM-B3DM is an efficient and effective solution for image dehazing, especially in challenging scenarios.
Abstract: Existing dehazing methods deal with real-world haze images with difficulty, especially scenes with thick haze. One of the main reasons is the lack of real-world paired data and robust priors. To avoid the costly collection of paired hazy and clear images, we propose an efficient semi-supervised image dehazing method via Expectation-Maximization and Bidirectional Brownian Bridge Diffusion Models (EM-B3DM) with a two-stage learning scheme. In the first stage, we employ the EM algorithm to decouple the joint distribution of paired hazy and clear images into two conditional distributions, which are then modeled using a unified Brownian Bridge diffusion model to directly capture the structural and content-related correlations between hazy and clear images. In the second stage, we leverage the pre-trained model and large-scale unpaired hazy and clear images to further improve the performance of image dehazing. Additionally, we introduce a detail-enhanced Residual Difference Convolution block (RDC) to capture gradient-level information, significantly enhancing the model’s representation capability. Extensive experiments demonstrate that our EM-B3DM achieves superior or at least comparable performance to state-of-the-art methods on both synthetic and real-world datasets.
[98] VFM-Guided Semi-Supervised Detection Transformer for Source-Free Object Detection in Remote Sensing Images
Jianhong Han, Yupei Wang, Liang Chen
Main category: cs.CV
TL;DR: VG-DETR is a semi-supervised framework for source-free object detection in remote sensing, using a Vision Foundation Model to improve pseudo-label quality and feature alignment.
Details
Motivation: Addressing the limitations of source-free object detection (SFOD) in remote sensing, where noisy pseudo-labels and domain gaps hinder performance.Method: Integrates a Vision Foundation Model (VFM) for pseudo-label mining and dual-level feature alignment (instance and image levels) to enhance robustness.
Result: VG-DETR outperforms existing methods in source-free remote sensing detection tasks.
Conclusion: VG-DETR effectively mitigates pseudo-label noise and improves feature representation, advancing SFOD in remote sensing.
Abstract: Unsupervised domain adaptation methods have been widely explored to bridge domain gaps. However, in real-world remote-sensing scenarios, privacy and transmission constraints often preclude access to source domain data, which limits their practical applicability. Recently, Source-Free Object Detection (SFOD) has emerged as a promising alternative, aiming at cross-domain adaptation without relying on source data, primarily through a self-training paradigm. Despite its potential, SFOD frequently suffers from training collapse caused by noisy pseudo-labels, especially in remote sensing imagery with dense objects and complex backgrounds. Considering that limited target domain annotations are often feasible in practice, we propose a Vision foundation-Guided DEtection TRansformer (VG-DETR), built upon a semi-supervised framework for SFOD in remote sensing images. VG-DETR integrates a Vision Foundation Model (VFM) into the training pipeline in a “free lunch” manner, leveraging a small amount of labeled target data to mitigate pseudo-label noise while improving the detector’s feature-extraction capability. Specifically, we introduce a VFM-guided pseudo-label mining strategy that leverages the VFM’s semantic priors to further assess the reliability of the generated pseudo-labels. By recovering potentially correct predictions from low-confidence outputs, our strategy improves pseudo-label quality and quantity. In addition, a dual-level VFM-guided alignment method is proposed, which aligns detector features with VFM embeddings at both the instance and image levels. Through contrastive learning among fine-grained prototypes and similarity matching between feature maps, this dual-level alignment further enhances the robustness of feature representations against domain gaps. Extensive experiments demonstrate that VG-DETR achieves superior performance in source-free remote sensing detection tasks.
[99] Better Supervised Fine-tuning for VQA: Integer-Only Loss
Baihong Qian, Haotian Fan, Wenjie Liao, Yunqiu Wang, Tao Li, Junhui Cui
Main category: cs.CV
TL;DR: IOVQA, a novel fine-tuning method for VLMs, improves video quality assessment by using integer labels and a target-mask strategy, achieving high accuracy and ranking 3rd in a benchmark.
Details
Motivation: Existing methods for visual quality assessment in VLMs are imprecise and inefficient, limiting focus on key evaluation indicators.Method: IOVQA constrains outputs to integers [10,50], converts decimal labels to integers, and uses a target-mask strategy to focus on critical evaluation components.
Result: Fine-tuned Qwen2.5-VL model shows significant accuracy and consistency improvements, ranking 3rd in VQualA 2025 GenAI-Bench.
Conclusion: Integer-only fine-tuning is effective for optimizing VLMs in quantitative evaluation tasks.
Abstract: With the rapid advancement of vision language models(VLM), their ability to assess visual content based on specific criteria and dimensions has become increasingly critical for applications such as video-theme consistency assessment and visual quality scoring. However, existing methods often suffer from imprecise results and inefficient loss calculation, which limit the focus of the model on key evaluation indicators. To address this, we propose IOVQA(Integer-only VQA), a novel fine-tuning approach tailored for VLMs to enhance their performance in video quality assessment tasks. The key innovation of IOVQA lies in its label construction and its targeted loss calculation mechanism. Specifically, during dataset curation, we constrain the model’s output to integers within the range of [10,50], ensuring numerical stability, and convert decimal Overall_MOS to integer before using them as labels. We also introduce a target-mask strategy: when computing the loss, only the first two-digit-integer of the label is unmasked, forcing the model to learn the critical components of the numerical evaluation. After fine-tuning the Qwen2.5-VL model using the constructed dataset, experimental results demonstrate that the proposed method significantly improves the model’s accuracy and consistency in the VQA task, ranking 3rd in VQualA 2025 GenAI-Bench AIGC Video Quality Assessment Challenge – Track I. Our work highlights the effectiveness of merely leaving integer labels during fine-tuning, providing an effective idea for optimizing VLMs in quantitative evaluation scenarios.
[100] Exploring the Tradeoff Between Diversity and Discrimination for Continuous Category Discovery
Ruobing Jiang, Yang Liu, Haobing Liu, Yanwei Yu, Chunyang Wang
Main category: cs.CV
TL;DR: IDOD is a novel method for continuous category discovery (CCD) that addresses challenges like error accumulation and catastrophic forgetting by using diversity enrichment, joint novelty discovery, and orthogonality-based discrimination.
Details
Motivation: Existing CCD methods struggle with balancing novel class discovery and classification, accumulate errors, and require excessive storage for preventing forgetting.Method: IDOD employs three modules: independent diversity enrichment (contrastive loss), joint novelty discovery (single-stage), and orthogonality-based discrimination (mutually orthogonal prototypes).
Result: IDOD outperforms state-of-the-art methods on fine-grained datasets, demonstrating superior performance in CCD.
Conclusion: IDOD effectively mitigates key CCD challenges, offering a robust solution with lower storage overhead and improved accuracy.
Abstract: Continuous category discovery (CCD) aims to automatically discover novel categories in continuously arriving unlabeled data. This is a challenging problem considering that there is no number of categories and labels in the newly arrived data, while also needing to mitigate catastrophic forgetting. Most CCD methods cannot handle the contradiction between novel class discovery and classification well. They are also prone to accumulate errors in the process of gradually discovering novel classes. Moreover, most of them use knowledge distillation and data replay to prevent forgetting, occupying more storage space. To address these limitations, we propose Independence-based Diversity and Orthogonality-based Discrimination (IDOD). IDOD mainly includes independent enrichment of diversity module, joint discovery of novelty module, and continuous increment by orthogonality module. In independent enrichment, the backbone is trained separately using contrastive loss to avoid it focusing only on features for classification. Joint discovery transforms multi-stage novel class discovery into single-stage, reducing error accumulation impact. Continuous increment by orthogonality module generates mutually orthogonal prototypes for classification and prevents forgetting with lower space overhead via representative representation replay. Experimental results show that on challenging fine-grained datasets, our method outperforms the state-of-the-art methods.
[101] Fine-Grained VLM Fine-tuning via Latent Hierarchical Adapter Learning
Yumiao Zhao, Bo Jiang, Yuhe Ding, Xiao Wang, Jin Tang, Bin Luo
Main category: cs.CV
TL;DR: LatHAdapter improves few-shot classification by aligning visual and textual representations in hyperbolic space, leveraging latent semantic hierarchy for better performance.
Details
Motivation: Existing adapters fail to capture one-to-many associations between categories and images and struggle with unknown categories.Method: LatHAdapter uses learnable attribute prompts and hyperbolic space projection with hierarchical regularization to model latent semantic hierarchy.
Result: LatHAdapter outperforms other fine-tuning methods on four few-shot tasks, especially for known and unknown classes.
Conclusion: LatHAdapter effectively addresses limitations of existing adapters by leveraging hyperbolic learning and latent hierarchy.
Abstract: Adapter-based approaches have garnered attention for fine-tuning pre-trained Vision-Language Models (VLMs) on few-shot classification tasks. These methods strive to develop a lightweight module that better aligns visual and (category) textual representations, thereby enhancing performance on downstream few-shot learning tasks. However, existing adapters generally learn/align (category) textual-visual modalities via explicit spatial proximity in the underlying embedding space, which i) fails to capture the inherent one-to-many associations between categories and image samples and ii) struggles to establish accurate associations between the unknown categories and images. To address these issues, inspired by recent works on hyperbolic learning, we develop a novel Latent Hierarchical Adapter (LatHAdapter) for fine-tuning VLMs on downstream few-shot classification tasks. The core of LatHAdapter is to exploit the latent semantic hierarchy of downstream training data and employ it to provide richer, fine-grained guidance for the adapter learning process. Specifically, LatHAdapter first introduces some learnable `attribute’ prompts as the bridge to align categories and images. Then, it projects the categories, attribute prompts, and images within each batch in a hyperbolic space, and employs hierarchical regularization to learn the latent semantic hierarchy of them, thereby fully modeling the inherent one-to-many associations among categories, learnable attributes, and image samples. Extensive experiments on four challenging few-shot tasks show that the proposed LatHAdapter consistently outperforms many other fine-tuning approaches, particularly in adapting known classes and generalizing to unknown classes.
[102] Versatile Video Tokenization with Generative 2D Gaussian Splatting
Zhenghao Chen, Zicong Chen, Lei Liu, Yiming Wu, Dong Xu
Main category: cs.CV
TL;DR: GVT introduces a Gaussian Video Transformer using 2D Gaussian Splatting for adaptive video tokenization, improving spatial and temporal efficiency.
Details
Motivation: Existing video tokenization methods lack versatility, over-encode low-information regions, and struggle with temporal redundancy.Method: GVT uses Spatio-Temporal Gaussian Embedding (STGE) for adaptive spatial tokenization and Gaussian Set Partitioning (GSP) for separating static/dynamic content.
Result: GVT achieves state-of-the-art video reconstruction, outperforms MAGVIT-v2 in action recognition, and matches compression benchmarks.
Conclusion: GVT offers a versatile and efficient solution for video tokenization, excelling in reconstruction, recognition, and compression tasks.
Abstract: Video tokenization procedure is critical for a wide range of video processing tasks. Most existing approaches directly transform video into fixed-grid and patch-wise tokens, which exhibit limited versatility. Spatially, uniformly allocating a fixed number of tokens often leads to over-encoding in low-information regions. Temporally, reducing redundancy remains challenging without explicitly distinguishing between static and dynamic content. In this work, we propose the Gaussian Video Transformer (GVT), a versatile video tokenizer built upon a generative 2D Gaussian Splatting (2DGS) strategy. We first extract latent rigid features from a video clip and represent them with a set of 2D Gaussians generated by our proposed Spatio-Temporal Gaussian Embedding (STGE) mechanism in a feed-forward manner. Such generative 2D Gaussians not only enhance spatial adaptability by assigning higher (resp., lower) rendering weights to regions with higher (resp., lower) information content during rasterization, but also improve generalization by avoiding per-video optimization.To enhance the temporal versatility, we introduce a Gaussian Set Partitioning (GSP) strategy that separates the 2D Gaussians into static and dynamic sets, which explicitly model static content shared across different time-steps and dynamic content specific to each time-step, enabling a compact representation.We primarily evaluate GVT on the video reconstruction, while also assessing its performance on action recognition and compression using the UCF101, Kinetics, and DAVIS datasets. Extensive experiments demonstrate that GVT achieves a state-of-the-art video reconstruction quality, outperforms the baseline MAGVIT-v2 in action recognition, and delivers comparable compression performance.
[103] CHARM3R: Towards Unseen Camera Height Robust Monocular 3D Detector
Abhinav Kumar, Yuliang Guo, Zhihao Zhang, Xinyu Huang, Liu Ren, Xiaoming Liu
Main category: cs.CV
TL;DR: CHARM3R improves monocular 3D object detection under varying camera heights by averaging depth estimates, achieving a 45% performance boost.
Details
Motivation: Addressing the challenge of monocular 3D object detectors struggling with unseen camera heights.Method: Systematic analysis of camera height impact, proposing CHARM3R to average depth estimates for robustness.
Result: CHARM3R improves generalization by over 45%, achieving state-of-the-art performance.
Conclusion: Averaging depth estimates effectively mitigates camera height variations, enhancing detector robustness.
Abstract: Monocular 3D object detectors, while effective on data from one ego camera height, struggle with unseen or out-of-distribution camera heights. Existing methods often rely on Plucker embeddings, image transformations or data augmentation. This paper takes a step towards this understudied problem by first investigating the impact of camera height variations on state-of-the-art (SoTA) Mono3D models. With a systematic analysis on the extended CARLA dataset with multiple camera heights, we observe that depth estimation is a primary factor influencing performance under height variations. We mathematically prove and also empirically observe consistent negative and positive trends in mean depth error of regressed and ground-based depth models, respectively, under camera height changes. To mitigate this, we propose Camera Height Robust Monocular 3D Detector (CHARM3R), which averages both depth estimates within the model. CHARM3R improves generalization to unseen camera heights by more than $45%$, achieving SoTA performance on the CARLA dataset. Codes and Models at https://github.com/abhi1kumar/CHARM3R
[104] Generalized Decoupled Learning for Enhancing Open-Vocabulary Dense Perception
Junjie Wang, Keyu Chen, Yulin Li, Bin Chen, Hengshuang Zhao, Xiaojuan Qi, Zhuotao Tian
Main category: cs.CV
TL;DR: DeCLIP enhances CLIP by decoupling self-attention into content and context features, improving local discriminability and spatial consistency for open-vocabulary dense perception tasks.
Details
Motivation: Current dense perception tasks are limited by predefined categories, and VLMs like CLIP underperform due to poor local feature representation.Method: DeCLIP decouples CLIP’s self-attention into content and context features, enhancing them with semantic correlations from VFMs and object integrity cues from diffusion models.
Result: DeCLIP achieves state-of-the-art performance in tasks like 2D/3D detection, segmentation, and 6D pose estimation.
Conclusion: DeCLIP provides a robust framework for open-vocabulary dense perception, outperforming existing methods.
Abstract: Dense visual perception tasks have been constrained by their reliance on
predefined categories, limiting their applicability in real-world scenarios
where visual concepts are unbounded. While Vision-Language Models (VLMs) like
CLIP have shown promise in open-vocabulary tasks, their direct application to
dense perception often leads to suboptimal performance due to limitations in
local feature representation. In this work, we present our observation that
CLIP’s image tokens struggle to effectively aggregate information from
spatially or semantically related regions, resulting in features that lack
local discriminability and spatial consistency. To address this issue, we
propose DeCLIP, a novel framework that enhances CLIP by decoupling the
self-attention module to obtain content'' and
context’’ features
respectively. \revise{The context features are enhanced by jointly distilling
semantic correlations from Vision Foundation Models (VFMs) and object integrity
cues from diffusion models, thereby enhancing spatial consistency. In parallel,
the content features are aligned with image crop representations and
constrained by region correlations from VFMs to improve local discriminability.
Extensive experiments demonstrate that DeCLIP establishes a solid foundation
for open-vocabulary dense perception, consistently achieving state-of-the-art
performance across a broad spectrum of tasks, including 2D detection and
segmentation, 3D instance segmentation, video instance segmentation, and 6D
object pose estimation.} Code is available at
https://github.com/xiaomoguhz/DeCLIP
[105] Generating Dialogues from Egocentric Instructional Videos for Task Assistance: Dataset, Method and Benchmark
Lavisha Aggarwal, Vikas Bahirwani, Lin Li, Andrea Colaco
Main category: cs.CV
TL;DR: The paper introduces HowToDIV, a dataset created by transforming instructional videos into task-guidance dialogues using large language models, offering a cost-effective alternative to human-assisted data collection.
Details
Motivation: There is a lack of dialogue-video datasets for real-world task assistance, despite the need for expert knowledge in complex tasks.Method: The approach automatically converts single-person instructional videos into two-person dialogues, aligning them with fine-grained steps and video clips using large language models.
Result: The HowToDIV dataset includes 507 conversations, 6636 QA pairs, and 24 hours of video clips across diverse tasks, with baseline benchmarks set using the Gemma-3 model.
Conclusion: The work provides a scalable solution for creating task-guidance dialogue datasets and establishes a foundation for future research in procedural-task assistance.
Abstract: Many everyday tasks ranging from fixing appliances, cooking recipes to car maintenance require expert knowledge, especially when tasks are complex and multi-step. Despite growing interest in AI agents, there is a scarcity of dialogue-video datasets grounded for real world task assistance. In this paper, we propose a simple yet effective approach that transforms single-person instructional videos into task-guidance two-person dialogues, aligned with fine grained steps and video-clips. Our fully automatic approach, powered by large language models, offers an efficient alternative to the substantial cost and effort required for human-assisted data collection. Using this technique, we build HowToDIV, a large-scale dataset containing 507 conversations, 6636 question-answer pairs and 24 hours of videoclips across diverse tasks in cooking, mechanics, and planting. Each session includes multi-turn conversation where an expert teaches a novice user how to perform a task step by step, while observing user’s surrounding through a camera and microphone equipped wearable device. We establish the baseline benchmark performance on HowToDIV dataset through Gemma-3 model for future research on this new task of dialogues for procedural-task assistance.
[106] UAV-VL-R1: Generalizing Vision-Language Models via Supervised Fine-Tuning and Multi-Stage GRPO for UAV Visual Reasoning
Jiajin Guan, Haibo Mei, Bonan Zhang, Dan Liu, Yuanshuang Fu, Yue Zhang
Main category: cs.CV
TL;DR: UAV-VL-R1 is a lightweight vision-language model for aerial imagery, outperforming larger models with hybrid training (SFT + RL) and a new dataset (HRVQA-VL).
Details
Motivation: General-purpose VLMs struggle with UAV imagery due to high resolution, complex semantics, and real-time needs. UAV-VL-R1 addresses these gaps.Method: Hybrid training: supervised fine-tuning (SFT) + multi-stage RL (GRPO algorithm). Uses rule-guided rewards and policy alignment.
Result: 48.17% higher zero-shot accuracy than Qwen2-VL-2B-Instruct; outperforms 72B variant. Efficient memory usage (3.9GB FP16, 2.5GB INT8).
Conclusion: UAV-VL-R1 is effective for aerial reasoning, balancing performance and efficiency. GRPO enhances logical flexibility, compensating for SFT’s limitations.
Abstract: Recent advances in vision-language models (VLMs) have demonstrated strong generalization in natural image tasks. However, their performance often degrades on unmanned aerial vehicle (UAV)-based aerial imagery, which features high resolution, complex spatial semantics, and strict real-time constraints. These challenges limit the applicability of general-purpose VLMs to structured aerial reasoning tasks. To address these challenges, we propose UAV-VL-R1, a lightweight VLM explicitly designed for aerial visual reasoning. It is trained using a hybrid method that combines supervised fine-tuning (SFT) and multi-stage reinforcement learning (RL). We leverage the group relative policy optimization (GRPO) algorithm to promote structured and interpretable reasoning through rule-guided rewards and intra-group policy alignment. To support model training and evaluation, we introduce a high-resolution visual question answering dataset named HRVQA-VL, which consists of 50,019 annotated samples covering eight UAV-relevant reasoning tasks, including object counting, transportation recognition, and spatial scene inference. Experimental results show that UAV-VL-R1 achieves a 48.17% higher zero-shot accuracy than the Qwen2-VL-2B-Instruct baseline and even outperforms its 72B-scale variant, which is 36x larger, on multiple tasks. Ablation studies reveal that while SFT improves semantic alignment, it may reduce reasoning diversity in mathematical tasks. GRPO-based RL compensates for this limitation by enhancing logical flexibility and the robustness of inference. Additionally, UAV-VL-R1 requires only 3.9GB of memory under FP16 inference and can be quantized to 2.5GB with INT8, supporting real-time deployment on resource-constrained UAV platforms.
[107] Vision-Language Models display a strong gender bias
Aiswarya Konavoor, Raj Abhijit Dandekar, Rajat Dandekar, Sreedath Panat
Main category: cs.CV
TL;DR: The paper investigates gender-linked associations in vision-language models (VLM) by analyzing embeddings of face images and occupational/activity phrases, revealing subtle biases not captured by standard metrics.
Details
Motivation: To uncover and quantify subtle gender biases in VLMs, which align images and text but may inadvertently encode stereotypes.Method: Uses a dataset of 220 face images (split by perceived gender) and 150 statements across six labor categories. Measures associations via cosine similarity between embeddings, with bootstrap confidence intervals and a null model for validation.
Result: Provides a detailed map of gender associations in VLMs, including uncertainty estimates and a robust evaluation framework for bias.
Conclusion: Highlights the presence of gender biases in VLMs and proposes a method to systematically evaluate and address them.
Abstract: Vision-language models (VLM) align images and text in a shared representation space that is useful for retrieval and zero-shot transfer. Yet, this alignment can encode and amplify social stereotypes in subtle ways that are not obvious from standard accuracy metrics. In this study, we test whether the contrastive vision-language encoder exhibits gender-linked associations when it places embeddings of face images near embeddings of short phrases that describe occupations and activities. We assemble a dataset of 220 face photographs split by perceived binary gender and a set of 150 unique statements distributed across six categories covering emotional labor, cognitive labor, domestic labor, technical labor, professional roles, and physical labor. We compute unit-norm image embeddings for every face and unit-norm text embeddings for every statement, then define a statement-level association score as the difference between the mean cosine similarity to the male set and the mean cosine similarity to the female set, where positive values indicate stronger association with the male set and negative values indicate stronger association with the female set. We attach bootstrap confidence intervals by resampling images within each gender group, aggregate by category with a separate bootstrap over statements, and run a label-swap null model that estimates the level of mean absolute association we would expect if no gender structure were present. The outcome is a statement-wise and category-wise map of gender associations in a contrastive vision-language space, accompanied by uncertainty, simple sanity checks, and a robust gender bias evaluation framework.
[108] A Coarse-to-Fine Human Pose Estimation Method based on Two-stage Distillation and Progressive Graph Neural Network
Zhangjian Ji, Wenjin Zhang, Shaotong Qiao, Kai Feng, Yuhua Qian
Main category: cs.CV
TL;DR: A novel two-stage knowledge distillation framework improves lightweight human pose estimation by leveraging structural and contextual joint information.
Details
Motivation: Existing pose estimation methods are computationally heavy; knowledge distillation can transfer accuracy to lightweight models but lacks joint contextual exploration.Method: Proposes a coarse-to-fine two-stage distillation: first-stage uses joint structure loss for semantic knowledge transfer; second-stage refines poses with an Image-Guided Progressive GCN.
Result: Outperforms state-of-the-art methods on COCO keypoint and CrowdPose datasets, with notable gains on complex CrowdPose.
Conclusion: The framework effectively balances accuracy and efficiency, advancing lightweight pose estimation.
Abstract: Human pose estimation has been widely applied in the human-centric understanding and generation, but most existing state-of-the-art human pose estimation methods require heavy computational resources for accurate predictions. In order to obtain an accurate, robust yet lightweight human pose estimator, one feasible way is to transfer pose knowledge from a powerful teacher model to a less-parameterized student model by knowledge distillation. However, the traditional knowledge distillation framework does not fully explore the contextual information among human joints. Thus, in this paper, we propose a novel coarse-to-fine two-stage knowledge distillation framework for human pose estimation. In the first-stage distillation, we introduce the human joints structure loss to mine the structural information among human joints so as to transfer high-level semantic knowledge from the teacher model to the student model. In the second-stage distillation, we utilize an Image-Guided Progressive Graph Convolutional Network (IGP-GCN) to refine the initial human pose obtained from the first-stage distillation and supervise the training of the IGP-GCN in the progressive way by the final output pose of teacher model. The extensive experiments on the benchmark dataset: COCO keypoint and CrowdPose datasets, show that our proposed method performs favorably against lots of the existing state-of-the-art human pose estimation methods, especially for the more complex CrowdPose dataset, the performance improvement of our model is more significant.
[109] Enhancing Supervised Composed Image Retrieval via Reasoning-Augmented Representation Engineering
Jun Li, Kai Li, Shaoguo Liu, Tingting Gao
Main category: cs.CV
TL;DR: Proposes PMTFR framework for Composed Image Retrieval (CIR), enhancing visual understanding and refining retrieval scores without additional training.
Details
Motivation: Addresses challenges in CIR, like joint understanding of images and text, and limitations of existing methods requiring extra training or complex prompts.Method: Uses Pyramid Matching Model with Pyramid Patcher for granular visual understanding and Training-Free Refinement inspired by representation engineering.
Result: PMTFR outperforms state-of-the-art methods in supervised CIR tasks.
Conclusion: The framework effectively improves CIR performance without additional training, with code to be made public.
Abstract: Composed Image Retrieval (CIR) presents a significant challenge as it requires jointly understanding a reference image and a modified textual instruction to find relevant target images. Some existing methods attempt to use a two-stage approach to further refine retrieval results. However, this often requires additional training of a ranking model. Despite the success of Chain-of-Thought (CoT) techniques in reducing training costs for language models, their application in CIR tasks remains limited – compressing visual information into text or relying on elaborate prompt designs. Besides, existing works only utilize it for zero-shot CIR, as it is challenging to achieve satisfactory results in supervised CIR with a well-trained model. In this work, we proposed a framework that includes the Pyramid Matching Model with Training-Free Refinement (PMTFR) to address these challenges. Through a simple but effective module called Pyramid Patcher, we enhanced the Pyramid Matching Model’s understanding of visual information at different granularities. Inspired by representation engineering, we extracted representations from COT data and injected them into the LVLMs. This approach allowed us to obtain refined retrieval scores in the Training-Free Refinement paradigm without relying on explicit textual reasoning, further enhancing performance. Extensive experiments on CIR benchmarks demonstrate that PMTFR surpasses state-of-the-art methods in supervised CIR tasks. The code will be made public.
[110] A CLIP-based Uncertainty Modal Modeling (UMM) Framework for Pedestrian Re-Identification in Autonomous Driving
Jialin Li, Shuqi Wu, Ning Wang
Main category: cs.CV
TL;DR: A lightweight Uncertainty Modal Modeling (UMM) framework is proposed for pedestrian ReID in autonomous driving, addressing missing modalities and computational constraints.
Details
Motivation: Challenges in ReID due to uncertain/missing input modalities and computational limits of pre-trained models.Method: UMM integrates a multimodal token mapper, synthetic modality augmentation, and cross-modal interactive learner, leveraging CLIP for efficient fusion.
Result: UMM achieves robustness, generalization, and efficiency under uncertain modality conditions.
Conclusion: UMM offers a scalable, practical solution for pedestrian ReID in autonomous driving.
Abstract: Re-Identification (ReID) is a critical technology in intelligent perception systems, especially within autonomous driving, where onboard cameras must identify pedestrians across views and time in real-time to support safe navigation and trajectory prediction. However, the presence of uncertain or missing input modalities–such as RGB, infrared, sketches, or textual descriptions–poses significant challenges to conventional ReID approaches. While large-scale pre-trained models offer strong multimodal semantic modeling capabilities, their computational overhead limits practical deployment in resource-constrained environments. To address these challenges, we propose a lightweight Uncertainty Modal Modeling (UMM) framework, which integrates a multimodal token mapper, synthetic modality augmentation strategy, and cross-modal cue interactive learner. Together, these components enable unified feature representation, mitigate the impact of missing modalities, and extract complementary information across different data types. Additionally, UMM leverages CLIP’s vision-language alignment ability to fuse multimodal inputs efficiently without extensive finetuning. Experimental results demonstrate that UMM achieves strong robustness, generalization, and computational efficiency under uncertain modality conditions, offering a scalable and practical solution for pedestrian re-identification in autonomous driving scenarios.
[111] FantasyTalking2: Timestep-Layer Adaptive Preference Optimization for Audio-Driven Portrait Animation
MengChao Wang, Qiang Wang, Fan Jiang, Mu Xu
Main category: cs.CV
TL;DR: The paper introduces Talking-Critic and TLPO to improve audio-driven portrait animation by aligning with fine-grained human preferences, supported by a new dataset, Talking-NSQ.
Details
Motivation: Existing methods fail to align with fine-grained human preferences due to conflicting objectives and lack of annotated datasets.Method: Proposes Talking-Critic (reward model), curates Talking-NSQ (dataset), and introduces TLPO (framework for preference optimization).
Result: Talking-Critic outperforms existing methods, and TLPO improves lip-sync, motion, and visual quality.
Conclusion: The proposed methods effectively align with human preferences, enhancing portrait animation performance.
Abstract: Recent advances in audio-driven portrait animation have demonstrated impressive capabilities. However, existing methods struggle to align with fine-grained human preferences across multiple dimensions, such as motion naturalness, lip-sync accuracy, and visual quality. This is due to the difficulty of optimizing among competing preference objectives, which often conflict with one another, and the scarcity of large-scale, high-quality datasets with multidimensional preference annotations. To address these, we first introduce Talking-Critic, a multimodal reward model that learns human-aligned reward functions to quantify how well generated videos satisfy multidimensional expectations. Leveraging this model, we curate Talking-NSQ, a large-scale multidimensional human preference dataset containing 410K preference pairs. Finally, we propose Timestep-Layer adaptive multi-expert Preference Optimization (TLPO), a novel framework for aligning diffusion-based portrait animation models with fine-grained, multidimensional preferences. TLPO decouples preferences into specialized expert modules, which are then fused across timesteps and network layers, enabling comprehensive, fine-grained enhancement across all dimensions without mutual interference. Experiments demonstrate that Talking-Critic significantly outperforms existing methods in aligning with human preference ratings. Meanwhile, TLPO achieves substantial improvements over baseline models in lip-sync accuracy, motion naturalness, and visual quality, exhibiting superior performance in both qualitative and quantitative evaluations. Ours project page: https://fantasy-amap.github.io/fantasy-talking2/
[112] Domain-aware Category-level Geometry Learning Segmentation for 3D Point Clouds
Pei He, Lingling Li, Licheng Jiao, Ronghua Shang, Fang Liu, Shuang Wang, Xu Liu, Wenping Ma
Main category: cs.CV
TL;DR: A framework for domain generalization in 3D segmentation using category-level geometry learning to improve model generalization by focusing on invariant geometric features.
Details
Motivation: Addressing the challenge of domain shift in 3D segmentation by focusing on category-level geometric patterns, which current methods overlook.Method: Proposes Category-level Geometry Embedding (CGE) for fine-grained geometric properties and Geometric Consistent Learning (GCL) for aligning embeddings and simulating latent distributions.
Result: Achieves competitive segmentation accuracy compared to state-of-the-art domain generalized point cloud methods.
Conclusion: The framework effectively improves generalization by leveraging category-level geometric features and alignment.
Abstract: Domain generalization in 3D segmentation is a critical challenge in deploying models to unseen environments. Current methods mitigate the domain shift by augmenting the data distribution of point clouds. However, the model learns global geometric patterns in point clouds while ignoring the category-level distribution and alignment. In this paper, a category-level geometry learning framework is proposed to explore the domain-invariant geometric features for domain generalized 3D semantic segmentation. Specifically, Category-level Geometry Embedding (CGE) is proposed to perceive the fine-grained geometric properties of point cloud features, which constructs the geometric properties of each class and couples geometric embedding to semantic learning. Secondly, Geometric Consistent Learning (GCL) is proposed to simulate the latent 3D distribution and align the category-level geometric embeddings, allowing the model to focus on the geometric invariant information to improve generalization. Experimental results verify the effectiveness of the proposed method, which has very competitive segmentation accuracy compared with the state-of-the-art domain generalized point cloud methods.
[113] Probing the Representational Power of Sparse Autoencoders in Vision Models
Matthew Lyle Olson, Musashi Hinck, Neale Ratzlaff, Changbai Li, Phillip Howard, Vasudev Lal, Shao-Yen Tseng
Main category: cs.CV
TL;DR: SAEs are evaluated for vision models, showing semantic meaning, improved generalization, and controllable generation across architectures.
Details
Motivation: SAEs are popular for interpreting LLMs but understudied in vision. This work explores their potential in visual tasks.Method: Extensive evaluation of SAEs on vision models, including embedding models, multi-modal LLMs, and diffusion models.
Result: SAE features are meaningful, improve generalization, and enable control in vision tasks. They reveal shared representations in multi-modal models.
Conclusion: SAEs show strong potential for interpretability, generalization, and steerability in vision, laying a foundation for future research.
Abstract: Sparse Autoencoders (SAEs) have emerged as a popular tool for interpreting the hidden states of large language models (LLMs). By learning to reconstruct activations from a sparse bottleneck layer, SAEs discover interpretable features from the high-dimensional internal representations of LLMs. Despite their popularity with language models, SAEs remain understudied in the visual domain. In this work, we provide an extensive evaluation the representational power of SAEs for vision models using a broad range of image-based tasks. Our experimental results demonstrate that SAE features are semantically meaningful, improve out-of-distribution generalization, and enable controllable generation across three vision model architectures: vision embedding models, multi-modal LMMs and diffusion models. In vision embedding models, we find that learned SAE features can be used for OOD detection and provide evidence that they recover the ontological structure of the underlying model. For diffusion models, we demonstrate that SAEs enable semantic steering through text encoder manipulation and develop an automated pipeline for discovering human-interpretable attributes. Finally, we conduct exploratory experiments on multi-modal LLMs, finding evidence that SAE features reveal shared representations across vision and language modalities. Our study provides a foundation for SAE evaluation in vision models, highlighting their strong potential improving interpretability, generalization, and steerability in the visual domain.
[114] Unifying Scale-Aware Depth Prediction and Perceptual Priors for Monocular Endoscope Pose Estimation and Tissue Reconstruction
Muzammil Khan, Enzo Kerkhof, Matteo Fusaglia, Koert Kuhlmann, Theo Ruers, Françoise J. Siepel
Main category: cs.CV
TL;DR: A unified framework for monocular endoscopic tissue reconstruction integrates scale-aware depth prediction and perceptual refinement to address challenges like depth ambiguity and tissue deformation, outperforming state-of-the-art methods.
Details
Motivation: Enhancing monocular minimally invasive surgery by improving endoscope pose estimation and 3D tissue reconstruction despite challenges like depth ambiguity and tissue deformation.Method: Combines MAPIS-Depth (using Depth Pro and Depth Anything for depth prediction) with L-BFGS-B optimization, RAFT for pixel correspondences, and LPIPS for perceptual refinement. WEMA-RTDL optimizes registration, followed by volumetric fusion for 3D mesh extraction.
Result: Demonstrates robustness and superiority over state-of-the-art methods on HEVD and SCARED datasets.
Conclusion: The framework effectively addresses key challenges in monocular endoscopic reconstruction, offering improved accuracy and spatial awareness for surgical procedures.
Abstract: Accurate endoscope pose estimation and 3D tissue surface reconstruction significantly enhances monocular minimally invasive surgical procedures by enabling accurate navigation and improved spatial awareness. However, monocular endoscope pose estimation and tissue reconstruction face persistent challenges, including depth ambiguity, physiological tissue deformation, inconsistent endoscope motion, limited texture fidelity, and a restricted field of view. To overcome these limitations, a unified framework for monocular endoscopic tissue reconstruction that integrates scale-aware depth prediction with temporally-constrained perceptual refinement is presented. This framework incorporates a novel MAPIS-Depth module, which leverages Depth Pro for robust initialisation and Depth Anything for efficient per-frame depth prediction, in conjunction with L-BFGS-B optimisation, to generate pseudo-metric depth estimates. These estimates are temporally refined by computing pixel correspondences using RAFT and adaptively blending flow-warped frames based on LPIPS perceptual similarity, thereby reducing artefacts arising from physiological tissue deformation and motion. To ensure accurate registration of the synthesised pseudo-RGBD frames from MAPIS-Depth, a novel WEMA-RTDL module is integrated, optimising both rotation and translation. Finally, truncated signed distance function-based volumetric fusion and marching cubes are applied to extract a comprehensive 3D surface mesh. Evaluations on HEVD and SCARED, with ablation and comparative analyses, demonstrate the framework’s robustness and superiority over state-of-the-art methods.
[115] Leveraging the RETFound foundation model for optic disc segmentation in retinal images
Zhenyi Zhao, Muthu Rama Krishnan Mookiah, Emanuele Trucco
Main category: cs.CV
TL;DR: RETFound, a foundation model for retinal images, is adapted for optic disc segmentation, outperforming state-of-the-art baselines with minimal task-specific training.
Details
Motivation: To explore RETFound's potential beyond disease diagnosis by applying it to optic disc segmentation, a foundational task in retinal image analysis.Method: Adapt RETFound for optic disc segmentation by training a head with a small number of task-specific examples.
Result: Achieves ~96% Dice score across multiple datasets, excelling in internal verification, domain generalization, and adaptation.
Conclusion: RETFound demonstrates versatility and superior performance, challenging the need for task-specific architectures.
Abstract: RETFound is a well-known foundation model (FM) developed for fundus camera and optical coherence tomography images. It has shown promising performance across multiple datasets in diagnosing diseases, both eye-specific and systemic, from retinal images. However, to our best knowledge, it has not been used for other tasks. We present the first adaptation of RETFound for optic disc segmentation, a ubiquitous and foundational task in retinal image analysis. The resulting segmentation system outperforms state-of-the-art, segmentation-specific baseline networks after training a head with only a very modest number of task-specific examples. We report and discuss results with four public datasets, IDRID, Drishti-GS, RIM-ONE-r3, and REFUGE, and a private dataset, GoDARTS, achieving about 96% Dice consistently across all datasets. Overall, our method obtains excellent performance in internal verification, domain generalization and domain adaptation, and exceeds most of the state-of-the-art baseline results. We discuss the results in the framework of the debate about FMs as alternatives to task-specific architectures. The code is available at: [link to be added after the paper is accepted]
[116] Controlling Multimodal LLMs via Reward-guided Decoding
Oscar Mañas, Pierluca D’Oro, Koustuv Sinha, Adriana Romero-Soriano, Michal Drozdzal, Aishwarya Agrawal
Main category: cs.CV
TL;DR: The paper introduces a reward-guided decoding method for Multimodal Large Language Models (MLLMs) to improve visual grounding, offering dynamic control over precision, recall, and computational trade-offs.
Details
Motivation: To adapt MLLMs for diverse user needs by enhancing their visual grounding capabilities through controlled decoding.Method: Builds two reward models for object precision and recall, guiding the MLLM’s decoding process to allow dynamic trade-offs and control over computational breadth.
Result: Demonstrates significant controllability over MLLM inference and outperforms existing hallucination mitigation methods on benchmarks.
Conclusion: The proposed method effectively enhances MLLM adaptability and performance in visual grounding tasks.
Abstract: As Multimodal Large Language Models (MLLMs) gain widespread applicability, it is becoming increasingly desirable to adapt them for diverse user needs. In this paper, we study the adaptation of MLLMs through controlled decoding. To achieve this, we introduce the first method for reward-guided decoding of MLLMs and demonstrate its application in improving their visual grounding. Our method involves building reward models for visual grounding and using them to guide the MLLM’s decoding process. Concretely, we build two separate reward models to independently control the degree of object precision and recall in the model’s output. Our approach enables on-the-fly controllability of an MLLM’s inference process in two ways: first, by giving control over the relative importance of each reward function during decoding, allowing a user to dynamically trade off object precision for recall in image captioning tasks; second, by giving control over the breadth of the search during decoding, allowing the user to control the trade-off between the amount of test-time compute and the degree of visual grounding. We evaluate our method on standard object hallucination benchmarks, showing that it provides significant controllability over MLLM inference, while consistently outperforming existing hallucination mitigation methods.
[117] TimeMachine: Fine-Grained Facial Age Editing with Identity Preservation
Yilin Mi, Qixin Yan, Zheng-Peng Duan, Chunle Guo, Hubery Yin, Hao Liu, Chen Li, Chongyi Li
Main category: cs.CV
TL;DR: TimeMachine is a diffusion-based framework for fine-grained facial age editing, preserving identity by separating age and identity features using multi-cross attention and an Age Classifier Guidance module.
Details
Motivation: Fine-grained age editing while preserving identity is challenging due to intertwined features and lack of high-quality datasets.Method: Uses a diffusion-based framework with multi-cross attention for feature separation and an Age Classifier Guidance module for latent-space age prediction. Introduces the HFFA dataset for training.
Result: Achieves state-of-the-art performance in fine-grained age editing with preserved identity consistency.
Conclusion: TimeMachine effectively addresses the challenges of age editing with its novel framework and dataset.
Abstract: With the advancement of generative models, facial image editing has made significant progress. However, achieving fine-grained age editing while preserving personal identity remains a challenging task.In this paper, we propose TimeMachine, a novel diffusion-based framework that achieves accurate age editing while keeping identity features unchanged. To enable fine-grained age editing, we inject high-precision age information into the multi-cross attention module, which explicitly separates age-related and identity-related features. This design facilitates more accurate disentanglement of age attributes, thereby allowing precise and controllable manipulation of facial aging.Furthermore, we propose an Age Classifier Guidance (ACG) module that predicts age directly in the latent space, instead of performing denoising image reconstruction during training. By employing a lightweight module to incorporate age constraints, this design enhances age editing accuracy by modest increasing training cost. Additionally, to address the lack of large-scale, high-quality facial age datasets, we construct a HFFA dataset (High-quality Fine-grained Facial-Age dataset) which contains one million high-resolution images labeled with identity and facial attributes. Experimental results demonstrate that TimeMachine achieves state-of-the-art performance in fine-grained age editing while preserving identity consistency.
[118] Does the Skeleton-Recall Loss Really Work?
Devansh Arora, Nitin Kumar, Sukrit Gupta
Main category: cs.CV
TL;DR: The paper critically evaluates Skeleton Recall Loss (SRL) for tubular structure segmentation, finding it underperforms traditional baselines despite claims of state-of-the-art results.
Details
Motivation: To analyze the effectiveness of topology-preserving loss functions like SRL in segmenting thin tubular structures, given their claimed advantages over traditional methods.Method: Theoretical gradient analysis of SRL and empirical comparison on tubular datasets, including those from the original SRL work and additional datasets.
Result: SRL-based models did not outperform traditional baseline models, contradicting earlier claims.
Conclusion: The study highlights limitations of topology-based loss functions, providing insights for improving segmentation models for complex tubular structures.
Abstract: Image segmentation is an important and widely performed task in computer vision. Accomplishing effective image segmentation in diverse settings often requires custom model architectures and loss functions. A set of models that specialize in segmenting thin tubular structures are topology preservation-based loss functions. These models often utilize a pixel skeletonization process claimed to generate more precise segmentation masks of thin tubes and better capture the structures that other models often miss. One such model, Skeleton Recall Loss (SRL) proposed by Kirchhoff et al.~\cite {kirchhoff2024srl}, was stated to produce state-of-the-art results on benchmark tubular datasets. In this work, we performed a theoretical analysis of the gradients for the SRL loss. Upon comparing the performance of the proposed method on some of the tubular datasets (used in the original work, along with some additional datasets), we found that the performance of SRL-based segmentation models did not exceed traditional baseline models. By providing both a theoretical explanation and empirical evidence, this work critically evaluates the limitations of topology-based loss functions, offering valuable insights for researchers aiming to develop more effective segmentation models for complex tubular structures.
[119] Hyperspectral vs. RGB for Pedestrian Segmentation in Urban Driving Scenes: A Comparative Study
Jiarong Li, Imad Ali Shah, Enda Ward, Martin Glavin, Edward Jones, Brian Deegan
Main category: cs.CV
TL;DR: The study explores hyperspectral imaging (HSI) for better pedestrian segmentation in automotive systems, outperforming RGB with methods like CSNR-JMIM.
Details
Motivation: Address safety challenges in pedestrian segmentation due to metamerism in RGB imaging by leveraging HSI.Method: Compared RGB with two HSI dimensionality-reduction methods (PCA and CSNR-JMIM) using three segmentation models (U-Net, DeepLabV3+, SegFormer).
Result: CSNR-JMIM improved pedestrian segmentation by 1.44% IoU and 2.18% F1-score, reducing false positives.
Conclusion: Optimal HSI band selection enhances pedestrian segmentation, proving valuable for safety-critical automotive applications.
Abstract: Pedestrian segmentation in automotive perception systems faces critical safety challenges due to metamerism in RGB imaging, where pedestrians and backgrounds appear visually indistinguishable.. This study investigates the potential of hyperspectral imaging (HSI) for enhanced pedestrian segmentation in urban driving scenarios using the Hyperspectral City v2 (H-City) dataset. We compared standard RGB against two dimensionality-reduction approaches by converting 128-channel HSI data into three-channel representations: Principal Component Analysis (PCA) and optimal band selection using Contrast Signal-to-Noise Ratio with Joint Mutual Information Maximization (CSNR-JMIM). Three semantic segmentation models were evaluated: U-Net, DeepLabV3+, and SegFormer. CSNR-JMIM consistently outperformed RGB with an average improvements of 1.44% in Intersection over Union (IoU) and 2.18% in F1-score for pedestrian segmentation. Rider segmentation showed similar gains with 1.43% IoU and 2.25% F1-score improvements. These improved performance results from enhanced spectral discrimination of optimally selected HSI bands effectively reducing false positives. This study demonstrates robust pedestrian segmentation through optimal HSI band selection, showing significant potential for safety-critical automotive applications.
[120] G-CUT3R: Guided 3D Reconstruction with Camera and Depth Prior Integration
Ramil Khafizov, Artem Komarichev, Ruslan Rakhimov, Peter Wonka, Evgeny Burnaev
Main category: cs.CV
TL;DR: G-CUT3R enhances 3D scene reconstruction by integrating prior data like depth or camera info, outperforming existing methods.
Details
Motivation: Existing feed-forward methods rely only on images, missing out on commonly available auxiliary data.Method: Modifies CUT3R with dedicated encoders for each data type, fusing features via zero convolution.
Result: Shows significant improvements in benchmarks for 3D reconstruction and multi-view tasks.
Conclusion: G-CUT3R effectively uses prior data, maintaining flexibility with varying inputs.
Abstract: We introduce G-CUT3R, a novel feed-forward approach for guided 3D scene reconstruction that enhances the CUT3R model by integrating prior information. Unlike existing feed-forward methods that rely solely on input images, our method leverages auxiliary data, such as depth, camera calibrations, or camera positions, commonly available in real-world scenarios. We propose a lightweight modification to CUT3R, incorporating a dedicated encoder for each modality to extract features, which are fused with RGB image tokens via zero convolution. This flexible design enables seamless integration of any combination of prior information during inference. Evaluated across multiple benchmarks, including 3D reconstruction and other multi-view tasks, our approach demonstrates significant performance improvements, showing its ability to effectively utilize available priors while maintaining compatibility with varying input modalities.
[121] Denoise-then-Retrieve: Text-Conditioned Video Denoising for Video Moment Retrieval
Weijia Liu, Jiuxin Cao, Bo Miao, Zhiheng Fu, Xuelin Zhu, Jiawei Ge, Bo Liu, Mehwish Nasim, Ajmal Mian
Main category: cs.CV
TL;DR: The paper proposes a denoise-then-retrieve paradigm (DRNet) for Video Moment Retrieval (VMR), filtering irrelevant clips to improve multimodal alignment and retrieval accuracy.
Details
Motivation: Current VMR methods encode all video clips, including irrelevant ones, disrupting alignment and optimization.Method: Introduces DRNet with Text-Conditioned Denoising (TCD) and Text-Reconstruction Feedback (TRF) modules to filter noise and align purified video representations with text.
Result: Outperforms state-of-the-art methods on Charades-STA and QVHighlights datasets.
Conclusion: The denoise-then-retrieve paradigm is effective, adaptable, and can enhance existing VMR models.
Abstract: Current text-driven Video Moment Retrieval (VMR) methods encode all video clips, including irrelevant ones, disrupting multimodal alignment and hindering optimization. To this end, we propose a denoise-then-retrieve paradigm that explicitly filters text-irrelevant clips from videos and then retrieves the target moment using purified multimodal representations. Following this paradigm, we introduce the Denoise-then-Retrieve Network (DRNet), comprising Text-Conditioned Denoising (TCD) and Text-Reconstruction Feedback (TRF) modules. TCD integrates cross-attention and structured state space blocks to dynamically identify noisy clips and produce a noise mask to purify multimodal video representations. TRF further distills a single query embedding from purified video representations and aligns it with the text embedding, serving as auxiliary supervision for denoising during training. Finally, we perform conditional retrieval using text embeddings on purified video representations for accurate VMR. Experiments on Charades-STA and QVHighlights demonstrate that our approach surpasses state-of-the-art methods on all metrics. Furthermore, our denoise-then-retrieve paradigm is adaptable and can be seamlessly integrated into advanced VMR models to boost performance.
[122] Delving into Dynamic Scene Cue-Consistency for Robust 3D Multi-Object Tracking
Haonan Zhang, Xinyao Wang, Boxi Wu, Tu Zheng, Wang Yunhua, Zheng Yang
Main category: cs.CV
TL;DR: The paper introduces DSC-Track, a 3D multi-object tracking method for autonomous driving that leverages stable spatial patterns and cue-consistency to improve accuracy in crowded environments.
Details
Motivation: Existing methods struggle in crowded scenes due to overlooked geometric relationships and interference from irrelevant objects, necessitating a focus on stable spatial cues.Method: The proposed DSC-Track uses a spatiotemporal encoder with Point Pair Features (PPF), a cue-consistency transformer module, and a dynamic update mechanism for robust tracking.
Result: The method achieves state-of-the-art performance, with 73.2% and 70.3% AMOTA on nuScenes validation and test sets.
Conclusion: DSC-Track effectively addresses challenges in 3D multi-object tracking by emphasizing cue-consistency and dynamic updates, demonstrating superior performance.
Abstract: 3D multi-object tracking is a critical and challenging task in the field of autonomous driving. A common paradigm relies on modeling individual object motion, e.g., Kalman filters, to predict trajectories. While effective in simple scenarios, this approach often struggles in crowded environments or with inaccurate detections, as it overlooks the rich geometric relationships between objects. This highlights the need to leverage spatial cues. However, existing geometry-aware methods can be susceptible to interference from irrelevant objects, leading to ambiguous features and incorrect associations. To address this, we propose focusing on cue-consistency: identifying and matching stable spatial patterns over time. We introduce the Dynamic Scene Cue-Consistency Tracker (DSC-Track) to implement this principle. Firstly, we design a unified spatiotemporal encoder using Point Pair Features (PPF) to learn discriminative trajectory embeddings while suppressing interference. Secondly, our cue-consistency transformer module explicitly aligns consistent feature representations between historical tracks and current detections. Finally, a dynamic update mechanism preserves salient spatiotemporal information for stable online tracking. Extensive experiments on the nuScenes and Waymo Open Datasets validate the effectiveness and robustness of our approach. On the nuScenes benchmark, for instance, our method achieves state-of-the-art performance, reaching 73.2% and 70.3% AMOTA on the validation and test sets, respectively.
[123] Noise Matters: Optimizing Matching Noise for Diffusion Classifiers
Yanghao Wang, Long Chen
Main category: cs.CV
TL;DR: The paper introduces NoOp, a noise optimization method for Diffusion Classifiers (DCs) to address noise instability by identifying and using ‘good noises’ that meet Frequency and Spatial Matching principles.
Details
Motivation: Existing DCs suffer from noise instability, requiring extensive ensembling for stable performance, which slows classification. The study aims to identify and optimize 'good noises' to improve stability and speed.Method: NoOp optimizes dataset-specific noise for Frequency Matching and trains a Meta-Network for Spatial Matching to generate image-specific noise offsets, replacing random noise in DCs.
Result: Extensive experiments show NoOp effectively improves DC performance by stabilizing noise, reducing the need for ensembling.
Conclusion: NoOp successfully addresses noise instability in DCs, enhancing classification speed and reliability by leveraging optimized noises.
Abstract: Although today’s pretrained discriminative vision-language models (e.g., CLIP) have demonstrated strong perception abilities, such as zero-shot image classification, they also suffer from the bag-of-words problem and spurious bias. To mitigate these problems, some pioneering studies leverage powerful generative models (e.g., pretrained diffusion models) to realize generalizable image classification, dubbed Diffusion Classifier (DC). Specifically, by randomly sampling a Gaussian noise, DC utilizes the differences of denoising effects with different category conditions to classify categories. Unfortunately, an inherent and notorious weakness of existing DCs is noise instability: different random sampled noises lead to significant performance changes. To achieve stable classification performance, existing DCs always ensemble the results of hundreds of sampled noises, which significantly reduces the classification speed. To this end, we firstly explore the role of noise in DC, and conclude that: there are some ``good noises’’ that can relieve the instability. Meanwhile, we argue that these good noises should meet two principles: Frequency Matching and Spatial Matching. Regarding both principles, we propose a novel Noise Optimization method to learn matching (i.e., good) noise for DCs: NoOp. For frequency matching, NoOp first optimizes a dataset-specific noise: Given a dataset and a timestep t, optimize one randomly initialized parameterized noise. For Spatial Matching, NoOp trains a Meta-Network that adopts an image as input and outputs image-specific noise offset. The sum of optimized noise and noise offset will be used in DC to replace random noise. Extensive ablations on various datasets demonstrated the effectiveness of NoOp.
[124] GANDiff FR: Hybrid GAN Diffusion Synthesis for Causal Bias Attribution in Face Recognition
Md Asgor Hossain Reaj, Rajan Das Gupta, Md Yeasin Rahat, Nafiz Fahad, Md Jawadul Hasan, Tze Hui Liew
Main category: cs.CV
TL;DR: GANDiff FR is a synthetic framework combining StyleGAN3 and diffusion models to control demographic and environmental factors, enabling precise bias measurement and reduction in face recognition systems.
Details
Motivation: To address bias in face recognition by creating a reproducible and rigorous method for measuring and reducing bias under controlled conditions.Method: Unifies StyleGAN3 for identity-preserving generation with diffusion models for attribute control, synthesizing 10,000 demographically balanced faces. Evaluates bias in ArcFace, CosFace, and AdaFace.
Result: AdaFace reduces inter-group TPR disparity by 60%, with illumination accounting for 42% of residual bias. Strong synthetic-to-real transfer (r 0.85) is confirmed.
Conclusion: GANDiff FR provides a reproducible, regulation-aligned standard for fairness auditing, despite a 20% computational overhead, and releases code/data for transparency.
Abstract: We introduce GANDiff FR, the first synthetic framework that precisely controls demographic and environmental factors to measure, explain, and reduce bias with reproducible rigor. GANDiff FR unifies StyleGAN3-based identity-preserving generation with diffusion-based attribute control, enabling fine-grained manipulation of pose around 30 degrees, illumination (four directions), and expression (five levels) under ceteris paribus conditions. We synthesize 10,000 demographically balanced faces across five cohorts validated for realism via automated detection (98.2%) and human review (89%) to isolate and quantify bias drivers. Benchmarking ArcFace, CosFace, and AdaFace under matched operating points shows AdaFace reduces inter-group TPR disparity by 60% (2.5% vs. 6.3%), with illumination accounting for 42% of residual bias. Cross-dataset evaluation on RFW, BUPT, and CASIA WebFace confirms strong synthetic-to-real transfer (r 0.85). Despite around 20% computational overhead relative to pure GANs, GANDiff FR yields three times more attribute-conditioned variants, establishing a reproducible, regulation-aligned (EU AI Act) standard for fairness auditing. Code and data are released to support transparent, scalable bias evaluation.
[125] Index-Aligned Query Distillation for Transformer-based Incremental Object Detection
Mingxiao Ma, Shunyao Zhu, Guoliang Kang
Main category: cs.CV
TL;DR: The paper introduces Index-Aligned Query Distillation (IAQD) to address catastrophic knowledge forgetting in transformer-based incremental object detection (IOD), outperforming previous methods.
Details
Motivation: To mitigate knowledge forgetting in IOD tasks when using transformer-based models, as traditional methods like Hungarian Matching fail to preserve old category knowledge.Method: Proposes IAQD, which aligns queries by index instead of Hungarian Matching and focuses on critical queries for old categories, preserving their semantic and spatial encoding.
Result: IAQD achieves state-of-the-art performance by effectively reducing knowledge forgetting in IOD tasks.
Conclusion: IAQD is a superior approach for transformer-based IOD, successfully preserving old knowledge while learning new categories.
Abstract: Incremental object detection (IOD) aims to continuously expand the capability of a model to detect novel categories while preserving its performance on previously learned ones. When adopting a transformer-based detection model to perform IOD, catastrophic knowledge forgetting may inevitably occur, meaning the detection performance on previously learned categories may severely degenerate. Previous typical methods mainly rely on knowledge distillation (KD) to mitigate the catastrophic knowledge forgetting of transformer-based detection models. Specifically, they utilize Hungarian Matching to build a correspondence between the queries of the last-phase and current-phase detection models and align the classifier and regressor outputs between matched queries to avoid knowledge forgetting. However, we observe that in IOD task, Hungarian Matching is not a good choice. With Hungarian Matching, the query of the current-phase model may match different queries of the last-phase model at different iterations during KD. As a result, the knowledge encoded in each query may be reshaped towards new categories, leading to the forgetting of previously encoded knowledge of old categories. Based on our observations, we propose a new distillation approach named Index-Aligned Query Distillation (IAQD) for transformer-based IOD. Beyond using Hungarian Matching, IAQD establishes a correspondence between queries of the previous and current phase models that have the same index. Moreover, we perform index-aligned distillation only on partial queries which are critical for the detection of previous categories. In this way, IAQD largely preserves the previous semantic and spatial encoding capabilities without interfering with the learning of new categories. Extensive experiments on representative benchmarks demonstrate that IAQD effectively mitigates knowledge forgetting, achieving new state-of-the-art performance.
[126] Inside Knowledge: Graph-based Path Generation with Explainable Data Augmentation and Curriculum Learning for Visual Indoor Navigation
Daniel Airinei, Elena Burceanu, Marius Leordeanu
Main category: cs.CV
TL;DR: The paper introduces a vision-based deep learning approach for indoor navigation, using a novel graph-based path generation method and a large annotated dataset from a shopping mall.
Details
Motivation: Indoor navigation is challenging due to poor GPS access and complex existing solutions. The paper aims to provide an efficient, real-time, and easily deployable alternative relying solely on visual input.Method: The approach uses a deep learning model with explainable data augmentation and curriculum learning, trained on a novel large-scale dataset of annotated video footage from a shopping mall.
Result: The method avoids the need for special sensors, markers, or internet access, and includes a user-friendly Android application.
Conclusion: The paper presents a practical, vision-only solution for indoor navigation, with publicly available data, code, and demos.
Abstract: Indoor navigation is a difficult task, as it generally comes with poor GPS access, forcing solutions to rely on other sources of information. While significant progress continues to be made in this area, deployment to production applications is still lacking, given the complexity and additional requirements of current solutions. Here, we introduce an efficient, real-time and easily deployable deep learning approach, based on visual input only, that can predict the direction towards a target from images captured by a mobile device. Our technical approach, based on a novel graph-based path generation method, combined with explainable data augmentation and curriculum learning, includes contributions that make the process of data collection, annotation and training, as automatic as possible, efficient and robust. On the practical side, we introduce a novel largescale dataset, with video footage inside a relatively large shopping mall, in which each frame is annotated with the correct next direction towards different specific target destinations. Different from current methods, ours relies solely on vision, avoiding the need of special sensors, additional markers placed along the path, knowledge of the scene map or internet access. We also created an easy to use application for Android, which we plan to make publicly available. We make all our data and code available along with visual demos on our project site
[127] Cost-Effective Active Labeling for Data-Efficient Cervical Cell Classification
Yuanlin Liu, Zhihan Zhou, Mingqiang Wei, Youyi Song
Main category: cs.CV
TL;DR: Proposes active labeling for cost-efficient cervical cell classification by leveraging classifier uncertainty to select beneficial images for labeling.
Details
Motivation: Existing methods for cervical cell classification require expensive human-labeled datasets, which is costly and impractical.Method: Active labeling uses classifier uncertainty to select the most beneficial unlabeled images for labeling, reducing human cost.
Result: The method effectively enhances the training dataset’s representativeness and reduces human labeling effort.
Conclusion: Active labeling is a cost-effective solution for data-efficient cervical cell classification.
Abstract: Information on the number and category of cervical cells is crucial for the diagnosis of cervical cancer. However, existing classification methods capable of automatically measuring this information require the training dataset to be representative, which consumes an expensive or even unaffordable human cost. We herein propose active labeling that enables us to construct a representative training dataset using a much smaller human cost for data-efficient cervical cell classification. This cost-effective method efficiently leverages the classifier’s uncertainty on the unlabeled cervical cell images to accurately select images that are most beneficial to label. With a fast estimation of the uncertainty, this new algorithm exhibits its validity and effectiveness in enhancing the representative ability of the constructed training dataset. The extensive empirical results confirm its efficacy again in navigating the usage of human cost, opening the avenue for data-efficient cervical cell classification.
[128] Semantically Guided Adversarial Testing of Vision Models Using Language Models
Katarzyna Filus, Jorge M. Cruz-Duarte
Main category: cs.CV
TL;DR: The paper introduces a semantics-guided framework for selecting target labels in adversarial attacks on vision models, leveraging cross-modal knowledge from pretrained models like BERT, TinyLLAMA, and CLIP. It outperforms static methods like WordNet, especially for distant class relationships.
Details
Motivation: Existing methods for selecting target labels in adversarial attacks lack interpretability, reproducibility, or flexibility, relying on randomness or static resources. This work aims to address these limitations.Method: The proposed framework uses pretrained language and vision-language models (BERT, TinyLLAMA, CLIP) to select semantically related labels for adversarial attacks, forming best- and worst-case scenarios.
Result: Experiments show the framework consistently provides practical adversarial targets, surpassing static lexical databases like WordNet, particularly for distant class relationships.
Conclusion: Pretrained models are suitable for creating interpretable, standardized, and scalable adversarial benchmarks across architectures and datasets.
Abstract: In targeted adversarial attacks on vision models, the selection of the target label is a critical yet often overlooked determinant of attack success. This target label corresponds to the class that the attacker aims to force the model to predict. Now, existing strategies typically rely on randomness, model predictions, or static semantic resources, limiting interpretability, reproducibility, or flexibility. This paper then proposes a semantics-guided framework for adversarial target selection using the cross-modal knowledge transfer from pretrained language and vision-language models. We evaluate several state-of-the-art models (BERT, TinyLLAMA, and CLIP) as similarity sources to select the most and least semantically related labels with respect to the ground truth, forming best- and worst-case adversarial scenarios. Our experiments on three vision models and five attack methods reveal that these models consistently render practical adversarial targets and surpass static lexical databases, such as WordNet, particularly for distant class relationships. We also observe that static testing of target labels offers a preliminary assessment of the effectiveness of similarity sources, \textit{a priori} testing. Our results corroborate the suitability of pretrained models for constructing interpretable, standardized, and scalable adversarial benchmarks across architectures and datasets.
[129] Handwritten Text Recognition of Historical Manuscripts Using Transformer-Based Models
Erez Meoded
Main category: cs.CV
TL;DR: The paper applies TrOCR, a transformer-based HTR model, to 16th-century Latin manuscripts, using targeted preprocessing and novel data augmentation methods to improve recognition accuracy.
Details
Motivation: Historical handwritten text recognition (HTR) faces challenges like scarce transcriptions, linguistic variation, and diverse handwriting styles, limiting digitization of archival documents.Method: The study uses TrOCR, introduces four novel data augmentation techniques for historical handwriting, and evaluates ensemble learning approaches.
Result: The best single-model augmentation achieves a CER of 1.86, while a top-5 voting ensemble reduces CER to 1.60, significantly improving over previous results.
Conclusion: Domain-specific augmentations and ensemble strategies significantly advance HTR performance for historical manuscripts.
Abstract: Historical handwritten text recognition (HTR) is essential for unlocking the cultural and scholarly value of archival documents, yet digitization is often hindered by scarce transcriptions, linguistic variation, and highly diverse handwriting styles. In this study, we apply TrOCR, a state-of-the-art transformer-based HTR model, to 16th-century Latin manuscripts authored by Rudolf Gwalther. We investigate targeted image preprocessing and a broad suite of data augmentation techniques, introducing four novel augmentation methods designed specifically for historical handwriting characteristics. We also evaluate ensemble learning approaches to leverage the complementary strengths of augmentation-trained models. On the Gwalther dataset, our best single-model augmentation (Elastic) achieves a Character Error Rate (CER) of 1.86, while a top-5 voting ensemble achieves a CER of 1.60 - representing a 50% relative improvement over the best reported TrOCR_BASE result and a 42% improvement over the previous state of the art. These results highlight the impact of domain-specific augmentations and ensemble strategies in advancing HTR performance for historical manuscripts.
[130] HOID-R1: Reinforcement Learning for Open-World Human-Object Interaction Detection Reasoning with Multimodal Large Language Model
Zhenhao Zhang, Hanqing Wang, Xiangyu Zeng, Ziyu Cheng, Jiaxin Liu, Haoyu Yan, Zhirui Liu, Kaiyang Ji, Tianxiang Gui, Ke Hu, Kangyi Chen, Yahao Fan, Mokai Pan
Main category: cs.CV
TL;DR: HOID-R1 is a novel HOI detection framework combining CoT-guided SFT and GRPO in RL, outperforming existing methods in benchmarks and open-world generalization.
Details
Motivation: Current HOI detection methods rely on large language models but lack 3D spatial understanding, limiting their effectiveness.Method: Integrates CoT-guided SFT for reasoning and GRPO for policy optimization, with an MLLM-as-a-judge mechanism to reduce hallucinations.
Result: Achieves state-of-the-art performance on HOI benchmarks and excels in open-world generalization.
Conclusion: HOID-R1 addresses limitations of existing methods and demonstrates superior performance and generalization.
Abstract: Understanding and recognizing human-object interaction (HOI) is a pivotal application in AR/VR and robotics. Recent open-vocabulary HOI detection approaches depend exclusively on large language models for richer textual prompts, neglecting their inherent 3D spatial understanding capabilities. To address this shortcoming, we introduce HOID-R1, the first HOI detection framework that integrates chain-of-thought (CoT) guided supervised fine-tuning (SFT) with group relative policy optimization (GRPO) within a reinforcement learning (RL) paradigm. Specifically, we initially apply SFT to imbue the model with essential reasoning capabilities, forcing the model to articulate its thought process in the output. Subsequently, we integrate GRPO to leverage multi-reward signals for policy optimization, thereby enhancing alignment across diverse modalities. To mitigate hallucinations in the CoT reasoning, we introduce an “MLLM-as-a-judge” mechanism that supervises the CoT outputs, further improving generalization. Extensive experiments show that HOID-R1 achieves state-of-the-art performance on HOI detection benchmarks and outperforms existing methods in open-world generalization to novel scenarios.
[131] Unified Knowledge Distillation Framework: Fine-Grained Alignment and Geometric Relationship Preservation for Deep Face Recognition
Durgesh Mishra, Rishabh Uikey
Main category: cs.CV
TL;DR: A unified knowledge distillation approach for face recognition models combines instance-level and relational distillation, outperforming traditional methods and even surpassing teacher accuracy in some cases.
Details
Motivation: Traditional knowledge distillation methods fail to capture fine-grained details and relational structures, limiting performance in face recognition for edge devices.Method: Proposes two novel loss functions: Instance-Level Embedding Distillation (dynamic hard mining) and Relation-Based Pairwise Similarity Distillation (memory bank and sample mining).
Result: Outperforms state-of-the-art methods on benchmark datasets; student models can surpass teacher accuracy with strong teachers.
Conclusion: The unified framework effectively aligns instance-level features and preserves relational structures, enhancing distillation for face recognition.
Abstract: Knowledge Distillation is crucial for optimizing face recognition models for deployment in computationally limited settings, such as edge devices. Traditional KD methods, such as Raw L2 Feature Distillation or Feature Consistency loss, often fail to capture both fine-grained instance-level details and complex relational structures, leading to suboptimal performance. We propose a unified approach that integrates two novel loss functions, Instance-Level Embedding Distillation and Relation-Based Pairwise Similarity Distillation. Instance-Level Embedding Distillation focuses on aligning individual feature embeddings by leveraging a dynamic hard mining strategy, thereby enhancing learning from challenging examples. Relation-Based Pairwise Similarity Distillation captures relational information through pairwise similarity relationships, employing a memory bank mechanism and a sample mining strategy. This unified framework ensures both effective instance-level alignment and preservation of geometric relationships between samples, leading to a more comprehensive distillation process. Our unified framework outperforms state-of-the-art distillation methods across multiple benchmark face recognition datasets, as demonstrated by extensive experimental evaluations. Interestingly, when using strong teacher networks compared to the student, our unified KD enables the student to even surpass the teacher’s accuracy.
[132] RMFAT: Recurrent Multi-scale Feature Atmospheric Turbulence Mitigator
Zhiming Liu, Nantheera Anantrasirichai
Main category: cs.CV
TL;DR: RMFAT is a lightweight recurrent framework for efficient and temporally consistent video restoration under atmospheric turbulence, outperforming existing methods in clarity and speed.
Details
Motivation: Atmospheric turbulence degrades video quality with distortions like warping and blur, and current methods are computationally expensive, limiting real-time use.Method: RMFAT uses a recurrent framework with two-frame input, multi-scale feature encoding/decoding, and temporal warping modules for efficiency and coherence.
Result: RMFAT improves clarity (9% SSIM boost) and speed (4x runtime reduction) over existing methods.
Conclusion: RMFAT is highly effective for real-time atmospheric turbulence suppression, balancing performance and efficiency.
Abstract: Atmospheric turbulence severely degrades video quality by introducing distortions such as geometric warping, blur, and temporal flickering, posing significant challenges to both visual clarity and temporal consistency. Current state-of-the-art methods are based on transformer and 3D architectures and require multi-frame input, but their large computational cost and memory usage limit real-time deployment, especially in resource-constrained scenarios. In this work, we propose RMFAT: Recurrent Multi-scale Feature Atmospheric Turbulence Mitigator, designed for efficient and temporally consistent video restoration under AT conditions. RMFAT adopts a lightweight recurrent framework that restores each frame using only two inputs at a time, significantly reducing temporal window size and computational burden. It further integrates multi-scale feature encoding and decoding with temporal warping modules at both encoder and decoder stages to enhance spatial detail and temporal coherence. Extensive experiments on synthetic and real-world atmospheric turbulence datasets demonstrate that RMFAT not only outperforms existing methods in terms of clarity restoration (with nearly a 9% improvement in SSIM) but also achieves significantly improved inference speed (more than a fourfold reduction in runtime), making it particularly suitable for real-time atmospheric turbulence suppression tasks.
[133] SelfAdapt: Unsupervised Domain Adaptation of Cell Segmentation Models
Fabian H. Reith, Jannik Franzen, Dinesh R. Palli, J. Lorenz Rumberger, Dagmar Kainmueller
Main category: cs.CV
TL;DR: SelfAdapt enables label-free adaptation of pre-trained cell segmentation models, improving performance on diverse datasets without requiring annotated data.
Details
Motivation: Generalist models like Cellpose degrade in performance on domains differing from their training data, and supervised fine-tuning requires scarce annotated data.Method: SelfAdapt uses student-teacher augmentation consistency training with L2-SP regularization and label-free stopping criteria.
Result: Achieves up to 29.64% improvement in AP0.5 over baseline Cellpose on LiveCell and TissueNet datasets, even enhancing supervised fine-tuned models.
Conclusion: SelfAdapt is a practical, label-free solution for adapting cell segmentation models, released as an extension of Cellpose.
Abstract: Deep neural networks have become the go-to method for biomedical instance segmentation. Generalist models like Cellpose demonstrate state-of-the-art performance across diverse cellular data, though their effectiveness often degrades on domains that differ from their training data. While supervised fine-tuning can address this limitation, it requires annotated data that may not be readily available. We propose SelfAdapt, a method that enables the adaptation of pre-trained cell segmentation models without the need for labels. Our approach builds upon student-teacher augmentation consistency training, introducing L2-SP regularization and label-free stopping criteria. We evaluate our method on the LiveCell and TissueNet datasets, demonstrating relative improvements in AP0.5 of up to 29.64% over baseline Cellpose. Additionally, we show that our unsupervised adaptation can further improve models that were previously fine-tuned with supervision. We release SelfAdapt as an easy-to-use extension of the Cellpose framework. The code for our method is publicly available at https: //github.com/Kainmueller-Lab/self_adapt.
[134] Training-free Dimensionality Reduction via Feature Truncation: Enhancing Efficiency in Privacy-preserving Multi-Biometric Systems
Florian Bayer, Maximilian Russo, Christian Rathgeb
Main category: cs.CV
TL;DR: The paper explores reducing multi-biometric template sizes while maintaining accuracy and security, using dimensionality reduction and Homomorphic Encryption (HE) for efficient processing.
Details
Motivation: Addressing the computational challenges and privacy concerns in biometric recognition by leveraging multi-modal fusion and HE.Method: Experiments on a virtual multi-biometric database using DNN-extracted features from face, fingerprint, and iris, with dimensionality reduction and HE.
Result: Template size reduced by 67% with no loss in Equal Error Rate (EER) compared to single-modality recognition.
Conclusion: Multi-modal fusion with dimensionality reduction and HE offers efficient, secure, and accurate biometric recognition.
Abstract: Biometric recognition is widely used, making the privacy and security of extracted templates a critical concern. Biometric Template Protection schemes, especially those utilizing Homomorphic Encryption, introduce significant computational challenges due to increased workload. Recent advances in deep neural networks have enabled state-of-the-art feature extraction for face, fingerprint, and iris modalities. The ubiquity and affordability of biometric sensors further facilitate multi-modal fusion, which can enhance security by combining features from different modalities. This work investigates the biometric performance of reduced multi-biometric template sizes. Experiments are conducted on an in-house virtual multi-biometric database, derived from DNN-extracted features for face, fingerprint, and iris, using the FRGC, MCYT, and CASIA databases. The evaluated approaches are (i) explainable and straightforward to implement under encryption, (ii) training-free, and (iii) capable of generalization. Dimensionality reduction of feature vectors leads to fewer operations in the Homomorphic Encryption (HE) domain, enabling more efficient encrypted processing while maintaining biometric accuracy and security at a level equivalent to or exceeding single-biometric recognition. Our results demonstrate that, by fusing feature vectors from multiple modalities, template size can be reduced by 67 % with no loss in Equal Error Rate (EER) compared to the best-performing single modality.
[135] ImagiDrive: A Unified Imagination-and-Planning Framework for Autonomous Driving
Jingyu Li, Bozhou Zhang, Xin Jin, Jiankang Deng, Xiatian Zhu, Li Zhang
Main category: cs.CV
TL;DR: ImagiDrive integrates Vision-Language Models (VLMs) and Driving World Models (DWMs) for autonomous driving, combining behavioral prediction with realistic scene generation for improved planning.
Details
Motivation: Autonomous driving requires contextual comprehension and predictive reasoning. VLMs and DWMs address different aspects, but their integration is understudied despite complementary strengths.Method: ImagiDrive combines a VLM-based driving agent with a DWM-based scene imaginer in an iterative loop, using early stopping and trajectory selection for efficiency.
Result: Experiments on nuScenes and NAVSIM show ImagiDrive outperforms previous methods in robustness and performance under open-loop and closed-loop conditions.
Conclusion: ImagiDrive successfully integrates VLMs and DWMs, addressing challenges in action-pixel connection and efficiency, proving superior in autonomous driving scenarios.
Abstract: Autonomous driving requires rich contextual comprehension and precise predictive reasoning to navigate dynamic and complex environments safely. Vision-Language Models (VLMs) and Driving World Models (DWMs) have independently emerged as powerful recipes addressing different aspects of this challenge. VLMs provide interpretability and robust action prediction through their ability to understand multi-modal context, while DWMs excel in generating detailed and plausible future driving scenarios essential for proactive planning. Integrating VLMs with DWMs is an intuitive, promising, yet understudied strategy to exploit the complementary strengths of accurate behavioral prediction and realistic scene generation. Nevertheless, this integration presents notable challenges, particularly in effectively connecting action-level decisions with high-fidelity pixel-level predictions and maintaining computational efficiency. In this paper, we propose ImagiDrive, a novel end-to-end autonomous driving framework that integrates a VLM-based driving agent with a DWM-based scene imaginer to form a unified imagination-and-planning loop. The driving agent predicts initial driving trajectories based on multi-modal inputs, guiding the scene imaginer to generate corresponding future scenarios. These imagined scenarios are subsequently utilized to iteratively refine the driving agent’s planning decisions. To address efficiency and predictive accuracy challenges inherent in this integration, we introduce an early stopping mechanism and a trajectory selection strategy. Extensive experimental validation on the nuScenes and NAVSIM datasets demonstrates the robustness and superiority of ImagiDrive over previous alternatives under both open-loop and closed-loop conditions.
[136] Remove360: Benchmarking Residuals After Object Removal in 3D Gaussian Splatting
Simona Kocour, Assia Benbihi, Torsten Sattler
Main category: cs.CV
TL;DR: A benchmark and framework for measuring semantic residuals after object removal in 3D Gaussian Splatting, revealing limitations in current techniques.
Details
Motivation: To understand and measure unintended semantic traces left after object removal for privacy and editable scene representations.Method: Introduces Remove360 dataset and evaluates semantic residuals using pre/post-removal RGB images and object-level masks in diverse scenes.
Result: Current methods preserve semantic information despite visual geometry absence, highlighting limitations in object removal techniques.
Conclusion: Robust solutions are needed for real-world complexity in 3D object removal, as current methods fall short.
Abstract: Understanding what semantic information persists after object removal is critical for privacy-preserving 3D reconstruction and editable scene representations. In this work, we introduce a novel benchmark and evaluation framework to measure semantic residuals, the unintended semantic traces left behind, after object removal in 3D Gaussian Splatting. We conduct experiments across a diverse set of indoor and outdoor scenes, showing that current methods can preserve semantic information despite the absence of visual geometry. We also release Remove360, a dataset of pre/post-removal RGB images and object-level masks captured in real-world environments. While prior datasets have focused on isolated object instances, Remove360 covers a broader and more complex range of indoor and outdoor scenes, enabling evaluation of object removal in the context of full-scene representations. Given ground truth images of a scene before and after object removal, we assess whether we can truly eliminate semantic presence, and if downstream models can still infer what was removed. Our findings reveal critical limitations in current 3D object removal techniques and underscore the need for more robust solutions capable of handling real-world complexity. The evaluation framework is available at github.com/spatial-intelligence-ai/Remove360.git. Data are available at huggingface.co/datasets/simkoc/Remove360.
[137] Is ChatGPT-5 Ready for Mammogram VQA?
Qiang Li, Shansong Wang, Mingzhe Hu, Mojtaba Safari, Zachary Eidex, Xiaofeng Yang
Main category: cs.CV
TL;DR: GPT-5 outperforms GPT-4o in mammogram VQA tasks but lags behind human experts and specialized models. It shows potential but needs domain adaptation for clinical use.
Details
Motivation: To evaluate the performance of GPT-5 and GPT-4o in mammogram VQA tasks, assessing their potential to support breast cancer screening.Method: Systematic evaluation of GPT-5 and GPT-4o on four mammography datasets (EMBED, InBreast, CMMD, CBIS-DDSM) for BI-RADS assessment, abnormality detection, and malignancy classification.
Result: GPT-5 achieved the highest scores among GPT variants but underperformed compared to human experts and fine-tuned models. Performance varied across datasets, e.g., 56.8% density classification on EMBED and 69.3% BI-RADS accuracy on CBIS-DDSM.
Conclusion: GPT-5 shows promise for mammography VQA but requires domain-specific optimization for clinical applications. The improvement from GPT-4o to GPT-5 indicates potential for LLMs in this field.
Abstract: Mammogram visual question answering (VQA) integrates image interpretation with clinical reasoning and has potential to support breast cancer screening. We systematically evaluated the GPT-5 family and GPT-4o model on four public mammography datasets (EMBED, InBreast, CMMD, CBIS-DDSM) for BI-RADS assessment, abnormality detection, and malignancy classification tasks. GPT-5 consistently was the best performing model but lagged behind both human experts and domain-specific fine-tuned models. On EMBED, GPT-5 achieved the highest scores among GPT variants in density (56.8%), distortion (52.5%), mass (64.5%), calcification (63.5%), and malignancy (52.8%) classification. On InBreast, it attained 36.9% BI-RADS accuracy, 45.9% abnormality detection, and 35.0% malignancy classification. On CMMD, GPT-5 reached 32.3% abnormality detection and 55.0% malignancy accuracy. On CBIS-DDSM, it achieved 69.3% BI-RADS accuracy, 66.0% abnormality detection, and 58.2% malignancy accuracy. Compared with human expert estimations, GPT-5 exhibited lower sensitivity (63.5%) and specificity (52.3%). While GPT-5 exhibits promising capabilities for screening tasks, its performance remains insufficient for high-stakes clinical imaging applications without targeted domain adaptation and optimization. However, the tremendous improvements in performance from GPT-4o to GPT-5 show a promising trend in the potential for general large language models (LLMs) to assist with mammography VQA tasks.
[138] MM-R1: Unleashing the Power of Unified Multimodal Large Language Models for Personalized Image Generation
Qian Liang, Yujia Wu, Kuncheng Li, Jiwei Wei, Shiyuan He, Jinyu Guo, Ning Xie
Main category: cs.CV
TL;DR: MM-R1 introduces a framework using cross-modal Chain-of-Thought reasoning to enable personalized image generation with unified MLLMs, avoiding data-intensive fine-tuning.
Details
Motivation: Aligning MLLMs with personalized image generation is challenging due to subject-specific methods and scalability issues.Method: MM-R1 uses X-CoT reasoning and GRPO for visual grounding and generation, integrating subject representations and user prompts.
Result: MM-R1 achieves high subject fidelity and text alignment in zero-shot personalized image generation.
Conclusion: The framework successfully leverages unified MLLMs for scalable and effective personalized image generation.
Abstract: Multimodal Large Language Models (MLLMs) with unified architectures excel across a wide range of vision-language tasks, yet aligning them with personalized image generation remains a significant challenge. Existing methods for MLLMs are frequently subject-specific, demanding a data-intensive fine-tuning process for every new subject, which limits their scalability. In this paper, we introduce MM-R1, a framework that integrates a cross-modal Chain-of-Thought (X-CoT) reasoning strategy to unlock the inherent potential of unified MLLMs for personalized image generation. Specifically, we structure personalization as an integrated visual reasoning and generation process: (1) grounding subject concepts by interpreting and understanding user-provided images and contextual cues, and (2) generating personalized images conditioned on both the extracted subject representations and user prompts. To further enhance the reasoning capability, we adopt Grouped Reward Proximal Policy Optimization (GRPO) to explicitly align the generation. Experiments demonstrate that MM-R1 unleashes the personalization capability of unified MLLMs to generate images with high subject fidelity and strong text alignment in a zero-shot manner.
[139] Data-Driven Deepfake Image Detection Method – The 2024 Global Deepfake Image Detection Challenge
Xiaoya Zhu, Yibing Nan, Shiguo Lian
Main category: cs.CV
TL;DR: The paper discusses using Swin Transformer V2-B for Deepfake image detection, employing data augmentation to enhance model generalization, achieving excellence in competition.
Details
Motivation: Addressing the challenges posed by Deepfake technology in digital security by improving detection methods.Method: Utilizes Swin Transformer V2-B classification network with online data augmentation and offline sample generation for diverse training samples.
Result: Achieved excellence in Deepfake image detection competition.
Conclusion: The approach effectively enhances Deepfake detection, demonstrating the potential of advanced models and data augmentation in tackling digital security threats.
Abstract: With the rapid development of technology in the field of AI, deepfake technology has emerged as a double-edged sword. It has not only created a large amount of AI-generated content but also posed unprecedented challenges to digital security. The task of the competition is to determine whether a face image is a Deepfake image and output its probability score of being a Deepfake image. In the image track competition, our approach is based on the Swin Transformer V2-B classification network. And online data augmentation and offline sample generation methods are employed to enrich the diversity of training samples and increase the generalization ability of the model. Finally, we got the award of excellence in Deepfake image detection.
[140] CoFi: A Fast Coarse-to-Fine Few-Shot Pipeline for Glomerular Basement Membrane Segmentation
Hongjin Fang, Daniel Reisenbüchler, Kenji Ikemura, Mert R. Sabuncu, Yihe Yang, Ruining Deng
Main category: cs.CV
TL;DR: CoFi is a coarse-to-fine few-shot segmentation pipeline for GBM in EM images, reducing annotation burden while maintaining accuracy.
Details
Motivation: Supervised deep learning for GBM segmentation requires extensive annotation, making it impractical for clinical use. Few-shot learning struggles with fine details.Method: CoFi uses a lightweight network trained on three images for coarse segmentation, then refines it with SAM using morphology-aware point prompts.
Result: Achieves 74.54% Dice coefficient and 1.9 FPS, balancing accuracy and speed.
Conclusion: CoFi is efficient, accurate, and suitable for clinical renal pathology applications.
Abstract: Accurate segmentation of the glomerular basement membrane (GBM) in electron microscopy (EM) images is fundamental for quantifying membrane thickness and supporting the diagnosis of various kidney diseases. While supervised deep learning approaches achieve high segmentation accuracy, their reliance on extensive pixel-level annotation renders them impractical for clinical workflows. Few-shot learning can reduce this annotation burden but often struggles to capture the fine structural details necessary for GBM analysis. In this study, we introduce CoFi, a fast and efficient coarse-to-fine few-shot segmentation pipeline designed for GBM delineation in EM images. CoFi first trains a lightweight neural network using only three annotated images to produce an initial coarse segmentation mask. This mask is then automatically processed to generate high-quality point prompts with morphology-aware pruning, which are subsequently used to guide SAM in refining the segmentation. The proposed method achieved exceptional GBM segmentation performance, with a Dice coefficient of 74.54% and an inference speed of 1.9 FPS. We demonstrate that CoFi not only alleviates the annotation and computational burdens associated with conventional methods, but also achieves accurate and reliable segmentation results. The pipeline’s speed and annotation efficiency make it well-suited for research and hold strong potential for clinical applications in renal pathology. The pipeline is publicly available at: https://github.com/ddrrnn123/CoFi.
[141] TACR-YOLO: A Real-time Detection Framework for Abnormal Human Behaviors Enhanced with Coordinate and Task-Aware Representations
Xinyi Yin, Wenbo Yuan, Xuecheng Wu, Liangyu Fu, Danlei Huang
Main category: cs.CV
TL;DR: TACR-YOLO is a real-time framework for abnormal human behavior detection, addressing challenges like small objects and task conflicts with novel modules and optimizations, achieving 91.92% mAP on a new dataset.
Details
Motivation: The need for effective abnormal human behavior detection in special scenarios, overcoming limitations of YOLO-based methods.Method: Introduces Coordinate Attention Module, Task-Aware Attention Module, Strengthen Neck Network, optimized Anchor Box sizes, and DIoU-Loss.
Result: Achieves 91.92% mAP on the PABD dataset with competitive speed and robustness.
Conclusion: TACR-YOLO advances abnormal behavior detection, offering practical insights for special scenarios.
Abstract: Abnormal Human Behavior Detection (AHBD) under special scenarios is becoming increasingly crucial. While YOLO-based detection methods excel in real-time tasks, they remain hindered by challenges including small objects, task conflicts, and multi-scale fusion in AHBD. To tackle them, we propose TACR-YOLO, a new real-time framework for AHBD. We introduce a Coordinate Attention Module to enhance small object detection, a Task-Aware Attention Module to deal with classification-regression conflicts, and a Strengthen Neck Network for refined multi-scale fusion, respectively. In addition, we optimize Anchor Box sizes using K-means clustering and deploy DIoU-Loss to improve bounding box regression. The Personnel Anomalous Behavior Detection (PABD) dataset, which includes 8,529 samples across four behavior categories, is also presented. Extensive experimental results indicate that TACR-YOLO achieves 91.92% mAP on PABD, with competitive speed and robustness. Ablation studies highlight the contribution of each improvement. This work provides new insights for abnormal behavior detection under special scenarios, advancing its progress.
[142] OpenConstruction: A Systematic Synthesis of Open Visual Datasets for Data-Centric Artificial Intelligence in Construction Monitoring
Ruoxin Xiong, Yanyu Wang, Jiannan Cai, Kaijian Liu, Yuansheng Zhu, Pingbo Tang, Nora El-Gohary
Main category: cs.CV
TL;DR: The paper reviews 51 visual datasets for AI/ML in construction, categorizes them, and proposes a roadmap for future data infrastructure based on FAIR principles.
Details
Motivation: To address the lack of systematic review and categorization of visual datasets in construction, which limits AI/ML applications.Method: Conducted an extensive search of academic databases and open-data platforms, analyzed 51 datasets, and categorized them using a structured schema.
Result: Created OpenConstruction, an open-source catalog, and identified gaps in existing datasets. Proposed a roadmap for future data infrastructure.
Conclusion: The study supports data-centric AI/ML advancements in construction by improving dataset accessibility and quality.
Abstract: The construction industry increasingly relies on visual data to support Artificial Intelligence (AI) and Machine Learning (ML) applications for site monitoring. High-quality, domain-specific datasets, comprising images, videos, and point clouds, capture site geometry and spatiotemporal dynamics, including the location and interaction of objects, workers, and materials. However, despite growing interest in leveraging visual datasets, existing resources vary widely in sizes, data modalities, annotation quality, and representativeness of real-world construction conditions. A systematic review to categorize their data characteristics and application contexts is still lacking, limiting the community’s ability to fully understand the dataset landscape, identify critical gaps, and guide future directions toward more effective, reliable, and scalable AI applications in construction. To address this gap, this study conducts an extensive search of academic databases and open-data platforms, yielding 51 publicly available visual datasets that span the 2005-2024 period. These datasets are categorized using a structured data schema covering (i) data fundamentals (e.g., size and license), (ii) data modalities (e.g., RGB and point cloud), (iii) annotation frameworks (e.g., bounding boxes), and (iv) downstream application domains (e.g., progress tracking). This study synthesizes these findings into an open-source catalog, OpenConstruction, supporting data-driven method development. Furthermore, the study discusses several critical limitations in the existing construction dataset landscape and presents a roadmap for future data infrastructure anchored in the Findability, Accessibility, Interoperability, and Reusability (FAIR) principles. By reviewing the current landscape and outlining strategic priorities, this study supports the advancement of data-centric solutions in the construction sector.
[143] CineTrans: Learning to Generate Videos with Cinematic Transitions via Masked Diffusion Models
Xiaoxue Wu, Bingjie Gao, Yu Qiao, Yaohui Wang, Xinyuan Chen
Main category: cs.CV
TL;DR: CineTrans is a novel framework for generating coherent multi-shot videos with cinematic transitions, leveraging a new dataset and mask-based control in diffusion models.
Details
Motivation: Current video synthesis lacks stable multi-shot generation, limiting videos to single-shot sequences.Method: Introduces CineTrans, a framework using a multi-shot video-text dataset (Cine250K) and a mask-based control mechanism in diffusion models for transitions.
Result: CineTrans produces high-quality multi-shot videos with stable transitions, outperforming baselines in evaluations.
Conclusion: CineTrans advances multi-shot video generation with cinematic transitions, validated by specialized metrics.
Abstract: Despite significant advances in video synthesis, research into multi-shot video generation remains in its infancy. Even with scaled-up models and massive datasets, the shot transition capabilities remain rudimentary and unstable, largely confining generated videos to single-shot sequences. In this work, we introduce CineTrans, a novel framework for generating coherent multi-shot videos with cinematic, film-style transitions. To facilitate insights into the film editing style, we construct a multi-shot video-text dataset Cine250K with detailed shot annotations. Furthermore, our analysis of existing video diffusion models uncovers a correspondence between attention maps in the diffusion model and shot boundaries, which we leverage to design a mask-based control mechanism that enables transitions at arbitrary positions and transfers effectively in a training-free setting. After fine-tuning on our dataset with the mask mechanism, CineTrans produces cinematic multi-shot sequences while adhering to the film editing style, avoiding unstable transitions or naive concatenations. Finally, we propose specialized evaluation metrics for transition control, temporal consistency and overall quality, and demonstrate through extensive experiments that CineTrans significantly outperforms existing baselines across all criteria.
[144] Automated Building Heritage Assessment Using Street-Level Imagery
Kristina Dabrock, Tim Johansson, Anna Donarelli, Mikael Mangold, Noah Pflugradt, Jann Michael Weinand, Jochen Linßen
Main category: cs.CV
TL;DR: AI tools like GPT improve efficiency in identifying cultural heritage values in buildings, aiding energy conservation without compromising heritage.
Details
Motivation: To quantify energy conservation in buildings while preserving cultural heritage, avoiding costly traditional methods.Method: Used GPT to detect cultural heritage values in facade images, combined with register data to train ML models for building classification.
Result: Achieved macro F1-scores of 0.71 (combined data) and 0.60 (GPT-only), validated against expert inventory.
Conclusion: The method enhances heritage-aware energy efficiency measures in large-scale refurbishments.
Abstract: Detailed data is required to quantify energy conservation measures in buildings, such as envelop retrofits, without compromising cultural heritage. Novel artificial intelligence tools may improve efficiency in identifying heritage values in buildings compared to costly and time-consuming traditional inventories. In this study, the large language model GPT was used to detect various aspects of cultural heritage value in fa\c{c}ade images. Using this data and building register data as features, machine learning models were trained to classify multi-family and non-residential buildings in Stockholm, Sweden. Validation against an expert-created inventory shows a macro F1-score of 0.71 using a combination of register data and features retrieved from GPT, and a score of 0.60 using only GPT-derived data. The presented methodology can contribute to a higher-quality database and thus support careful energy efficiency measures and integrated consideration of heritage value in large-scale energetic refurbishment scenarios.
[145] TokLIP: Marry Visual Tokens to CLIP for Multimodal Comprehension and Generation
Haokun Lin, Teng Wang, Yixiao Ge, Yuying Ge, Zhichao Lu, Ying Wei, Qingfu Zhang, Zhenan Sun, Ying Shan
Main category: cs.CV
TL;DR: TokLIP is a visual tokenizer that improves multimodal comprehension and generation by combining low-level VQ tokens with high-level semantics, enabling efficient end-to-end training.
Details
Motivation: Existing token-based methods like Chameleon and Emu3 suffer from high computational costs and limited comprehension due to missing high-level semantics.Method: TokLIP integrates a low-level discrete VQ tokenizer with a ViT-based encoder to capture high-level semantics, disentangling comprehension and generation objectives.
Result: TokLIP achieves exceptional data efficiency, enhances semantic understanding, and improves generative capacity for autoregressive Transformers.
Conclusion: TokLIP is effective for multimodal tasks, offering a balanced approach to comprehension and generation without specialized quantization.
Abstract: Pioneering token-based works such as Chameleon and Emu3 have established a foundation for multimodal unification but face challenges of high training computational overhead and limited comprehension performance due to a lack of high-level semantics. In this paper, we introduce TokLIP, a visual tokenizer that enhances comprehension by semanticizing vector-quantized (VQ) tokens and incorporating CLIP-level semantics while enabling end-to-end multimodal autoregressive training with standard VQ tokens. TokLIP integrates a low-level discrete VQ tokenizer with a ViT-based token encoder to capture high-level continuous semantics. Unlike previous approaches (e.g., VILA-U) that discretize high-level features, TokLIP disentangles training objectives for comprehension and generation, allowing the direct application of advanced VQ tokenizers without the need for tailored quantization operations. Our empirical results demonstrate that TokLIP achieves exceptional data efficiency, empowering visual tokens with high-level semantic understanding while enhancing low-level generative capacity, making it well-suited for autoregressive Transformers in both comprehension and generation tasks. The code and models are available at https://github.com/TencentARC/TokLIP.
[146] Perception in Plan: Coupled Perception and Planning for End-to-End Autonomous Driving
Bozhou Zhang, Jingyu Li, Nan Song, Li Zhang
Main category: cs.CV
TL;DR: VeteranAD introduces a perception-in-plan framework for end-to-end autonomous driving, integrating perception with planning for targeted optimization, achieving state-of-the-art performance.
Details
Motivation: To enhance planning performance by integrating perception into the planning process, enabling targeted perception guided by evolving planning objectives.Method: Uses a coupled perception and planning framework with multi-mode anchored trajectories as planning priors, adopting an autoregressive strategy for progressive trajectory prediction and targeted perception.
Result: Achieves state-of-the-art performance on NAVSIM and Bench2Drive datasets, demonstrating more accurate and reliable driving behavior.
Conclusion: VeteranAD successfully integrates perception and planning, unlocking the potential of planning-oriented end-to-end methods for autonomous driving.
Abstract: End-to-end autonomous driving has achieved remarkable advancements in recent years. Existing methods primarily follow a perception-planning paradigm, where perception and planning are executed sequentially within a fully differentiable framework for planning-oriented optimization. We further advance this paradigm through a perception-in-plan framework design, which integrates perception into the planning process. This design facilitates targeted perception guided by evolving planning objectives over time, ultimately enhancing planning performance. Building on this insight, we introduce VeteranAD, a coupled perception and planning framework for end-to-end autonomous driving. By incorporating multi-mode anchored trajectories as planning priors, the perception module is specifically designed to gather traffic elements along these trajectories, enabling comprehensive and targeted perception. Planning trajectories are then generated based on both the perception results and the planning priors. To make perception fully serve planning, we adopt an autoregressive strategy that progressively predicts future trajectories while focusing on relevant regions for targeted perception at each step. With this simple yet effective design, VeteranAD fully unleashes the potential of planning-oriented end-to-end methods, leading to more accurate and reliable driving behavior. Extensive experiments on the NAVSIM and Bench2Drive datasets demonstrate that our VeteranAD achieves state-of-the-art performance.
[147] Hierarchical Graph Feature Enhancement with Adaptive Frequency Modulation for Visual Recognition
Feiyue Zhao, Zhichao Zhang
Main category: cs.CV
TL;DR: The paper introduces HGFE, a graph-based framework integrated into CNNs to enhance structural awareness and feature representation, improving performance in visual recognition tasks.
Details
Motivation: CNNs' reliance on grid structures limits their ability to model complex topological relationships and non-local semantics in images.Method: HGFE uses intra-window graph convolution for local dependencies and inter-window supernode interactions for global semantics, with adaptive frequency modulation to balance signal propagation.
Result: HGFE improves performance on CIFAR-100, PASCAL VOC, VisDrone, CrackSeg, and CarParts datasets.
Conclusion: HGFE is effective, lightweight, and seamlessly integrable into CNNs, enhancing structural representation and recognition performance.
Abstract: Convolutional neural networks (CNNs) have demonstrated strong performance in visual recognition tasks, but their inherent reliance on regular grid structures limits their capacity to model complex topological relationships and non-local semantics within images. To address this limita tion, we propose the hierarchical graph feature enhancement (HGFE), a novel framework that integrates graph-based rea soning into CNNs to enhance both structural awareness and feature representation. HGFE builds two complementary levels of graph structures: intra-window graph convolution to cap ture local spatial dependencies and inter-window supernode interactions to model global semantic relationships. Moreover, we introduce an adaptive frequency modulation module that dynamically balances low-frequency and high-frequency signal propagation, preserving critical edge and texture information while mitigating over-smoothing. The proposed HGFE module is lightweight, end-to-end trainable, and can be seamlessly integrated into standard CNN backbone networks. Extensive experiments on CIFAR-100 (classification), PASCAL VOC, and VisDrone (detection), as well as CrackSeg and CarParts (segmentation), validated the effectiveness of the HGFE in improving structural representation and enhancing overall recognition performance.
[148] AIM: Amending Inherent Interpretability via Self-Supervised Masking
Eyad Alshami, Shashank Agnihotri, Bernt Schiele, Margret Keuper
Main category: cs.CV
TL;DR: AIM is a self-supervised method that enhances DNNs’ use of genuine features over spurious ones, improving interpretability and accuracy without extra annotations.
Details
Motivation: Deep neural networks often rely on spurious features, which can harm interpretability and generalization. AIM addresses this by promoting genuine feature utilization.Method: AIM uses multi-stage features to guide a self-supervised, sample-specific masking process, training interpretable models that faithfully summarize decisions.
Result: AIM improves interpretability (measured by EPG score) and accuracy across diverse datasets like ImageNet100, HardImageNet, and fine-grained benchmarks.
Conclusion: AIM consistently enhances generalization and human-aligned interpretability by prioritizing meaningful features, validated across domains and architectures.
Abstract: It has been observed that deep neural networks (DNNs) often use both genuine as well as spurious features. In this work, we propose “Amending Inherent Interpretability via Self-Supervised Masking” (AIM), a simple yet interestingly effective method that promotes the network’s utilization of genuine features over spurious alternatives without requiring additional annotations. In particular, AIM uses features at multiple encoding stages to guide a self-supervised, sample-specific feature-masking process. As a result, AIM enables the training of well-performing and inherently interpretable models that faithfully summarize the decision process. We validate AIM across a diverse range of challenging datasets that test both out-of-distribution generalization and fine-grained visual understanding. These include general-purpose classification benchmarks such as ImageNet100, HardImageNet, and ImageWoof, as well as fine-grained classification datasets such as Waterbirds, TravelingBirds, and CUB-200. AIM demonstrates significant dual benefits: interpretability improvements, as measured by the Energy Pointing Game (EPG) score, and accuracy gains over strong baselines. These consistent gains across domains and architectures provide compelling evidence that AIM promotes the use of genuine and meaningful features that directly contribute to improved generalization and human-aligned interpretability.
[149] A Real-time Concrete Crack Detection and Segmentation Model Based on YOLOv11
Shaoze Huang, Qi Liu, Chao Chen, Yuhang Chen
Main category: cs.CV
TL;DR: Proposes YOLOv11-KW-TA-FP, an enhanced YOLOv11n-based model for concrete crack detection, integrating dynamic KWConv, triple attention, and FP-IoU loss, achieving high precision and robustness.
Details
Motivation: Addresses inefficient manual inspection and suboptimal deep learning performance for small-target crack detection in complex backgrounds.Method: Three-stage optimization: KWConv for dynamic kernel sharing, triple attention for channel-spatial interaction, and FP-IoU loss for adaptive bounding box regression.
Result: Achieves 91.3% precision, 76.6% recall, and 86.4% mAP@50, with robustness in data scarcity and noise.
Conclusion: Provides an efficient, practical solution for automated infrastructure inspection with significant engineering value.
Abstract: Accelerated aging of transportation infrastructure in the rapidly developing Yangtze River Delta region necessitates efficient concrete crack detection, as crack deterioration critically compromises structural integrity and regional economic growth. To overcome the limitations of inefficient manual inspection and the suboptimal performance of existing deep learning models, particularly for small-target crack detection within complex backgrounds, this paper proposes YOLOv11-KW-TA-FP, a multi-task concrete crack detection and segmentation model based on the YOLOv11n architecture. The proposed model integrates a three-stage optimization framework: (1) Embedding dynamic KernelWarehouse convolution (KWConv) within the backbone network to enhance feature representation through a dynamic kernel sharing mechanism; (2) Incorporating a triple attention mechanism (TA) into the feature pyramid to strengthen channel-spatial interaction modeling; and (3) Designing an FP-IoU loss function to facilitate adaptive bounding box regression penalization. Experimental validation demonstrates that the enhanced model achieves significant performance improvements over the baseline, attaining 91.3% precision, 76.6% recall, and 86.4% mAP@50. Ablation studies confirm the synergistic efficacy of the proposed modules. Furthermore, robustness tests indicate stable performance under conditions of data scarcity and noise interference. This research delivers an efficient computer vision solution for automated infrastructure inspection, exhibiting substantial practical engineering value.
[150] Multi-State Tracker: Enhancing Efficient Object Tracking via Multi-State Specialization and Interaction
Shilei Wang, Gong Cheng, Pujian Lai, Dong Gao, Junwei Han
Main category: cs.CV
TL;DR: MST introduces lightweight SSE and CSI modules to enhance multi-state features, improving tracking accuracy and robustness with minimal computational overhead.
Details
Motivation: Efficient trackers often sacrifice feature representation for speed, limiting accuracy in complex environments.Method: Uses MSG for multi-state feature generation, SSE for refinement, and CSI for adaptive aggregation.
Result: Outperforms previous trackers, with a 4.5% AO score improvement on GOT-10K.
Conclusion: MST balances efficiency and accuracy, offering robust tracking with low computational cost.
Abstract: Efficient trackers achieve faster runtime by reducing computational complexity and model parameters. However, this efficiency often compromises the expense of weakened feature representation capacity, thus limiting their ability to accurately capture target states using single-layer features. To overcome this limitation, we propose Multi-State Tracker (MST), which utilizes highly lightweight state-specific enhancement (SSE) to perform specialized enhancement on multi-state features produced by multi-state generation (MSG) and aggregates them in an interactive and adaptive manner using cross-state interaction (CSI). This design greatly enhances feature representation while incurring minimal computational overhead, leading to improved tracking robustness in complex environments. Specifically, the MSG generates multiple state representations at multiple stages during feature extraction, while SSE refines them to highlight target-specific features. The CSI module facilitates information exchange between these states and ensures the integration of complementary features. Notably, the introduced SSE and CSI modules adopt a highly lightweight hidden state adaptation-based state space duality (HSA-SSD) design, incurring only 0.1 GFLOPs in computation and 0.66 M in parameters. Experimental results demonstrate that MST outperforms all previous efficient trackers across multiple datasets, significantly improving tracking accuracy and robustness. In particular, it shows excellent runtime performance, with an AO score improvement of 4.5% over the previous SOTA efficient tracker HCAT on the GOT-10K dataset. The code is available at https://github.com/wsumel/MST.
[151] An Efficient Medical Image Classification Method Based on a Lightweight Improved ConvNeXt-Tiny Architecture
Jingsong Xia, Yue Yin, Xiuhan Li
Main category: cs.CV
TL;DR: The paper proposes an improved ConvNeXt-Tiny architecture for efficient and accurate medical image classification in resource-constrained environments, achieving 89.10% accuracy with optimized feature extraction and reduced complexity.
Details
Motivation: To address the challenge of high-accuracy medical image classification in computational resource-limited settings.Method: Introduces dual global pooling (GAP and GMP) for feature fusion, a lightweight SEVector attention module, and Feature Smoothing Loss for intra-class consistency.
Result: Achieves 89.10% classification accuracy on the test set under CPU-only conditions with stable convergence.
Conclusion: The method provides a feasible and efficient solution for deploying medical imaging analysis models in resource-limited environments.
Abstract: Intelligent analysis of medical imaging plays a crucial role in assisting clinical diagnosis. However, achieving efficient and high-accuracy image classification in resource-constrained computational environments remains challenging. This study proposes a medical image classification method based on an improved ConvNeXt-Tiny architecture. Through structural optimization and loss function design, the proposed method enhances feature extraction capability and classification performance while reducing computational complexity. Specifically, the method introduces a dual global pooling (Global Average Pooling and Global Max Pooling) feature fusion strategy into the ConvNeXt-Tiny backbone to simultaneously preserve global statistical features and salient response information. A lightweight channel attention module, termed Squeeze-and-Excitation Vector (SEVector), is designed to improve the adaptive allocation of channel weights while minimizing parameter overhead. Additionally, a Feature Smoothing Loss is incorporated into the loss function to enhance intra-class feature consistency and suppress intra-class variance. Under CPU-only conditions (8 threads), the method achieves a maximum classification accuracy of 89.10% on the test set within 10 training epochs, exhibiting a stable convergence trend in loss values. Experimental results demonstrate that the proposed method effectively improves medical image classification performance in resource-limited settings, providing a feasible and efficient solution for the deployment and promotion of medical imaging analysis models.
[152] Reinforcing Video Reasoning Segmentation to Think Before It Segments
Sitong Gong, Lu Zhang, Yunzhi Zhuge, Xu Jia, Pingping Zhang, Huchuan Lu
Main category: cs.CV
TL;DR: Veason-R1 is a specialized LVLM for Video Reasoning Segmentation (VRS) that improves interpretability and performance via structured reasoning, trained with GRPO and CoT initialization, achieving state-of-the-art results.
Details
Motivation: Previous VRS methods using LVLMs lack interpretability and perform suboptimally due to poor spatiotemporal reasoning.Method: Veason-R1 employs Group Relative Policy Optimization (GRPO) and Chain-of-Thought (CoT) initialization for training, enhancing structured reasoning and spatiotemporal alignment.
Result: Veason-R1 outperforms prior methods on benchmarks (e.g., +1.3 J &F in ReVOS, +10.0 J &F in ReasonVOS) and reduces hallucinations (+8.8 R).
Conclusion: Veason-R1 sets a new standard for VRS by combining structured reasoning with efficient training, achieving robust and interpretable results.
Abstract: Video reasoning segmentation (VRS) endeavors to delineate referred objects in
videos guided by implicit instructions that encapsulate human intent and
temporal logic. Previous approaches leverage large vision language models
(LVLMs) to encode object semantics into
[153] Training-Free Anomaly Generation via Dual-Attention Enhancement in Diffusion Model
Zuo Zuo, Jiahao Dong, Yanyun Qu, Zongze Wu
Main category: cs.CV
TL;DR: AAG is a training-free anomaly generation framework using Stable Diffusion to create realistic anomalies in specific image regions while preserving other areas, enhancing downstream anomaly detection tasks.
Details
Motivation: Addressing data scarcity in industrial anomaly detection by generating realistic anomalies without extra training data.Method: Uses Stable Diffusion with Cross-Attention Enhancement (CAE) and Self-Attention Enhancement (SAE) to guide anomaly generation based on masks and text prompts.
Result: Demonstrates effectiveness on MVTec AD and VisA datasets, improving downstream anomaly inspection tasks.
Conclusion: AAG provides a practical solution for anomaly generation, enhancing anomaly detection performance.
Abstract: Industrial anomaly detection (AD) plays a significant role in manufacturing where a long-standing challenge is data scarcity. A growing body of works have emerged to address insufficient anomaly data via anomaly generation. However, these anomaly generation methods suffer from lack of fidelity or need to be trained with extra data. To this end, we propose a training-free anomaly generation framework dubbed AAG, which is based on Stable Diffusion (SD)’s strong generation ability for effective anomaly image generation. Given a normal image, mask and a simple text prompt, AAG can generate realistic and natural anomalies in the specific regions and simultaneously keep contents in other regions unchanged. In particular, we propose Cross-Attention Enhancement (CAE) to re-engineer the cross-attention mechanism within Stable Diffusion based on the given mask. CAE increases the similarity between visual tokens in specific regions and text embeddings, which guides these generated visual tokens in accordance with the text description. Besides, generated anomalies need to be more natural and plausible with object in given image. We propose Self-Attention Enhancement (SAE) which improves similarity between each normal visual token and anomaly visual tokens. SAE ensures that generated anomalies are coherent with original pattern. Extensive experiments on MVTec AD and VisA datasets demonstrate effectiveness of AAG in anomaly generation and its utility. Furthermore, anomaly images generated by AAG can bolster performance of various downstream anomaly inspection tasks.
[154] TrajSV: A Trajectory-based Model for Sports Video Representations and Applications
Zheng Wang, Shihao Xu, Wei Shi
Main category: cs.CV
TL;DR: TrajSV is a trajectory-based framework for sports analytics, addressing data unavailability, lack of trajectory frameworks, and supervision label needs. It improves sports video retrieval, action spotting, and captioning.
Details
Motivation: To resolve unresolved issues in sports analytics like data unavailability, ineffective trajectory frameworks, and supervision label requirements.Method: TrajSV includes data preprocessing, Clip Representation Network (CRNet), and Video Representation Network (VRNet), using trajectory-enhanced Transformers and unsupervised optimization with triple contrastive loss.
Result: Achieves state-of-the-art performance: ~70% improvement in retrieval, top results in 9/17 action categories, and ~20% better captioning.
Conclusion: TrajSV effectively addresses sports analytics challenges and demonstrates superior performance in multiple applications.
Abstract: Sports analytics has received significant attention from both academia and industry in recent years. Despite the growing interest and efforts in this field, several issues remain unresolved, including (1) data unavailability, (2) lack of an effective trajectory-based framework, and (3) requirement for sufficient supervision labels. In this paper, we present TrajSV, a trajectory-based framework that addresses various issues in existing studies. TrajSV comprises three components: data preprocessing, Clip Representation Network (CRNet), and Video Representation Network (VRNet). The data preprocessing module extracts player and ball trajectories from sports broadcast videos. CRNet utilizes a trajectory-enhanced Transformer module to learn clip representations based on these trajectories. Additionally, VRNet learns video representations by aggregating clip representations and visual features with an encoder-decoder architecture. Finally, a triple contrastive loss is introduced to optimize both video and clip representations in an unsupervised manner. The experiments are conducted on three broadcast video datasets to verify the effectiveness of TrajSV for three types of sports (i.e., soccer, basketball, and volleyball) with three downstream applications (i.e., sports video retrieval, action spotting, and video captioning). The results demonstrate that TrajSV achieves state-of-the-art performance in sports video retrieval, showcasing a nearly 70% improvement. It outperforms baselines in action spotting, achieving state-of-the-art results in 9 out of 17 action categories, and demonstrates a nearly 20% improvement in video captioning. Additionally, we introduce a deployed system along with the three applications based on TrajSV.
[155] Causality Matters: How Temporal Information Emerges in Video Language Models
Yumeng Shi, Quanyu Long, Yin Wu, Wenya Wang
Main category: cs.CV
TL;DR: The paper challenges the importance of positional encodings (PEs) in VideoLMs for temporal understanding, revealing that inter-frame attention and causal mechanisms play a more critical role. It proposes efficiency strategies like staged cross-modal attention and temporal exit mechanisms.
Details
Motivation: To address the gap in temporal understanding in VideoLMs, particularly the role of PEs and inter-frame interactions.Method: Analyzes the impact of PEs and frame sequence reversal, traces temporal information pathways, and proposes staged cross-modal attention and temporal exit mechanisms.
Result: Reveals that temporal reasoning emerges from inter-visual token interactions under causal attention, not PEs. Proposed strategies improve efficiency.
Conclusion: The study provides insights into temporal understanding in VideoLMs and offers practical improvements, paving the way for future model enhancements.
Abstract: Video language models (VideoLMs) have made significant progress in multimodal understanding. However, temporal understanding, which involves identifying event order, duration, and relationships across time, still remains a core challenge. Prior works emphasize positional encodings (PEs) as a key mechanism for encoding temporal structure. Surprisingly, we find that removing or modifying PEs in video inputs yields minimal degradation in the performance of temporal understanding. In contrast, reversing the frame sequence while preserving the original PEs causes a substantial drop. To explain this behavior, we conduct substantial analysis experiments to trace how temporal information is integrated within the model. We uncover a causal information pathway: temporal cues are progressively synthesized through inter-frame attention, aggregated in the final frame, and subsequently integrated into the query tokens. This emergent mechanism shows that temporal reasoning emerges from inter-visual token interactions under the constraints of causal attention, which implicitly encodes temporal structure. Based on these insights, we propose two efficiency-oriented strategies: staged cross-modal attention and a temporal exit mechanism for early token truncation. Experiments on two benchmarks validate the effectiveness of both approaches. To the best of our knowledge, this is the first work to systematically investigate video temporal understanding in VideoLMs, offering insights for future model improvement.
[156] DashCam Video: A complementary low-cost data stream for on-demand forest-infrastructure system monitoring
Durga Joshi, Chandi Witharana, Robert Fahey, Thomas Worthley, Zhe Zhu, Diego Cerrai
Main category: cs.CV
TL;DR: A low-cost, reproducible framework using dashcam video for real-time structural assessment and geolocation of roadside objects, combining depth estimation, error correction, and GPS-based triangulation.
Details
Motivation: To provide a cost-effective, scalable solution for monitoring urban vegetation and infrastructure using readily available dashcam data, addressing limitations of conventional methods like LiDAR.Method: An end-to-end pipeline integrating monocular depth estimation, depth error correction via gradient-boosted regression, and GPS-based triangulation for geolocation and height estimation.
Result: Achieved high accuracy (R2 = 0.92, MAE = 0.31 for depth correction; geolocation error of 2.83 m, height MAE of 2.09 m for trees, 0.88 m for poles).
Conclusion: The framework offers a fast, real-time, and cost-effective alternative to traditional methods, valuable for utility companies and urban planners.
Abstract: Our study introduces a novel, low-cost, and reproducible framework for real-time, object-level structural assessment and geolocation of roadside vegetation and infrastructure with commonly available but underutilized dashboard camera (dashcam) video data. We developed an end-to-end pipeline that combines monocular depth estimation, depth error correction, and geometric triangulation to generate accurate spatial and structural data from street-level video streams from vehicle-mounted dashcams. Depth maps were first estimated using a state-of-the-art monocular depth model, then refined via a gradient-boosted regression framework to correct underestimations, particularly for distant objects. The depth correction model achieved strong predictive performance (R2 = 0.92, MAE = 0.31 on transformed scale), significantly reducing bias beyond 15 m. Further, object locations were estimated using GPS-based triangulation, while object heights were calculated using pin hole camera geometry. Our method was evaluated under varying conditions of camera placement and vehicle speed. Low-speed vehicle with inside camera gave the highest accuracy, with mean geolocation error of 2.83 m, and mean absolute error (MAE) in height estimation of 2.09 m for trees and 0.88 m for poles. To the best of our knowledge, it is the first framework to combine monocular depth modeling, triangulated GPS-based geolocation, and real-time structural assessment for urban vegetation and infrastructure using consumer-grade video data. Our approach complements conventional RS methods, such as LiDAR and image by offering a fast, real-time, and cost-effective solution for object-level monitoring of vegetation risks and infrastructure exposure, making it especially valuable for utility companies, and urban planners aiming for scalable and frequent assessments in dynamic urban environments.
[157] CoreEditor: Consistent 3D Editing via Correspondence-constrained Diffusion
Zhe Zhu, Honghua Chen, Peng Li, Mingqiang Wei
Main category: cs.CV
TL;DR: CoreEditor introduces a correspondence-constrained attention mechanism for consistent text-to-3D editing, outperforming prior methods with sharper details and better cross-view consistency.
Details
Motivation: Existing text-driven 3D editing methods often fail to maintain cross-view consistency, leading to insufficient edits and blurry details.Method: CoreEditor uses a correspondence-constrained attention mechanism and incorporates semantic similarity during denoising for robust multi-view editing. It also includes a selective editing pipeline for user flexibility.
Result: CoreEditor produces high-quality, 3D-consistent edits with sharper details, significantly outperforming prior methods.
Conclusion: CoreEditor advances text-to-3D editing by ensuring cross-view consistency and offering user control, setting a new benchmark for quality.
Abstract: Text-driven 3D editing seeks to modify 3D scenes according to textual descriptions, and most existing approaches tackle this by adapting pre-trained 2D image editors to multi-view inputs. However, without explicit control over multi-view information exchange, they often fail to maintain cross-view consistency, leading to insufficient edits and blurry details. We introduce CoreEditor, a novel framework for consistent text-to-3D editing. The key innovation is a correspondence-constrained attention mechanism that enforces precise interactions between pixels expected to remain consistent throughout the diffusion denoising process. Beyond relying solely on geometric alignment, we further incorporate semantic similarity estimated during denoising, enabling more reliable correspondence modeling and robust multi-view editing. In addition, we design a selective editing pipeline that allows users to choose preferred results from multiple candidates, offering greater flexibility and user control. Extensive experiments show that CoreEditor produces high-quality, 3D-consistent edits with sharper details, significantly outperforming prior methods.
[158] LoRAtorio: An intrinsic approach to LoRA Skill Composition
Niki Foteinopoulou, Ignas Budvytis, Stephan Liwicki
Main category: cs.CV
TL;DR: LoRAtorio is a train-free framework for composing multiple LoRA adapters in text-to-image diffusion models, leveraging intrinsic model behavior and spatial-aware weighting for improved performance.
Details
Motivation: Existing methods struggle with composing multiple LoRA adapters in open-ended settings, limiting personalization of visual concepts.Method: Uses spatial patches and cosine similarity to construct a weight matrix for weighted aggregation of LoRA outputs, with modifications to classifier-free guidance.
Result: Achieves state-of-the-art performance, with up to 1.3% ClipScore improvement and 72.43% win rate in GPT-4V evaluations.
Conclusion: LoRAtorio effectively generalizes to multiple latent diffusion models, addressing domain drift and enabling dynamic adapter selection.
Abstract: Low-Rank Adaptation (LoRA) has become a widely adopted technique in text-to-image diffusion models, enabling the personalisation of visual concepts such as characters, styles, and objects. However, existing approaches struggle to effectively compose multiple LoRA adapters, particularly in open-ended settings where the number and nature of required skills are not known in advance. In this work, we present LoRAtorio, a novel train-free framework for multi-LoRA composition that leverages intrinsic model behaviour. Our method is motivated by two key observations: (1) LoRA adapters trained on narrow domains produce denoised outputs that diverge from the base model, and (2) when operating out-of-distribution, LoRA outputs show behaviour closer to the base model than when conditioned in distribution. The balance between these two observations allows for exceptional performance in the single LoRA scenario, which nevertheless deteriorates when multiple LoRAs are loaded. Our method operates in the latent space by dividing it into spatial patches and computing cosine similarity between each patch’s predicted noise and that of the base model. These similarities are used to construct a spatially-aware weight matrix, which guides a weighted aggregation of LoRA outputs. To address domain drift, we further propose a modification to classifier-free guidance that incorporates the base model’s unconditional score into the composition. We extend this formulation to a dynamic module selection setting, enabling inference-time selection of relevant LoRA adapters from a large pool. LoRAtorio achieves state-of-the-art performance, showing up to a 1.3% improvement in ClipScore and a 72.43% win rate in GPT-4V pairwise evaluations, and generalises effectively to multiple latent diffusion models.
[159] Thyme: Think Beyond Images
Yi-Fan Zhang, Xingyu Lu, Shukang Yin, Chaoyou Fu, Wei Chen, Xiao Hu, Bin Wen, Kaiyu Jiang, Changyi Liu, Tianke Zhang, Haonan Fan, Kaibing Chen, Jiankang Chen, Haojie Ding, Kaiyu Tang, Zhang Zhang, Liang Wang, Fan Yang, Tingting Gao, Guorui Zhou
Main category: cs.CV
TL;DR: Thyme introduces a novel paradigm for MLLMs to autonomously generate and execute image processing and computational operations via code, outperforming existing methods in perception and reasoning tasks.
Details
Motivation: To bridge the gap between proprietary models (O3) and open-source work by enabling richer image manipulations and logical reasoning through code.Method: A two-stage training strategy: SFT on 500K samples for code generation, followed by RL with GRPO-ATS for refined decision-making.
Result: Significant performance gains on nearly 20 benchmarks, especially in high-resolution perception and complex reasoning.
Conclusion: Thyme advances MLLMs by combining autonomous code execution with image processing, offering a robust solution for enhanced reasoning and perception.
Abstract: Following OpenAI’s introduction of the thinking with images'' concept, recent efforts have explored stimulating the use of visual information in the reasoning process to enhance model performance in perception and reasoning tasks. However, to the best of our knowledge, no open-source work currently offers a feature set as rich as proprietary models (O3), which can perform diverse image manipulations and simultaneously enhance logical reasoning capabilities through code. In this paper, we make a preliminary attempt in this direction by introducing Thyme (Think Beyond Images), a novel paradigm for enabling MLLMs to transcend existing
think with images’’ approaches by
autonomously generating and executing diverse image processing and
computational operations via executable code. This approach not only
facilitates a rich, on-the-fly set of image manipulations (e.g., cropping,
rotation, contrast enhancement) but also allows for mathematical computations,
all while maintaining high autonomy in deciding when and how to apply these
operations. We activate this capability through a two-stage training strategy:
an initial SFT on a curated dataset of 500K samples to teach code generation,
followed by a RL phase to refine decision-making. For the RL stage, we manually
collect and design high-resolution question-answer pairs to increase the
learning difficulty, and we propose GRPO-ATS (Group Relative Policy
Optimization with Adaptive Temperature Sampling), an algorithm that applies
distinct temperatures to text and code generation to balance reasoning
exploration with code execution precision. We conduct extensive experimental
analysis and ablation studies. Comprehensive evaluations on nearly 20
benchmarks show that Thyme yields significant and consistent performance gains,
particularly in challenging high-resolution perception and complex reasoning
tasks.
[160] SORT3D: Spatial Object-centric Reasoning Toolbox for Zero-Shot 3D Grounding Using Large Language Models
Nader Zantout, Haochen Zhang, Pujith Kachana, Jinkai Qiu, Guofei Chen, Ji Zhang, Wenshan Wang
Main category: cs.CV
TL;DR: SORT3D is a method for interpreting object-referential language in 3D scenes, leveraging 2D data and LLMs for zero-shot generalization without requiring text-to-3D training data.
Details
Motivation: The challenge lies in grounding objects in 3D with diverse scenes, fine-grained objects, and complex language references, compounded by limited natural language training data in 3D.Method: SORT3D combines 2D object attributes, a spatial reasoning toolbox, and LLMs for sequential reasoning, enabling zero-shot application.
Result: Achieves state-of-the-art zero-shot performance on view-dependent grounding tasks and works in real-time on autonomous vehicles.
Conclusion: SORT3D effectively addresses the challenges of 3D object grounding with minimal data and generalizes to unseen environments.
Abstract: Interpreting object-referential language and grounding objects in 3D with spatial relations and attributes is essential for robots operating alongside humans. However, this task is often challenging due to the diversity of scenes, large number of fine-grained objects, and complex free-form nature of language references. Furthermore, in the 3D domain, obtaining large amounts of natural language training data is difficult. Thus, it is important for methods to learn from little data and zero-shot generalize to new environments. To address these challenges, we propose SORT3D, an approach that utilizes rich object attributes from 2D data and merges a heuristics-based spatial reasoning toolbox with the ability of large language models (LLMs) to perform sequential reasoning. Importantly, our method does not require text-to-3D data for training and can be applied zero-shot to unseen environments. We show that SORT3D achieves state-of-the-art zero-shot performance on complex view-dependent grounding tasks on two benchmarks. We also implement the pipeline to run real-time on two autonomous vehicles and demonstrate that our approach can be used for object-goal navigation on previously unseen real-world environments. All source code for the system pipeline is publicly released at https://github.com/nzantout/SORT3D.
[161] Blending 3D Geometry and Machine Learning for Multi-View Stereopsis
Vibhas Vats, Md. Alimoor Reza, David Crandall, Soon-heung Jung
Main category: cs.CV
TL;DR: GC MVSNet++ integrates geometric consistency checks during learning, improving efficiency and performance in multi-view stereo tasks.
Details
Motivation: Traditional MVS methods rely on post-processing for geometric consistency, while learning-based methods ignore it during training. This work aims to enforce geometric consistency during learning for better results.Method: Introduces GC MVSNet++, which enforces multi-view, multi-scale geometric consistency during learning and uses a densely connected cost regularization network with optimized block designs.
Result: Achieves state-of-the-art on DTU and BlendedMVS datasets and ranks second on Tanks and Temples benchmark, with training iterations halved.
Conclusion: GC MVSNet++ is the first to enforce multi-view, multi-scale geometric consistency during learning, significantly improving MVS performance.
Abstract: Traditional multi-view stereo (MVS) methods primarily depend on photometric and geometric consistency constraints. In contrast, modern learning-based algorithms often rely on the plane sweep algorithm to infer 3D geometry, applying explicit geometric consistency (GC) checks only as a post-processing step, with no impact on the learning process itself. In this work, we introduce GC MVSNet plus plus, a novel approach that actively enforces geometric consistency of reference view depth maps across multiple source views (multi view) and at various scales (multi scale) during the learning phase (see Fig. 1). This integrated GC check significantly accelerates the learning process by directly penalizing geometrically inconsistent pixels, effectively halving the number of training iterations compared to other MVS methods. Furthermore, we introduce a densely connected cost regularization network with two distinct block designs simple and feature dense optimized to harness dense feature connections for enhanced regularization. Extensive experiments demonstrate that our approach achieves a new state of the art on the DTU and BlendedMVS datasets and secures second place on the Tanks and Temples benchmark. To our knowledge, GC MVSNet plus plus is the first method to enforce multi-view, multi-scale supervised geometric consistency during learning. Our code is available.
[162] ViFusionTST: Deep Fusion of Time-Series Image Representations from Load Signals for Early Bed-Exit Prediction
Hao Liu, Yu Hu, Rakiba Rayhana, Ling Bai, Zheng Liu
Main category: cs.CV
TL;DR: A method using low-cost load cells and image-based signal processing predicts bed-exit intent early, outperforming existing baselines in accuracy and F1 score.
Details
Motivation: Bed-related falls are a significant issue in healthcare, and current alarms often detect falls too late.Method: Uses a single load cell under a bed leg, converts signals into images (RGB line plot and texture maps), and processes them with a dual-stream Swin Transformer (ViFusionTST).
Result: Achieves 0.885 accuracy and 0.794 F1 score on real-world data, surpassing other time-series methods.
Conclusion: Image-based fusion of load-sensor signals is effective for real-time, privacy-preserving fall prevention.
Abstract: Bed-related falls remain a major source of injury in hospitals and long-term care facilities, yet many commercial alarms trigger only after a patient has already left the bed. We show that early bed-exit intent can be predicted using only one low-cost load cell mounted under a bed leg. The resulting load signals are first converted into a compact set of complementary images: an RGB line plot that preserves raw waveforms and three texture maps-recurrence plot, Markov transition field, and Gramian angular field-that expose higher-order dynamics. We introduce ViFusionTST, a dual-stream Swin Transformer that processes the line plot and texture maps in parallel and fuses them through cross-attention to learn data-driven modality weights. To provide a realistic benchmark, we collected six months of continuous data from 95 beds in a long-term-care facility. On this real-world dataset ViFusionTST reaches an accuracy of 0.885 and an F1 score of 0.794, surpassing recent 1D and 2D time-series baselines across F1, recall, accuracy, and AUPRC. The results demonstrate that image-based fusion of load-sensor signals for time series classification is a practical and effective solution for real-time, privacy-preserving fall prevention.
[163] Scanpath Prediction in Panoramic Videos via Expected Code Length Minimization
Mu Li, Kanglong Fan, Kede Ma
Main category: cs.CV
TL;DR: A new method for predicting human scanpaths in panoramic videos uses lossy data compression principles, avoiding reliance on ground-truth data and improving accuracy and realism.
Details
Motivation: The challenge lies in predicting human scanpaths due to spherical geometry, multimodality, and output uncertainty. Existing methods often fail to address these fully.Method: The approach minimizes the expected code length of quantized scanpaths, fitting a discrete conditional probability model via maximum likelihood, using viewport sequences and historical scanpaths as inputs.
Result: The method outperforms others in prediction accuracy and perceptual realism, validated by experiments and psychophysical tests.
Conclusion: The proposed criterion and PID controller-based sampler effectively predict realistic scanpaths without ground-truth reliance, generalizing well to unseen datasets.
Abstract: Predicting human scanpaths when exploring panoramic videos is a challenging task due to the spherical geometry and the multimodality of the input, and the inherent uncertainty and diversity of the output. Most previous methods fail to give a complete treatment of these characteristics, and thus are prone to errors. In this paper, we present a simple new criterion for scanpath prediction based on principles from lossy data compression. This criterion suggests minimizing the expected code length of quantized scanpaths in a training set, which corresponds to fitting a discrete conditional probability model via maximum likelihood. Specifically, the probability model is conditioned on two modalities: a viewport sequence as the deformation-reduced visual input and a set of relative historical scanpaths projected onto respective viewports as the aligned path input. The probability model is parameterized by a product of discretized Gaussian mixture models to capture the uncertainty and the diversity of scanpaths from different users. Most importantly, the training of the probability model does not rely on the specification of “ground-truth” scanpaths for imitation learning. We also introduce a proportional-integral-derivative (PID) controller-based sampler to generate realistic human-like scanpaths from the learned probability model. Experimental results demonstrate that our method consistently produces better quantitative scanpath results in terms of prediction accuracy (by comparing to the assumed “ground-truths”) and perceptual realism (through machine discrimination) over a wide range of prediction horizons. We additionally verify the perceptual realism improvement via a formal psychophysical experiment and the generalization improvement on several unseen panoramic video datasets.
[164] GLM-4.5V and GLM-4.1V-Thinking: Towards Versatile Multimodal Reasoning with Scalable Reinforcement Learning
GLM-V Team, :, Wenyi Hong, Wenmeng Yu, Xiaotao Gu, Guo Wang, Guobing Gan, Haomiao Tang, Jiale Cheng, Ji Qi, Junhui Ji, Lihang Pan, Shuaiqi Duan, Weihan Wang, Yan Wang, Yean Cheng, Zehai He, Zhe Su, Zhen Yang, Ziyang Pan, Aohan Zeng, Baoxu Wang, Bin Chen, Boyan Shi, Changyu Pang, Chenhui Zhang, Da Yin, Fan Yang, Guoqing Chen, Jiazheng Xu, Jiale Zhu, Jiali Chen, Jing Chen, Jinhao Chen, Jinghao Lin, Jinjiang Wang, Junjie Chen, Leqi Lei, Letian Gong, Leyi Pan, Mingdao Liu, Mingde Xu, Mingzhi Zhang, Qinkai Zheng, Sheng Yang, Shi Zhong, Shiyu Huang, Shuyuan Zhao, Siyan Xue, Shangqin Tu, Shengbiao Meng, Tianshu Zhang, Tianwei Luo, Tianxiang Hao, Tianyu Tong, Wenkai Li, Wei Jia, Xiao Liu, Xiaohan Zhang, Xin Lyu, Xinyue Fan, Xuancheng Huang, Yanling Wang, Yadong Xue, Yanfeng Wang, Yanzi Wang, Yifan An, Yifan Du, Yiming Shi, Yiheng Huang, Yilin Niu, Yuan Wang, Yuanchang Yue, Yuchen Li, Yutao Zhang, Yuting Wang, Yu Wang, Yuxuan Zhang, Zhao Xue, Zhenyu Hou, Zhengxiao Du, Zihan Wang, Peng Zhang, Debing Liu, Bin Xu, Juanzi Li, Minlie Huang, Yuxiao Dong, Jie Tang
Main category: cs.CV
TL;DR: GLM-4.1V-Thinking and GLM-4.5V are advanced vision-language models (VLMs) with state-of-the-art performance across diverse tasks, outperforming open-source and some closed-source models.
Details
Motivation: To advance general-purpose multimodal understanding and reasoning by developing a capable vision foundation model and enhancing its potential through innovative training methods.Method: Large-scale pre-training followed by Reinforcement Learning with Curriculum Sampling (RLCS) to improve performance across tasks like STEM, video understanding, and coding.
Result: GLM-4.5V achieves top performance on 42 benchmarks, surpassing open-source models and competing with closed-source ones like Gemini-2.5-Flash. GLM-4.1V-9B-Thinking also outperforms larger models on 29 benchmarks.
Conclusion: The models demonstrate superior multimodal reasoning and understanding, with open-sourced availability for broader use.
Abstract: We present GLM-4.1V-Thinking and GLM-4.5V, a family of vision-language models (VLMs) designed to advance general-purpose multimodal understanding and reasoning. In this report, we share our key findings in the development of the reasoning-centric training framework. We first develop a capable vision foundation model with significant potential through large-scale pre-training, which arguably sets the upper bound for the final performance. We then propose Reinforcement Learning with Curriculum Sampling (RLCS) to unlock the full potential of the model, leading to comprehensive capability enhancement across a diverse range of tasks, including STEM problem solving, video understanding, content recognition, coding, grounding, GUI-based agents, and long document interpretation. In a comprehensive evaluation across 42 public benchmarks, GLM-4.5V achieves state-of-the-art performance on nearly all tasks among open-source models of similar size, and demonstrates competitive or even superior results compared to closed-source models such as Gemini-2.5-Flash on challenging tasks including Coding and GUI Agents. Meanwhile, the smaller GLM-4.1V-9B-Thinking remains highly competitive-achieving superior results to the much larger Qwen2.5-VL-72B on 29 benchmarks. We open-source both GLM-4.1V-9B-Thinking and GLM-4.5V. Code, models and more information are released at https://github.com/zai-org/GLM-V.
[165] Lightweight Attribute Localizing Models for Pedestrian Attribute Recognition
Ashish Jha, Dimitrii Ermilov, Konstantin Sobolev, Anh Huy Phan, Salman Ahmadi-Asl, Naveed Ahmed, Imran Junejo, Zaher AL Aghbari, Thar Baker, Ahmed Mohamed Khedr, Andrzej Cichocki
Main category: cs.CV
TL;DR: A novel method optimizes low-rank layer compression for Pedestrian Attribute Recognition (PAR) to reduce model complexity while preserving performance.
Details
Motivation: DNNs for PAR are over-parameterized and computationally complex, unsuitable for resource-constrained devices. Traditional compression methods fail to preserve gradient direction, leading to inefficiency and accuracy loss.Method: Proposes optimizing ranks for low-rank layers to align gradient directions with the original model, using ALM model for rank optimization and CPD-EPC or truncated SVD for compression.
Result: Reduces model complexity while maintaining high performance.
Conclusion: The approach enables efficient compression for PAR tasks by preserving gradient direction and optimizing ranks.
Abstract: Pedestrian Attribute Recognition (PAR) focuses on identifying various attributes in pedestrian images, with key applications in person retrieval, suspect re-identification, and soft biometrics. However, Deep Neural Networks (DNNs) for PAR often suffer from over-parameterization and high computational complexity, making them unsuitable for resource-constrained devices. Traditional tensor-based compression methods typically factorize layers without adequately preserving the gradient direction during compression, leading to inefficient compression and a significant accuracy loss. In this work, we propose a novel approach for determining the optimal ranks of low-rank layers, ensuring that the gradient direction of the compressed model closely aligns with that of the original model. This means that the compressed model effectively preserves the update direction of the full model, enabling more efficient compression for PAR tasks. The proposed procedure optimizes the compression ranks for each layer within the ALM model, followed by compression using CPD-EPC or truncated SVD. This results in a reduction in model complexity while maintaining high performance.
[166] Compositional Zero-shot Learning via Progressive Language-based Observations
Lin Li, Guikun Chen, Zhen Wang, Jun Xiao, Long Chen
Main category: cs.CV
TL;DR: The paper introduces Progressive Language-based Observations (PLO) to address challenges in compositional zero-shot learning by dynamically determining observation orders of primitives using vision-language models (VLMs) and large language models (LLMs).
Details
Motivation: The appearance of states or objects varies when combined with different primitives, making compositional recognition difficult. The paper aims to mitigate this variance by predicting compositions based on pre-observed primitives.Method: PLO uses VLMs and LLMs to dynamically determine observation orders. Two variants are proposed: PLO-VLM (two-step method) and PLO-LLM (multi-step scheme with composition-specific prompts).
Result: Extensive experiments on three datasets show PLO outperforms state-of-the-art methods in compositional recognition.
Conclusion: PLO effectively addresses compositional variance and improves recognition by leveraging language-based observations.
Abstract: Compositional zero-shot learning aims to recognize unseen state-object compositions by leveraging known primitives (state and object) during training. However, effectively modeling interactions between primitives and generalizing knowledge to novel compositions remains a perennial challenge. There are two key factors: object-conditioned and state-conditioned variance, i.e., the appearance of states (or objects) can vary significantly when combined with different objects (or states). For instance, the state “old” can signify a vintage design for a “car” or an advanced age for a “cat”. In this paper, we argue that these variances can be mitigated by predicting composition categories based on pre-observed primitive. To this end, we propose Progressive Language-based Observations (PLO), which can dynamically determine a better observation order of primitives. These observations comprise a series of concepts or languages that allow the model to understand image content in a step-by-step manner. Specifically, PLO adopts pre-trained vision-language models (VLMs) to empower the model with observation capabilities. We further devise two variants: 1) PLO-VLM: a two-step method, where a pre-observing classifier dynamically determines the observation order of two primitives. 2) PLO-LLM: a multi-step scheme, which utilizes large language models (LLMs) to craft composition-specific prompts for step-by-step observing. Extensive ablations on three challenging datasets demonstrate the superiority of PLO compared with state-of-the-art methods, affirming its abilities in compositional recognition.
[167] Wild2Avatar: Rendering Humans Behind Occlusions
Tiange Xiang, Adam Sun, Scott Delp, Kazuki Kozuka, Li Fei-Fei, Ehsan Adeli
Main category: cs.CV
TL;DR: Wild2Avatar is a neural rendering method for occluded monocular videos, decoupling scenes into occlusion, human, and background for realistic rendering.
Details
Motivation: Existing methods fail in occluded real-world scenes, requiring clear views. Wild2Avatar addresses this gap.Method: Uses occlusion-aware scene parameterization and objective functions to decouple and ensure human model completeness.
Result: Effective in rendering humans from occluded in-the-wild videos.
Conclusion: Wild2Avatar successfully tackles occlusion challenges in monocular video rendering.
Abstract: Rendering the visual appearance of moving humans from occluded monocular videos is a challenging task. Most existing research renders 3D humans under ideal conditions, requiring a clear and unobstructed scene. Those methods cannot be used to render humans in real-world scenes where obstacles may block the camera’s view and lead to partial occlusions. In this work, we present Wild2Avatar, a neural rendering approach catered for occluded in-the-wild monocular videos. We propose occlusion-aware scene parameterization for decoupling the scene into three parts - occlusion, human, and background. Additionally, extensive objective functions are designed to help enforce the decoupling of the human from both the occlusion and the background and to ensure the completeness of the human model. We verify the effectiveness of our approach with experiments on in-the-wild videos.
[168] HateClipSeg: A Segment-Level Annotated Dataset for Fine-Grained Hate Video Detection
Han Wang, Zhuoran Wang, Roy Ka-Wei Lee
Main category: cs.CV
TL;DR: HateClipSeg is a large-scale multimodal dataset for hate speech detection in videos, featuring fine-grained annotations and high inter-annotator agreement. It introduces three benchmarking tasks, revealing gaps in current models.
Details
Motivation: The complexity of multimodal content and lack of fine-grained annotations in existing datasets make hate speech detection in videos challenging.Method: A three-stage annotation process was used to create HateClipSeg, with over 11,714 segments labeled across categories. Three benchmarking tasks were proposed.
Result: Results show significant gaps in current models, indicating the need for advanced multimodal and temporally aware approaches.
Conclusion: HateClipSeg addresses dataset limitations and provides a benchmark for improving hate speech detection in videos.
Abstract: Detecting hate speech in videos remains challenging due to the complexity of multimodal content and the lack of fine-grained annotations in existing datasets. We present HateClipSeg, a large-scale multimodal dataset with both video-level and segment-level annotations, comprising over 11,714 segments labeled as Normal or across five Offensive categories: Hateful, Insulting, Sexual, Violence, Self-Harm, along with explicit target victim labels. Our three-stage annotation process yields high inter-annotator agreement (Krippendorff’s alpha = 0.817). We propose three tasks to benchmark performance: (1) Trimmed Hateful Video Classification, (2) Temporal Hateful Video Localization, and (3) Online Hateful Video Classification. Results highlight substantial gaps in current models, emphasizing the need for more sophisticated multimodal and temporally aware approaches. The HateClipSeg dataset are publicly available at https://github.com/Social-AI-Studio/HateClipSeg.git.
[169] Effective Message Hiding with Order-Preserving Mechanisms
Gao Yu, Qiu Xuchong, Ye Zihan
Main category: cs.CV
TL;DR: StegaFormer, an MLP-based framework, improves message hiding by preserving bit order and enabling global fusion between modalities, outperforming existing methods in recovery accuracy, capacity, and imperceptibility.
Details
Motivation: Convolutional neural networks struggle with preserving message bit order and addressing modality discrepancies, limiting recovery accuracy in message hiding.Method: StegaFormer uses Order-Preserving Message Encoder/Decoder (OPME/OPMD) for bit order and Global Message-Image Fusion (GMIF) for cross-modality fusion.
Result: Outperforms state-of-the-art methods on COCO and DIV2K datasets in recovery accuracy, capacity, and imperceptibility.
Conclusion: StegaFormer effectively addresses challenges in message hiding, offering superior performance and promising future applications.
Abstract: Message hiding, a technique that conceals secret message bits within a cover image, aims to achieve an optimal balance among message capacity, recovery accuracy, and imperceptibility. While convolutional neural networks have notably improved message capacity and imperceptibility, achieving high recovery accuracy remains challenging. This challenge arises because convolutional operations struggle to preserve the sequential order of message bits and effectively address the discrepancy between these two modalities. To address this, we propose StegaFormer, an innovative MLP-based framework designed to preserve bit order and enable global fusion between modalities. Specifically, StegaFormer incorporates three crucial components: Order-Preserving Message Encoder (OPME), Decoder (OPMD) and Global Message-Image Fusion (GMIF). OPME and OPMD aim to preserve the order of message bits by segmenting the entire sequence into equal-length segments and incorporating sequential information during encoding and decoding. Meanwhile, GMIF employs a cross-modality fusion mechanism to effectively fuse the features from the two uncorrelated modalities. Experimental results on the COCO and DIV2K datasets demonstrate that StegaFormer surpasses existing state-of-the-art methods in terms of recovery accuracy, message capacity, and imperceptibility. We will make our code publicly available.
[170] Reconstructing Satellites in 3D from Amateur Telescope Images
Zhiming Chang, Boyang Liu, Yifei Xia, Youming Guo, Boxin Shi, He Sun
Main category: cs.CV
TL;DR: A novel computational imaging framework for 3D satellite reconstruction from ground-based images, outperforming NeRF-based methods.
Details
Motivation: Challenges in 3D satellite model reconstruction due to atmospheric turbulence, long distances, limited viewpoints, and low signal-to-noise ratios.Method: Hybrid image pre-processing pipeline combined with joint pose estimation and 3D reconstruction using Gaussian Splatting and Branch-and-Bound search.
Result: Robust 3D reconstructions validated on synthetic datasets and real observations (Tiangong Space Station, ISS), outperforming NeRF-based methods in SSIM, PSNR, LPIPS, and Chamfer Distance.
Conclusion: The framework enables high-fidelity 3D satellite monitoring from Earth, providing a cost-effective solution for space situational awareness.
Abstract: Monitoring space objects is crucial for space situational awareness, yet reconstructing 3D satellite models from ground-based telescope images is challenging due to atmospheric turbulence, long observation distances, limited viewpoints, and low signal-to-noise ratios. In this paper, we propose a novel computational imaging framework that overcomes these obstacles by integrating a hybrid image pre-processing pipeline with a joint pose estimation and 3D reconstruction module based on controlled Gaussian Splatting (GS) and Branch-and-Bound (BnB) search. We validate our approach on both synthetic satellite datasets and on-sky observations of China’s Tiangong Space Station and the International Space Station, achieving robust 3D reconstructions of low-Earth orbit satellites from ground-based data. Quantitative evaluations using SSIM, PSNR, LPIPS, and Chamfer Distance demonstrate that our method outperforms state-of-the-art NeRF-based approaches, and ablation studies confirm the critical role of each component. Our framework enables high-fidelity 3D satellite monitoring from Earth, offering a cost-effective alternative for space situational awareness. Project page: https://ai4scientificimaging.org/ReconstructingSatellites
[171] Refine-IQA: Multi-Stage Reinforcement Finetuning for Perceptual Image Quality Assessment
Ziheng Jia, Jiaying Qian, Zicheng Zhang, Zijian Chen, Xiongkuo Min
Main category: cs.CV
TL;DR: Refine-IQA introduces a multi-stage RFT framework for IQA, enhancing visual perception and ’think’ process supervision, achieving top performance.
Details
Motivation: Address gaps in existing RFT-based IQA methods, which lack reward supervision for the 'think' process and direct fine-tuning without enhancing low-level visual perception.Method: Proposes a two-stage approach: Stage-1 uses the Refine-Perception-20K dataset and multi-task rewards to improve visual perception; Stage-2 introduces a probability difference reward for ’think’ process supervision.
Result: Refine-IQA models excel in perception and scoring tasks, with strong performance on quality interpreting benchmarks.
Conclusion: The framework effectively enhances IQA by addressing key limitations, demonstrating superior performance and robust ’think’ capabilities.
Abstract: Reinforcement fine-tuning (RFT) is a proliferating paradigm for LMM training. Analogous to high-level reasoning tasks, RFT is similarly applicable to low-level vision domains, including image quality assessment (IQA). Existing RFT-based IQA methods typically use rule-based output rewards to verify the model’s rollouts but provide no reward supervision for the “think” process, leaving its correctness and efficacy uncontrolled. Furthermore, these methods typically fine-tune directly on downstream IQA tasks without explicitly enhancing the model’s native low-level visual quality perception, which may constrain its performance upper bound. In response to these gaps, we propose the multi-stage RFT IQA framework (Refine-IQA). In Stage-1, we build the Refine-Perception-20K dataset (with 12 main distortions, 20,907 locally-distorted images, and over 55K RFT samples) and design multi-task reward functions to strengthen the model’s visual quality perception. In Stage-2, targeting the quality scoring task, we introduce a probability difference reward involved strategy for “think” process supervision. The resulting Refine-IQA Series Models achieve outstanding performance on both perception and scoring tasks-and, notably, our paradigm activates a robust “think” (quality interpreting) capability that also attains exceptional results on the corresponding quality interpreting benchmark.
[172] FancyVideo: Towards Dynamic and Consistent Video Generation via Cross-frame Textual Guidance
Jiasong Feng, Ao Ma, Jing Wang, Ke Cao, Zhanjie Zhang
Main category: cs.CV
TL;DR: FancyVideo introduces a Cross-frame Textual Guidance Module (CTGM) to enhance text-to-video generation by providing frame-specific textual guidance, achieving state-of-the-art results.
Details
Motivation: Existing text-to-video models lack frame-specific textual guidance, limiting coherent motion generation. FancyVideo aims to address this by improving temporal logic comprehension.Method: FancyVideo uses CTGM with Temporal Information Injector (TII) and Temporal Affinity Refiner (TAR) to inject and refine frame-specific textual conditions.
Result: Achieves state-of-the-art performance on the EvalCrafter benchmark and supports both text-to-video and image-to-video tasks.
Conclusion: FancyVideo effectively synthesizes dynamic and consistent videos, outperforming existing methods in temporal coherence and motion richness.
Abstract: Synthesizing motion-rich and temporally consistent videos remains a challenge in artificial intelligence, especially when dealing with extended durations. Existing text-to-video (T2V) models commonly employ spatial cross-attention for text control, equivalently guiding different frame generations without frame-specific textual guidance. Thus, the model’s capacity to comprehend the temporal logic conveyed in prompts and generate videos with coherent motion is restricted. To tackle this limitation, we introduce FancyVideo, an innovative video generator that improves the existing text-control mechanism with the well-designed Cross-frame Textual Guidance Module (CTGM). Specifically, CTGM incorporates the Temporal Information Injector (TII) and Temporal Affinity Refiner (TAR) at the beginning and end of cross-attention, respectively, to achieve frame-specific textual guidance. Firstly, TII injects frame-specific information from latent features into text conditions, thereby obtaining cross-frame textual conditions. Then, TAR refines the correlation matrix between cross-frame textual conditions and latent features along the time dimension. Extensive experiments comprising both quantitative and qualitative evaluations demonstrate the effectiveness of FancyVideo. Our approach achieves state-of-the-art T2V generation results on the EvalCrafter benchmark and facilitates the synthesis of dynamic and consistent videos. Note that the T2V process of FancyVideo essentially involves a text-to-image step followed by T+I2V. This means it also supports the generation of videos from user images, i.e., the image-to-video (I2V) task. A significant number of experiments have shown that its performance is also outstanding.
[173] Towards Physically Realizable Adversarial Attacks in Embodied Vision Navigation
Meng Chen, Jiawei Tu, Chao Qi, Yonghao Dang, Feng Zhou, Wei Wei, Jianqin Yin
Main category: cs.CV
TL;DR: The paper proposes a practical adversarial attack method for embodied vision navigation using learnable patches, optimizing for multi-view effectiveness and visual naturalness.
Details
Motivation: Addressing the susceptibility of embodied vision navigation to adversarial attacks, especially 3D physical threats, while overcoming the limitations of existing methods in physical feasibility and multi-view effectiveness.Method: Attaching adversarial patches to objects with learnable opacity and textures, using multi-view optimization and a two-stage opacity optimization mechanism.
Result: The method reduces navigation success rates by 22.39%, outperforming prior methods in practicality, effectiveness, and naturalness.
Conclusion: The proposed attack method is practical and effective, highlighting vulnerabilities in embodied vision navigation systems.
Abstract: The significant advancements in embodied vision navigation have raised concerns about its susceptibility to adversarial attacks exploiting deep neural networks. Investigating the adversarial robustness of embodied vision navigation is crucial, especially given the threat of 3D physical attacks that could pose risks to human safety. However, existing attack methods for embodied vision navigation often lack physical feasibility due to challenges in transferring digital perturbations into the physical world. Moreover, current physical attacks for object detection struggle to achieve both multi-view effectiveness and visual naturalness in navigation scenarios. To address this, we propose a practical attack method for embodied navigation by attaching adversarial patches to objects, where both opacity and textures are learnable. Specifically, to ensure effectiveness across varying viewpoints, we employ a multi-view optimization strategy based on object-aware sampling, which optimizes the patch’s texture based on feedback from the vision-based perception model used in navigation. To make the patch inconspicuous to human observers, we introduce a two-stage opacity optimization mechanism, in which opacity is fine-tuned after texture optimization. Experimental results demonstrate that our adversarial patches decrease the navigation success rate by an average of 22.39%, outperforming previous methods in practicality, effectiveness, and naturalness. Code is available at: https://github.com/chen37058/Physical-Attacks-in-Embodied-Nav
[174] Efficient High-Resolution Visual Representation Learning with State Space Model for Human Pose Estimation
Hao Zhang, Yongqiang Ma, Wenqi Shao, Ping Luo, Nanning Zheng, Kaipeng Zhang
Main category: cs.CV
TL;DR: The paper introduces HRVMamba, a model combining State Space Models (SSMs) with multi-scale convolutions to efficiently capture long-range dependencies for dense prediction tasks like human pose estimation.
Details
Motivation: To address the inefficiency of Vision Transformers (ViTs) and limitations of existing visual SSMs (weak spatial bias, long-range forgetting, low-resolution outputs) in dense prediction tasks.Method: Proposes the Dynamic Visual State Space (DVSS) block, integrating multi-scale convolutions and deformable operations for better spatial representation and semantic aggregation. HRVMamba, built with DVSS, is a multi-branch high-resolution architecture.
Result: HRVMamba performs competitively in human pose estimation, image classification, and semantic segmentation against CNN-, ViT-, and SSM-based baselines.
Conclusion: HRVMamba offers an efficient, scalable solution for high-resolution dense prediction tasks by enhancing SSMs with spatial inductive biases and adaptive sampling.
Abstract: Capturing long-range dependencies while preserving high-resolution visual representations is crucial for dense prediction tasks such as human pose estimation. Vision Transformers (ViTs) have advanced global modeling through self-attention but suffer from quadratic computational complexity with respect to token count, limiting their efficiency and scalability to high-resolution inputs, especially on mobile and resource-constrained devices. State Space Models (SSMs), exemplified by Mamba, offer an efficient alternative by combining global receptive fields with linear computational complexity, enabling scalable and resource-friendly sequence modeling. However, when applied to dense prediction tasks, existing visual SSMs face key limitations: weak spatial inductive bias, long-range forgetting from hidden state decay, and low-resolution outputs that hinder fine-grained localization. To address these issues, we propose the Dynamic Visual State Space (DVSS) block, which augments visual state space models with multi-scale convolutional operations to enhance local spatial representations and strengthen spatial inductive biases. Through architectural exploration and theoretical analysis, we incorporate deformable operation into the DVSS block, identifying it as an efficient and effective mechanism to enhance semantic aggregation and mitigate long-range forgetting via input-dependent, adaptive spatial sampling. We embed DVSS into a multi-branch high-resolution architecture to build HRVMamba, a novel model for efficient high-resolution representation learning. Extensive experiments on human pose estimation, image classification, and semantic segmentation show that HRVMamba performs competitively against leading CNN-, ViT-, and SSM-based baselines. Code is available at https://github.com/zhanghao5201/PoseVMamba.
[175] ShoulderShot: Generating Over-the-Shoulder Dialogue Videos
Yuang Zhang, Junqi Cheng, Haoyu Zhao, Jiaxi Gu, Fangyuan Zou, Zenghui Lu, Peng Shu
Main category: cs.CV
TL;DR: ShoulderShot is a framework for generating over-the-shoulder dialogue videos, addressing challenges like character consistency and spatial continuity, outperforming existing methods.
Details
Motivation: Over-the-shoulder dialogue videos are crucial for visual variety and emotional engagement but are underexplored in video generation research.Method: Combines dual-shot generation with looping video to maintain character consistency and spatial continuity for extended dialogues.
Result: Outperforms existing methods in shot-reverse-shot layout, spatial continuity, and dialogue length flexibility.
Conclusion: ShoulderShot opens new possibilities for practical dialogue video generation, as demonstrated by its superior results.
Abstract: Over-the-shoulder dialogue videos are essential in films, short dramas, and advertisements, providing visual variety and enhancing viewers’ emotional connection. Despite their importance, such dialogue scenes remain largely underexplored in video generation research. The main challenges include maintaining character consistency across different shots, creating a sense of spatial continuity, and generating long, multi-turn dialogues within limited computational budgets. Here, we present ShoulderShot, a framework that combines dual-shot generation with looping video, enabling extended dialogues while preserving character consistency. Our results demonstrate capabilities that surpass existing methods in terms of shot-reverse-shot layout, spatial continuity, and flexibility in dialogue length, thereby opening up new possibilities for practical dialogue video generation. Videos and comparisons are available at https://shouldershot.github.io.
[176] MUNBa: Machine Unlearning via Nash Bargaining
Jing Wu, Mehrtash Harandi
Main category: cs.CV
TL;DR: The paper proposes a game-theoretic approach to Machine Unlearning (MU), addressing gradient conflicts by modeling MU as a two-player cooperative game, ensuring optimal trade-offs between forgetting and preserving objectives.
Details
Motivation: To resolve gradient conflicts and dominance in MU, which hinder optimal performance when balancing forgetting and preserving objectives.Method: Reformulates MU as a two-player cooperative game (forgetting and preservation players) and uses Nash bargaining theory to derive a closed-form solution for Pareto optimality.
Result: Outperforms state-of-the-art MU algorithms in tasks like image classification and generation, achieving better trade-offs and robustness.
Conclusion: The game-theoretic approach ensures equilibrium and optimality in MU, improving forgetting precision and generalization preservation.
Abstract: Machine Unlearning (MU) aims to selectively erase harmful behaviors from models while retaining the overall utility of the model. As a multi-task learning problem, MU involves balancing objectives related to forgetting specific concepts/data and preserving general performance. A naive integration of these forgetting and preserving objectives can lead to gradient conflicts and dominance, impeding MU algorithms from reaching optimal solutions. To address the gradient conflict and dominance issue, we reformulate MU as a two-player cooperative game, where the two players, namely, the forgetting player and the preservation player, contribute via their gradient proposals to maximize their overall gain and balance their contributions. To this end, inspired by the Nash bargaining theory, we derive a closed-form solution to guide the model toward the Pareto stationary point. Our formulation of MU guarantees an equilibrium solution, where any deviation from the final state would lead to a reduction in the overall objectives for both players, ensuring optimality in each objective. We evaluate our algorithm’s effectiveness on a diverse set of tasks across image classification and image generation. Extensive experiments with ResNet, vision-language model CLIP, and text-to-image diffusion models demonstrate that our method outperforms state-of-the-art MU algorithms, achieving a better trade-off between forgetting and preserving. Our results also highlight improvements in forgetting precision, preservation of generalization, and robustness against adversarial attacks.
[177] GBR: Generative Bundle Refinement for High-fidelity Gaussian Splatting with Enhanced Mesh Reconstruction
Jianing Zhang, Yuchao Zheng, Ziwei Li, Qionghai Dai, Xiaoyun Yuan
Main category: cs.CV
TL;DR: GBR (Generative Bundle Refinement) improves Gaussian splatting for sparse-view 3D scene reconstruction by integrating neural bundle adjustment and generative depth refinement, achieving high-fidelity results with only 4-6 input views.
Details
Motivation: Gaussian splatting struggles with sparse-view inputs due to limited geometric and photometric information, leading to ambiguities in depth, shape, and texture.Method: GBR combines neural bundle adjustment for initial 3D point maps and point matches, followed by bundle adjustment optimization. It also uses a diffusion-based generative depth refinement module and a multimodal loss function for Gaussian splatting optimization.
Result: GBR outperforms existing methods on sparse-view inputs and successfully reconstructs large-scale scenes like the Pavilion of Prince Teng and the Great Wall with only 6 views.
Conclusion: GBR effectively addresses sparse-view challenges in Gaussian splatting, enabling high-fidelity 3D reconstruction and rendering with minimal input views.
Abstract: Gaussian splatting has gained attention for its efficient representation and rendering of 3D scenes using continuous Gaussian primitives. However, it struggles with sparse-view inputs due to limited geometric and photometric information, causing ambiguities in depth, shape, and texture. we propose GBR: Generative Bundle Refinement, a method for high-fidelity Gaussian splatting and meshing using only 4-6 input views. GBR integrates a neural bundle adjustment module to enhance geometry accuracy and a generative depth refinement module to improve geometry fidelity. More specifically, the neural bundle adjustment module integrates a foundation network to produce initial 3D point maps and point matches from unposed images, followed by bundle adjustment optimization to improve multiview consistency and point cloud accuracy. The generative depth refinement module employs a diffusion-based strategy to enhance geometric details and fidelity while preserving the scale. Finally, for Gaussian splatting optimization, we propose a multimodal loss function incorporating depth and normal consistency, geometric regularization, and pseudo-view supervision, providing robust guidance under sparse-view conditions. Experiments on widely used datasets show that GBR significantly outperforms existing methods under sparse-view inputs. Additionally, GBR demonstrates the ability to reconstruct and render large-scale real-world scenes, such as the Pavilion of Prince Teng and the Great Wall, with remarkable details using only 6 views.
[178] RL-MoE: An Image-Based Privacy Preserving Approach In Intelligent Transportation System
Abdolazim Rezaei, Mehdi Sookhak, Mahboobeh Haghparast
Main category: cs.CV
TL;DR: RL-MoE transforms visual data into privacy-preserving text, balancing accuracy and privacy using MoE and RL, outperforming baselines in protection and utility.
Details
Motivation: Address the conflict between rich visual data needs and privacy rights in ITS, overcoming limitations of existing methods like blurring or encryption.Method: Combines Mixture-of-Experts (MoE) for scene decomposition with Reinforcement Learning (RL) to optimize text for accuracy and privacy.
Result: Reduces replay attack success to 9.4% on CFP-FP dataset while generating richer text than baselines.
Conclusion: RL-MoE offers a scalable, trustworthy solution for privacy-sensitive AI systems, advancing secure smart city and autonomous vehicle networks.
Abstract: The proliferation of AI-powered cameras in Intelligent Transportation Systems (ITS) creates a severe conflict between the need for rich visual data and the right to privacy. Existing privacy-preserving methods, such as blurring or encryption, are often insufficient due to creating an undesirable trade-off where either privacy is compromised against advanced reconstruction attacks or data utility is critically degraded. To resolve this challenge, we propose RL-MoE, a novel framework that transforms sensitive visual data into privacy-preserving textual descriptions, eliminating the need for direct image transmission. RL-MoE uniquely combines a Mixture-of-Experts (MoE) architecture for nuanced, multi-aspect scene decomposition with a Reinforcement Learning (RL) agent that optimizes the generated text for a dual objective of semantic accuracy and privacy preservation. Extensive experiments demonstrate that RL-MoE provides superior privacy protection, reducing the success rate of replay attacks to just 9.4% on the CFP-FP dataset, while simultaneously generating richer textual content than baseline methods. Our work provides a practical and scalable solution for building trustworthy AI systems in privacy-sensitive domains, paving the way for more secure smart city and autonomous vehicle networks.
[179] Learning an Adaptive and View-Invariant Vision Transformer for Real-Time UAV Tracking
You Wu, Yongxin Li, Mengyuan Liu, Xucheng Wang, Xiangyang Yang, Hengzhou Ye, Dan Zeng, Qijun Zhao, Shuiwang Li
Main category: cs.CV
TL;DR: AVTrack is an adaptive computation tracking framework for UAVs, improving efficiency and performance by selectively activating transformer blocks and learning view-invariant representations. AVTrack-MD enhances this with multi-teacher knowledge distillation, boosting speed by 17% without sacrificing performance.
Details
Motivation: Existing transformer-based models for visual tracking struggle with real-time performance on resource-limited devices, especially for UAV tracking.Method: Proposes AVTrack with an Activation Module (AM) for adaptive computation and mutual information (MI) maximization for view-invariant representations. AVTrack-MD adds multi-teacher knowledge distillation.
Result: AVTrack-MD achieves comparable performance to AVTrack while reducing complexity and increasing tracking speed by over 17%.
Conclusion: AVTrack and AVTrack-MD offer efficient, high-performance solutions for UAV tracking, balancing speed and accuracy.
Abstract: Transformer-based models have improved visual tracking, but most still cannot run in real time on resource-limited devices, especially for unmanned aerial vehicle (UAV) tracking. To achieve a better balance between performance and efficiency, we propose AVTrack, an adaptive computation tracking framework that adaptively activates transformer blocks through an Activation Module (AM), which dynamically optimizes the ViT architecture by selectively engaging relevant components. To address extreme viewpoint variations, we propose to learn view-invariant representations via mutual information (MI) maximization. In addition, we propose AVTrack-MD, an enhanced tracker incorporating a novel MI maximization-based multi-teacher knowledge distillation framework. Leveraging multiple off-the-shelf AVTrack models as teachers, we maximize the MI between their aggregated softened features and the corresponding softened feature of the student model, improving the generalization and performance of the student, especially under noisy conditions. Extensive experiments show that AVTrack-MD achieves performance comparable to AVTrack’s performance while reducing model complexity and boosting average tracking speed by over 17%. Codes is available at: https://github.com/wuyou3474/AVTrack.
[180] Preacher: Paper-to-Video Agentic System
Jingwei Liu, Ling Yang, Hao Luo, Fan Wang, Hongyan Li, Mengdi Wang
Main category: cs.CV
TL;DR: Preacher is a paper-to-video system that decomposes, summarizes, and reformulates research papers into structured video abstracts, overcoming limitations of current video generation models.
Details
Motivation: Current video generation models lack context, flexibility, and domain-specific knowledge, limiting their effectiveness for paper-to-video tasks.Method: Preacher uses a top-down approach for decomposition and summarization, followed by bottom-up video generation with Progressive Chain of Thought (P-CoT) for planning.
Result: Preacher generates high-quality video abstracts across five research fields, outperforming existing models.
Conclusion: Preacher addresses key limitations of video generation models, offering a robust solution for paper-to-video tasks.
Abstract: The paper-to-video task converts a research paper into a structured video abstract, distilling key concepts, methods, and conclusions into an accessible, well-organized format. While state-of-the-art video generation models demonstrate potential, they are constrained by limited context windows, rigid video duration constraints, limited stylistic diversity, and an inability to represent domain-specific knowledge. To address these limitations, we introduce Preacher, the first paper-to-video agentic system. Preacher employs a topdown approach to decompose, summarize, and reformulate the paper, followed by bottom-up video generation, synthesizing diverse video segments into a coherent abstract. To align cross-modal representations, we define key scenes and introduce a Progressive Chain of Thought (P-CoT) for granular, iterative planning. Preacher successfully generates high-quality video abstracts across five research fields, demonstrating expertise beyond current video generation models. Code will be released at: https://github.com/GenVerse/Paper2Video
[181] Towards Consumer-Grade Cybersickness Prediction: Multi-Model Alignment for Real-Time Vision-Only Inference
Yitong Zhu, Zhuowen Liang, Yiming Wu, Tangyao Li, Yuyang Wang
Main category: cs.CV
TL;DR: A scalable framework for predicting cybersickness in VR using non-invasive signals like head motion and eye tracking, achieving 88.4% accuracy comparable to EEG-based methods.
Details
Motivation: Cybersickness hinders VR adoption; existing methods rely on invasive signals (e.g., EEG), which are impractical for consumer use.Method: Uses a modality-specific graph neural network with a Difference Attention Module and cross-modal alignment to predict cybersickness from non-invasive signals.
Result: Achieves 88.4% accuracy, close to EEG-based methods (89.16%), with real-time latency (90ms).
Conclusion: The framework is practical for consumer-grade VR, balancing accuracy and deployability without invasive hardware.
Abstract: Cybersickness remains a major obstacle to the widespread adoption of immersive virtual reality (VR), particularly in consumer-grade environments. While prior methods rely on invasive signals such as electroencephalography (EEG) for high predictive accuracy, these approaches require specialized hardware and are impractical for real-world applications. In this work, we propose a scalable, deployable framework for personalized cybersickness prediction leveraging only non-invasive signals readily available from commercial VR headsets, including head motion, eye tracking, and physiological responses. Our model employs a modality-specific graph neural network enhanced with a Difference Attention Module to extract temporal-spatial embeddings capturing dynamic changes across modalities. A cross-modal alignment module jointly trains the video encoder to learn personalized traits by aligning video features with sensor-derived representations. Consequently, the model accurately predicts individual cybersickness using only video input during inference. Experimental results show our model achieves 88.4% accuracy, closely matching EEG-based approaches (89.16%), while reducing deployment complexity. With an average inference latency of 90ms, our framework supports real-time applications, ideal for integration into consumer-grade VR platforms without compromising personalization or performance. The code will be relesed at https://github.com/U235-Aurora/PTGNN.
[182] LVFace: Progressive Cluster Optimization for Large Vision Models in Face Recognition
Jinghan You, Shanglin Li, Yuanrui Sun, Jiangchuan Wei, Mingyu Guo, Chao Feng, Jiao Ran
Main category: cs.CV
TL;DR: LVFace, a ViT-based face recognition model with Progressive Cluster Optimization (PCO), outperforms CNNs and sets a new benchmark, winning the ICCV 2021 MFR Challenge.
Details
Motivation: CNNs dominate face recognition, but ViTs underperform due to suboptimal training paradigms. LVFace aims to unlock ViT's potential.Method: LVFace uses PCO, combining negative class sub-sampling, feature expectation penalties, and cluster boundary refinement for robust training.
Result: LVFace surpasses UniFace and TopoFR, achieving state-of-the-art performance and winning the ICCV 2021 MFR Challenge.
Conclusion: LVFace proves ViTs’ efficacy in face recognition, offering scalability and compatibility with other models.
Abstract: Vision Transformers (ViTs) have revolutionized large-scale visual modeling, yet remain underexplored in face recognition (FR) where CNNs still dominate. We identify a critical bottleneck: CNN-inspired training paradigms fail to unlock ViT’s potential, leading to suboptimal performance and convergence instability.To address this challenge, we propose LVFace, a ViT-based FR model that integrates Progressive Cluster Optimization (PCO) to achieve superior results. Specifically, PCO sequentially applies negative class sub-sampling (NCS) for robust and fast feature alignment from random initialization, feature expectation penalties for centroid stabilization, performing cluster boundary refinement through full-batch training without NCS constraints. LVFace establishes a new state-of-the-art face recognition baseline, surpassing leading approaches such as UniFace and TopoFR across multiple benchmarks. Extensive experiments demonstrate that LVFace delivers consistent performance gains, while exhibiting scalability to large-scale datasets and compatibility with mainstream VLMs and LLMs. Notably, LVFace secured 1st place in the ICCV 2021 Masked Face Recognition (MFR)-Ongoing Challenge (March 2025), proving its efficacy in real-world scenarios. Project is available at https://github.com/bytedance/LVFace.
[183] PTQAT: A Hybrid Parameter-Efficient Quantization Algorithm for 3D Perception Tasks
Xinhao Wang, Zhiwei Lin, Zhongyu Xia, Yongtao Wang
Main category: cs.CV
TL;DR: PTQAT is a hybrid quantization method combining PTQ and QAT, selectively fine-tuning critical layers for efficiency and accuracy, outperforming QAT-only baselines.
Details
Motivation: Address the inefficiency and performance issues of PTQ and QAT by proposing a hybrid approach.Method: Select critical layers for QAT fine-tuning and apply PTQ to others, focusing on layers with smaller output discrepancies.
Result: Achieves better accuracy than QAT-only methods with fewer fine-tuned weights, supporting various bit widths and architectures.
Conclusion: PTQAT offers an efficient and universal solution for model quantization, balancing speed and accuracy.
Abstract: Post-Training Quantization (PTQ) and Quantization-Aware Training (QAT) represent two mainstream model quantization approaches. However, PTQ often leads to unacceptable performance degradation in quantized models, while QAT imposes substantial GPU memory requirements and extended training time due to weight fine-tuning. In this paper, we propose PTQAT, a novel general hybrid quantization algorithm for the efficient deployment of 3D perception networks. To address the speed accuracy trade-off between PTQ and QAT, our method selects critical layers for QAT fine-tuning and performs PTQ on the remaining layers. Contrary to intuition, fine-tuning the layers with smaller output discrepancies before and after quantization, rather than those with larger discrepancies, actually leads to greater improvements in the model’s quantization accuracy. This means we better compensate for quantization errors during their propagation, rather than addressing them at the point where they occur. The proposed PTQAT achieves similar performance to QAT with more efficiency by freezing nearly 50% of quantifiable layers. Additionally, PTQAT is a universal quantization method that supports various quantization bit widths (4 bits) as well as different model architectures, including CNNs and Transformers. The experimental results on nuScenes across diverse 3D perception tasks, including object detection, semantic segmentation, and occupancy prediction, show that our method consistently outperforms QAT-only baselines. Notably, it achieves 0.2%-0.9% NDS and 0.3%-1.0% mAP gains in object detection, 0.3%-2.0% mIoU gains in semantic segmentation and occupancy prediction while fine-tuning fewer weights.
[184] FairT2I: Mitigating Social Bias in Text-to-Image Generation via Large Language Model-Assisted Detection and Attribute Rebalancing
Jinya Sakurai, Issei Sato
Main category: cs.CV
TL;DR: FairT2I is a framework using LLMs to detect and mitigate social biases in Text-to-Image models, improving diversity while maintaining image quality.
Details
Motivation: Address ethical concerns of societal biases in T2I models, which are amplified when AI-generated content reinforces stereotypes.Method: Combines an LLM-based bias detection module and an attribute rebalancing module to fine-tune T2I models.
Result: Significantly reduces bias and enhances diversity in generated images, even detecting subtle biases.
Conclusion: FairT2I effectively mitigates biases and introduces a new benchmark dataset for evaluating bias in T2I models.
Abstract: The proliferation of Text-to-Image (T2I) models has revolutionized content creation, providing powerful tools for diverse applications ranging from artistic expression to educational material development and marketing. Despite these technological advancements, significant ethical concerns arise from these models’ reliance on large-scale datasets that often contain inherent societal biases. These biases are further amplified when AI-generated content is included in training data, potentially reinforcing and perpetuating stereotypes in the generated outputs. In this paper, we introduce FairT2I, a novel framework that harnesses large language models to detect and mitigate social biases in T2I generation. Our framework comprises two key components: (1) an LLM-based bias detection module that identifies potential social biases in generated images based on text prompts, and (2) an attribute rebalancing module that fine-tunes sensitive attributes within the T2I model to mitigate identified biases. Our extensive experiments across various T2I models and datasets show that FairT2I can significantly reduce bias while maintaining high-quality image generation. We conducted both qualitative user studies and quantitative non-parametric analyses in the generated image feature space, building upon the occupational dataset introduced in the Stable Bias study. Our results show that FairT2I successfully mitigates social biases and enhances the diversity of sensitive attributes in generated images. We further demonstrate, using the P2 dataset, that our framework can detect subtle biases that are challenging for human observers to perceive, extending beyond occupation-related prompts. On the basis of these findings, we introduce a new benchmark dataset for evaluating bias in T2I models.
[185] Introducing Unbiased Depth into 2D Gaussian Splatting for High-accuracy Surface Reconstruction
Yixin Yang, Yang Zhou, Hui Huang
Main category: cs.CV
TL;DR: The paper improves 2D Gaussian Splatting (2DGS) by addressing its failure on glossy surfaces through a novel depth convergence loss and rectified depth criterion, achieving better reconstruction quality.
Details
Motivation: 2DGS struggles with glossy surfaces due to reflection discontinuity, leading to visible holes. The authors aim to fix this issue.Method: Replace depth distortion loss with depth convergence loss and rectify the depth criterion for surface determination.
Result: Significant improvement in reconstruction quality, with more complete and accurate surfaces compared to 2DGS.
Conclusion: The proposed method effectively addresses the limitations of 2DGS on glossy surfaces, enhancing reconstruction accuracy.
Abstract: Recently, 2D Gaussian Splatting (2DGS) has demonstrated superior geometry reconstruction quality than the popular 3DGS by using 2D surfels to approximate thin surfaces. However, it falls short when dealing with glossy surfaces, resulting in visible holes in these areas. We find that the reflection discontinuity causes the issue. To fit the jump from diffuse to specular reflection at different viewing angles, depth bias is introduced in the optimized Gaussian primitives. To address that, we first replace the depth distortion loss in 2DGS with a novel depth convergence loss, which imposes a strong constraint on depth continuity. Then, we rectify the depth criterion in determining the actual surface, which fully accounts for all the intersecting Gaussians along the ray. Qualitative and quantitative evaluations across various datasets reveal that our method significantly improves reconstruction quality, with more complete and accurate surfaces than 2DGS. Code is available at https://github.com/XiaoXinyyx/Unbiased_Surfel.
[186] Seeing and Seeing Through the Glass: Real and Synthetic Data for Multi-Layer Depth Estimation
Hongyu Wen, Yiming Zuo, Venkat Subramanian, Patrick Chen, Jia Deng
Main category: cs.CV
TL;DR: The paper introduces LayeredDepth, a dataset for multi-layer depth estimation of transparent objects, including a real-world benchmark and a synthetic data generator. It shows that existing depth estimation methods struggle with transparency and that training on synthetic data improves performance.
Details
Motivation: Understanding multi-layer depth of transparent objects is crucial for real-world applications, but existing methods fail to handle transparency effectively.Method: The authors created a real-world benchmark (1,500 images) and a synthetic data generator (15,300 images) for multi-layer depth estimation. They evaluated state-of-the-art methods and trained baseline models on synthetic data.
Result: Training on synthetic data improved performance, with accuracy on transparent objects increasing from 55.14% to 75.20%.
Conclusion: LayeredDepth provides valuable resources for advancing multi-layer depth estimation, demonstrating the effectiveness of synthetic data for training.
Abstract: Transparent objects are common in daily life, and understanding their multi-layer depth information – perceiving both the transparent surface and the objects behind it – is crucial for real-world applications that interact with transparent materials. In this paper, we introduce LayeredDepth, the first dataset with multi-layer depth annotations, including a real-world benchmark and a synthetic data generator, to support the task of multi-layer depth estimation. Our real-world benchmark consists of 1,500 images from diverse scenes, and evaluating state-of-the-art depth estimation methods on it reveals that they struggle with transparent objects. The synthetic data generator is fully procedural and capable of providing training data for this task with an unlimited variety of objects and scene compositions. Using this generator, we create a synthetic dataset with 15,300 images. Baseline models training solely on this synthetic dataset produce good cross-domain multi-layer depth estimation. Fine-tuning state-of-the-art single-layer depth models on it substantially improves their performance on transparent objects, with quadruplet accuracy on our benchmark increased from 55.14% to 75.20%. All images and validation annotations are available under CC0 at https://layereddepth.cs.princeton.edu.
[187] AFR-CLIP: Enhancing Zero-Shot Industrial Anomaly Detection with Stateless-to-Stateful Anomaly Feature Rectification
Jingyi Yuan, Chenqiang Gao, Pengyu Jie, Xuan Xia, Shangri Huang, Wanquan Liu
Main category: cs.CV
TL;DR: AFR-CLIP improves zero-shot anomaly detection by rectifying CLIP’s features with image-guided textual prompts and multi-scale enhancements.
Details
Motivation: Existing CLIP-based methods align with object categories, not anomalies, limiting effectiveness.Method: AFR-CLIP uses image-guided textual rectification, self-prompting, and multi-patch feature aggregation.
Result: Outperforms on eleven benchmarks in industrial and medical domains.
Conclusion: AFR-CLIP advances zero-shot anomaly detection by addressing CLIP’s limitations.
Abstract: Recently, zero-shot anomaly detection (ZSAD) has emerged as a pivotal paradigm for industrial inspection and medical diagnostics, detecting defects in novel objects without requiring any target-dataset samples during training. Existing CLIP-based ZSAD methods generate anomaly maps by measuring the cosine similarity between visual and textual features. However, CLIP’s alignment with object categories instead of their anomalous states limits its effectiveness for anomaly detection. To address this limitation, we propose AFR-CLIP, a CLIP-based anomaly feature rectification framework. AFR-CLIP first performs image-guided textual rectification, embedding the implicit defect information from the image into a stateless prompt that describes the object category without indicating any anomalous state. The enriched textual embeddings are then compared with two pre-defined stateful (normal or abnormal) embeddings, and their text-on-text similarity yields the anomaly map that highlights defective regions. To further enhance perception to multi-scale features and complex anomalies, we introduce self prompting (SP) and multi-patch feature aggregation (MPFA) modules. Extensive experiments are conducted on eleven anomaly detection benchmarks across industrial and medical domains, demonstrating AFR-CLIP’s superiority in ZSAD.
[188] Image-to-Text for Medical Reports Using Adaptive Co-Attention and Triple-LSTM Module
Yishen Liu, Shengda Luo, Zishao Zhong, Hudan Pan
Main category: cs.CV
TL;DR: CA-TriNet, a multimodal model combining transformers and Multi-LSTM, improves medical report generation by addressing data similarity and overfitting issues.
Details
Motivation: General large models struggle with medical report accuracy due to data repetition and similarity, leading to overfitting.Method: Proposes CA-TriNet, integrating Co-Attention (linking vision and text transformers) and Triple-LSTM (refining sentences with image objects).
Result: Outperforms state-of-the-art models on three public datasets, even surpassing pre-trained large language models in some metrics.
Conclusion: CA-TriNet effectively addresses challenges in medical report generation, offering superior performance.
Abstract: Medical report generation requires specialized expertise that general large models often fail to accurately capture. Moreover, the inherent repetition and similarity in medical data make it difficult for models to extract meaningful features, resulting in a tendency to overfit. So in this paper, we propose a multimodal model, Co-Attention Triple-LSTM Network (CA-TriNet), a deep learning model that combines transformer architectures with a Multi-LSTM network. Its Co-Attention module synergistically links a vision transformer with a text transformer to better differentiate medical images with similarities, augmented by an adaptive weight operator to catch and amplify image labels with minor similarities. Furthermore, its Triple-LSTM module refines generated sentences using targeted image objects. Extensive evaluations over three public datasets have demonstrated that CA-TriNet outperforms state-of-the-art models in terms of comprehensive ability, even pre-trained large language models on some metrics.
[189] Towards Generalizable Forgery Detection and Reasoning
Yueying Gao, Dongliang Chang, Bingyao Yu, Haotian Qin, Muxi Diao, Lei Chen, Kongming Liang, Zhanyu Ma
Main category: cs.CV
TL;DR: The paper introduces a unified Forgery Detection and Reasoning task (FDR-Task) using Multi-Modal Large Language Models (MLLMs) to detect and explain AI-generated images, supported by a new dataset (MMFR-Dataset) and the FakeReasoning framework.
Details
Motivation: Addressing the challenges of detecting AI-generated images due to domain gaps among generative models and the inadequacy of traditional saliency-based methods.Method: Proposes FakeReasoning, a framework with a dual-branch visual encoder (CLIP and DINO), Forgery-Aware Feature Fusion Module, and Classification Probability Mapper.
Result: FakeReasoning outperforms state-of-the-art methods in detection and reasoning tasks, demonstrating robust generalization.
Conclusion: The approach effectively unifies detection and reasoning, providing accurate and interpretable results for AI-generated image detection.
Abstract: Accurate and interpretable detection of AI-generated images is essential for mitigating risks associated with AI misuse. However, the substantial domain gap among generative models makes it challenging to develop a generalizable forgery detection model. Moreover, since every pixel in an AI-generated image is synthesized, traditional saliency-based forgery explanation methods are not well suited for this task. To address these challenges, we formulate detection and explanation as a unified Forgery Detection and Reasoning task (FDR-Task), leveraging Multi-Modal Large Language Models (MLLMs) to provide accurate detection through reliable reasoning over forgery attributes. To facilitate this task, we introduce the Multi-Modal Forgery Reasoning dataset (MMFR-Dataset), a large-scale dataset containing 120K images across 10 generative models, with 378K reasoning annotations on forgery attributes, enabling comprehensive evaluation of the FDR-Task. Furthermore, we propose FakeReasoning, a forgery detection and reasoning framework with three key components: 1) a dual-branch visual encoder that integrates CLIP and DINO to capture both high-level semantics and low-level artifacts; 2) a Forgery-Aware Feature Fusion Module that leverages DINO’s attention maps and cross-attention mechanisms to guide MLLMs toward forgery-related clues; 3) a Classification Probability Mapper that couples language modeling and forgery detection, enhancing overall performance. Experiments across multiple generative models demonstrate that FakeReasoning not only achieves robust generalization but also outperforms state-of-the-art methods on both detection and reasoning tasks.
[190] Marmot: Object-Level Self-Correction via Multi-Agent Reasoning
Jiayang Sun, Hongbo Wang, Jie Cao, Huaibo Huang, Ran He
Main category: cs.CV
TL;DR: Marmot is a framework using multi-agent reasoning to improve image-text alignment in diffusion models by decomposing tasks into object-centric subtasks and mitigating distortions.
Details
Motivation: Diffusion models struggle with accurate counting, attributes, and spatial relationships in multi-object scenes, and existing MLLM-based solutions have limitations.Method: Marmot employs an Object-Aware Agent for task decomposition, an Object Correction System for reliable editing, and a Pixel-Domain Stitching Smoother for distortion-free integration.
Result: Marmot enhances accuracy in object counting, attributes, and spatial relationships in image generation.
Conclusion: Marmot effectively addresses challenges in multi-object scene generation, improving reliability and efficiency.
Abstract: While diffusion models excel at generating high-quality images, they often struggle with accurate counting, attributes, and spatial relationships in complex multi-object scenes. One potential solution involves employing Multimodal Large Language Model (MLLM) as an AI agent to construct a self-correction framework. However, these approaches heavily rely on the capabilities of the MLLMs used, often fail to account for all objects within the image, and suffer from cumulative distortions during multi-round editing processes. To address these challenges, we propose Marmot, a novel and generalizable framework that leverages Multi-Agent Reasoning for Multi-Object Self-Correcting to enhance image-text alignment. First, we employ a large language model as an Object-Aware Agent to perform object-level divide-and-conquer, automatically decomposing self-correction tasks into object-centric subtasks based on image descriptions. For each subtask, we construct an Object Correction System featuring a decision-execution-verification mechanism that operates exclusively on a single object’s segmentation mask or the bounding boxes of object pairs, effectively mitigating inter-object interference and enhancing editing reliability. To efficiently integrate correction results from subtasks while avoiding cumulative distortions from multi-stage editing, we propose a Pixel-Domain Stitching Smoother, which employs mask-guided two-stage latent space optimization. This innovation enables parallel processing of subtasks, significantly improving runtime efficiency while preventing distortion accumulation. Extensive experiments demonstrate that Marmot significantly improves accuracy in object counting, attribute assignment, and spatial relationships for image generation tasks.
[191] Physics-Guided Image Dehazing Diffusion
Shijun Zhou, Baojie Fan, Jiandong Tian
Main category: cs.CV
TL;DR: IDDM is a diffusion model for image dehazing that bridges the gap between synthetic and real-world data by incorporating the atmospheric scattering model into noise diffusion.
Details
Motivation: Current dehazing algorithms trained on synthetic data struggle with real-world scenarios due to domain gaps.Method: IDDM integrates the atmospheric scattering model into noise diffusion, using a gradual haze formation process to train a denoising Unet.
Result: IDDM effectively restores real-world hazy images despite synthetic training, outperforming state-of-the-art methods.
Conclusion: IDDM demonstrates strong domain generalization and practical dehazing performance, validated by extensive experiments.
Abstract: Due to the domain gap between real-world and synthetic hazy images, current data-driven dehazing algorithms trained on synthetic datasets perform well on synthetic data but struggle to generalize to real-world scenarios. To address this challenge, we propose \textbf{I}mage \textbf{D}ehazing \textbf{D}iffusion \textbf{M}odels (IDDM), a novel diffusion process that incorporates the atmospheric scattering model into noise diffusion. IDDM aims to use the gradual haze formation process to help the denoising Unet robustly learn the distribution of clear images from the conditional input hazy images. We design a specialized training strategy centered around IDDM. Diffusion models are leveraged to bridge the domain gap from synthetic to real-world, while the atmospheric scattering model provides physical guidance for haze formation. During the forward process, IDDM simultaneously introduces haze and noise into clear images, and then robustly separates them during the sampling process. By training with physics-guided information, IDDM shows the ability of domain generalization, and effectively restores the real-world hazy images despite being trained on synthetic datasets. Extensive experiments demonstrate the effectiveness of our method through both quantitative and qualitative comparisons with state-of-the-art approaches.
[192] LSVG: Language-Guided Scene Graphs with 2D-Assisted Multi-Modal Encoding for 3D Visual Grounding
Feng Xiao, Hongbin Xu, Guocan Zhao, Wenxiong Kang
Main category: cs.CV
TL;DR: A novel 3D visual grounding framework improves relational perception by constructing language-guided scene graphs and leveraging pre-trained 2D semantics for multi-modal 3D encoding.
Details
Motivation: Address the challenge of distinguishing similar objects in 3D scenes using natural language descriptions, which current methods overlook by ignoring referred object modeling.Method: Proposes a dual-branch visual encoder with pre-trained 2D semantics and graph attention for cross-modal interaction, enhancing object representation and scene graph structure.
Result: Outperforms state-of-the-art methods on benchmarks, particularly in handling multiple similar distractors.
Conclusion: The framework effectively aligns 3D visual content with textual descriptions, improving relational perception in 3D visual grounding.
Abstract: 3D visual grounding aims to localize the unique target described by natural languages in 3D scenes. The significant gap between 3D and language modalities makes it a notable challenge to distinguish multiple similar objects through the described spatial relationships. Current methods attempt to achieve cross-modal understanding in complex scenes via a target-centered learning mechanism, ignoring the modeling of referred objects. We propose a novel 3D visual grounding framework that constructs language-guided scene graphs with referred object discrimination to improve relational perception. The framework incorporates a dual-branch visual encoder that leverages pre-trained 2D semantics to enhance and supervise the multi-modal 3D encoding. Furthermore, we employ graph attention to promote relationship-oriented information fusion in cross-modal interaction. The learned object representations and scene graph structure enable effective alignment between 3D visual content and textual descriptions. Experimental results on popular benchmarks demonstrate our superior performance compared to state-of-the-art methods, especially in handling the challenges of multiple similar distractors.
[193] PRS-Med: Position Reasoning Segmentation with Vision-Language Model in Medical Imaging
Quoc-Huy Trinh, Minh-Van Nguyen, Jung Zeng, Ulas Bagci, Debesh Jha
Main category: cs.CV
TL;DR: PRS-Med integrates vision-language models for medical image segmentation and spatial reasoning, outperforming existing methods and introducing the MMRS dataset.
Details
Motivation: Existing methods lack natural language interaction and spatial reasoning capabilities in medical image segmentation.Method: PRS-Med combines vision-language models with segmentation to produce masks and spatial reasoning outputs, using the MMRS dataset for training.
Result: PRS-Med excels in segmentation accuracy and position reasoning across six imaging modalities.
Conclusion: PRS-Med enhances doctor-system interaction via natural language, with released resources to advance spatially-aware medical AI.
Abstract: Recent advancements in prompt-based medical image segmentation have enabled clinicians to identify tumors using simple input like bounding boxes or text prompts. However, existing methods face challenges when doctors need to interact through natural language or when position reasoning is required - understanding spatial relationships between anatomical structures and pathologies. We present PRS-Med, a framework that integrates vision-language models with segmentation capabilities to generate both accurate segmentation masks and corresponding spatial reasoning outputs. Additionally, we introduce the MMRS dataset (Multimodal Medical in Positional Reasoning Segmentation), which provides diverse, spatially-grounded question-answer pairs to address the lack of position reasoning data in medical imaging. PRS-Med demonstrates superior performance across six imaging modalities (CT, MRI, X-ray, ultrasound, endoscopy, RGB), significantly outperforming state-of-the-art methods in both segmentation accuracy and position reasoning. Our approach enables intuitive doctor-system interaction through natural language, facilitating more efficient diagnoses. Our dataset pipeline, model, and codebase will be released to foster further research in spatially-aware multimodal reasoning for medical applications.
[194] ImpliHateVid: A Benchmark Dataset and Two-stage Contrastive Learning Framework for Implicit Hate Speech Detection in Videos
Mohammad Zia Ur Rehman, Anukriti Bhatnagar, Omkar Kabde, Shubhi Bansal, Nagendra Kumar
Main category: cs.CV
TL;DR: The paper introduces ImpliHateVid, a novel dataset for implicit hate speech detection in videos, and proposes a two-stage contrastive learning framework for multimodal hate speech detection.
Details
Motivation: Existing research lacks focus on video-based hate speech detection, especially for implicit hate, prompting the creation of a dedicated dataset and method.Method: A two-stage contrastive learning framework is used: first, modality-specific encoders (audio, text, image) are trained; second, cross-encoders refine multimodal representations. Additional features (sentiment, emotion, captions) are incorporated.
Result: The method is evaluated on ImpliHateVid and HateMM datasets, showing effectiveness in detecting hateful content in videos and validating the dataset’s significance.
Conclusion: The proposed framework and dataset advance video-based hate speech detection, particularly for implicit hate, demonstrating the value of multimodal contrastive learning.
Abstract: The existing research has primarily focused on text and image-based hate speech detection, video-based approaches remain underexplored. In this work, we introduce a novel dataset, ImpliHateVid, specifically curated for implicit hate speech detection in videos. ImpliHateVid consists of 2,009 videos comprising 509 implicit hate videos, 500 explicit hate videos, and 1,000 non-hate videos, making it one of the first large-scale video datasets dedicated to implicit hate detection. We also propose a novel two-stage contrastive learning framework for hate speech detection in videos. In the first stage, we train modality-specific encoders for audio, text, and image using contrastive loss by concatenating features from the three encoders. In the second stage, we train cross-encoders using contrastive learning to refine multimodal representations. Additionally, we incorporate sentiment, emotion, and caption-based features to enhance implicit hate detection. We evaluate our method on two datasets, ImpliHateVid for implicit hate speech detection and another dataset for general hate speech detection in videos, HateMM dataset, demonstrating the effectiveness of the proposed multimodal contrastive learning for hateful content detection in videos and the significance of our dataset.
[195] MCA-Bench: A Multimodal Benchmark for Evaluating CAPTCHA Robustness Against VLM-based Attacks
Zonglin Wu, Yule Xue, Yaoyao Feng, Xiaolong Wang, Yiren Song
Main category: cs.CV
TL;DR: The paper introduces MCA-Bench, a unified benchmark for evaluating CAPTCHA security across diverse modalities, using a shared vision-language model to assess vulnerabilities and propose design improvements.
Details
Motivation: The lack of a standardized, large-scale benchmark for evaluating CAPTCHA security across different modalities motivates the creation of MCA-Bench.Method: MCA-Bench integrates various CAPTCHA types into a single protocol, fine-tuning specialized cracking agents for each category using a shared vision-language model.
Result: Experiments show MCA-Bench effectively maps CAPTCHA vulnerabilities, analyzing challenge complexity, interaction depth, and solvability.
Conclusion: The paper proposes design principles for CAPTCHA hardening and identifies open challenges, promoting systematic benchmarking and collaboration.
Abstract: As automated attack techniques rapidly advance, CAPTCHAs remain a critical defense mechanism against malicious bots. However, existing CAPTCHA schemes encompass a diverse range of modalities – from static distorted text and obfuscated images to interactive clicks, sliding puzzles, and logic-based questions – yet the community still lacks a unified, large-scale, multimodal benchmark to rigorously evaluate their security robustness. To address this gap, we introduce MCA-Bench, a comprehensive and reproducible benchmarking suite that integrates heterogeneous CAPTCHA types into a single evaluation protocol. Leveraging a shared vision-language model backbone, we fine-tune specialized cracking agents for each CAPTCHA category, enabling consistent, cross-modal assessments. Extensive experiments reveal that MCA-Bench effectively maps the vulnerability spectrum of modern CAPTCHA designs under varied attack settings, and crucially offers the first quantitative analysis of how challenge complexity, interaction depth, and model solvability interrelate. Based on these findings, we propose three actionable design principles and identify key open challenges, laying the groundwork for systematic CAPTCHA hardening, fair benchmarking, and broader community collaboration. Datasets and code are available online.
[196] PhysLab: A Benchmark Dataset for Multi-Granularity Visual Parsing of Physics Experiments
Minghao Zou, Qingtian Zeng, Yongping Miao, Shangkun Liu, Zilong Wang, Hantao Liu, Wei Zhou
Main category: cs.CV
TL;DR: PhysLab is a new video dataset for fine-grained visual parsing in educational scenarios, addressing gaps in annotation granularity, domain coverage, and procedural guidance.
Details
Motivation: Existing datasets lack granularity, domain coverage (especially in education), and procedural guidance, hindering fine-grained scene understanding and reasoning.Method: Introduces PhysLab, a dataset with 620 long-form videos of students conducting physics experiments, featuring multilevel annotations for various vision tasks.
Result: PhysLab supports tasks like action recognition, object detection, and HOI analysis, with strong baselines and evaluations highlighting key challenges.
Conclusion: PhysLab aims to advance visual parsing, intelligent classroom systems, and integration of computer vision with educational technologies.
Abstract: Visual parsing of images and videos is critical for a wide range of real-world applications. However, progress in this field is constrained by limitations of existing datasets: (1) insufficient annotation granularity, which impedes fine-grained scene understanding and high-level reasoning; (2) limited coverage of domains, particularly a lack of datasets tailored for educational scenarios; and (3) lack of explicit procedural guidance, with minimal logical rules and insufficient representation of structured task process. To address these gaps, we introduce PhysLab, the first video dataset that captures students conducting complex physics experiments. The dataset includes four representative experiments that feature diverse scientific instruments and rich human-object interaction (HOI) patterns. PhysLab comprises 620 long-form videos and provides multilevel annotations that support a variety of vision tasks, including action recognition, object detection, HOI analysis, etc. We establish strong baselines and perform extensive evaluations to highlight key challenges in the parsing of procedural educational videos. We expect PhysLab to serve as a valuable resource for advancing fine-grained visual parsing, facilitating intelligent classroom systems, and fostering closer integration between computer vision and educational technologies. The dataset and the evaluation toolkit are publicly available at https://github.com/ZMH-SDUST/PhysLab.
[197] Learning Camera-Agnostic White-Balance Preferences
Luxi Zhao, Mahmoud Afifi, Michael S. Brown
Main category: cs.CV
TL;DR: The paper introduces a lightweight method to transform neutral white balance corrections into aesthetically preferred ones, ensuring consistency across different cameras.
Details
Motivation: Commercial AWB systems prioritize aesthetic preferences over accurate color correction, and existing learning-based methods struggle with generalization across camera sensors. This paper addresses aesthetic consistency in AWB.Method: The authors propose a post-illuminant-estimation mapping that transforms neutral corrections into aesthetic ones in a camera-agnostic space. The model is lightweight (~500 parameters) and efficient (0.024 ms runtime).
Result: The method achieves state-of-the-art performance on a dataset of 771 smartphone images from three cameras, with minimal computational overhead.
Conclusion: The proposed approach enables consistent and stylized color rendering across unseen cameras while remaining compatible with existing cross-camera AWB techniques.
Abstract: The image signal processor (ISP) pipeline in modern cameras consists of several modules that transform raw sensor data into visually pleasing images in a display color space. Among these, the auto white balance (AWB) module is essential for compensating for scene illumination. However, commercial AWB systems often strive to compute aesthetic white-balance preferences rather than accurate neutral color correction. While learning-based methods have improved AWB accuracy, they typically struggle to generalize across different camera sensors – an issue for smartphones with multiple cameras. Recent work has explored cross-camera AWB, but most methods remain focused on achieving neutral white balance. In contrast, this paper is the first to address aesthetic consistency by learning a post-illuminant-estimation mapping that transforms neutral illuminant corrections into aesthetically preferred corrections in a camera-agnostic space. Once trained, our mapping can be applied after any neutral AWB module to enable consistent and stylized color rendering across unseen cameras. Our proposed model is lightweight – containing only $\sim$500 parameters – and runs in just 0.024 milliseconds on a typical flagship mobile CPU. Evaluated on a dataset of 771 smartphone images from three different cameras, our method achieves state-of-the-art performance while remaining fully compatible with existing cross-camera AWB techniques, introducing minimal computational and memory overhead.
[198] Zero-Shot Anomaly Detection with Dual-Branch Prompt Selection
Zihan Wang, Samira Ebrahimi Kahou, Narges Armanfard
Main category: cs.CV
TL;DR: PILOT introduces a dual-branch prompt learning mechanism and label-free test-time adaptation to improve zero-shot anomaly detection under domain shifts.
Details
Motivation: Existing ZSAD methods struggle with domain shifts due to limited training data and lack of generalization.Method: PILOT uses a dual-branch prompt learning mechanism and label-free test-time adaptation with pseudo-labels.
Result: PILOT achieves state-of-the-art performance on 13 benchmarks for anomaly detection and localization under domain shift.
Conclusion: PILOT effectively addresses domain shift challenges in ZSAD, outperforming existing methods.
Abstract: Zero-shot anomaly detection (ZSAD) enables identifying and localizing defects in unseen categories by relying solely on generalizable features rather than requiring any labeled examples of anomalies. However, existing ZSAD methods, whether using fixed or learned prompts, struggle under domain shifts because their training data are derived from limited training domains and fail to generalize to new distributions. In this paper, we introduce PILOT, a framework designed to overcome these challenges through two key innovations: (1) a novel dual-branch prompt learning mechanism that dynamically integrates a pool of learnable prompts with structured semantic attributes, enabling the model to adaptively weight the most relevant anomaly cues for each input image; and (2) a label-free test-time adaptation strategy that updates the learnable prompt parameters using high-confidence pseudo-labels from unlabeled test data. Extensive experiments on 13 industrial and medical benchmarks demonstrate that PILOT achieves state-of-the-art performance in both anomaly detection and localization under domain shift.
[199] DSConv: Dynamic Splitting Convolution for Pansharpening
Xuanyu Liu, Bonan An
Main category: cs.CV
TL;DR: The paper proposes DSConv, a dynamic splitting convolution method with attention for pansharpening, enhancing feature extraction and network performance.
Details
Motivation: Existing pansharpening methods rely on standard convolutions, lacking adaptability to inter-pixel correlations in remote sensing images.Method: Introduces DSConv, which dynamically splits convolution kernels with attention, selecting positions of interest and splitting kernels into smaller ones.
Result: DSConv improves generalization, optimization, and feature representation, achieving state-of-the-art performance in pansharpening.
Conclusion: DSConv is superior and versatile, with rigorous experiments validating its effectiveness and optimal usage conditions.
Abstract: Aiming to obtain a high-resolution image, pansharpening involves the fusion of a multi-spectral image (MS) and a panchromatic image (PAN), the low-level vision task remaining significant and challenging in contemporary research. Most existing approaches rely predominantly on standard convolutions, few making the effort to adaptive convolutions, which are effective owing to the inter-pixel correlations of remote sensing images. In this paper, we propose a novel strategy for dynamically splitting convolution kernels in conjunction with attention, selecting positions of interest, and splitting the original convolution kernel into multiple smaller kernels, named DSConv. The proposed DSConv more effectively extracts features of different positions within the receptive field, enhancing the network’s generalization, optimization, and feature representation capabilities. Furthermore, we innovate and enrich concepts of dynamic splitting convolution and provide a novel network architecture for pansharpening capable of achieving the tasks more efficiently, building upon this methodology. Adequate fair experiments illustrate the effectiveness and the state-of-the-art performance attained by DSConv.Comprehensive and rigorous discussions proved the superiority and optimal usage conditions of DSConv.
[200] SVG-Head: Hybrid Surface-Volumetric Gaussians for High-Fidelity Head Reconstruction and Real-Time Editing
Heyi Sun, Cong Wang, Tian-Xing Xu, Jingwei Huang, Di Kang, Chunchao Guo, Song-Hai Zhang
Main category: cs.CV
TL;DR: SVG-Head introduces a hybrid representation for head avatars using 3D Gaussians and disentangled textures, enabling high-fidelity rendering and real-time appearance editing.
Details
Motivation: The challenge lies in creating editable, high-fidelity head avatars due to implicit representations and entangled geometry-appearance modeling.Method: SVG-Head combines surface and volumetric Gaussians bound to a FLAME mesh, using texture images for appearance and a mesh-aware UV mapping method.
Result: Achieves high-fidelity rendering and real-time texture editing, outperforming existing methods on the NeRSemble dataset.
Conclusion: SVG-Head is a breakthrough for editable head avatars, offering explicit texture control and real-time performance.
Abstract: Creating high-fidelity and editable head avatars is a pivotal challenge in computer vision and graphics, boosting many AR/VR applications. While recent advancements have achieved photorealistic renderings and plausible animation, head editing, especially real-time appearance editing, remains challenging due to the implicit representation and entangled modeling of the geometry and global appearance. To address this, we propose Surface-Volumetric Gaussian Head Avatar (SVG-Head), a novel hybrid representation that explicitly models the geometry with 3D Gaussians bound on a FLAME mesh and leverages disentangled texture images to capture the global appearance. Technically, it contains two types of Gaussians, in which surface Gaussians explicitly model the appearance of head avatars using learnable texture images, facilitating real-time texture editing, while volumetric Gaussians enhance the reconstruction quality of non-Lambertian regions (e.g., lips and hair). To model the correspondence between 3D world and texture space, we provide a mesh-aware Gaussian UV mapping method, which leverages UV coordinates given by the FLAME mesh to obtain sharp texture images and real-time rendering speed. A hierarchical optimization strategy is further designed to pursue the optimal performance in both reconstruction quality and editing flexibility. Experiments on the NeRSemble dataset show that SVG-Head not only generates high-fidelity rendering results, but also is the first method to obtain explicit texture images for Gaussian head avatars and support real-time appearance editing.
[201] Seeing, Listening, Remembering, and Reasoning: A Multimodal Agent with Long-Term Memory
Lin Long, Yichen He, Wentao Ye, Yiyuan Pan, Yuan Lin, Hang Li, Junbo Zhao, Wei Li
Main category: cs.CV
TL;DR: M3-Agent is a multimodal agent with long-term memory, processing visual/auditory inputs to build episodic and semantic memory. It outperforms baselines on M3-Bench, a new QA benchmark.
Details
Motivation: To advance multimodal agents with human-like long-term memory and evaluate their reasoning capabilities.Method: Uses reinforcement learning to train M3-Agent, which autonomously performs multi-turn reasoning and retrieves memory. Evaluated on M3-Bench (real-world and web-sourced videos).
Result: M3-Agent outperforms baselines (Gemini-1.5-pro and GPT-4o) by 6.7%, 7.7%, and 5.3% on M3-Bench-robot, M3-Bench-web, and VideoMME-long, respectively.
Conclusion: M3-Agent advances multimodal agents with human-like memory and provides practical design insights. Model, code, and data are publicly available.
Abstract: We introduce M3-Agent, a novel multimodal agent framework equipped with long-term memory. Like humans, M3-Agent can process real-time visual and auditory inputs to build and update its long-term memory. Beyond episodic memory, it also develops semantic memory, enabling it to accumulate world knowledge over time. Its memory is organized in an entity-centric, multimodal format, allowing deeper and more consistent understanding of the environment. Given an instruction, M3-Agent autonomously performs multi-turn, iterative reasoning and retrieves relevant information from memory to accomplish the task. To evaluate memory effectiveness and memory-based reasoning in multimodal agents, we develop M3-Bench, a new long-video question answering benchmark. M3-Bench comprises 100 newly recorded real-world videos captured from a robot’s perspective (M3-Bench-robot) and 920 web-sourced videos across diverse scenarios (M3-Bench-web). We annotate question-answer pairs designed to test key capabilities essential for agent applications, such as human understanding, general knowledge extraction, and cross-modal reasoning. Experimental results show that M3-Agent, trained via reinforcement learning, outperforms the strongest baseline, a prompting agent using Gemini-1.5-pro and GPT-4o, achieving 6.7%, 7.7%, and 5.3% higher accuracy on M3-Bench-robot, M3-Bench-web and VideoMME-long, respectively. Our work advances the multimodal agents toward more human-like long-term memory and provides insights into their practical design. Model, code and data are available at https://github.com/bytedance-seed/m3-agent
[202] Reverse Convolution and Its Applications to Image Restoration
Xuhong Huang, Shiqi Liu, Kai Zhang, Ying Tai, Jian Yang, Hui Zeng, Lei Zhang
Main category: cs.CV
TL;DR: The paper introduces a novel depthwise reverse convolution operator to address the lack of a true inverse for convolution in neural networks, proposing ConverseNet for image restoration tasks.
Details
Motivation: Existing transposed convolution doesn't truly invert convolution, limiting neural network design. The paper aims to fill this gap with a mathematically sound reverse operator.Method: The authors formulate a regularized least-squares problem to create a depthwise reverse convolution operator, integrating it into a Transformer-like block with layer normalization and GELU activation.
Result: ConverseNet variants outperform conventional methods in Gaussian denoising, super-resolution, and deblurring, validating the operator’s effectiveness.
Conclusion: The proposed reverse convolution operator is a promising building block for neural architectures, potentially inspiring new deep learning operators.
Abstract: Convolution and transposed convolution are fundamental operators widely used in neural networks. However, transposed convolution (a.k.a. deconvolution) does not serve as a true inverse of convolution due to inherent differences in their mathematical formulations. To date, no reverse convolution operator has been established as a standard component in neural architectures. In this paper, we propose a novel depthwise reverse convolution operator as an initial attempt to effectively reverse depthwise convolution by formulating and solving a regularized least-squares optimization problem. We thoroughly investigate its kernel initialization, padding strategies, and other critical aspects to ensure its effective implementation. Building upon this operator, we further construct a reverse convolution block by combining it with layer normalization, 1$\times$1 convolution, and GELU activation, forming a Transformer-like structure. The proposed operator and block can directly replace conventional convolution and transposed convolution layers in existing architectures, leading to the development of ConverseNet. Corresponding to typical image restoration models such as DnCNN, SRResNet and USRNet, we train three variants of ConverseNet for Gaussian denoising, super-resolution and deblurring, respectively. Extensive experiments demonstrate the effectiveness of the proposed reverse convolution operator as a basic building module. We hope this work could pave the way for developing new operators in deep model design and applications.
[203] UI-Venus Technical Report: Building High-performance UI Agents with RFT
Zhangxuan Gu, Zhengwen Zeng, Zhenyu Xu, Xingran Zhou, Shuheng Shen, Yunfei Liu, Beitong Zhou, Changhua Meng, Tianyu Xia, Weizhi Chen, Yue Wen, Jingya Dou, Fei Tang, Jinzhen Lin, Yulin Liu, Zhenlin Guo, Yichen Gong, Heng Jia, Changlong Gao, Yuan Guo, Yong Deng, Zhenyu Guo, Liang Chen, Weiqiang Wang
Main category: cs.CV
TL;DR: UI-Venus is a multimodal LLM-based UI agent using screenshots as input, achieving SOTA performance in UI grounding and navigation with reinforcement finetuning and novel self-evolving techniques.
Details
Motivation: To advance UI interaction by creating a high-performing, open-source agent that relies solely on screenshots, surpassing existing models in grounding and navigation tasks.Method: Uses reinforcement finetuning (RFT) based on Qwen2.5-VL, with reward functions, data cleaning, and Self-Evolving Trajectory History Alignment & Sparse Action Enhancement for navigation.
Result: Achieves 94.1%/50.8% (7B) and 95.3%/61.9% (72B) on Screenspot-V2/Pro benchmarks and 49.1%/65.9% success on AndroidWorld, outperforming prior SOTA models.
Conclusion: UI-Venus sets new benchmarks for UI agents, offering open-source tools, data protocols, and a self-evolving framework to spur further research.
Abstract: We present UI-Venus, a native UI agent that takes only screenshots as input based on a multimodal large language model. UI-Venus achieves SOTA performance on both UI grounding and navigation tasks using only several hundred thousand high-quality training samples through reinforcement finetune (RFT) based on Qwen2.5-VL. Specifically, the 7B and 72B variants of UI-Venus obtain 94.1% / 50.8% and 95.3% / 61.9% on the standard grounding benchmarks, i.e., Screenspot-V2 / Pro, surpassing the previous SOTA baselines including open-source GTA1 and closed-source UI-TARS-1.5. To show UI-Venus’s summary and planing ability, we also evaluate it on the AndroidWorld, an online UI navigation arena, on which our 7B and 72B variants achieve 49.1% and 65.9% success rate, also beating existing models. To achieve this, we introduce carefully designed reward functions for both UI grounding and navigation tasks and corresponding efficient data cleaning strategies. To further boost navigation performance, we propose Self-Evolving Trajectory History Alignment & Sparse Action Enhancement that refine historical reasoning traces and balances the distribution of sparse but critical actions, leading to more coherent planning and better generalization in complex UI tasks. Our contributions include the publish of SOTA open-source UI agents, comprehensive data cleaning protocols and a novel self-evolving framework for improving navigation performance, which encourage further research and development in the community. Code is available at https://github.com/inclusionAI/UI-Venus.
cs.AI
[204] Grounding Rule-Based Argumentation Using Datalog
Martin Diller, Sarah Alice Gaggl, Philipp Hanisch, Giuseppina Monterosso, Fritz Rauschenbach
Main category: cs.AI
TL;DR: The paper proposes an intelligent grounding procedure for first-order ASPIC+ to manage exponential growth in input theories, using Datalog translation and simplifications, with empirical validation.
Details
Motivation: Existing approaches for rule-based argumentation in ASPIC+ lack support for first-order rules, requiring inefficient grounding steps.Method: Translate first-order ASPIC+ into Datalog, query for ground substitutions, and apply ASPIC+-specific simplifications to avoid unnecessary grounding.
Result: The proposed method maintains manageable grounding size and preserves reasoning correctness, with empirical scalability.
Conclusion: The intelligent grounding procedure effectively addresses the limitations of first-order ASPIC+ reasoning, demonstrating practical scalability.
Abstract: ASPIC+ is one of the main general frameworks for rule-based argumentation for AI. Although first-order rules are commonly used in ASPIC+ examples, most existing approaches to reason over rule-based argumentation only support propositional rules. To enable reasoning over first-order instances, a preliminary grounding step is required. As groundings can lead to an exponential increase in the size of the input theories, intelligent procedures are needed. However, there is a lack of dedicated solutions for ASPIC+. Therefore, we propose an intelligent grounding procedure that keeps the size of the grounding manageable while preserving the correctness of the reasoning process. To this end, we translate the first-order ASPIC+ instance into a Datalog program and query a Datalog engine to obtain ground substitutions to perform the grounding of rules and contraries. Additionally, we propose simplifications specific to the ASPIC+ formalism to avoid grounding of rules that have no influence on the reasoning process. Finally, we performed an empirical evaluation of a prototypical implementation to show scalability.
[205] From Individual to Multi-Agent Algorithmic Recourse: Minimizing the Welfare Gap via Capacitated Bipartite Matching
Zahra Khotanlou, Kate Larson, Amir-Hossein Karimi
Main category: cs.AI
TL;DR: The paper introduces a multi-agent algorithmic recourse framework to address the limitations of single-individual and single-model approaches, optimizing social welfare in many-to-many interactions.
Details
Motivation: Existing algorithmic recourse methods focus on single individuals and models, neglecting the multi-agent nature of real-world systems where stakeholders compete for limited resources.Method: The framework models many-to-many interactions as a capacitated weighted bipartite matching problem, optimizing for social welfare through a three-layer optimization approach.
Result: Experiments show the framework achieves near-optimal welfare with minimal system modifications, balancing individual actionability and collective outcomes.
Conclusion: This work extends algorithmic recourse to system-level design, offering a practical solution for higher social welfare while maintaining individual feasibility.
Abstract: Decision makers are increasingly relying on machine learning in sensitive situations. In such settings, algorithmic recourse aims to provide individuals with actionable and minimally costly steps to reverse unfavorable AI-driven decisions. While existing research predominantly focuses on single-individual (i.e., seeker) and single-model (i.e., provider) scenarios, real-world applications often involve multiple interacting stakeholders. Optimizing outcomes for seekers under an individual welfare approach overlooks the inherently multi-agent nature of real-world systems, where individuals interact and compete for limited resources. To address this, we introduce a novel framework for multi-agent algorithmic recourse that accounts for multiple recourse seekers and recourse providers. We model this many-to-many interaction as a capacitated weighted bipartite matching problem, where matches are guided by both recourse cost and provider capacity. Edge weights, reflecting recourse costs, are optimized for social welfare while quantifying the welfare gap between individual welfare and this collectively feasible outcome. We propose a three-layer optimization framework: (1) basic capacitated matching, (2) optimal capacity redistribution to minimize the welfare gap, and (3) cost-aware optimization balancing welfare maximization with capacity adjustment costs. Experimental validation on synthetic and real-world datasets demonstrates that our framework enables the many-to-many algorithmic recourse to achieve near-optimal welfare with minimum modification in system settings. This work extends algorithmic recourse from individual recommendations to system-level design, providing a tractable path toward higher social welfare while maintaining individual actionability.
[206] Learn to optimize for automatic proton PBS treatment planning for H&N cancers
Qingqing Wang, Liqiang Xiao, Chang Chang
Main category: cs.AI
TL;DR: A data-driven inverse optimizer integrated into a PPO-based framework automates proton PBS treatment planning for H&N cancers, improving efficiency and plan quality.
Details
Motivation: Manual treatment planning for H&N cancers is time-consuming and relies heavily on human expertise. Automating this process can enhance efficiency and plan quality.Method: The framework combines a PPO-based virtual planner for objective parameter adjustment and a Transformer-based L2O inverse optimizer for machine-deliverable MU values. Techniques from LLMs address scalability.
Result: The L2O optimizer improves effectiveness by 22.97% and efficiency by 36.41% over L-BFGSB. Plans generated in ~2.55 hours match or surpass human-generated plans in OAR sparing and target coverage.
Conclusion: The proposed framework successfully automates treatment planning, reducing reliance on human expertise while maintaining or improving plan quality.
Abstract: Proton PBS treatment planning for H&N cancers involves numerous conflicting objectives, requiring significant effort from human planners to balance and satisfy multiple clinical goals during planning. To achieve this, experience-demanding objective parameter adjustment and computationally expensive inverse optimization are performed iteratively. Extensive efforts have been made to automatically adjust objective parameters, but the most time-consuming component, i.e., inverse optimization, still relies heavily on theory-driven approaches. We propose a data-driven inverse optimizer and integrate it into a PPO-based automatic treatment planning framework to automatically generate high-quality plans within a clinical acceptable planning time. The inverse optimizer is a L2O method that predicts update steps by learning from the task-specific data distribution. For the first time, we integrate techniques designed for long-context processing, originally developed for LLMs, into a Transformer-based L2O framework to address the scalability issue of existing L2O methods. The PPO framework functions as an outer-loop virtual planner, autonomously adjusting objective parameters through a policy network, and the dose predictor is used to initialize objective parameters. The inner-loop L2O inverse optimizer computes machine-deliverable MU values based on objectives refined by the PPO policy network. 97 patients are collected in this study, and compared with L-BFGSB, our L2O-based inverse optimizer improves the effectiveness and efficiency by 22.97% and 36.41%, respectively. In conjunction with the PPO-based learned virtual planner, plans generated by our framework within an average of 2.55 hours show improved or comparable OAR sparing with superior target coverage for patients with different prescription dose levels, number of target volumes, beam angles, etc., compared with human-generated plans.
[207] On Strong and Weak Admissibility in Non-Flat Assumption-Based Argumentation
Matti Berthold, Lydia Blümel, Anna Rapberger
Main category: cs.AI
TL;DR: The paper explores strong and weak admissibility in assumption-based argumentation (ABA), extending these notions to non-flat ABA and analyzing their properties and shortcomings.
Details
Motivation: To broaden the understanding of admissibility notions in ABA, particularly strong and weak admissibility, and their application to non-flat ABA frameworks.Method: Uses abstract bipolar set-based argumentation frameworks (BSAFs) to study strong and weak admissibility, introducing preferred, complete, and grounded semantics for non-flat ABA.
Result: Demonstrates that modularization properties are maintained under classical, strong, and weak admissibility, but also identifies shortcomings in these semantics.
Conclusion: The study extends admissibility notions to non-flat ABA, highlighting both their strengths and limitations, and suggests potential improvements.
Abstract: In this work, we broaden the investigation of admissibility notions in the context of assumption-based argumentation (ABA). More specifically, we study two prominent alternatives to the standard notion of admissibility from abstract argumentation, namely strong and weak admissibility, and introduce the respective preferred, complete and grounded semantics for general (sometimes called non-flat) ABA. To do so, we use abstract bipolar set-based argumentation frameworks (BSAFs) as formal playground since they concisely capture the relations between assumptions and are expressive enough to represent general non-flat ABA frameworks, as recently shown. While weak admissibility has been recently investigated for a restricted fragment of ABA in which assumptions cannot be derived (flat ABA), strong admissibility has not been investigated for ABA so far. We introduce strong admissibility for ABA and investigate desirable properties. We furthermore extend the recent investigations of weak admissibility in the flat ABA fragment to the non-flat case. We show that the central modularization property is maintained under classical, strong, and weak admissibility. We also show that strong and weakly admissible semantics in non-flat ABA share some of the shortcomings of standard admissible semantics and discuss ways to address these.
[208] Beyond Solving Math Quiz: Evaluating the Ability of Large Reasoning Models to Ask for Information
Youcheng Huang, Bowen Qin, Chen Huang, Duanyu Feng, Xi Yang, Wenqiang Lei
Main category: cs.AI
TL;DR: The paper introduces a new dataset to evaluate Large Reasoning Models (LRMs) on incomplete problems, revealing their inability to proactively ask for missing information and highlighting issues like overthinking and hallucination.
Details
Motivation: To address the gap in evaluating LRMs on incomplete problems, as genuine intelligence requires proactive information-seeking, not just problem-solving.Method: Proposes a new dataset with incomplete problems and evaluates LRMs systematically, identifying their limitations and behaviors.
Result: LRMs fail to ask for missing information and exhibit overthinking and hallucination. Supervised fine-tuning shows potential but faces challenges.
Conclusion: The study provides insights for developing genuinely intelligent LRMs, emphasizing the need for proactive abilities beyond problem-solving.
Abstract: Large Reasoning Models (LRMs) have demonstrated remarkable problem-solving abilities in mathematics, as evaluated by existing benchmarks exclusively on well-defined problems. However, such evaluation setup constitutes a critical gap, since a genuine intelligent agent should not only solve problems (as a math quiz solver), but also be able~to ask for information when the problems lack sufficient information, enabling proactivity in responding users’ requests. To bridge such gap, we proposes a new dataset consisting of two types of incomplete problems with diverse contexts. Based on the dataset, our systematical evaluation of LRMs reveals their inability in proactively asking for information. In addition, we uncover the behaviors related to overthinking and hallucination of LRMs, and highlight the potential and challenges of supervised fine-tuning in learning such ability. We hope to provide new insights in developing LRMs with genuine intelligence, rather than just solving problems.
[209] SAGE: Scale-Aware Gradual Evolution for Continual Knowledge Graph Embedding
Yifei Li, Lingling Zhang, Hang Yan, Tianzhe Zhao, Zihan Ma, Muye Huang, Jun Liu
Main category: cs.AI
TL;DR: SAGE is a scale-aware gradual evolution framework for continual knowledge graph embedding (CKGE) that adapts embedding dimensions to update scales and balances old and new knowledge, outperforming existing methods.
Details
Motivation: Real-world knowledge graphs (KGs) evolve dynamically, but existing CKGE methods fail to handle varying update scales and lack systematic evaluation.Method: SAGE determines embedding dimensions based on update scales, expands the embedding space, and uses Dynamic Distillation to balance old and new knowledge.
Result: SAGE outperforms baselines with improvements of 1.38% in MRR, 1.25% in H@1, and 1.6% in H@10, and shows optimal performance with adaptive dimensions.
Conclusion: SAGE demonstrates the importance of adaptive embedding dimensions in CKGE and provides a robust framework for dynamic KG updates.
Abstract: Traditional knowledge graph (KG) embedding methods aim to represent entities and relations in a low-dimensional space, primarily focusing on static graphs. However, real-world KGs are dynamically evolving with the constant addition of entities, relations and facts. To address such dynamic nature of KGs, several continual knowledge graph embedding (CKGE) methods have been developed to efficiently update KG embeddings to accommodate new facts while maintaining learned knowledge. As KGs grow at different rates and scales in real-world scenarios, existing CKGE methods often fail to consider the varying scales of updates and lack systematic evaluation throughout the entire update process. In this paper, we propose SAGE, a scale-aware gradual evolution framework for CKGE. Specifically, SAGE firstly determine the embedding dimensions based on the update scales and expand the embedding space accordingly. The Dynamic Distillation mechanism is further employed to balance the preservation of learned knowledge and the incorporation of new facts. We conduct extensive experiments on seven benchmarks, and the results show that SAGE consistently outperforms existing baselines, with a notable improvement of 1.38% in MRR, 1.25% in H@1 and 1.6% in H@10. Furthermore, experiments comparing SAGE with methods using fixed embedding dimensions show that SAGE achieves optimal performance on every snapshot, demonstrating the importance of adaptive embedding dimensions in CKGE. The codes of SAGE are publicly available at: https://github.com/lyfxjtu/Dynamic-Embedding.
[210] CRAFT-GUI: Curriculum-Reinforced Agent For GUI Tasks
Songqin Nong, Jingxuan Xu, Sheng Zhou, Jianfeng Chen, Xiaoxuan Tang, Tao Jiang, Wenhao Xu
Main category: cs.AI
TL;DR: CRAFT-GUI, a curriculum learning framework using GRPO, improves RL in GUI tasks by addressing task difficulty variation and reward granularity, outperforming prior methods by 5.6-10.3%.
Details
Motivation: Current RL methods for GUI tasks treat training data uniformly and use coarse rewards, limiting adaptability and policy efficiency.Method: Proposes CRAFT-GUI, combining GRPO for curriculum learning and a nuanced reward function with rule-based and model-judged signals.
Result: Achieves 5.6% and 10.3% improvements on Android Control and internal benchmarks, respectively.
Conclusion: Integrating RL with curriculum learning enhances GUI task performance, validated by empirical results.
Abstract: As autonomous agents become adept at understanding and interacting with graphical user interface (GUI) environments, a new era of automated task execution is emerging. Recent studies have demonstrated that Reinforcement Learning (RL) can effectively enhance agents’ performance in dynamic interactive GUI environments. However, these methods face two key limitations: (1) they overlook the significant variation in difficulty across different GUI tasks by treating the entire training data as a uniform set, which hampers the agent’s ability to adapt its learning process; and (2) most approaches collapse task-specific nuances into a single, coarse reward, leaving the agent with a uniform signal that yields inefficient policy updates. To address these limitations, we propose CRAFT-GUI, a curriculum learning framework based on Group Relative Policy Optimization (GRPO) that explicitly accounts for the varying difficulty across trajectories. To enable more fine-grained policy optimization, we design a reward function that combines simple rule-based signals with model-judged evaluation, providing richer and more nuanced feedback during training. Experimental results demonstrate that our method achieves significant improvements over previous state-of-the-art approaches, outperforming them by 5.6% on public benchmarks Android Control and 10.3% on our internal online benchmarks, respectively. These findings empirically validate the effectiveness of integrating reinforcement learning with curriculum learning in GUI interaction tasks.
[211] AIM-Bench: Evaluating Decision-making Biases of Agentic LLM as Inventory Manager
Xuhua Zhao, Yuxuan Xie, Caihua Chen, Yuxiang Sun
Main category: cs.AI
TL;DR: The paper introduces AIM-Bench to evaluate LLM agents’ decision-making in uncertain supply chain scenarios, revealing biases similar to humans and suggesting mitigation strategies.
Details
Motivation: To explore LLM agents' capabilities and biases in inventory decision-making under uncertainty, addressing gaps in current understanding.Method: Developed AIM-Bench, a benchmark for testing LLM agents in inventory replenishment experiments, and analyzed decision biases and mitigation strategies.
Result: Found LLMs exhibit human-like biases (e.g., pull-to-centre, bullwhip effects) and identified cognitive reflection and information sharing as potential mitigations.
Conclusion: Highlights the need to address LLM biases in inventory decisions and suggests pathways for human-centred decision support systems in supply chains.
Abstract: Recent advances in mathematical reasoning and the long-term planning capabilities of large language models (LLMs) have precipitated the development of agents, which are being increasingly leveraged in business operations processes. Decision models to optimize inventory levels are one of the core elements of operations management. However, the capabilities of the LLM agent in making inventory decisions in uncertain contexts, as well as the decision-making biases (e.g. framing effect, etc.) of the agent, remain largely unexplored. This prompts concerns regarding the capacity of LLM agents to effectively address real-world problems, as well as the potential implications of biases that may be present. To address this gap, we introduce AIM-Bench, a novel benchmark designed to assess the decision-making behaviour of LLM agents in uncertain supply chain management scenarios through a diverse series of inventory replenishment experiments. Our results reveal that different LLMs typically exhibit varying degrees of decision bias that are similar to those observed in human beings. In addition, we explored strategies to mitigate the pull-to-centre effect and the bullwhip effect, namely cognitive reflection and implementation of information sharing. These findings underscore the need for careful consideration of the potential biases in deploying LLMs in Inventory decision-making scenarios. We hope that these insights will pave the way for mitigating human decision bias and developing human-centred decision support systems for supply chains.
[212] Inclusion Arena: An Open Platform for Evaluating Large Foundation Models with Real-World Apps
Kangyu Wang, Hongliang He, Lin Liu, Ruiqi Liang, Zhenzhong Lan, Jianguo Li
Main category: cs.AI
TL;DR: Inclusion Arena is a live leaderboard for ranking LLMs and MLLMs using human feedback from real-world applications, employing innovative ranking methods for reliability and stability.
Details
Motivation: Existing benchmarks for LLMs and MLLMs rely on static datasets or general-domain prompts, failing to reflect real-world performance. Inclusion Arena addresses this gap by integrating human feedback from practical usage scenarios.Method: The platform uses pairwise model comparisons in natural user interactions, enhanced by the Bradley-Terry model with Placement Matches (for cold-start ratings) and Proximity Sampling (to prioritize comparisons between similarly capable models).
Result: Inclusion Arena provides reliable rankings, higher data transitivity, and reduced manipulation risks compared to traditional benchmarks.
Conclusion: By linking foundation models to real-world applications, Inclusion Arena aims to advance LLM and MLLM development for practical, user-centric use.
Abstract: Large Language Models (LLMs) and Multimodal Large Language Models (MLLMs) have ushered in a new era of AI capabilities, demonstrating near-human-level performance across diverse scenarios. While numerous benchmarks (e.g., MMLU) and leaderboards (e.g., Chatbot Arena) have been proposed to help evolve the development of LLMs and MLLMs, most rely on static datasets or crowdsourced general-domain prompts, often falling short of reflecting performance in real-world applications. To bridge this critical gap, we present Inclusion Arena, a live leaderboard that ranks models based on human feedback collected directly from AI-powered applications. Our platform integrates pairwise model comparisons into natural user interactions, ensuring evaluations reflect practical usage scenarios. For robust model ranking, we employ the Bradley-Terry model augmented with two key innovations: (1) Placement Matches, a cold-start mechanism to quickly estimate initial ratings for newly integrated models, and (2) Proximity Sampling, an intelligent comparison strategy that prioritizes battles between models of similar capabilities to maximize information gain and enhance rating stability. Extensive empirical analyses and simulations demonstrate that Inclusion Arena yields reliable and stable rankings, exhibits higher data transitivity compared to general crowdsourced datasets, and significantly mitigates the risk of malicious manipulation. By fostering an open alliance between foundation models and real-world applications, Inclusion Arena aims to accelerate the development of LLMs and MLLMs truly optimized for practical, user-centric deployments. The platform is publicly accessible at https://doraemon.alipay.com/model-ranking.
[213] Landmark-Assisted Monte Carlo Planning
David H. Chan, Mark Roberts, Dana S. Nau
Main category: cs.AI
TL;DR: The paper formalizes probabilistic landmarks for stochastic domains, adapting UCT to use them as subgoals in MDPs, improving performance in online probabilistic planning.
Details
Motivation: Landmarks are underutilized in stochastic domains despite their success in classical planning, prompting their formalization and application in MDPs.Method: Probabilistic landmarks are formalized, and the UCT algorithm is adapted to balance greedy landmark achievement with long-term goal achievement.
Result: Well-chosen landmarks significantly boost UCT performance in benchmark domains, though the optimal balance between greedy and long-term goals varies by problem.
Conclusion: Landmarks offer valuable guidance for anytime algorithms solving MDPs, enhancing planning efficiency in stochastic settings.
Abstract: Landmarks$\unicode{x2013}$conditions that must be satisfied at some point in every solution plan$\unicode{x2013}$have contributed to major advancements in classical planning, but they have seldom been used in stochastic domains. We formalize probabilistic landmarks and adapt the UCT algorithm to leverage them as subgoals to decompose MDPs; core to the adaptation is balancing between greedy landmark achievement and final goal achievement. Our results in benchmark domains show that well-chosen landmarks can significantly improve the performance of UCT in online probabilistic planning, while the best balance of greedy versus long-term goal achievement is problem-dependent. The results suggest that landmarks can provide helpful guidance for anytime algorithms solving MDPs.
[214] Inspire or Predict? Exploring New Paradigms in Assisting Classical Planners with Large Language Models
Wenkai Yu, Jianhang Tang, Yang Zhang, Shanjiang Tang, Kebing Jin, Hankz Hankui Zhuo
Main category: cs.AI
TL;DR: The paper proposes an LLM-assisted planner with problem decomposition to address large-scale planning challenges, using LLM4Inspire for general guidance and LLM4Predict for domain-specific knowledge.
Details
Motivation: Large-scale planning problems suffer from state-space explosion, and prior LLM-based approaches lack domain-specific integration for valid plans.Method: The planner decomposes problems into sub-tasks and uses LLM4Inspire (general knowledge) and LLM4Predict (domain-specific knowledge) to guide the process.
Result: Empirical validation shows LLM4Predict, with domain-specific knowledge, outperforms LLM4Inspire in pruning search spaces and locating feasible solutions.
Conclusion: Integrating domain-specific knowledge into LLMs (LLM4Predict) is particularly effective for large-scale planning problems.
Abstract: Addressing large-scale planning problems has become one of the central challenges in the planning community, deriving from the state-space explosion caused by growing objects and actions. Recently, researchers have explored the effectiveness of leveraging Large Language Models (LLMs) to generate helpful actions and states to prune the search space. However, prior works have largely overlooked integrating LLMs with domain-specific knowledge to ensure valid plans. In this paper, we propose a novel LLM-assisted planner integrated with problem decomposition, which first decomposes large planning problems into multiple simpler sub-tasks. Then we explore two novel paradigms to utilize LLMs, i.e., LLM4Inspire and LLM4Predict, to assist problem decomposition, where LLM4Inspire provides heuristic guidance according to general knowledge and LLM4Predict employs domain-specific knowledge to infer intermediate conditions. We empirically validate the effectiveness of our planner across multiple domains, demonstrating the ability of search space partition when solving large-scale planning problems. The experimental results show that LLMs effectively locate feasible solutions when pruning the search space, where infusing domain-specific knowledge into LLMs, i.e., LLM4Predict, holds particular promise compared with LLM4Inspire, which offers general knowledge within LLMs.
[215] Sophisticated Learning: A novel algorithm for active learning during model-based planning
Rowan Hodson, Bruce Bassett, Charel van Hoof, Benjamin Rosman, Mark Solms, Jonathan P. Shock, Ryan Smith
Main category: cs.AI
TL;DR: Sophisticated Learning (SL) enhances Active Inference by integrating active parameter learning, outperforming SI and BARL in early trials and converging faster.
Details
Motivation: To improve decision-making under uncertainty by embedding active parameter learning within the Sophisticated Inference framework.Method: SL updates beliefs about model parameters during tree-search, enabling counterfactual reasoning. Tested against SI and BARL in a seasonal foraging task.
Result: SL agents survived longer in early trials (8.2% vs. SI, 35% vs. BARL) and converged 40% faster than SI.
Conclusion: Active learning in multi-step planning significantly enhances performance under uncertainty, validating Active Inference’s utility for biological behavior modeling.
Abstract: We introduce Sophisticated Learning (SL), a planning-to-learn algorithm that embeds active parameter learning inside the Sophisticated Inference (SI) tree-search framework of Active Inference. Unlike SI – which optimizes beliefs about hidden states – SL also updates beliefs about model parameters within each simulated branch, enabling counterfactual reasoning about how future observations would improve subsequent planning. We compared SL with Bayes-adaptive Reinforcement Learning (BARL) agents as well as with its parent algorithm, SI. Using a biologically inspired seasonal foraging task in which resources shift probabilistically over a 10x10 grid, we designed experiments that forced agents to balance probabilistic reward harvesting against information gathering. In early trials, where rapid learning is vital, SL agents survive, on average, 8.2% longer than SI and 35% longer than Bayes-adaptive Reinforcement Learning. While both SL and SI showed equal convergence performance, SL reached this convergence 40% faster than SI. Additionally, SL showed robust out-performance of other algorithms in altered environment configurations. Our results show that incorporating active learning into multi-step planning materially improves decision making under radical uncertainty, and reinforces the broader utility of Active Inference for modeling biologically relevant behavior.
[216] MetaAgents: Large Language Model Based Agents for Decision-Making on Teaming
Yuan Li, Lichao Sun, Yixuan Zhang
Main category: cs.AI
TL;DR: The paper introduces MetaAgents, a framework for social simulations using LLM-based agents, focusing on teaming in task-oriented events like job fairs. It evaluates their performance and identifies limitations.
Details
Motivation: To explore LLMs' underexplored abilities in teaming for task-oriented social events, aiming to mimic human-like social behaviors and efficient team formation.Method: Develops MetaAgents, a framework with LLM-based agents, and uses a job fair case study to analyze team assembly and skill-matching behaviors through quantitative and qualitative evaluations.
Result: LLM-based agents perform well in rational decision-making for team formation but face limitations in complex tasks.
Conclusion: The study offers insights into LLMs’ role in social simulations, highlighting their potential and current constraints.
Abstract: Significant advancements have occurred in the application of Large Language Models (LLMs) for social simulations. Despite this, their abilities to perform teaming in task-oriented social events are underexplored. Such capabilities are crucial if LLMs are to effectively mimic human-like social behaviors and form efficient teams to solve tasks. To bridge this gap, we introduce MetaAgents, a social simulation framework populated with LLM-based agents. MetaAgents facilitates agent engagement in conversations and a series of decision making within social contexts, serving as an appropriate platform for investigating interactions and interpersonal decision-making of agents. In particular, we construct a job fair environment as a case study to scrutinize the team assembly and skill-matching behaviors of LLM-based agents. We take advantage of both quantitative metrics evaluation and qualitative text analysis to assess their teaming abilities at the job fair. Our evaluation demonstrates that LLM-based agents perform competently in making rational decisions to develop efficient teams. However, we also identify limitations that hinder their effectiveness in more complex team assembly tasks. Our work provides valuable insights into the role and evolution of LLMs in task-oriented social simulations.
[217] Tool-Planner: Task Planning with Clusters across Multiple Tools
Yanming Liu, Xinyue Peng, Jiannan Cao, Yuwei Zhang, Xuhong Zhang, Sheng Cheng, Xun Wang, Jianwei Yin, Tianyu Du
Main category: cs.AI
TL;DR: Tool-Planner enhances LLMs’ tool learning by grouping tools into toolkits, improving planning stability and execution efficiency.
Details
Motivation: Address challenges in tool learning for LLMs, such as unstable planning due to redundant error correction and difficulty in designing correct plans among multiple tools.Method: Propose Tool-Planner, a framework that groups tools into toolkits and allows LLMs to plan across them, adjusting tools when errors occur.
Result: High pass and win rates across datasets, optimizing planning for models like GPT-4 and Claude 3.
Conclusion: Tool-Planner showcases potential for improving tool learning in LLMs by addressing key challenges.
Abstract: Large language models (LLMs) have demonstrated exceptional reasoning capabilities, enabling them to solve various complex problems. Recently, this ability has been applied to the paradigm of tool learning. Tool learning involves providing examples of tool usage and their corresponding functions, allowing LLMs to formulate plans and demonstrate the process of invoking and executing each tool. LLMs can address tasks that they cannot complete independently, thereby enhancing their potential across different tasks. However, this approach faces two key challenges. First, redundant error correction leads to unstable planning and long execution time. Additionally, designing a correct plan among multiple tools is also a challenge in tool learning. To address these issues, we propose Tool-Planner, a task-processing framework based on toolkits. Tool-Planner groups tools based on the API functions with the same function into a toolkit and allows LLMs to implement planning across the various toolkits. When a tool error occurs, the language model can reselect and adjust tools based on the toolkit. Experiments show that our approach demonstrates a high pass and win rate across different datasets and optimizes the planning scheme for tool learning in models such as GPT-4 and Claude 3, showcasing the potential of our method. Our code is public at https://github.com/OceannTwT/Tool-Planner
[218] Sketch Decompositions for Classical Planning via Deep Reinforcement Learning
Michael Aichmüller, Hector Geffner
Main category: cs.AI
TL;DR: The paper proposes a deep reinforcement learning (DRL) approach to learn sketch decompositions for planning tasks, addressing scalability and expressivity limitations of existing methods.
Details
Motivation: Identifying common subgoal structures is crucial for long-horizon goals in planning and reinforcement learning. Existing methods using sketches face scalability and expressivity issues.Method: The problem is formulated as a DRL task, where general policies are learned in a modified planning problem using IW$(k)$ searches.
Result: The DRL-derived decompositions are evaluated across domains, showing effectiveness in solving problems via greedy IW$(k)$ searches, though lacking interpretable rule-based sketches.
Conclusion: The DRL approach successfully addresses scalability and expressivity but sacrifices interpretability, though decompositions remain understandable.
Abstract: In planning and reinforcement learning, the identification of common subgoal structures across problems is important when goals are to be achieved over long horizons. Recently, it has been shown that such structures can be expressed as feature-based rules, called sketches, over a number of classical planning domains. These sketches split problems into subproblems which then become solvable in low polynomial time by a greedy sequence of IW$(k)$ searches. Methods for learning sketches using feature pools and min-SAT solvers have been developed, yet they face two key limitations: scalability and expressivity. In this work, we address these limitations by formulating the problem of learning sketch decompositions as a deep reinforcement learning (DRL) task, where general policies are sought in a modified planning problem where the successor states of a state s are defined as those reachable from s through an IW$(k)$ search. The sketch decompositions obtained through this method are experimentally evaluated across various domains, and problems are regarded as solved by the decomposition when the goal is reached through a greedy sequence of IW$(k)$ searches. While our DRL approach for learning sketch decompositions does not yield interpretable sketches in the form of rules, we demonstrate that the resulting decompositions can often be understood in a crisp manner.
[219] Learning to Be A Doctor: Searching for Effective Medical Agent Architectures
Yangyang Zhuang, Wenjia Jiang, Jiayu Zhang, Ze Yang, Joey Tianyi Zhou, Chi Zhang
Main category: cs.AI
TL;DR: The paper introduces an automated framework for designing medical agent architectures, addressing the limitations of static workflows in existing systems. It uses a hierarchical search space and graph-based structures to enhance adaptability and diagnostic accuracy.
Details
Motivation: Existing medical agent systems rely on static workflows, which lack flexibility for diverse diagnostic needs and emerging clinical scenarios. Inspired by AutoML, the paper aims to automate medical agent design.Method: The framework defines a hierarchical agent search space for dynamic workflow adaptation. It conceptualizes agents as graph-based architectures with diverse node types and supports iterative self-improvement using diagnostic feedback.
Result: Experiments on skin disease diagnosis show the method effectively evolves workflows and improves diagnostic accuracy over time.
Conclusion: This work presents the first fully automated framework for medical agent architecture design, offering a scalable and adaptable solution for clinical deployment.
Abstract: Large Language Model (LLM)-based agents have demonstrated strong capabilities across a wide range of tasks, and their application in the medical domain holds particular promise due to the demand for high generalizability and reliance on interdisciplinary knowledge. However, existing medical agent systems often rely on static, manually crafted workflows that lack the flexibility to accommodate diverse diagnostic requirements and adapt to emerging clinical scenarios. Motivated by the success of automated machine learning (AutoML), this paper introduces a novel framework for the automated design of medical agent architectures. Specifically, we define a hierarchical and expressive agent search space that enables dynamic workflow adaptation through structured modifications at the node, structural, and framework levels. Our framework conceptualizes medical agents as graph-based architectures composed of diverse, functional node types and supports iterative self-improvement guided by diagnostic feedback. Experimental results on skin disease diagnosis tasks demonstrate that the proposed method effectively evolves workflow structures and significantly enhances diagnostic accuracy over time. This work represents the first fully automated framework for medical agent architecture design and offers a scalable, adaptable foundation for deploying intelligent agents in real-world clinical environments.
[220] CogDDN: A Cognitive Demand-Driven Navigation with Decision Optimization and Dual-Process Thinking
Yuehao Huang, Liang Liu, Shuangming Lei, Yukai Ma, Hao Su, Jianbiao Mei, Pengxiang Zhao, Yaqing Gu, Yong Liu, Jiajun Lv
Main category: cs.AI
TL;DR: CogDDN is a VLM-based framework for demand-driven navigation that integrates human-like cognitive mechanisms, outperforming traditional methods by 15% in navigation accuracy.
Details
Motivation: To address the limitations of data-driven DDN methods, which lack generalization in unseen scenarios, by emulating human cognitive processes.Method: Combines fast and slow thinking systems, semantic alignment of objects, and dual-process decision-making (Heuristic and Analytic Processes) with CoT reasoning.
Result: Achieves 15% better navigation accuracy than single-view camera-only methods in AI2Thor simulations.
Conclusion: CogDDN enhances adaptability and performance in unknown environments by mimicking human cognition.
Abstract: Mobile robots are increasingly required to navigate and interact within unknown and unstructured environments to meet human demands. Demand-driven navigation (DDN) enables robots to identify and locate objects based on implicit human intent, even when object locations are unknown. However, traditional data-driven DDN methods rely on pre-collected data for model training and decision-making, limiting their generalization capability in unseen scenarios. In this paper, we propose CogDDN, a VLM-based framework that emulates the human cognitive and learning mechanisms by integrating fast and slow thinking systems and selectively identifying key objects essential to fulfilling user demands. CogDDN identifies appropriate target objects by semantically aligning detected objects with the given instructions. Furthermore, it incorporates a dual-process decision-making module, comprising a Heuristic Process for rapid, efficient decisions and an Analytic Process that analyzes past errors, accumulates them in a knowledge base, and continuously improves performance. Chain of Thought (CoT) reasoning strengthens the decision-making process. Extensive closed-loop evaluations on the AI2Thor simulator with the ProcThor dataset show that CogDDN outperforms single-view camera-only methods by 15%, demonstrating significant improvements in navigation accuracy and adaptability. The project page is available at https://yuehaohuang.github.io/CogDDN/.
[221] AirTrafficGen: Configurable Air Traffic Scenario Generation with Large Language Models
Dewi Sid William Gould, George De Ath, Ben Carvell, Nick Pepper
Main category: cs.AI
TL;DR: Automating ATC scenario generation using LLMs to overcome manual design limitations.
Details
Motivation: Manual ATC scenario design is time-consuming and limits diversity, necessitating automation.Method: Uses LLMs with a graph-based representation of sector topology for scenario generation.
Result: LLMs like Gemini 2.5 Pro and GPT-5 generate realistic, high-traffic scenarios with iterative refinement.
Conclusion: LLMs offer scalable, automated ATC scenario generation, enhancing training diversity and volume.
Abstract: The manual design of scenarios for Air Traffic Control (ATC) training is a demanding and time-consuming bottleneck that limits the diversity of simulations available to controllers. To address this, we introduce a novel, end-to-end approach, $\texttt{AirTrafficGen}$, that leverages large language models (LLMs) to automate and control the generation of complex ATC scenarios. Our method uses a purpose-built, graph-based representation to encode sector topology (including airspace geometry, routes, and fixes) into a format LLMs can process. Through rigorous benchmarking, we show that state-of-the-art models like Gemini 2.5 Pro, OpenAI o3, GPT-oss-120b and GPT-5 can generate high-traffic scenarios while maintaining operational realism. Our engineered prompting enables fine-grained control over interaction presence, type, and location. Initial findings suggest these models are also capable of iterative refinement, correcting flawed scenarios based on simple textual feedback. This approach provides a scalable alternative to manual scenario design, addressing the need for a greater volume and variety of ATC training and validation simulations. More broadly, this work showcases the potential of LLMs for complex planning in safety-critical domains.
[222] IRL-VLA: Training an Vision-Language-Action Policy via Reward World Model
Anqing Jiang, Yu Gao, Yiru Wang, Zhigang Sun, Shuo Wang, Yuwen Heng, Hao Sun, Shichen Tang, Lijuan Zhu, Jinhao Chai, Jijun Wang, Zichong Gu, Hao Jiang, Li Sun
Main category: cs.AI
TL;DR: The paper introduces IRL-VLA, a novel close-loop reinforcement learning framework for autonomous driving, addressing limitations of imitation learning and simulation reliance.
Details
Motivation: Existing VLA models suffer from suboptimal performance due to open-loop imitation learning and computational inefficiencies in close-loop training with simulated sensors.Method: A three-stage approach: (1) pretrain VLA policy via imitation learning, (2) build a lightweight reward world model via inverse reinforcement learning, (3) enhance planning with PPO-based reinforcement learning.
Result: Achieves top performance in NAVSIM v2 benchmark and ranks 1st runner-up in CVPR2025 Autonomous Grand Challenge.
Conclusion: IRL-VLA accelerates VLA research for close-loop autonomous driving by balancing safety, comfort, and efficiency.
Abstract: Vision-Language-Action (VLA) models have demonstrated potential in autonomous driving. However, two critical challenges hinder their development: (1) Existing VLA architectures are typically based on imitation learning in open-loop setup which tends to capture the recorded behaviors in the dataset, leading to suboptimal and constrained performance, (2) Close-loop training relies heavily on high-fidelity sensor simulation, where domain gaps and computational inefficiencies pose significant barriers. In this paper, we introduce IRL-VLA, a novel close-loop Reinforcement Learning via \textbf{I}nverse \textbf{R}einforcement \textbf{L}earning reward world model with a self-built VLA approach. Our framework proceeds in a three-stage paradigm: In the first stage, we propose a VLA architecture and pretrain the VLA policy via imitation learning. In the second stage, we construct a lightweight reward world model via inverse reinforcement learning to enable efficient close-loop reward computation. To further enhance planning performance, finally, we design specialized reward world model guidence reinforcement learning via PPO(Proximal Policy Optimization) to effectively balance the safety incidents, comfortable driving, and traffic efficiency. Our approach achieves state-of-the-art performance in NAVSIM v2 end-to-end driving benchmark, 1st runner up in CVPR2025 Autonomous Grand Challenge. We hope that our framework will accelerate VLA research in close-loop autonomous driving.
[223] DSperse: A Framework for Targeted Verification in Zero-Knowledge Machine Learning
Dan Ivanov, Tristan Freiberg, Shirin Shahabi, Jonathan Gold, Haruna Isah
Main category: cs.AI
TL;DR: DSperse is a modular framework for distributed ML inference with cryptographic verification, enabling targeted verification of subcomputations for cost efficiency.
Details
Motivation: To address the high cost and rigidity of full-model circuitization in distributed zero-knowledge ML by enabling flexible, targeted verification.Method: Uses strategically chosen verifiable segments (“slices”) in the inference pipeline, enforced via audit, replication, or incentives, and evaluates with multiple proving systems.
Result: Empirical results show performance in memory usage, runtime, and circuit behavior under sliced/unsliced configurations.
Conclusion: DSperse offers scalable, targeted verification strategies, aligning proof boundaries with model structure for diverse deployment needs.
Abstract: DSperse is a modular framework for distributed machine learning inference with strategic cryptographic verification. Operating within the emerging paradigm of distributed zero-knowledge machine learning, DSperse avoids the high cost and rigidity of full-model circuitization by enabling targeted verification of strategically chosen subcomputations. These verifiable segments, or “slices”, may cover part or all of the inference pipeline, with global consistency enforced through audit, replication, or economic incentives. This architecture supports a pragmatic form of trust minimization, localizing zero-knowledge proofs to the components where they provide the greatest value. We evaluate DSperse using multiple proving systems and report empirical results on memory usage, runtime, and circuit behavior under sliced and unsliced configurations. By allowing proof boundaries to align flexibly with the model’s logical structure, DSperse supports scalable, targeted verification strategies suited to diverse deployment needs.
[224] PASS: Probabilistic Agentic Supernet Sampling for Interpretable and Adaptive Chest X-Ray Reasoning
Yushi Feng, Junye Du, Yingying Hong, Qifan Wang, Lequan Yu
Main category: cs.AI
TL;DR: PASS introduces a multimodal framework for Chest X-Ray reasoning, addressing trust, safety, and efficiency issues in agentic systems by adaptively sampling workflows and annotating decision paths with probabilities.
Details
Motivation: Existing agentic systems lack transparency, multimodal integration, and efficiency, posing risks in healthcare tasks like CXR reasoning.Method: PASS uses a probabilistic supernet to sample workflows, leverages task-conditioned distributions, and employs a three-stage training procedure for optimization.
Result: PASS outperforms baselines in accuracy, AUC, and efficiency, validated across benchmarks.
Conclusion: PASS advances interpretable, adaptive, and multimodal medical AI, setting a new paradigm for safety-critical agentic systems.
Abstract: Existing tool-augmented agentic systems are limited in the real world by (i) black-box reasoning steps that undermine trust of decision-making and pose safety risks, (ii) poor multimodal integration, which is inherently critical for healthcare tasks, and (iii) rigid and computationally inefficient agentic pipelines. We introduce PASS (Probabilistic Agentic Supernet Sampling), the first multimodal framework to address these challenges in the context of Chest X-Ray (CXR) reasoning. PASS adaptively samples agentic workflows over a multi-tool graph, yielding decision paths annotated with interpretable probabilities. Given the complex CXR reasoning task with multimodal medical data, PASS leverages its learned task-conditioned distribution over the agentic supernet. Thus, it adaptively selects the most suitable tool at each supernet layer, offering probability-annotated trajectories for post-hoc audits and directly enhancing medical AI safety. PASS also continuously compresses salient findings into an evolving personalized memory, while dynamically deciding whether to deepen its reasoning path or invoke an early exit for efficiency. To optimize a Pareto frontier balancing performance and cost, we design a novel three-stage training procedure, including expert knowledge warm-up, contrastive path-ranking, and cost-aware reinforcement learning. To facilitate rigorous evaluation, we introduce CAB-E, a comprehensive benchmark for multi-step, safety-critical, free-form CXR reasoning. Experiments across various benchmarks validate that PASS significantly outperforms strong baselines in multiple metrics (e.g., accuracy, AUC, LLM-J.) while balancing computational costs, pushing a new paradigm shift towards interpretable, adaptive, and multimodal medical agentic systems.
cs.SD
[225] Perturbed Public Voices (P$^{2}$V): A Dataset for Robust Audio Deepfake Detection
Chongyang Gao, Marco Postiglione, Isabel Gortner, Sarit Kraus, V. S. Subrahmanian
Main category: cs.SD
TL;DR: Current audio deepfake detectors perform poorly in real-world scenarios. The P$^{2}$V dataset addresses this by including realistic challenges like identity-consistent transcripts, noise, and advanced cloning. Tests show existing detectors lose 43% performance on P$^{2}$V, while P$^{2}$V-trained models remain robust.
Details
Motivation: Existing audio deepfake detectors fail in real-world conditions, highlighting the need for a more robust dataset and evaluation framework.Method: Introduces P$^{2}$V, a dataset with identity-consistent transcripts, environmental/adversarial noise, and advanced voice cloning (2020-2025). Evaluates 22 detectors on P$^{2}$V.
Result: Existing detectors lose 43% performance on P$^{2}$V. Adversarial perturbations degrade performance by 16%, and advanced cloning reduces detectability by 20-30%. P$^{2}$V-trained models outperform others.
Conclusion: P$^{2}$V sets a new benchmark for robust audio deepfake detection, exposing vulnerabilities in current methods and improving real-world applicability.
Abstract: Current audio deepfake detectors cannot be trusted. While they excel on controlled benchmarks, they fail when tested in the real world. We introduce Perturbed Public Voices (P$^{2}$V), an IRB-approved dataset capturing three critical aspects of malicious deepfakes: (1) identity-consistent transcripts via LLMs, (2) environmental and adversarial noise, and (3) state-of-the-art voice cloning (2020-2025). Experiments reveal alarming vulnerabilities of 22 recent audio deepfake detectors: models trained on current datasets lose 43% performance when tested on P$^{2}$V, with performance measured as the mean of F1 score on deepfake audio, AUC, and 1-EER. Simple adversarial perturbations induce up to 16% performance degradation, while advanced cloning techniques reduce detectability by 20-30%. In contrast, P$^{2}$V-trained models maintain robustness against these attacks while generalizing to existing datasets, establishing a new benchmark for robust audio deepfake detection. P$^{2}$V will be publicly released upon acceptance by a conference/journal.
[226] LD-LAudio-V1: Video-to-Long-Form-Audio Generation Extension with Dual Lightweight Adapters
Haomin Zhang, Kristin Qi, Shuxin Yang, Zihao Chen, Chaofan Ding, Xinhan Di
Main category: cs.SD
TL;DR: LD-LAudio-V1 improves long-form video-to-audio generation with dual lightweight adapters and a clean dataset, reducing artifacts and inconsistencies while boosting performance metrics.
Details
Motivation: Existing methods struggle with long-form audio generation and rely on noisy datasets, limiting quality and synchronization.Method: Introduces LD-LAudio-V1, extending state-of-the-art models with dual lightweight adapters and a clean, annotated dataset.
Result: Significant improvements in metrics like FD, KL, IS, and semantic relevance, with reduced splicing artifacts and energy discrepancies.
Conclusion: LD-LAudio-V1 advances long-form audio generation, supported by a high-quality dataset for future research.
Abstract: Generating high-quality and temporally synchronized audio from video content is essential for video editing and post-production tasks, enabling the creation of semantically aligned audio for silent videos. However, most existing approaches focus on short-form audio generation for video segments under 10 seconds or rely on noisy datasets for long-form video-to-audio zsynthesis. To address these limitations, we introduce LD-LAudio-V1, an extension of state-of-the-art video-to-audio models and it incorporates dual lightweight adapters to enable long-form audio generation. In addition, we release a clean and human-annotated video-to-audio dataset that contains pure sound effects without noise or artifacts. Our method significantly reduces splicing artifacts and temporal inconsistencies while maintaining computational efficiency. Compared to direct fine-tuning with short training videos, LD-LAudio-V1 achieves significant improvements across multiple metrics: $FD_{\text{passt}}$ 450.00 $\rightarrow$ 327.29 (+27.27%), $FD_{\text{panns}}$ 34.88 $\rightarrow$ 22.68 (+34.98%), $FD_{\text{vgg}}$ 3.75 $\rightarrow$ 1.28 (+65.87%), $KL_{\text{panns}}$ 2.49 $\rightarrow$ 2.07 (+16.87%), $KL_{\text{passt}}$ 1.78 $\rightarrow$ 1.53 (+14.04%), $IS_{\text{panns}}$ 4.17 $\rightarrow$ 4.30 (+3.12%), $IB_{\text{score}}$ 0.25 $\rightarrow$ 0.28 (+12.00%), $Energy\Delta10\text{ms}$ 0.3013 $\rightarrow$ 0.1349 (+55.23%), $Energy\Delta10\text{ms(vs.GT)}$ 0.0531 $\rightarrow$ 0.0288 (+45.76%), and $Sem.,Rel.$ 2.73 $\rightarrow$ 3.28 (+20.15%). Our dataset aims to facilitate further research in long-form video-to-audio generation and is available at https://github.com/deepreasonings/long-form-video2audio.
[227] Benchmarking Prosody Encoding in Discrete Speech Tokens
Kentaro Onda, Satoru Fukayama, Daisuke Saito, Nobuaki Minematsu
Main category: cs.SD
TL;DR: The paper analyzes how well discrete tokens from SSL models capture prosodic features in speech, aiming to guide better token design.
Details
Motivation: Discrete tokens from SSL models are used in speech tasks, but their ability to encode prosody is understudied. This work fills that gap.Method: The study evaluates prosodic encoding by testing token sensitivity to artificially modified prosody.
Result: Findings provide insights into how discrete tokens handle prosodic information.
Conclusion: The analysis offers practical guidelines for designing discrete tokens to better capture prosody in speech tasks.
Abstract: Recently, discrete tokens derived from self-supervised learning (SSL) models via k-means clustering have been actively studied as pseudo-text in speech language models and as efficient intermediate representations for various tasks. However, these discrete tokens are typically learned in advance, separately from the training of language models or downstream tasks. As a result, choices related to discretization, such as the SSL model used or the number of clusters, must be made heuristically. In particular, speech language models are expected to understand and generate responses that reflect not only the semantic content but also prosodic features. Yet, there has been limited research on the ability of discrete tokens to capture prosodic information. To address this gap, this study conducts a comprehensive analysis focusing on prosodic encoding based on their sensitivity to the artificially modified prosody, aiming to provide practical guidelines for designing discrete tokens.
[228] Mitigating Category Imbalance: Fosafer System for the Multimodal Emotion and Intent Joint Understanding Challenge
Honghong Wang, Yankai Wang, Dejun Zhang, Jing Deng, Rong Zheng
Main category: cs.SD
TL;DR: The paper introduces Fosafer, a method for joint emotion and intent recognition in Mandarin, addressing category imbalance with data augmentation, a novel loss function, and modal dropout.
Details
Motivation: To tackle the challenge of joint emotion and intent recognition in Mandarin, especially the issue of category imbalance.Method: Uses data augmentation across text, video, and audio, introduces SampleWeighted Focal Contrastive loss, fine-tunes Hubert model, and employs modal dropout and plurality voting.
Result: Achieves second-best performance in the Track 2 Mandarin challenge.
Conclusion: The proposed method effectively addresses category imbalance and modal competition, demonstrating strong performance.
Abstract: This paper presents Fosafer approach to the Track 2 Mandarin in the Multimodal Emotion and Intent Joint Understandingchallenge, which focuses on achieving joint recognition of emotion and intent in Mandarin, despite the issue of category imbalance. To alleviate this issue, we use a variety of data augmentation techniques across text, video, and audio modalities. Additionally, we introduce the SampleWeighted Focal Contrastive loss, designed to address the challenges of recognizing minority class samples and those that are semantically similar but difficult to distinguish. Moreover, we fine-tune the Hubert model to adapt the emotion and intent joint recognition. To mitigate modal competition, we introduce a modal dropout strategy. For the final predictions, a plurality voting approach is used to determine the results. The experimental results demonstrate the effectiveness of our method, which achieves the second-best performance in the Track 2 Mandarin challenge.
[229] Speech Emotion Recognition Using Fine-Tuned DWFormer:A Study on Track 1 of the IERPChallenge 2024
Honghong Wang, Xupeng Jia, Jing Deng, Rong Zheng
Main category: cs.SD
TL;DR: The paper discusses Fosafer’s winning approach in the IERP Challenge 2024 Track 1, focusing on audio-based emotion recognition by integrating personality traits, data augmentation, and score fusion with a pre-trained model.
Details
Motivation: To improve emotion recognition by incorporating personality traits, addressing individual differences in emotional expression.Method: Fine-tuned the pre-trained DWFormer model using audio features, data augmentation, and score fusion strategies.
Result: Achieved first place in the IERP Challenge 2024 Track 1.
Conclusion: Integrating personality traits and advanced techniques like data augmentation and score fusion enhances emotion recognition performance.
Abstract: The field of artificial intelligence has a strong interest in the topic of emotion recognition. The majority of extant emotion recognition models are oriented towards enhancing the precision of discrete emotion label prediction. Given the direct relationship between human personality and emotion, as well as the significant inter-individual differences in subjective emotional expression, the IERP Challenge 2024 incorporates personality traits into emotion recognition research. This paper presents the Fosafer submissions to the Track 1 of the IERP Challenge 2024. This task primarily concerns the recognition of emotions in audio, while also providing text and audio features. In Track 1, we utilized exclusively audio-based features and fine-tuned a pre-trained speech emotion recognition model, DWFormer, through the integration of data augmentation and score fusion strategies, thereby achieving the first place among the participating teams.
[230] Pretrained Conformers for Audio Fingerprinting and Retrieval
Kemal Altwlkany, Elmedin Selmanovic, Sead Delalic
Main category: cs.SD
TL;DR: A self-supervised contrastive learning framework trains conformer-based encoders for audio retrieval, achieving state-of-the-art results with 3-second audio segments and robustness to distortions.
Details
Motivation: To leverage conformers' ability to capture local and global interactions for improved audio retrieval tasks.Method: Utilizes a self-supervised contrastive learning framework to train conformer-based encoders for generating unique audio embeddings.
Result: State-of-the-art performance in audio retrieval, robustness to distortions (noise, reverb, temporal stretching), and immunity to temporal misalignments.
Conclusion: The approach is effective, reproducible, and publicly available, demonstrating strong generalization to unseen data.
Abstract: Conformers have shown great results in speech processing due to their ability to capture both local and global interactions. In this work, we utilize a self-supervised contrastive learning framework to train conformer-based encoders that are capable of generating unique embeddings for small segments of audio, generalizing well to previously unseen data. We achieve state-of-the-art results for audio retrieval tasks while using only 3 seconds of audio to generate embeddings. Our models are almost completely immune to temporal misalignments and achieve state-of-the-art results in cases of other audio distortions such as noise, reverb or extreme temporal stretching. Code and models are made publicly available and the results are easy to reproduce as we train and test using popular and freely available datasets of different sizes.
[231] L3AC: Towards a Lightweight and Lossless Audio Codec
Linwei Zhai, Han Ding, Cui Zhao, fei wang, Ge Wang, Wang Zhi, Wei Xi
Main category: cs.SD
TL;DR: L3AC is a lightweight neural audio codec using a single quantizer and efficient architecture to match or outperform leading codecs with reduced computational overhead.
Details
Motivation: Existing neural audio codecs are resource-intensive and complex, limiting practicality. L3AC aims to simplify and improve efficiency.Method: L3AC employs streamlined convolutional networks, local Transformer modules, and a novel TConv structure for multi-scale acoustic variation capture.
Result: L3AC matches or exceeds leading codecs in reconstruction quality while reducing computational overhead by an order of magnitude.
Conclusion: L3AC offers a practical, efficient solution for high-fidelity audio compression and generative modeling, with open-source availability.
Abstract: Neural audio codecs have recently gained traction for their ability to compress high-fidelity audio and provide discrete tokens for generative modeling. However, leading approaches often rely on resource-intensive models and complex multi-quantizer architectures, limiting their practicality in real-world applications. In this work, we introduce L3AC, a lightweight neural audio codec that addresses these challenges by leveraging a single quantizer and a highly efficient architecture. To enhance reconstruction fidelity while minimizing model complexity, L3AC explores streamlined convolutional networks and local Transformer modules, alongside TConv–a novel structure designed to capture acoustic variations across multiple temporal scales. Despite its compact design, extensive experiments across diverse datasets demonstrate that L3AC matches or exceeds the reconstruction quality of leading codecs while reducing computational overhead by an order of magnitude. The single-quantizer design further enhances its adaptability for downstream tasks. The source code is publicly available at https://github.com/zhai-lw/L3AC.
[232] Neurodyne: Neural Pitch Manipulation with Representation Learning and Cycle-Consistency GAN
Yicheng Gu, Chaoren Wang, Zhizheng Wu, Lauri Juvela
Main category: cs.SD
TL;DR: Neurodyne improves neural-network-based pitch manipulation by using adversarial representation learning and cycle-consistency training, enhancing synthesis quality and preserving singer identity.
Details
Motivation: Current neural-network-based pitch-manipulation systems suffer from inaccurate feature disentanglement and lack of paired training data, limiting their performance.Method: Neurodyne employs adversarial representation learning for pitch-independent latent representation and cycle-consistency training to implicitly create paired data.
Result: Experiments show improved synthesis quality in global-key and template-based pitch manipulation while retaining the original singer identity.
Conclusion: Neurodyne effectively addresses the limitations of existing systems, offering superior pitch manipulation with enhanced quality and identity preservation.
Abstract: Pitch manipulation is the process of producers adjusting the pitch of an audio segment to a specific key and intonation, which is essential in music production. Neural-network-based pitch-manipulation systems have been popular in recent years due to their superior synthesis quality compared to classical DSP methods. However, their performance is still limited due to their inaccurate feature disentanglement using source-filter models and the lack of paired in- and out-of-tune training data. This work proposes Neurodyne to address these issues. Specifically, Neurodyne uses adversarial representation learning to learn a pitch-independent latent representation to avoid inaccurate disentanglement and cycle-consistency training to create paired training data implicitly. Experimental results on global-key and template-based pitch manipulation demonstrate the effectiveness of the proposed system, marking improved synthesis quality while maintaining the original singer identity.
cs.LG
[233] A Cooperative Game-Based Multi-Criteria Weighted Ensemble Approach for Multi-Class Classification
DongSeong-Yoon
Main category: cs.LG
TL;DR: The paper proposes a cooperative game-based method for multi-criteria ensemble weighting to improve performance by considering diverse classifier pre-information.
Details
Motivation: Existing ensemble weighting methods use only one evaluation criterion, limiting realistic model performance.Method: Proposes a cooperative game approach to simultaneously reflect diverse classifier pre-information in weights.
Result: Applied to Open-ML-CC18 dataset, outperforming existing weighting methods.
Conclusion: The method enhances ensemble performance by better weight distribution through multi-criteria consideration.
Abstract: Since the Fourth Industrial Revolution, AI technology has been widely used in many fields, but there are several limitations that need to be overcome, including overfitting/underfitting, class imbalance, and the limitations of representation (hypothesis space) due to the characteristics of different models. As a method to overcome these problems, ensemble, commonly known as model combining, is being extensively used in the field of machine learning. Among ensemble learning methods, voting ensembles have been studied with various weighting methods, showing performance improvements. However, the existing methods that reflect the pre-information of classifiers in weights consider only one evaluation criterion, which limits the reflection of various information that should be considered in a model realistically. Therefore, this paper proposes a method of making decisions considering various information through cooperative games in multi-criteria situations. Using this method, various types of information known beforehand in classifiers can be simultaneously considered and reflected, leading to appropriate weight distribution and performance improvement. The machine learning algorithms were applied to the Open-ML-CC18 dataset and compared with existing ensemble weighting methods. The experimental results showed superior performance compared to other weighting methods.
[234] Apriel-Nemotron-15B-Thinker
Shruthan Radhakrishna, Soham Parikh, Gopal Sarda, Anil Turkkan, Quaizar Vohra, Raymond Li, Dhruv Jhamb, Kelechi Ogueji, Aanjaneya Shukla, Oluwanifemi Bamgbose, Toby Liang, Luke Kumar, Oleksiy Ostapenko, Shiva Krishna Reddy Malay, Aman Tiwari, Tara Bogavelli, Vikas Yadav, Jash Mehta, Saloni Mittal, Akshay Kalkunte, Pulkit Pattnaik, Khalil Slimi, Anirudh Sreeram, Jishnu Nair, Akintunde Oladipo, Shashank Maiya, Khyati Mahajan, Rishabh Maheshwary, Masoud Hashemi, Sai Rajeswar Mudumba, Sathwik Tejaswi Madhusudhan, Torsten Scholak, Sebastien Paquet, Sagar Davasam, Srinivas Sunkara
Main category: cs.LG
TL;DR: Apriel-Nemotron-15B-Thinker is a 15B-parameter model that matches or outperforms larger 32B-parameter models while using half the memory, trained via a four-stage pipeline.
Details
Motivation: Address the impractical memory and computational costs of large language models (LLMs) in enterprise settings by developing a smaller yet competitive model.Method: Four-stage training pipeline: Base Model upscaling, Continual Pre-training, Supervised Fine-tuning (SFT), and Reinforcement Learning using GRPO.
Result: Matches or exceeds performance of 32B-parameter models (e.g., o1-mini, QWQ32B, EXAONE-Deep-32B) with half the memory footprint.
Conclusion: Apriel-Nemotron-15B-Thinker offers a practical, efficient alternative to larger LLMs without sacrificing performance.
Abstract: While large language models (LLMs) have achieved remarkable reasoning capabilities across domains like code, math and other enterprise tasks, their significant memory and computational costs often preclude their use in practical enterprise settings. To this end, we introduce Apriel-Nemotron-15B-Thinker, a 15-billion parameter model in the ServiceNow Apriel SLM series that achieves performance against medium sized state-of-the-art models such as o1-mini, QWQ32B, and EXAONE-Deep-32B while maintaining only half the memory footprint of those alternatives. Apriel-Nemotron-15B-Thinker model is trained in a four stage training pipeline including 1) Base Model upscaling, 2) Continual Pre-training 3) Supervised Fine-tuning (SFT) and 4) Reinforcement Learning using GRPO. Comprehensive evaluations across a diverse suite of benchmarks consistently demonstrate that our Apriel-Nemotron-15B-Thinker model matches or exceeds the performance of its 32-billion parameter counterparts, despite being less than half their size.
[235] Towards Efficient Prompt-based Continual Learning in Distributed Medical AI
Gyutae Oh, Jitae Shin
Main category: cs.LG
TL;DR: A prompt-based continual learning (PCL) method is proposed to address data-sharing constraints in medical AI, improving accuracy and reducing computational costs.
Details
Motivation: Ethical and institutional barriers limit data sharing in healthcare, making centralized learning impractical. Traditional methods overfit and forget knowledge, while medical data distributions shift.Method: PCL uses a unified prompt pool with minimal expansion, freezing subsets of prompts to reduce overhead, and a novel regularization term for balancing retention and adaptation.
Result: Experiments on diabetic retinopathy datasets show PCL improves accuracy by 10% and F1-score by 9 points over state-of-the-art methods, with lower inference costs.
Conclusion: PCL enables sustainable medical AI advances, supporting real-time diagnosis and telemedicine in distributed healthcare.
Abstract: Modern AI models achieve state-of-the-art performance with large-scale, high-quality datasets; however, ethical, social, and institutional constraints in the medical domain severely restrict data sharing, rendering centralized learning nearly impossible. Each institution must incrementally update models using only local data. Traditional training overfits new samples and suffers from catastrophic forgetting, losing previously acquired knowledge. Medical data distributions also shift due to varying diagnostic equipment and demographics. Although continual learning (CL) has advanced, most methods address natural images, leaving medical-domain-specific CL underexplored. We propose a prompt-based continual learning (PCL) approach featuring a unified prompt pool with a minimal expansion strategy: by expanding and freezing a subset of prompts, our method reduces computational overhead, and a novel regularization term balances retention and adaptation. Experiments on three diabetic retinopathy datasets Aptos2019, LI2019, and Diabetic Retinopathy Detection show our model improves final classification accuracy by at least 10% and F1-score by 9 points over state-of-the-art approaches while lowering inference cost. We anticipate this study will drive sustainable medical AI advances, enabling real-time diagnosis, patient monitoring, and telemedicine applications in distributed healthcare. Code will be released upon acceptance
[236] Retro-Expert: Collaborative Reasoning for Interpretable Retrosynthesis
Xinyi Li, Sai Wang, Yutian Lin, Yu Wu, Yi Yang
Main category: cs.LG
TL;DR: Retro-Expert is an interpretable retrosynthesis framework combining LLMs and specialized models via reinforcement learning, outperforming existing methods and providing expert-aligned explanations.
Details
Motivation: Existing retrosynthesis models lack interpretability and logic decision-making, limiting their practical utility.Method: Combines specialized models for shallow reasoning, LLMs for critical reasoning, and reinforcement learning for decision policy optimization.
Result: Outperforms both LLM-based and specialized models and provides actionable chemical insights.
Conclusion: Retro-Expert bridges the gap between AI predictions and expert-aligned chemical reasoning.
Abstract: Retrosynthesis prediction aims to infer the reactant molecule based on a given product molecule, which is a fundamental task in chemical synthesis. However, existing models rely on static pattern-matching paradigm, which limits their ability to perform effective logic decision-making, leading to black-box decision-making. Building on this, we propose Retro-Expert, an interpretable retrosynthesis framework that performs collaborative reasoning by combining the complementary reasoning strengths of Large Language Models and specialized models via reinforcement learning. It outputs natural language explanations grounded in chemical logic through three components: (1) specialized models perform shallow reasoning to construct high-quality chemical decision space, (2) LLM-driven critical reasoning to generate predictions and corresponding interpretable reasoning path, and (3) reinforcement learning optimizing interpretable decision policy. Experiments show that Retro-Expert not only surpasses both LLM-based and specialized models across different metrics but also provides expert-aligned explanations that bridge the gap between AI predictions and actionable chemical insights.
[237] BeyondWeb: Lessons from Scaling Synthetic Data for Trillion-scale Pretraining
Pratyush Maini, Vineeth Dorna, Parth Doshi, Aldo Carranza, Fan Pan, Jack Urbanek, Paul Burstein, Alex Fang, Alvin Deng, Amro Abbas, Brett Larsen, Cody Blakeney, Charvi Bannur, Christina Baek, Darren Teh, David Schwab, Haakon Mongstad, Haoli Yin, Josh Wills, Kaleigh Mentzer, Luke Merrick, Ricardo Monti, Rishabh Adiga, Siddharth Joshi, Spandan Das, Zhengping Wang, Bogdan Gaza, Ari Morcos, Matthew Leavitt
Main category: cs.LG
TL;DR: BeyondWeb is a synthetic data generation framework that outperforms existing synthetic datasets, offering faster training and better performance, while highlighting the complexity of optimizing synthetic data quality.
Details
Motivation: The diminishing returns from scaling data quantity in LLM pretraining and the lack of understanding of synthetic data quality factors motivated the development of BeyondWeb.Method: BeyondWeb is introduced as a synthetic data generation framework, extending traditional web-scale datasets and optimizing multiple factors for high-quality synthetic data.
Result: BeyondWeb outperforms state-of-the-art synthetic datasets (Cosmopedia and Nemotron-Synth) by up to 5.1pp and 2.6pp, respectively, and trains up to 7.7x faster than open web data.
Conclusion: Generating high-quality synthetic data requires joint optimization of many factors, with BeyondWeb demonstrating transformative improvements over naive approaches.
Abstract: Recent advances in large language model (LLM) pretraining have shown that simply scaling data quantity eventually leads to diminishing returns, hitting a data wall. In response, the use of synthetic data for pretraining has emerged as a promising paradigm for pushing the frontier of performance. Despite this, the factors affecting synthetic data quality remain poorly understood. In this work, we introduce BeyondWeb, a synthetic data generation framework that produces high-quality synthetic data for pretraining. BeyondWeb significantly extends the capabilities of traditional web-scale datasets, outperforming state-of-the-art synthetic pretraining datasets such as Cosmopedia and Nemotron-CC’s high-quality synthetic subset (Nemotron-Synth) by up to 5.1 percentage points (pp) and 2.6pp, respectively, when averaged across a suite of 14 benchmark evaluations. It delivers up to 7.7x faster training than open web data and 2.7x faster than Nemotron-Synth. Remarkably, a 3B model trained for 180B tokens on BeyondWeb outperforms an 8B model trained for the same token budget on Cosmopedia. We also present several insights from BeyondWeb on synthetic data for pretraining: what drives its benefits, which data to rephrase and how, and the impact of model size and family on data quality. Overall, our work shows that there’s no silver bullet for generating high-quality synthetic pretraining data. The best outcomes require jointly optimizing many factors, a challenging task that requires rigorous science and practical expertise. Naive approaches can yield modest improvements, potentially at great cost, while well-executed methods can yield transformative improvements, as exemplified by BeyondWeb.
[238] Chasing Moving Targets with Online Self-Play Reinforcement Learning for Safer Language Models
Mickel Liu, Liwei Jiang, Yancheng Liang, Simon Shaolei Du, Yejin Choi, Tim Althoff, Natasha Jaques
Main category: cs.LG
TL;DR: Self-RedTeam introduces an online self-play reinforcement learning method for LM safety alignment, enabling dynamic co-adaptation between attacker and defender agents, outperforming static approaches in attack diversity and robustness.
Details
Motivation: Address the mismatch in conventional LM safety alignment, where reactive patching lags behind emerging threats, by enabling proactive co-evolution of attackers and defenders.Method: Uses a two-player zero-sum game framework with alternating attacker and defender roles, adjudicated by a reward LM, and incorporates hidden Chain-of-Thought for private planning.
Result: Achieves +21.8% SBERT in attack diversity and +65.5% robustness on WildJailBreak compared to static methods.
Conclusion: Proposes a shift from reactive patching to proactive co-evolution, enabling scalable and autonomous LM safety improvement via MARL.
Abstract: Conventional language model (LM) safety alignment relies on a reactive, disjoint procedure: attackers exploit a static model, followed by defensive fine-tuning to patch exposed vulnerabilities. This sequential approach creates a mismatch – attackers overfit to obsolete defenses, while defenders perpetually lag behind emerging threats. To address this, we propose Self-RedTeam, an online self-play reinforcement learning algorithm where an attacker and defender agent co-evolve through continuous interaction. We cast safety alignment as a two-player zero-sum game, where a single model alternates between attacker and defender roles – generating adversarial prompts and safeguarding against them – while a reward LM adjudicates outcomes. This enables dynamic co-adaptation. Grounded in the game-theoretic framework of zero-sum games, we establish a theoretical safety guarantee which motivates the design of our method: if self-play converges to a Nash Equilibrium, the defender will reliably produce safe responses to any adversarial input. Empirically, Self-RedTeam uncovers more diverse attacks (+21.8% SBERT) compared to attackers trained against static defenders and achieves higher robustness on safety benchmarks (e.g., +65.5% on WildJailBreak) than defenders trained against static attackers. We further propose hidden Chain-of-Thought, allowing agents to plan privately, which boosts adversarial diversity and reduces over-refusals. Our results motivate a shift from reactive patching to proactive co-evolution in LM safety training, enabling scalable, autonomous, and robust self-improvement of LMs via multi-agent reinforcement learning (MARL).
[239] Match & Choose: Model Selection Framework for Fine-tuning Text-to-Image Diffusion Models
Basile Lewandowski, Robert Birke, Lydia Y. Chen
Main category: cs.LG
TL;DR: The paper introduces M&C, a framework for selecting the best pretrained text-to-image (T2I) model for fine-tuning on a target dataset without exhaustive testing.
Details
Motivation: Public pretrained T2I models are widely available, but selecting the best one for fine-tuning is challenging due to lack of performance indicators.Method: M&C uses a matching graph with nodes for models/datasets and edges for performance/similarity, along with a predictive model leveraging graph embeddings.
Result: M&C predicts the best fine-tuning model 61.3% of the time and a close alternative otherwise, outperforming baselines.
Conclusion: M&C effectively addresses the model selection problem for T2I fine-tuning, improving efficiency and performance.
Abstract: Text-to-image (T2I) models based on diffusion and transformer architectures advance rapidly. They are often pretrained on large corpora, and openly shared on a model platform, such as HuggingFace. Users can then build up AI applications, e.g., generating media contents, by adopting pretrained T2I models and fine-tuning them on the target dataset. While public pretrained T2I models facilitate the democratization of the models, users face a new challenge: which model can be best fine-tuned based on the target data domain? Model selection is well addressed in classification tasks, but little is known in (pretrained) T2I models and their performance indication on the target domain. In this paper, we propose the first model selection framework, M&C, which enables users to efficiently choose a pretrained T2I model from a model platform without exhaustively fine-tuning them all on the target dataset. The core of M&C is a matching graph, which consists of: (i) nodes of available models and profiled datasets, and (ii) edges of model-data and data-data pairs capturing the fine-tuning performance and data similarity, respectively. We then build a model that, based on the inputs of model/data feature, and, critically, the graph embedding feature, extracted from the matching graph, predicts the model achieving the best quality after fine-tuning for the target domain. We evaluate M&C on choosing across ten T2I models for 32 datasets against three baselines. Our results show that M&C successfully predicts the best model for fine-tuning in 61.3% of the cases and a closely performing model for the rest.
[240] CURE: Critical-Token-Guided Re-concatenation for Entropy-collapse Prevention
Qingbin Li, Rongkun Xue, Jie Wang, Ming Zhou, Zhi Li, Xiaofeng Ji, Yongqi Wang, Miao Liu, Zheming Yang, Minghui Qiu, Jing Yang
Main category: cs.LG
TL;DR: CURE introduces a two-stage framework to prevent entropy collapse in RLVR, balancing exploration and exploitation for improved performance in LLMs.
Details
Motivation: Address the issue of low diversity and entropy collapse in prior RLVR pipelines due to static initial-state sampling.Method: CURE uses a two-stage approach: high-entropy critical token regeneration for exploration and static sampling for exploitation.
Result: Achieves a 5% performance gain on math benchmarks and maintains high entropy.
Conclusion: CURE outperforms other RLVR methods, establishing state-of-the-art performance in accuracy and entropy.
Abstract: Recent advances in Reinforcement Learning with Verified Reward (RLVR) have driven the emergence of more sophisticated cognitive behaviors in large language models (LLMs), thereby enhancing their reasoning capabilities. However, in prior RLVR pipelines, the repeated use of static initial-state sampling drawn exactly from the dataset distribution during each sampling phase produced overly deterministic, low diversity model behavior, which manifested as rapid entropy collapse and hindered sustained performance gains during prolonged training. To address this issue, we introduce CURE (Critical-token-gUided Re concatenation for Entropy-collapse prevention), a two-stage framework that balances exploration and exploitation. Specifically, in the first stage, to deliberately steer the model toward novel yet coherent contexts, we re-generate at high-entropy critical tokens and jointly optimize the original and the branched trajectories. The further comparison with vanilla DAPO shows that the regeneration process achieves a better performance on math reasoning tasks while sustaining a high-level entropy degree for exploration. In the second stage, we continue training with static initial-state sampling by DAPO, intentionally placing the model in a familiar state to gradually strengthen exploitation. Extensive experiments on Qwen-2.5-Math-7B show that, compared to other RLVR methods, CURE achieves a 5% performance gain across six math benchmarks, establishing state-of-the-art performance in both entropy and accuracy. A series of experiments further validate the effectiveness of our approach. Code is available at https://github.com/CURE-Project/CURE.
[241] Quantization vs Pruning: Insights from the Strong Lottery Ticket Hypothesis
Aakash Kumar, Emanuele Natale
Main category: cs.LG
TL;DR: The paper extends the Strong Lottery Ticket Hypothesis (SLTH) to quantized neural networks, proving exact representation of discrete networks and optimal bounds on overparameterization.
Details
Motivation: To bridge the gap in theoretical understanding of quantization in neural networks, especially for low-precision networks, by leveraging insights from the Random Subset Sum Problem.Method: Builds on Borgs et al.’s work on the Number Partitioning Problem to derive new theoretical results for the Random Subset Sum Problem in a quantized setting, then extends SLTH to finite-precision networks.
Result: Demonstrates exact representation of discrete neural networks in the quantized setting and proves optimal bounds on the required overparameterization.
Conclusion: The work successfully extends SLTH to quantized networks, providing theoretical foundations for efficient low-precision network design.
Abstract: Quantization is an essential technique for making neural networks more efficient, yet our theoretical understanding of it remains limited. Previous works demonstrated that extremely low-precision networks, such as binary networks, can be constructed by pruning large, randomly-initialized networks, and showed that the ratio between the size of the original and the pruned networks is at most polylogarithmic. The specific pruning method they employed inspired a line of theoretical work known as the Strong Lottery Ticket Hypothesis (SLTH), which leverages insights from the Random Subset Sum Problem. However, these results primarily address the continuous setting and cannot be applied to extend SLTH results to the quantized setting. In this work, we build on foundational results by Borgs et al. on the Number Partitioning Problem to derive new theoretical results for the Random Subset Sum Problem in a quantized setting. Using these results, we then extend the SLTH framework to finite-precision networks. While prior work on SLTH showed that pruning allows approximation of a certain class of neural networks, we demonstrate that, in the quantized setting, the analogous class of target discrete neural networks can be represented exactly, and we prove optimal bounds on the necessary overparameterization of the initial network as a function of the precision of the target network.
[242] Zono-Conformal Prediction: Zonotope-Based Uncertainty Quantification for Regression and Classification Tasks
Laura Lützow, Michael Eichelbeck, Mykel J. Kochenderfer, Matthias Althoff
Main category: cs.LG
TL;DR: Zono-conformal prediction introduces prediction zonotopes for efficient, data-light uncertainty quantification with coverage guarantees, outperforming interval-based methods.
Details
Motivation: Current conformal prediction methods are computationally expensive, data-intensive, and limited to intervals, failing to capture multi-dimensional dependencies.Method: Zono-conformal prediction uses zonotopic uncertainty sets integrated into the base predictor, identified via a single linear program, applicable to nonlinear predictors like neural networks.
Result: Zono-conformal predictors are less conservative than interval-based methods while maintaining similar test coverage, with probabilistic guarantees and outlier detection.
Conclusion: Zono-conformal prediction offers a computationally efficient, data-light alternative to traditional conformal prediction, with improved flexibility and performance.
Abstract: Conformal prediction is a popular uncertainty quantification method that augments a base predictor with prediction sets with statistically valid coverage guarantees. However, current methods are often computationally expensive and data-intensive, as they require constructing an uncertainty model before calibration. Moreover, existing approaches typically represent the prediction sets with intervals, which limits their ability to capture dependencies in multi-dimensional outputs. We address these limitations by introducing zono-conformal prediction, a novel approach inspired by interval predictor models and reachset-conformant identification that constructs prediction zonotopes with assured coverage. By placing zonotopic uncertainty sets directly into the model of the base predictor, zono-conformal predictors can be identified via a single, data-efficient linear program. While we can apply zono-conformal prediction to arbitrary nonlinear base predictors, we focus on feed-forward neural networks in this work. Aside from regression tasks, we also construct optimal zono-conformal predictors in classification settings where the output of an uncertain predictor is a set of possible classes. We provide probabilistic coverage guarantees and present methods for detecting outliers in the identification data. In extensive numerical experiments, we show that zono-conformal predictors are less conservative than interval predictor models and standard conformal prediction methods, while achieving a similar coverage over the test data.
[243] Learning with Confidence
Oliver Ethan Richardson
Main category: cs.LG
TL;DR: The paper formalizes the concept of ‘confidence’ in learning and belief updates, distinguishing it from probability or likelihood. It provides axiomatic foundations, measurement methods, and representations for confidence-based learning, including vector fields and loss functions.
Details
Motivation: To clarify and formalize the often-misunderstood concept of confidence in learning, distinguishing it from related probabilistic measures like probability or likelihood.Method: The paper axiomatizes confidence, proposes two canonical ways to measure it, and proves its representability. It also derives compact representations using vector fields and loss functions under additional assumptions.
Result: Confidence can be formally represented and measured, and Bayes Rule emerges as a special case of an optimizing learner with linear expectation loss.
Conclusion: The work provides a rigorous framework for understanding confidence in learning, unifying various existing concepts and offering new representations for confidence-based belief updates.
Abstract: We characterize a notion of confidence that arises in learning or updating beliefs: the amount of trust one has in incoming information and its impact on the belief state. This learner’s confidence can be used alongside (and is easily mistaken for) probability or likelihood, but it is fundamentally a different concept – one that captures many familiar concepts in the literature, including learning rates and number of training epochs, Shafer’s weight of evidence, and Kalman gain. We formally axiomatize what it means to learn with confidence, give two canonical ways of measuring confidence on a continuum, and prove that confidence can always be represented in this way. Under additional assumptions, we derive more compact representations of confidence-based learning in terms of vector fields and loss functions. These representations induce an extended language of compound “parallel” observations. We characterize Bayes Rule as the special case of an optimizing learner whose loss representation is a linear expectation.
[244] Conditional Independence Estimates for the Generalized Nonparanormal
Ujas Shah, Manuel Lladser, Rebecca Morrison
Main category: cs.LG
TL;DR: The paper shows that for a class of non-Gaussian distributions (generalized nonparanormal), conditional independence structure can still be inferred from the precision matrix under certain conditions, similar to Gaussian cases. It also provides an efficient algorithm for this purpose.
Details
Motivation: To extend the understanding of independence structures beyond Gaussian distributions, particularly for non-Gaussian cases derived from diagonal transformations of Gaussians.Method: The paper introduces the generalized nonparanormal class and proposes a computationally efficient algorithm to infer conditional independence from such data.
Result: The algorithm effectively recovers conditional independence structures, validated by synthetic experiments and real-world data applications.
Conclusion: The generalized nonparanormal framework and proposed algorithm successfully extend conditional independence inference to a broader class of non-Gaussian distributions.
Abstract: For general non-Gaussian distributions, the covariance and precision matrices do not encode the independence structure of the variables, as they do for the multivariate Gaussian. This paper builds on previous work to show that for a class of non-Gaussian distributions – those derived from diagonal transformations of a Gaussian – information about the conditional independence structure can still be inferred from the precision matrix, provided the data meet certain criteria, analogous to the Gaussian case. We call such transformations of the Gaussian as the generalized nonparanormal. The functions that define these transformations are, in a broad sense, arbitrary. We also provide a simple and computationally efficient algorithm that leverages this theory to recover conditional independence structure from the generalized nonparanormal data. The effectiveness of the proposed algorithm is demonstrated via synthetic experiments and applications to real-world data.
[245] SHLIME: Foiling adversarial attacks fooling SHAP and LIME
Sam Chauhan, Estelle Duguet, Karthik Ramakrishnan, Hugh Van Deventer, Jack Kruger, Ranjan Subbaraman
Main category: cs.LG
TL;DR: The paper investigates the vulnerability of LIME and SHAP to adversarial manipulation and evaluates strategies to improve their robustness in detecting biases in black-box models.
Details
Motivation: Post hoc explanation methods like LIME and SHAP are widely used but can be manipulated to hide biases, posing risks in high-stakes applications.Method: The study replicates the COMPAS experiment, introduces a modular testing framework, and evaluates ensemble configurations of LIME/SHAP for robustness against bias concealment.
Result: Certain ensemble configurations significantly improve bias detection compared to original methods.
Conclusion: Enhanced LIME/SHAP configurations can improve transparency in deploying high-stakes machine learning systems.
Abstract: Post hoc explanation methods, such as LIME and SHAP, provide interpretable insights into black-box classifiers and are increasingly used to assess model biases and generalizability. However, these methods are vulnerable to adversarial manipulation, potentially concealing harmful biases. Building on the work of Slack et al. (2020), we investigate the susceptibility of LIME and SHAP to biased models and evaluate strategies for improving robustness. We first replicate the original COMPAS experiment to validate prior findings and establish a baseline. We then introduce a modular testing framework enabling systematic evaluation of augmented and ensemble explanation approaches across classifiers of varying performance. Using this framework, we assess multiple LIME/SHAP ensemble configurations on out-of-distribution models, comparing their resistance to bias concealment against the original methods. Our results identify configurations that substantially improve bias detection, highlighting their potential for enhancing transparency in the deployment of high-stakes machine learning systems.
[246] Abundance-Aware Set Transformer for Microbiome Sample Embedding
Hyunwoo Yoo, Gail Rosen
Main category: cs.LG
TL;DR: The paper proposes an abundance-aware Set Transformer method for microbiome sample representation, outperforming traditional averaging and unweighted methods in classification tasks.
Details
Motivation: Prior methods for microbiome sample representation often ignore taxa abundance, which is biologically significant. This work aims to incorporate abundance information for more robust embeddings.Method: An abundance-aware Set Transformer is used to weight sequence embeddings by their relative abundance, applying self-attention-based aggregation without altering the model architecture.
Result: The method outperforms average pooling and unweighted Set Transformers in microbiome classification tasks, sometimes achieving perfect performance.
Conclusion: The approach demonstrates the value of abundance-aware aggregation for biologically informed microbiome representation, marking a novel integration of abundance into Transformer-based embeddings.
Abstract: Microbiome sample representation to input into LLMs is essential for downstream tasks such as phenotype prediction and environmental classification. While prior studies have explored embedding-based representations of each microbiome sample, most rely on simple averaging over sequence embeddings, often overlooking the biological importance of taxa abundance. In this work, we propose an abundance-aware variant of the Set Transformer to construct fixed-size sample-level embeddings by weighting sequence embeddings according to their relative abundance. Without modifying the model architecture, we replicate embedding vectors proportional to their abundance and apply self-attention-based aggregation. Our method outperforms average pooling and unweighted Set Transformers on real-world microbiome classification tasks, achieving perfect performance in some cases. These results demonstrate the utility of abundance-aware aggregation for robust and biologically informed microbiome representation. To the best of our knowledge, this is one of the first approaches to integrate sequence-level abundance into Transformer-based sample embeddings.
[247] A Feasibility Experiment on the Application of Predictive Coding to Instant Messaging Corpora
Thanasis Schoinas, Ghulam Qadir
Main category: cs.LG
TL;DR: The paper proposes a cost-effective predictive coding solution for instant messages using data grouping, feature selection, logistic regression, and dimensionality reduction, tested on Instant Bloomberg data.
Details
Motivation: Addressing the challenges of classifying informal and small-sized instant messages in legal document classification.Method: Group messages into day chats, feature selection, logistic regression, and dimensionality reduction with focus on quantitative features.
Result: Improved baseline model performance and demonstrated cost savings.
Conclusion: The workflow provides an economically feasible and effective solution for predictive coding in instant messages.
Abstract: Predictive coding, the term used in the legal industry for document classification using machine learning, presents additional challenges when the dataset comprises instant messages, due to their informal nature and smaller sizes. In this paper, we exploit a data management workflow to group messages into day chats, followed by feature selection and a logistic regression classifier to provide an economically feasible predictive coding solution. We also improve the solution’s baseline model performance by dimensionality reduction, with focus on quantitative features. We test our methodology on an Instant Bloomberg dataset, rich in quantitative information. In parallel, we provide an example of the cost savings of our approach.
[248] Relative Advantage Debiasing for Watch-Time Prediction in Short-Video Recommendation
Emily Liu, Kuan Han, Minfeng Zhan, Bocheng Zhao, Guanyu Mu, Yang Song
Main category: cs.LG
TL;DR: A framework to debias watch time in video recommendations by comparing it to reference distributions, improving accuracy and robustness.
Details
Motivation: Raw watch times are influenced by confounding factors, distorting preference signals and causing biased recommendations.Method: A two-stage architecture separates distribution estimation from preference learning, using quantile-based signals and distributional embeddings.
Result: Offline and online experiments show significant improvements in recommendation accuracy and robustness.
Conclusion: The proposed framework effectively addresses biases in watch time, enhancing recommendation quality.
Abstract: Watch time is widely used as a proxy for user satisfaction in video recommendation platforms. However, raw watch times are influenced by confounding factors such as video duration, popularity, and individual user behaviors, potentially distorting preference signals and resulting in biased recommendation models. We propose a novel relative advantage debiasing framework that corrects watch time by comparing it to empirically derived reference distributions conditioned on user and item groups. This approach yields a quantile-based preference signal and introduces a two-stage architecture that explicitly separates distribution estimation from preference learning. Additionally, we present distributional embeddings to efficiently parameterize watch-time quantiles without requiring online sampling or storage of historical data. Both offline and online experiments demonstrate significant improvements in recommendation accuracy and robustness compared to existing baseline methods.
[249] Compressive Meta-Learning
Daniel Mas Montserrat, David Bonet, Maria Perera, Xavier Giró-i-Nieto, Alexander G. Ioannidis
Main category: cs.LG
TL;DR: The paper introduces a Compressive Meta-Learning framework that improves efficiency and accuracy in parameter-learning by meta-learning encoding and decoding stages using neural networks.
Details
Motivation: The need for fast, efficient, and privacy-friendly parameter-learning techniques due to large-scale datasets, and the limitations of current randomized, data-independent compressive learning methods.Method: Proposes a framework that meta-learns encoding and decoding stages using neural networks, applied to tasks like compressive PCA, ridge regression, k-means, and autoencoders.
Result: The framework provides faster and more accurate systems than current state-of-the-art approaches.
Conclusion: The Compressive Meta-Learning framework demonstrates significant potential for improving compressive learning methods across various applications.
Abstract: The rapid expansion in the size of new datasets has created a need for fast and efficient parameter-learning techniques. Compressive learning is a framework that enables efficient processing by using random, non-linear features to project large-scale databases onto compact, information-preserving representations whose dimensionality is independent of the number of samples and can be easily stored, transferred, and processed. These database-level summaries are then used to decode parameters of interest from the underlying data distribution without requiring access to the original samples, offering an efficient and privacy-friendly learning framework. However, both the encoding and decoding techniques are typically randomized and data-independent, failing to exploit the underlying structure of the data. In this work, we propose a framework that meta-learns both the encoding and decoding stages of compressive learning methods by using neural networks that provide faster and more accurate systems than the current state-of-the-art approaches. To demonstrate the potential of the presented Compressive Meta-Learning framework, we explore multiple applications – including neural network-based compressive PCA, compressive ridge regression, compressive k-means, and autoencoders.
[250] Predictive Multimodal Modeling of Diagnoses and Treatments in EHR
Cindy Shih-Ting Huang, Clarence Boon Liang Ng, Marek Rei
Main category: cs.LG
TL;DR: The paper proposes a multimodal system for early forecasting of ICD codes using clinical notes and tabular data, outperforming current methods.
Details
Motivation: Early forecasting of ICD codes can improve health risk identification, treatment suggestions, and resource allocation.Method: A multimodal system fuses clinical notes and tabular events using pre-trained encoders, feature pooling, cross-modal attention, and a weighted temporal loss.
Result: The model outperforms state-of-the-art systems in early prediction of ICD codes.
Conclusion: The proposed multimodal approach effectively enhances early forecasting of ICD codes.
Abstract: While the ICD code assignment problem has been widely studied, most works have focused on post-discharge document classification. Models for early forecasting of this information could be used for identifying health risks, suggesting effective treatments, or optimizing resource allocation. To address the challenge of predictive modeling using the limited information at the beginning of a patient stay, we propose a multimodal system to fuse clinical notes and tabular events captured in electronic health records. The model integrates pre-trained encoders, feature pooling, and cross-modal attention to learn optimal representations across modalities and balance their presence at every temporal point. Moreover, we present a weighted temporal loss that adjusts its contribution at each point in time. Experiments show that these strategies enhance the early prediction model, outperforming the current state-of-the-art systems.
[251] Hybrid-Hierarchical Fashion Graph Attention Network for Compatibility-Oriented and Personalized Outfit Recommendation
Sajjad Saed, Babak Teimourpour
Main category: cs.LG
TL;DR: FGAT, a new framework using graph neural networks and attention mechanisms, improves fashion recommendations by modeling outfit compatibility and user preferences together.
Details
Motivation: The fashion industry's growth makes it hard for users to find compatible items. Existing systems treat outfit compatibility and personalized recommendations separately, missing key interactions.Method: FGAT constructs a hierarchical graph of users, outfits, and items, integrating visual and textual features with a graph attention mechanism to dynamically weight node importance.
Result: FGAT outperforms baseline models like HFGN on the POG dataset, improving precision, HR, recall, NDCG, and accuracy.
Conclusion: Combining multimodal features with hierarchical graphs and attention mechanisms enhances personalized fashion recommendation accuracy and efficiency.
Abstract: The rapid expansion of the fashion industry and the growing variety of products have made it challenging for users to find compatible items on e-commerce platforms. Effective fashion recommendation systems are crucial for filtering irrelevant items and suggesting suitable ones. However, simultaneously addressing outfit compatibility and personalized recommendations remains a significant challenge, as these aspects are often treated independently in existing studies, often overlooking the complex interactions between items and user preferences. This research introduces a new framework named FGAT, inspired by the HFGN model, which leverages graph neural networks and graph attention mechanisms to tackle this issue. The proposed framework constructs a three-tier hierarchical graph of users, outfits, and items, integrating visual and textual features to simultaneously model outfit compatibility and user preferences. A graph attention mechanism dynamically weights node importance during representation propagation, enabling the capture of key interactions and generating precise representations for both user preferences and outfit compatibility. Evaluated on the POG dataset, FGAT outperforms baseline models such as HFGN, achieving improved results in precision, HR, recall, NDCG, and accuracy.These results demonstrate that combining multimodal visual-textual features with a hierarchical graph structure and attention mechanisms significantly enhances the accuracy and efficiency of personalized fashion recommendation systems.
[252] Quantization through Piecewise-Affine Regularization: Optimization and Statistical Guarantees
Jianhao Ma, Lin Xiao
Main category: cs.LG
TL;DR: The paper explores piecewise-affine regularization (PAR) for quantization in supervised learning, showing its effectiveness in overparameterized regimes, deriving proximal mappings for various PARs, and providing statistical guarantees for PAR-regularized linear regression.
Details
Motivation: Addressing the challenges of optimization over discrete or quantized variables by leveraging PAR for flexible modeling and computation.Method: Theoretical analysis of PAR in overparameterized regimes, derivation of proximal mappings for different PAR types, and application of optimization methods like proximal gradient and ADMM.
Result: Critical points in overparameterized regimes exhibit high quantization; closed-form proximal mappings enable efficient optimization; PAR achieves statistical guarantees similar to classical regularizations.
Conclusion: PAR offers a robust framework for quantization in supervised learning, combining theoretical insights with practical optimization and statistical guarantees.
Abstract: Optimization problems over discrete or quantized variables are very challenging in general due to the combinatorial nature of their search space. Piecewise-affine regularization (PAR) provides a flexible modeling and computational framework for quantization based on continuous optimization. In this work, we focus on the setting of supervised learning and investigate the theoretical foundations of PAR from optimization and statistical perspectives. First, we show that in the overparameterized regime, where the number of parameters exceeds the number of samples, every critical point of the PAR-regularized loss function exhibits a high degree of quantization. Second, we derive closed-form proximal mappings for various (convex, quasi-convex, and non-convex) PARs and show how to solve PAR-regularized problems using the proximal gradient method, its accelerated variant, and the Alternating Direction Method of Multipliers. Third, we study statistical guarantees of PAR-regularized linear regression problems; specifically, we can approximate classical formulations of $\ell_1$-, squared $\ell_2$-, and nonconvex regularizations using PAR and obtain similar statistical guarantees with quantized solutions.
[253] CTRL Your Shift: Clustered Transfer Residual Learning for Many Small Datasets
Gauri Jain, Dominik Rothenhäusler, Kirk Bansak, Elisabeth Paulson
Main category: cs.LG
TL;DR: CTRL is a meta-learning method improving accuracy and preserving source-level heterogeneity in ML tasks with diverse data sources.
Details
Motivation: Address challenges like distributional shifts and varying sample sizes across data sources in ML tasks, e.g., asylum resettlement programs.Method: Combines cross-domain residual learning and adaptive pooling/clustering (CTRL).
Result: Outperforms benchmarks on 5 datasets, including a Swiss asylum program dataset.
Conclusion: CTRL effectively balances data quantity and quality, enhancing prediction reliability across diverse sources.
Abstract: Machine learning (ML) tasks often utilize large-scale data that is drawn from several distinct sources, such as different locations, treatment arms, or groups. In such settings, practitioners often desire predictions that not only exhibit good overall accuracy, but also remain reliable within each source and preserve the differences that matter across sources. For instance, several asylum and refugee resettlement programs now use ML-based employment predictions to guide where newly arriving families are placed within a host country, which requires generating informative and differentiated predictions for many and often small source locations. However, this task is made challenging by several common characteristics of the data in these settings: the presence of numerous distinct data sources, distributional shifts between them, and substantial variation in sample sizes across sources. This paper introduces Clustered Transfer Residual Learning (CTRL), a meta-learning method that combines the strengths of cross-domain residual learning and adaptive pooling/clustering in order to simultaneously improve overall accuracy and preserve source-level heterogeneity. We provide theoretical results that clarify how our objective navigates the trade-off between data quantity and data quality. We evaluate CTRL alongside other state-of-the-art benchmarks on 5 large-scale datasets. This includes a dataset from the national asylum program in Switzerland, where the algorithmic geographic assignment of asylum seekers is currently being piloted. CTRL consistently outperforms the benchmarks across several key metrics and when using a range of different base learners.
[254] Towards the Next-generation Bayesian Network Classifiers
Huan Zhang, Daokun Zhang, Kexin Meng, Geoffrey I. Webb
Main category: cs.LG
TL;DR: The paper proposes NeuralKDB, a neural version of KDB, to address limitations of Bayesian network classifiers by learning distributional representations for feature values, improving high-order dependency modeling.
Details
Motivation: Bayesian network classifiers struggle with high-order feature dependency due to parameter explosion and data sparsity, limiting their performance on complex real-world data.Method: The authors extend KDB into NeuralKDB, using a neural network to learn distributional representations and parameterize conditional probabilities, trained via stochastic gradient descent.
Result: NeuralKDB outperforms conventional Bayesian classifiers and other competitive methods on 60 UCI datasets, excelling in high-order dependency modeling.
Conclusion: NeuralKDB effectively addresses the limitations of Bayesian network classifiers by leveraging distributional representations, demonstrating superior performance.
Abstract: Bayesian network classifiers provide a feasible solution to tabular data classification, with a number of merits like high time and memory efficiency, and great explainability. However, due to the parameter explosion and data sparsity issues, Bayesian network classifiers are restricted to low-order feature dependency modeling, making them struggle in extrapolating the occurrence probabilities of complex real-world data. In this paper, we propose a novel paradigm to design high-order Bayesian network classifiers, by learning distributional representations for feature values, as what has been done in word embedding and graph representation learning. The learned distributional representations are encoded with the semantic relatedness between different features through their observed co-occurrence patterns in training data, which then serve as a hallmark to extrapolate the occurrence probabilities of new test samples. As a classifier design realization, we remake the K-dependence Bayesian classifier (KDB) by extending it into a neural version, i.e., NeuralKDB, where a novel neural network architecture is designed to learn distributional representations of feature values and parameterize the conditional probabilities between interdependent features. A stochastic gradient descent based algorithm is designed to train the NeuralKDB model efficiently. Extensive classification experiments on 60 UCI datasets demonstrate that the proposed NeuralKDB classifier excels in capturing high-order feature dependencies and significantly outperforms the conventional Bayesian network classifiers, as well as other competitive classifiers, including two neural network based classifiers without distributional representation learning.
[255] Mitigating Modality Quantity and Quality Imbalance in Multimodal Online Federated Learning
Heqiang Wang, Weihong Yang, Xiaoxiong Zhong, Jia Zhou, Fangming Liu, Weizhe Zhang
Main category: cs.LG
TL;DR: The paper introduces the Modality Quantity and Quality Rebalanced (QQR) algorithm to address imbalance issues in Multimodal Online Federated Learning (MMO-FL) for IoT data.
Details
Motivation: The need for efficient distributed learning paradigms in IoT due to data heterogeneity and device limitations, coupled with challenges like modality imbalance (QQI).Method: Proposes the QQR algorithm, a prototype learning-based method, to rebalance modality quantity and quality during training.
Result: QQR outperforms benchmarks on real-world datasets under imbalance conditions.
Conclusion: QQR effectively mitigates modality imbalance in MMO-FL, enhancing learning performance for IoT applications.
Abstract: The Internet of Things (IoT) ecosystem produces massive volumes of multimodal data from diverse sources, including sensors, cameras, and microphones. With advances in edge intelligence, IoT devices have evolved from simple data acquisition units into computationally capable nodes, enabling localized processing of heterogeneous multimodal data. This evolution necessitates distributed learning paradigms that can efficiently handle such data. Furthermore, the continuous nature of data generation and the limited storage capacity of edge devices demand an online learning framework. Multimodal Online Federated Learning (MMO-FL) has emerged as a promising approach to meet these requirements. However, MMO-FL faces new challenges due to the inherent instability of IoT devices, which often results in modality quantity and quality imbalance (QQI) during data collection. In this work, we systematically investigate the impact of QQI within the MMO-FL framework and present a comprehensive theoretical analysis quantifying how both types of imbalance degrade learning performance. To address these challenges, we propose the Modality Quantity and Quality Rebalanced (QQR) algorithm, a prototype learning based method designed to operate in parallel with the training process. Extensive experiments on two real-world multimodal datasets show that the proposed QQR algorithm consistently outperforms benchmarks under modality imbalance conditions with promising learning performance.
[256] A Semi-supervised Generative Model for Incomplete Multi-view Data Integration with Missing Labels
Yiyang Shen, Weiran Wang
Main category: cs.LG
TL;DR: A semi-supervised generative model is proposed to handle missing views and labels in multi-view learning, outperforming existing methods in predictive and imputation tasks.
Details
Motivation: Multi-view learning often faces missing views and labels, and prior probabilistic approaches lack the ability to leverage unlabeled data.Method: The proposed model maximizes the likelihood of unlabeled samples and uses cross-view mutual information maximization in a shared latent space.
Result: The model achieves superior predictive and imputation performance on image and multi-omics data with missing views and limited labels.
Conclusion: The semi-supervised approach effectively integrates labeled and unlabeled data, improving performance in multi-view learning scenarios.
Abstract: Multi-view learning is widely applied to real-life datasets, such as multiple omics biological data, but it often suffers from both missing views and missing labels. Prior probabilistic approaches addressed the missing view problem by using a product-of-experts scheme to aggregate representations from present views and achieved superior performance over deterministic classifiers, using the information bottleneck (IB) principle. However, the IB framework is inherently fully supervised and cannot leverage unlabeled data. In this work, we propose a semi-supervised generative model that utilizes both labeled and unlabeled samples in a unified framework. Our method maximizes the likelihood of unlabeled samples to learn a latent space shared with the IB on labeled data. We also perform cross-view mutual information maximization in the latent space to enhance the extraction of shared information across views. Compared to existing approaches, our model achieves better predictive and imputation performance on both image and multi-omics data with missing views and limited labeled samples.
[257] SynBrain: Enhancing Visual-to-fMRI Synthesis via Probabilistic Representation Learning
Weijian Mai, Jiamin Wu, Yu Zhu, Zhouheng Yao, Dongzhan Zhou, Andrew F. Luo, Qihao Zheng, Wanli Ouyang, Chunfeng Song
Main category: cs.LG
TL;DR: SynBrain is a generative framework for modeling the probabilistic transformation of visual stimuli to neural responses, outperforming existing methods in fMRI synthesis and adaptability.
Details
Motivation: Existing deterministic methods fail to model biological variability while maintaining functional consistency in visual-to-neural mapping.Method: SynBrain uses BrainVAE for probabilistic neural representations and a Semantic-to-Neural Mapper for high-fidelity fMRI synthesis.
Result: SynBrain excels in subject-specific encoding, few-shot adaptation, and improves fMRI-to-image decoding.
Conclusion: SynBrain captures biological variability and functional consistency, offering interpretable insights into neural responses.
Abstract: Deciphering how visual stimuli are transformed into cortical responses is a fundamental challenge in computational neuroscience. This visual-to-neural mapping is inherently a one-to-many relationship, as identical visual inputs reliably evoke variable hemodynamic responses across trials, contexts, and subjects. However, existing deterministic methods struggle to simultaneously model this biological variability while capturing the underlying functional consistency that encodes stimulus information. To address these limitations, we propose SynBrain, a generative framework that simulates the transformation from visual semantics to neural responses in a probabilistic and biologically interpretable manner. SynBrain introduces two key components: (i) BrainVAE models neural representations as continuous probability distributions via probabilistic learning while maintaining functional consistency through visual semantic constraints; (ii) A Semantic-to-Neural Mapper acts as a semantic transmission pathway, projecting visual semantics into the neural response manifold to facilitate high-fidelity fMRI synthesis. Experimental results demonstrate that SynBrain surpasses state-of-the-art methods in subject-specific visual-to-fMRI encoding performance. Furthermore, SynBrain adapts efficiently to new subjects with few-shot data and synthesizes high-quality fMRI signals that are effective in improving data-limited fMRI-to-image decoding performance. Beyond that, SynBrain reveals functional consistency across trials and subjects, with synthesized signals capturing interpretable patterns shaped by biological neural variability. The code will be made publicly available.
[258] Quantum-Boosted High-Fidelity Deep Learning
Feng-ao Wang, Shaobo Chen, Yao Xuan, Junwei Liu, Qi Gao, Hongdong Zhu, Junjie Hou, Lixin Yuan, Jinyu Cheng, Chenxin Yi, Hai Wei, Yin Ma, Tao Xu, Kai Wen, Yixue Li
Main category: cs.LG
TL;DR: The paper introduces the Quantum Boltzmann Machine-Variational Autoencoder (QBM-VAE), a hybrid quantum-classical model that outperforms Gaussian-based deep learning models in capturing complex biological data structures.
Details
Motivation: Current probabilistic deep learning relies on simplistic Gaussian priors, which fail to capture complex, non-Gaussian data landscapes, especially in demanding domains like biology.Method: The QBM-VAE leverages a quantum processor for efficient sampling from the Boltzmann distribution, integrating it as a prior in a deep generative model.
Result: Applied to million-scale single-cell datasets, QBM-VAE outperforms Gaussian-based models (VAE, SCVI) in tasks like omics data integration, cell-type classification, and trajectory inference.
Conclusion: The work demonstrates a practical quantum advantage in deep learning and provides a blueprint for hybrid quantum AI models, enhancing scientific discovery capabilities.
Abstract: A fundamental limitation of probabilistic deep learning is its predominant reliance on Gaussian priors. This simplistic assumption prevents models from accurately capturing the complex, non-Gaussian landscapes of natural data, particularly in demanding domains like complex biological data, severely hindering the fidelity of the model for scientific discovery. The physically-grounded Boltzmann distribution offers a more expressive alternative, but it is computationally intractable on classical computers. To date, quantum approaches have been hampered by the insufficient qubit scale and operational stability required for the iterative demands of deep learning. Here, we bridge this gap by introducing the Quantum Boltzmann Machine-Variational Autoencoder (QBM-VAE), a large-scale and long-time stable hybrid quantum-classical architecture. Our framework leverages a quantum processor for efficient sampling from the Boltzmann distribution, enabling its use as a powerful prior within a deep generative model. Applied to million-scale single-cell datasets from multiple sources, the QBM-VAE generates a latent space that better preserves complex biological structures, consistently outperforming conventional Gaussian-based deep learning models like VAE and SCVI in essential tasks such as omics data integration, cell-type classification, and trajectory inference. It also provides a typical example of introducing a physics priori into deep learning to drive the model to acquire scientific discovery capabilities that breaks through data limitations. This work provides the demonstration of a practical quantum advantage in deep learning on a large-scale scientific problem and offers a transferable blueprint for developing hybrid quantum AI models.
[259] Meta-learning Structure-Preserving Dynamics
Cheng Jing, Uvini Balasuriya Mudiyanselage, Woojin Cho, Minju Jo, Anthony Gruber, Kookjin Lee
Main category: cs.LG
TL;DR: A modulation-based meta-learning framework is introduced to enhance structure-preserving models for dynamical systems, enabling scalable and generalizable learning without requiring explicit system knowledge or costly retraining.
Details
Motivation: Existing structure-preserving models are limited by fixed configurations and costly retraining for new parameters, while meta-learning approaches face instability or poor generalization.Method: The paper proposes a modulation-based meta-learning framework that conditions models on latent representations of system parameters, avoiding explicit optimization and gray-box knowledge.
Result: Experiments show accurate predictions in few-shot learning while maintaining physical constraints for stability and generalization across parameter space.
Conclusion: The framework offers a scalable and generalizable solution for modeling parametric families of dynamical systems without compromising physical constraints.
Abstract: Structure-preserving approaches to dynamics modeling have demonstrated great potential for modeling physical systems due to their strong inductive biases that enforce conservation laws and dissipative behavior. However, the resulting models are typically trained for fixed system configurations, requiring explicit knowledge of system parameters as well as costly retraining for each new set of parameters – a major limitation in many-query or parameter-varying scenarios. Meta-learning offers a potential solution, but existing approaches like optimization-based meta-learning often suffer from training instability or limited generalization capability. Inspired by ideas from computer vision, we introduce a modulation-based meta-learning framework that directly conditions structure-preserving models on compact latent representations of potentially unknown system parameters, avoiding the need for gray-box system knowledge and explicit optimization during adaptation. Through the application of novel modulation strategies to parametric energy-conserving and dissipative systems, we enable scalable and generalizable learning across parametric families of dynamical systems. Experiments on standard benchmark problems demonstrate that our approach achieves accurate predictions in few-shot learning settings, without compromising on the essential physical constraints necessary for dynamical stability and effective generalization performance across parameter space.
[260] Borrowing From the Future: Enhancing Early Risk Assessment through Contrastive Learning
Minghui Sun, Matthew M. Engelhard, Benjamin A. Goldstein
Main category: cs.LG
TL;DR: The paper introduces BFF, a contrastive multi-modal framework to improve early-stage pediatric risk assessments by leveraging later-stage data.
Details
Motivation: Early-stage risk assessments in pediatrics are less precise but clinically desirable, prompting the need for improved prediction performance.Method: BFF treats each time window as a modality, using contrastive learning to borrow signals from later stages for earlier predictions.
Result: BFF shows consistent improvements in early risk assessments across two pediatric outcome prediction tasks.
Conclusion: BFF effectively enhances early-stage predictions by leveraging future data, validated on real-world tasks.
Abstract: Risk assessments for a pediatric population are often conducted across multiple stages. For example, clinicians may evaluate risks prenatally, at birth, and during Well-Child visits. Although predictions made at later stages typically achieve higher precision, it is clinically desirable to make reliable risk assessments as early as possible. Therefore, this study focuses on improving prediction performance in early-stage risk assessments. Our solution, \textbf{Borrowing From the Future (BFF)}, is a contrastive multi-modal framework that treats each time window as a distinct modality. In BFF, a model is trained on all available data throughout the time while performing a risk assessment using up-to-date information. This contrastive framework allows the model to ``borrow’’ informative signals from later stages (e.g., Well-Child visits) to implicitly supervise the learning at earlier stages (e.g., prenatal/birth stages). We validate BFF on two real-world pediatric outcome prediction tasks, demonstrating consistent improvements in early risk assessments. The code is available at https://github.com/scotsun/bff.
[261] How Causal Abstraction Underpins Computational Explanation
Atticus Geiger, Jacqueline Harding, Thomas Icard
Main category: cs.LG
TL;DR: The paper explores how causal abstraction theory can explain cognitive behavior and computational implementation, linking classical philosophy themes to modern machine learning.
Details
Motivation: To understand how systems implement computations over representations, using causal abstraction as a framework.Method: Uses causal abstraction theory and draws parallels with deep learning and artificial neural networks.
Result: Proposes an account of computational implementation based on causal abstraction and examines representation’s role.
Conclusion: Suggests that generalization and prediction are key areas for exploring these issues further.
Abstract: Explanations of cognitive behavior often appeal to computations over representations. What does it take for a system to implement a given computation over suitable representational vehicles within that system? We argue that the language of causality – and specifically the theory of causal abstraction – provides a fruitful lens on this topic. Drawing on current discussions in deep learning with artificial neural networks, we illustrate how classical themes in the philosophy of computation and cognition resurface in contemporary machine learning. We offer an account of computational implementation grounded in causal abstraction, and examine the role for representation in the resulting picture. We argue that these issues are most profitably explored in connection with generalization and prediction.
[262] Air Quality PM2.5 Index Prediction Model Based on CNN-LSTM
Zicheng Guo, Shuqi Wu, Meixing Zhu, He Guandi
Main category: cs.LG
TL;DR: A hybrid CNN-LSTM model for PM2.5 prediction outperforms traditional methods but requires high computational resources.
Details
Motivation: Accurate PM2.5 prediction is crucial for environmental and public health due to climate change.Method: Combines CNN for spatial features and LSTM for temporal dependencies using Beijing’s industrial area data (2010-2015).
Result: Achieves RMSE of 5.236, better than traditional models, but computationally intensive.
Conclusion: The model shows promise for real-world use but needs optimization for scalability and handling diverse factors.
Abstract: With the intensification of global climate change, accurate prediction of air quality indicators, especially PM2.5 concentration, has become increasingly important in fields such as environmental protection, public health, and urban management. To address this, we propose an air quality PM2.5 index prediction model based on a hybrid CNN-LSTM architecture. The model effectively combines Convolutional Neural Networks (CNN) for local spatial feature extraction and Long Short-Term Memory (LSTM) networks for modeling temporal dependencies in time series data. Using a multivariate dataset collected from an industrial area in Beijing between 2010 and 2015 – which includes hourly records of PM2.5 concentration, temperature, dew point, pressure, wind direction, wind speed, and precipitation – the model predicts the average PM2.5 concentration over 6-hour intervals. Experimental results show that the model achieves a root mean square error (RMSE) of 5.236, outperforming traditional time series models in both accuracy and generalization. This demonstrates its strong potential in real-world applications such as air pollution early warning systems. However, due to the complexity of multivariate inputs, the model demands high computational resources, and its ability to handle diverse atmospheric factors still requires optimization. Future work will focus on enhancing scalability and expanding support for more complex multivariate weather prediction tasks.
[263] Enhancing Interactive Voting-Based Map Matching: Improving Efficiency and Robustness for Heterogeneous GPS Trajectories
William Alemanni, Arianna Burzacchi, Davide Colombi, Elena Giarratano
Main category: cs.LG
TL;DR: An enhanced Interactive Voting-Based Map Matching algorithm improves GPS trajectory reconstruction accuracy, handles varying sampling rates, and integrates trajectory imputation and OpenStreetMap assets for broader applicability.
Details
Motivation: To reconstruct GPS trajectories accurately regardless of input data quality and extend the original algorithm's capabilities for diverse real-world scenarios.Method: Enhancements include trajectory imputation, a distance-bounded interactive voting strategy, and modifications for missing road network data, leveraging OpenStreetMap assets.
Result: The improved algorithm maintains the original’s strengths while significantly broadening its applicability and reducing computational complexity.
Conclusion: The enhanced algorithm offers high-accuracy trajectory reconstruction and adaptability to diverse geographic regions and data conditions.
Abstract: This paper presents an enhanced version of the Interactive Voting-Based Map Matching algorithm, designed to efficiently process trajectories with varying sampling rates. The main aim is to reconstruct GPS trajectories with high accuracy, independent of input data quality. Building upon the original algorithm, developed exclusively for aligning GPS signals to road networks, we extend its capabilities by integrating trajectory imputation. Our improvements also include the implementation of a distance-bounded interactive voting strategy to reduce computational complexity, as well as modifications to address missing data in the road network. Furthermore, we incorporate a custom-built asset derived from OpenStreetMap, enabling this approach to be smoothly applied in any geographic region covered by OpenStreetMap’s road network. These advancements preserve the core strengths of the original algorithm while significantly extending its applicability to diverse real-world scenarios.
[264] Graph Neural Diffusion via Generalized Opinion Dynamics
Asela Hevapathige, Asiri Wijesinghe, Ahad N. Zehmakan
Main category: cs.LG
TL;DR: GODNF is a novel GNN framework addressing limitations in existing diffusion-based methods by unifying opinion dynamics models for adaptable, interpretable, and efficient graph learning.
Details
Motivation: Existing diffusion-based GNNs lack adaptability, depth, and theoretical understanding, prompting the need for a more flexible and interpretable framework.Method: GODNF integrates opinion dynamics models into a trainable diffusion mechanism, enabling heterogeneous diffusion patterns and dynamic neighborhood influence.
Result: Theoretical analysis shows diverse convergence, and empirical tests confirm GODNF outperforms state-of-the-art GNNs in node classification and influence estimation.
Conclusion: GODNF successfully addresses key limitations, offering a versatile and interpretable solution for diffusion-based GNNs.
Abstract: There has been a growing interest in developing diffusion-based Graph Neural Networks (GNNs), building on the connections between message passing mechanisms in GNNs and physical diffusion processes. However, existing methods suffer from three critical limitations: (1) they rely on homogeneous diffusion with static dynamics, limiting adaptability to diverse graph structures; (2) their depth is constrained by computational overhead and diminishing interpretability; and (3) theoretical understanding of their convergence behavior remains limited. To address these challenges, we propose GODNF, a Generalized Opinion Dynamics Neural Framework, which unifies multiple opinion dynamics models into a principled, trainable diffusion mechanism. Our framework captures heterogeneous diffusion patterns and temporal dynamics via node-specific behavior modeling and dynamic neighborhood influence, while ensuring efficient and interpretable message propagation even at deep layers. We provide a rigorous theoretical analysis demonstrating GODNF’s ability to model diverse convergence configurations. Extensive empirical evaluations of node classification and influence estimation tasks confirm GODNF’s superiority over state-of-the-art GNNs.
[265] Group Fairness Meets the Black Box: Enabling Fair Algorithms on Closed LLMs via Post-Processing
Ruicheng Xian, Yuxuan Wan, Han Zhao
Main category: cs.LG
TL;DR: A framework for fair classification using closed-weight LLMs via strategic prompting and post-hoc lightweight classifier training.
Details
Motivation: Addressing group fairness in high-stakes applications using LLMs, especially closed-weight models like GPT-4, where traditional fairness methods are inapplicable.Method: Treats LLM as a feature extractor, uses strategic prompts for fairness criteria, and trains a lightweight fair classifier on extracted features.
Result: Demonstrates strong accuracy-fairness tradeoffs on five datasets, outperforming traditional methods like head-tuning or raw feature training.
Conclusion: Proposed framework is data-efficient and effective for fair classification with closed-weight LLMs.
Abstract: Instruction fine-tuned large language models (LLMs) enable a simple zero-shot or few-shot prompting paradigm, also known as in-context learning, for building prediction models. This convenience, combined with continued advances in LLM capability, has the potential to drive their adoption across a broad range of domains, including high-stakes applications where group fairness – preventing disparate impacts across demographic groups – is essential. The majority of existing approaches to enforcing group fairness on LLM-based classifiers rely on traditional fair algorithms applied via model fine-tuning or head-tuning on final-layer embeddings, but they are no longer applicable to closed-weight LLMs under the in-context learning setting, which include some of the most capable commercial models today, such as GPT-4, Gemini, and Claude. In this paper, we propose a framework for deriving fair classifiers from closed-weight LLMs via prompting: the LLM is treated as a feature extractor, and features are elicited from its probabilistic predictions (e.g., token log probabilities) using prompts strategically designed for the specified fairness criterion to obtain sufficient statistics for fair classification; a fair algorithm is then applied to these features to train a lightweight fair classifier in a post-hoc manner. Experiments on five datasets, including three tabular ones, demonstrate strong accuracy-fairness tradeoffs for the classifiers derived by our framework from both open-weight and closed-weight LLMs; in particular, our framework is data-efficient and outperforms fair classifiers trained on LLM embeddings (i.e., head-tuning) or from scratch on raw tabular features.
[266] Boosting the Robustness-Accuracy Trade-off of SNNs by Robust Temporal Self-Ensemble
Jihang Wang, Dongcheng Zhao, Ruolin Chen, Qian Zhang, Yi Zeng
Main category: cs.LG
TL;DR: The paper investigates adversarial robustness in Spiking Neural Networks (SNNs) using temporal ensembling, proposing the Robust Temporal self-Ensemble (RTE) framework to enhance sub-network robustness and reduce adversarial transferability.
Details
Motivation: SNNs are promising for energy-efficient computing but lack understanding of adversarial robustness. The study aims to address vulnerabilities in temporal sub-networks and adversarial transfer across time.Method: The proposed RTE framework unifies robustness and temporal diversity objectives into a loss function, using stochastic sampling for optimization.
Result: RTE outperforms existing methods in robust-accuracy trade-off and reshapes SNNs’ internal robustness landscape for more resilient decision boundaries.
Conclusion: The study emphasizes temporal structure’s role in adversarial learning and provides a foundation for robust SNNs.
Abstract: Spiking Neural Networks (SNNs) offer a promising direction for energy-efficient and brain-inspired computing, yet their vulnerability to adversarial perturbations remains poorly understood. In this work, we revisit the adversarial robustness of SNNs through the lens of temporal ensembling, treating the network as a collection of evolving sub-networks across discrete timesteps. This formulation uncovers two critical but underexplored challenges-the fragility of individual temporal sub-networks and the tendency for adversarial vulnerabilities to transfer across time. To overcome these limitations, we propose Robust Temporal self-Ensemble (RTE), a training framework that improves the robustness of each sub-network while reducing the temporal transferability of adversarial perturbations. RTE integrates both objectives into a unified loss and employs a stochastic sampling strategy for efficient optimization. Extensive experiments across multiple benchmarks demonstrate that RTE consistently outperforms existing training methods in robust-accuracy trade-off. Additional analyses reveal that RTE reshapes the internal robustness landscape of SNNs, leading to more resilient and temporally diversified decision boundaries. Our study highlights the importance of temporal structure in adversarial learning and offers a principled foundation for building robust spiking models.
[267] Generalize across Homophily and Heterophily: Hybrid Spectral Graph Pre-Training and Prompt Tuning
Haitong Luo, Suhang Wang, Weiyao Zhang, Ruiqi Meng, Xuying Meng, Yujun Zhang
Main category: cs.LG
TL;DR: The paper introduces HS-GPPT, a framework for graph pre-training and prompt-tuning that addresses spectral misalignment in real-world graphs, improving knowledge transfer across varying homophily.
Details
Motivation: Existing methods fail to handle diverse spectral distributions in graphs due to reliance on homophily-based low-frequency knowledge, limiting effective adaptation under limited supervision.Method: Proposes HS-GPPT, using a hybrid spectral filter backbone and local-global contrastive learning for spectral knowledge acquisition, and prompt graphs for spectral alignment.
Result: Extensive experiments show HS-GPPT’s effectiveness in transductive and inductive learning settings.
Conclusion: HS-GPPT bridges spectral gaps, enabling efficient knowledge transfer across diverse graph homophily and heterophily.
Abstract: Graph ``pre-training and prompt-tuning’’ aligns downstream tasks with pre-trained objectives to enable efficient knowledge transfer under limited supervision. However, existing methods rely on homophily-based low-frequency knowledge, failing to handle diverse spectral distributions in real-world graphs with varying homophily. Our theoretical analysis reveals a spectral specificity principle: optimal knowledge transfer requires alignment between pre-trained spectral filters and the intrinsic spectrum of downstream graphs. Under limited supervision, large spectral gaps between pre-training and downstream tasks impede effective adaptation. To bridge this gap, we propose the HS-GPPT model, a novel framework that ensures spectral alignment throughout both pre-training and prompt-tuning. We utilize a hybrid spectral filter backbone and local-global contrastive learning to acquire abundant spectral knowledge. Then we design prompt graphs to align the spectral distribution with pretexts, facilitating spectral knowledge transfer across homophily and heterophily. Extensive experiments validate the effectiveness under both transductive and inductive learning settings. Our code is available at https://anonymous.4open.science/r/HS-GPPT-62D2/.
[268] RegimeNAS: Regime-Aware Differentiable Architecture Search With Theoretical Guarantees for Financial Trading
Prathamesh Devadiga, Yashmitha Shailesh
Main category: cs.LG
TL;DR: RegimeNAS is a differentiable architecture search framework for cryptocurrency trading, integrating market regime awareness with Bayesian search, dynamic neural modules, and a multi-objective loss function, outperforming benchmarks significantly.
Details
Motivation: Addressing the limitations of static deep learning models in dynamic financial environments by embedding market regime awareness into the NAS process.Method: Uses a Bayesian search space, dynamically activated neural modules (Volatility, Trend, Range blocks), and a multi-objective loss function with market-specific penalties and Lipschitz constraints. Regime identification employs multi-head attention.
Result: Achieves 80.3% Mean Absolute Error reduction and faster convergence (9 vs. 50+ epochs) compared to traditional recurrent baselines.
Conclusion: Highlights the importance of embedding domain-specific knowledge (e.g., market regimes) in NAS for robust financial models.
Abstract: We introduce RegimeNAS, a novel differentiable architecture search framework specifically designed to enhance cryptocurrency trading performance by explicitly integrating market regime awareness. Addressing the limitations of static deep learning models in highly dynamic financial environments, RegimeNAS features three core innovations: (1) a theoretically grounded Bayesian search space optimizing architectures with provable convergence properties; (2) specialized, dynamically activated neural modules (Volatility, Trend, and Range blocks) tailored for distinct market conditions; and (3) a multi-objective loss function incorporating market-specific penalties (e.g., volatility matching, transition smoothness) alongside mathematically enforced Lipschitz stability constraints. Regime identification leverages multi-head attention across multiple timeframes for improved accuracy and uncertainty estimation. Rigorous empirical evaluation on extensive real-world cryptocurrency data demonstrates that RegimeNAS significantly outperforms state-of-the-art benchmarks, achieving an 80.3% Mean Absolute Error reduction compared to the best traditional recurrent baseline and converging substantially faster (9 vs. 50+ epochs). Ablation studies and regime-specific analysis confirm the critical contribution of each component, particularly the regime-aware adaptation mechanism. This work underscores the imperative of embedding domain-specific knowledge, such as market regimes, directly within the NAS process to develop robust and adaptive models for challenging financial applications.
[269] Conformal Prediction Meets Long-tail Classification
Shuqi Liu, Jianguo Huang, Luke Ong
Main category: cs.LG
TL;DR: The paper introduces Tail-Aware Conformal Prediction (TACP) to address imbalanced coverage in long-tail label distributions, ensuring better reliability for minority classes.
Details
Motivation: Existing CP methods often over-cover head classes and under-cover tail classes, compromising reliability for minority classes despite marginal coverage guarantees.Method: Proposes TACP to leverage long-tail structure and reduce head-tail coverage gap, with an extension (sTACP) using reweighting for balanced coverage.
Result: Theoretical analysis confirms TACP reduces head-tail coverage gap; experiments on benchmark datasets validate its effectiveness.
Conclusion: TACP and sTACP improve coverage balance across classes, enhancing reliability for minority classes in long-tail distributions.
Abstract: Conformal Prediction (CP) is a popular method for uncertainty quantification that converts a pretrained model’s point prediction into a prediction set, with the set size reflecting the model’s confidence. Although existing CP methods are guaranteed to achieve marginal coverage, they often exhibit imbalanced coverage across classes under long-tail label distributions, tending to over cover the head classes at the expense of under covering the remaining tail classes. This under coverage is particularly concerning, as it undermines the reliability of the prediction sets for minority classes, even with coverage ensured on average. In this paper, we propose the Tail-Aware Conformal Prediction (TACP) method to mitigate the under coverage of the tail classes by utilizing the long-tail structure and narrowing the head-tail coverage gap. Theoretical analysis shows that it consistently achieves a smaller head-tail coverage gap than standard methods. To further improve coverage balance across all classes, we introduce an extension of TACP: soft TACP (sTACP) via a reweighting mechanism. The proposed framework can be combined with various non-conformity scores, and experiments on multiple long-tail benchmark datasets demonstrate the effectiveness of our methods.
[270] NeMo: A Neuron-Level Modularizing-While-Training Approach for Decomposing DNN Models
Xiaohan Bi, Binhang Qi, Hailong Sun, Xiang Gao, Yue Yu, Xiaojun Liang
Main category: cs.LG
TL;DR: NeMo introduces a scalable and generalizable modularizing-while-training (MwT) approach for DNNs, outperforming existing methods in accuracy and efficiency.
Details
Motivation: The prohibitive costs of DNN model construction and the inefficiency of indiscriminate model reuse drive the need for better modularization techniques.Method: NeMo operates at the neuron level, uses contrastive learning, and employs a composite loss function to modularize DNNs during training.
Result: NeMo achieves 1.72% higher module classification accuracy and 58.10% smaller module size compared to state-of-the-art methods.
Conclusion: NeMo is a promising solution for scalable and generalizable DNN modularization, applicable to diverse architectures like Transformers and CNNs.
Abstract: With the growing incorporation of deep neural network (DNN) models into modern software systems, the prohibitive construction costs have become a significant challenge. Model reuse has been widely applied to reduce training costs, but indiscriminately reusing entire models may incur significant inference overhead. Consequently, DNN modularization has gained attention, enabling module reuse by decomposing DNN models. The emerging modularizing-while-training (MwT) paradigm, which incorporates modularization into training, outperforms modularizing-after-training approaches. However, existing MwT methods focus on small-scale CNN models at the convolutional kernel level and struggle with diverse DNNs and large-scale models, particularly Transformer-based models. To address these limitations, we propose NeMo, a scalable and generalizable MwT approach. NeMo operates at the neuron level fundamental component common to all DNNs-ensuring applicability to Transformers and various architectures. We design a contrastive learning-based modular training method with an effective composite loss function, enabling scalability to large-scale models. Comprehensive experiments on two Transformer-based models and four CNN models across two classification datasets demonstrate NeMo’s superiority over state-of-the-art MwT methods. Results show average gains of 1.72% in module classification accuracy and 58.10% reduction in module size, demonstrating efficacy across both CNN and large-scale Transformer-based models. A case study on open-source projects shows NeMo’s potential benefits in practical scenarios, offering a promising approach for scalable and generalizable DNN modularization.
[271] A Global Dataset of Location Data Integrity-Assessed Reforestation Efforts
Angela John, Selvyn Allotey, Till Koebe, Alexandra Tyukavina, Ingmar Weber
Main category: cs.LG
TL;DR: The study addresses reliability concerns in afforestation/reforestation projects by introducing a dataset with standardized location validation (LDIS) using satellite imagery.
Details
Motivation: To enhance accountability in voluntary carbon markets by validating self-reported data on afforestation/reforestation projects.Method: Compiled a global dataset from primary and secondary sources, including satellite imagery, and introduced the LDIS score for location integrity.
Result: 79% of georeferenced sites failed at least 1 LDIS indicator; 15% lacked machine-readable geodata.
Conclusion: The dataset improves transparency and serves as valuable training data for computer vision tasks.
Abstract: Afforestation and reforestation are popular strategies for mitigating climate change by enhancing carbon sequestration. However, the effectiveness of these efforts is often self-reported by project developers, or certified through processes with limited external validation. This leads to concerns about data reliability and project integrity. In response to increasing scrutiny of voluntary carbon markets, this study presents a dataset on global afforestation and reforestation efforts compiled from primary (meta-)information and augmented with time-series satellite imagery and other secondary data. Our dataset covers 1,289,068 planting sites from 45,628 projects spanning 33 years. Since any remote sensing-based validation effort relies on the integrity of a planting site’s geographic boundary, this dataset introduces a standardized assessment of the provided site-level location information, which we summarize in one easy-to-communicate key indicator: LDIS – the Location Data Integrity Score. We find that approximately 79% of the georeferenced planting sites monitored fail on at least 1 out of 10 LDIS indicators, while 15% of the monitored projects lack machine-readable georeferenced data in the first place. In addition to enhancing accountability in the voluntary carbon market, the presented dataset also holds value as training data for e.g. computer vision-related tasks with millions of linked Sentinel-2 and Planetscope satellite images.
[272] Harmonized Gradient Descent for Class Imbalanced Data Stream Online Learning
Han Zhou, Hongpeng Yin, Xuanhong Deng, Yuyu Huang, Hao Ren
Main category: cs.LG
TL;DR: The paper introduces the Harmonized Gradient Descent (HGD) algorithm to address imbalanced data streams by equalizing gradient norms across classes, achieving balanced online learning without extra requirements.
Details
Motivation: Real-world data streams often exhibit class imbalance, and existing methods like resampling or reweighting are limited. The paper aims to tackle imbalance through gradient descent modification.Method: Proposes HGD, which balances gradient norms across classes during training, ensuring no under-fitting for minor classes. It requires no data-buffer, extra parameters, or prior knowledge.
Result: Theoretical analysis shows HGD achieves a sub-linear regret bound. Experiments confirm its efficiency and effectiveness compared to existing methods.
Conclusion: HGD is a streamlined, effective solution for imbalanced data stream learning, applicable to any gradient-based model.
Abstract: Many real-world data are sequentially collected over time and often exhibit skewed class distributions, resulting in imbalanced data streams. While existing approaches have explored several strategies, such as resampling and reweighting, for imbalanced data stream learning, our work distinguishes itself by addressing the imbalance problem through training modification, particularly focusing on gradient descent techniques. We introduce the harmonized gradient descent (HGD) algorithm, which aims to equalize the norms of gradients across different classes. By ensuring the gradient norm balance, HGD mitigates under-fitting for minor classes and achieves balanced online learning. Notably, HGD operates in a streamlined implementation process, requiring no data-buffer, extra parameters, or prior knowledge, making it applicable to any learning models utilizing gradient descent for optimization. Theoretical analysis, based on a few common and mild assumptions, shows that HGD achieves a satisfied sub-linear regret bound. The proposed algorithm are compared with the commonly used online imbalance learning methods under several imbalanced data stream scenarios. Extensive experimental evaluations demonstrate the efficiency and effectiveness of HGD in learning imbalanced data streams.
[273] ETTRL: Balancing Exploration and Exploitation in LLM Test-Time Reinforcement Learning Via Entropy Mechanism
Jia Liu, ChangYi He, YingQiao Lin, MingMin Yang, FeiYang Shen, ShaoGuo Liu, TingTing Gao
Main category: cs.LG
TL;DR: The paper introduces entropy-based strategies (ETMR and EAR) to improve test-time reinforcement learning (TTRL) in LLMs, addressing challenges like high inference costs and overconfidence, achieving a 68% performance boost on AIME 2024.
Details
Motivation: Current LLMs rely on annotated data and struggle in unsupervised settings. TTRL offers self-optimization but faces issues like high costs and overconfidence.Method: Proposes ETMR and EAR, entropy-based mechanisms to balance exploration-exploitation in TTRL.
Result: Llama3.1-8B shows a 68% improvement in Pass at 1 on AIME 2024, using 60% fewer rollout tokens.
Conclusion: The entropy-based approach enhances TTRL efficiency, diversity, and robustness, advancing unsupervised learning for open-domain reasoning.
Abstract: Recent advancements in Large Language Models have yielded significant improvements in complex reasoning tasks such as mathematics and programming. However, these models remain heavily dependent on annotated data and exhibit limited adaptability in unsupervised scenarios. To address these limitations, test-time reinforcement learning (TTRL) has been proposed, which enables self-optimization by leveraging model-generated pseudo-labels. Despite its promise, TTRL faces several key challenges, including high inference costs due to parallel rollouts and early-stage estimation bias that fosters overconfidence, reducing output diversity and causing performance plateaus. To address these challenges, we introduce an entropy-based mechanism to enhance the exploration-exploitation balance in test-time reinforcement learning through two strategies: Entropy-fork Tree Majority Rollout (ETMR) and Entropy-based Advantage Reshaping (EAR). Compared with the baseline, our approach enables Llama3.1-8B to achieve a 68 percent relative improvement in Pass at 1 metric on the AIME 2024 benchmark, while consuming only 60 percent of the rollout tokens budget. This highlights our method’s ability to effectively optimize the trade-off between inference efficiency, diversity, and estimation robustness, thereby advancing unsupervised reinforcement learning for open-domain reasoning tasks.
[274] PTSM: Physiology-aware and Task-invariant Spatio-temporal Modeling for Cross-Subject EEG Decoding
Changhong Jing, Yan Liu, Shuqiang Wang, Bruce X. B. Yu, Gong Chen, Zhejing Hu, Zhi Zhang, Yanyan Shen
Main category: cs.LG
TL;DR: PTSM is a novel EEG decoding framework that combines personalized and shared spatio-temporal patterns for robust cross-subject performance without calibration.
Details
Motivation: Addressing inter-subject variability and lack of subject-invariant representations in EEG decoding for BCIs.Method: Uses a dual-branch masking mechanism, factorized spatio-temporal masks, and information-theoretic constraints for disentangled embeddings.
Result: Outperforms state-of-the-art baselines in zero-shot generalization on motor imagery datasets.
Conclusion: PTSM effectively balances personalized and transferable decoding, proving useful in non-stationary neurophysiological settings.
Abstract: Cross-subject electroencephalography (EEG) decoding remains a fundamental challenge in brain-computer interface (BCI) research due to substantial inter-subject variability and the scarcity of subject-invariant representations. This paper proposed PTSM (Physiology-aware and Task-invariant Spatio-temporal Modeling), a novel framework for interpretable and robust EEG decoding across unseen subjects. PTSM employs a dual-branch masking mechanism that independently learns personalized and shared spatio-temporal patterns, enabling the model to preserve individual-specific neural characteristics while extracting task-relevant, population-shared features. The masks are factorized across temporal and spatial dimensions, allowing fine-grained modulation of dynamic EEG patterns with low computational overhead. To further address representational entanglement, PTSM enforces information-theoretic constraints that decompose latent embeddings into orthogonal task-related and subject-related subspaces. The model is trained end-to-end via a multi-objective loss integrating classification, contrastive, and disentanglement objectives. Extensive experiments on cross-subject motor imagery datasets demonstrate that PTSM achieves strong zero-shot generalization, outperforming state-of-the-art baselines without subject-specific calibration. Results highlight the efficacy of disentangled neural representations for achieving both personalized and transferable decoding in non-stationary neurophysiological settings.
[275] Fusing Rewards and Preferences in Reinforcement Learning
Sadegh Khorasani, Saber Salehkaleybar, Negar Kiyavash, Matthias Grossglauser
Main category: cs.LG
TL;DR: DFA is a reinforcement learning algorithm combining individual rewards and pairwise preferences into a single update rule, outperforming SAC and RLHF baselines in stability and performance.
Details
Motivation: To improve reinforcement learning by integrating both rewards and preferences, avoiding separate reward-modeling steps and leveraging human or synthesized preferences.Method: DFA uses policy log-probabilities to model preference probability, incorporating preferences from human annotators or synthesized online from Q-values. It minimizes preference loss under a Bradley-Terry model.
Result: DFA matches or exceeds SAC in six control environments and outperforms RLHF baselines in a stochastic GridWorld, nearing oracle performance with true rewards.
Conclusion: DFA effectively combines rewards and preferences, offering stable training and superior performance over existing methods.
Abstract: We present Dual-Feedback Actor (DFA), a reinforcement learning algorithm that fuses both individual rewards and pairwise preferences (if available) into a single update rule. DFA uses the policy’s log-probabilities directly to model the preference probability, avoiding a separate reward-modeling step. Preferences can be provided by human-annotators (at state-level or trajectory-level) or be synthesized online from Q-values stored in an off-policy replay buffer. Under a Bradley-Terry model, we prove that minimizing DFA’s preference loss recovers the entropy-regularized Soft Actor-Critic (SAC) policy. Our simulation results show that DFA trained on generated preferences matches or exceeds SAC on six control environments and demonstrates a more stable training process. With only a semi-synthetic preference dataset under Bradley-Terry model, our algorithm outperforms reward-modeling reinforcement learning from human feedback (RLHF) baselines in a stochastic GridWorld and approaches the performance of an oracle with true rewards.
[276] Minimizing Surrogate Losses for Decision-Focused Learning using Differentiable Optimization
Jayanta Mandi, Ali İrfan Mahmutoğulları, Senne Berden, Tias Guns
Main category: cs.LG
TL;DR: The paper addresses the issue of zero gradients in gradient-based decision-focused learning (DFL) for linear programs (LPs) by proposing surrogate loss minimization, even with differentiable optimization layers. This approach achieves comparable or better regret than existing methods while reducing training time.
Details
Motivation: Existing gradient-based DFL methods for LPs face zero gradients almost everywhere, limiting their effectiveness. The paper aims to improve decision regret and training efficiency.Method: The authors propose minimizing surrogate losses instead of directly minimizing regret, even when using differentiable optimization layers. They test this approach with DYS-Net, a differentiable optimization technique for LPs.
Result: Experiments show surrogate loss minimization achieves regret comparable to or better than existing methods, with significantly reduced training time.
Conclusion: Minimizing surrogate losses with differentiable optimization layers, like DYS-Net, is an effective and efficient approach for DFL in LPs.
Abstract: Decision-focused learning (DFL) trains a machine learning (ML) model to predict parameters of an optimization problem, to directly minimize decision regret, i.e., maximize decision quality. Gradient-based DFL requires computing the derivative of the solution to the optimization problem with respect to the predicted parameters. However, for many optimization problems, such as linear programs (LPs), the gradient of the regret with respect to the predicted parameters is zero almost everywhere. Existing gradient-based DFL approaches for LPs try to circumvent this issue in one of two ways: (a) smoothing the LP into a differentiable optimization problem by adding a quadratic regularizer and then minimizing the regret directly or (b) minimizing surrogate losses that have informative (sub)gradients. In this paper, we show that the former approach still results in zero gradients, because even after smoothing the regret remains constant across large regions of the parameter space. To address this, we propose minimizing surrogate losses – even when a differentiable optimization layer is used and regret can be minimized directly. Our experiments demonstrate that minimizing surrogate losses allows differentiable optimization layers to achieve regret comparable to or better than surrogate-loss based DFL methods. Further, we demonstrate that this also holds for DYS-Net, a recently proposed differentiable optimization technique for LPs, that computes approximate solutions and gradients through operations that can be performed using feedforward neural network layers. Because DYS-Net executes the forward and the backward pass very efficiently, by minimizing surrogate losses using DYS-Net, we are able to attain regret on par with the state-of-the-art while reducing training time by a significant margin.
[277] A Remedy for Over-Squashing in Graph Learning via Forman-Ricci Curvature based Graph-to-Hypergraph Structural Lifting
Michael Banf, Dominik Filipiak, Max Schattauer, Liliya Imasheva
Main category: cs.LG
TL;DR: The paper proposes a structural lifting strategy using Forman-Ricci curvature to enhance Graph Neural Networks (GNNs) by addressing information distortion in long-distance message passing.
Details
Motivation: Real-world systems like social or biological networks involve complex interactions best represented by higher-order structures, which current GNNs struggle to model effectively.Method: The method introduces a lifting technique based on Forman-Ricci curvature to transform graph data into more expressive topologies before applying GNNs.
Result: The approach mitigates over-squashing by revealing network backbones and preserving structural properties, improving information flow across large distances.
Conclusion: The proposed curvature-based lifting strategy enhances GNN performance by better capturing higher-order interactions in complex networks.
Abstract: Graph Neural Networks are highly effective at learning from relational data, leveraging node and edge features while maintaining the symmetries inherent to graph structures. However, many real-world systems, such as social or biological networks, exhibit complex interactions that are more naturally represented by higher-order topological domains. The emerging field of Geometric and Topological Deep Learning addresses this challenge by introducing methods that utilize and benefit from higher-order structures. Central to TDL is the concept of lifting, which transforms data representations from basic graph forms to more expressive topologies before the application of GNN models for learning. In this work, we propose a structural lifting strategy using Forman-Ricci curvature, which defines an edge-based network characteristic based on Riemannian geometry. Curvature reveals local and global properties of a graph, such as a network’s backbones, i.e. coarse, structure-preserving graph geometries that form connections between major communities - most suitably represented as hyperedges to model information flows between clusters across large distances in the network. To this end, our approach provides a remedy to the problem of information distortion in message passing across long distances and graph bottlenecks - a phenomenon known in graph learning as over-squashing.
[278] On-Policy RL Meets Off-Policy Experts: Harmonizing Supervised Fine-Tuning and Reinforcement Learning via Dynamic Weighting
Wenhao Zhang, Yuexiang Xie, Yuchang Sun, Yanxi Chen, Guoyin Wang, Yaliang Li, Bolin Ding, Jingren Zhou
Main category: cs.LG
TL;DR: CHORD unifies SFT and RL via dynamic weighting, improving stability and performance in LLM refinement.
Details
Motivation: Addresses the risk of disrupting model patterns and overfitting when integrating SFT and RL.Method: Proposes CHORD, a framework with dynamic weighting of SFT as an auxiliary objective in RL, using dual-control for holistic and granular learning.
Result: CHORD achieves stable, efficient learning and outperforms baselines by harmonizing off-policy expert data with on-policy exploration.
Conclusion: CHORD effectively integrates SFT and RL, demonstrating significant improvements and inspiring further research.
Abstract: Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL) are two prominent post-training paradigms for refining the capabilities and aligning the behavior of Large Language Models (LLMs). Existing approaches that integrate SFT and RL often face the risk of disrupting established model patterns and inducing overfitting to expert data. To address this, we present a novel investigation into the unified view of SFT and RL through an off-policy versus on-policy lens. We propose CHORD, a framework for the Controllable Harmonization of On- and Off-Policy Reinforcement Learning via Dynamic Weighting, which reframes SFT not as a separate stage but as a dynamically weighted auxiliary objective within the on-policy RL process. Based on an analysis of off-policy expert data’s influence at both holistic and granular levels, we incorporate a dual-control mechanism in CHORD. Specifically, the framework first employs a global coefficient to holistically guide the transition from off-policy imitation to on-policy exploration, and then applies a token-wise weighting function that enables granular learning from expert tokens, which preserves on-policy exploration and mitigates disruption from off-policy data. We conduct extensive experiments on widely used benchmarks, providing empirical evidence that CHORD achieves a stable and efficient learning process. By effectively harmonizing off-policy expert data with on-policy exploration, CHORD demonstrates significant improvements over baselines. We release the implementation at https://github.com/modelscope/Trinity-RFT/tree/main/examples/mix_chord to inspire further research.
[279] Generative Co-Design of Antibody Sequences and Structures via Black-Box Guidance in a Shared Latent Space
Yinghua Yao, Yuangang Pan, Xixian Chen
Main category: cs.LG
TL;DR: LEAD is a sequence-structure co-design framework optimizing antibody CDRs in latent space, improving efficiency and performance over raw data methods.
Details
Motivation: Existing methods for optimizing antibody CDRs are inefficient due to raw data space operations, prompting the need for a more efficient approach.Method: LEAD optimizes sequence and structure in their shared latent space using a black-box guidance strategy for non-differentiable property evaluators.
Result: LEAD achieves superior optimization performance, reducing query consumption by half while outperforming baselines in property optimization.
Conclusion: LEAD offers an efficient and effective solution for antibody CDR optimization, with potential real-world applicability.
Abstract: Advancements in deep generative models have enabled the joint modeling of antibody sequence and structure, given the antigen-antibody complex as context. However, existing approaches for optimizing complementarity-determining regions (CDRs) to improve developability properties operate in the raw data space, leading to excessively costly evaluations due to the inefficient search process. To address this, we propose LatEnt blAck-box Design (LEAD), a sequence-structure co-design framework that optimizes both sequence and structure within their shared latent space. Optimizing shared latent codes can not only break through the limitations of existing methods, but also ensure synchronization of different modality designs. Particularly, we design a black-box guidance strategy to accommodate real-world scenarios where many property evaluators are non-differentiable. Experimental results demonstrate that our LEAD achieves superior optimization performance for both single and multi-property objectives. Notably, LEAD reduces query consumption by a half while surpassing baseline methods in property optimization. The code is available at https://github.com/EvaFlower/LatEnt-blAck-box-Design.
[280] Robust Convolution Neural ODEs via Contractivity-promoting regularization
Muhammad Zakwan, Liang Xu, Giancarlo Ferrari-Trecate
Main category: cs.LG
TL;DR: The paper proposes using contraction theory to enhance the robustness of Convolutional Neural Ordinary Differential Equations (NODEs) against input noise and adversarial attacks.
Details
Motivation: Neural networks are vulnerable to input noise and adversarial attacks, prompting the need for more robust models.Method: The authors introduce contractive Convolutional NODEs, leveraging contraction theory to ensure exponential convergence of trajectories. Robustness is induced via Jacobian-based regularization or weight regularization for slope-restricted activation functions.
Result: The method is tested on MNIST and FashionMNIST datasets with corrupted images, demonstrating improved robustness.
Conclusion: Contractive Convolutional NODEs, trained with specific regularization, effectively enhance robustness against noise and adversarial attacks.
Abstract: Neural networks can be fragile to input noise and adversarial attacks. In this work, we consider Convolutional Neural Ordinary Differential Equations (NODEs), a family of continuous-depth neural networks represented by dynamical systems, and propose to use contraction theory to improve their robustness. For a contractive dynamical system two trajectories starting from different initial conditions converge to each other exponentially fast. Contractive Convolutional NODEs can enjoy increased robustness as slight perturbations of the features do not cause a significant change in the output. Contractivity can be induced during training by using a regularization term involving the Jacobian of the system dynamics. To reduce the computational burden, we show that it can also be promoted using carefully selected weight regularization terms for a class of NODEs with slope-restricted activation functions. The performance of the proposed regularizers is illustrated through benchmark image classification tasks on MNIST and FashionMNIST datasets, where images are corrupted by different kinds of noise and attacks.
[281] Multi-Sensory Cognitive Computing for Learning Population-level Brain Connectivity
Mayssa Soussia, Mohamed Ali Mahjoub, Islem Rekik
Main category: cs.LG
TL;DR: mCOCO is a novel framework using Reservoir Computing to create interpretable, efficient, and cognitively rich connectional brain templates (CBTs) from BOLD signals, outperforming existing methods.
Details
Motivation: Existing CBT learning methods lack interpretability, are computationally expensive, and ignore cognitive aspects. mCOCO addresses these gaps.Method: mCOCO uses Reservoir Computing to map BOLD signals into functional connectomes, aggregates them into a CBT, and integrates multi-sensory inputs for cognitive traits.
Result: mCOCO-based CBT outperforms GNN-based methods in centeredness, discriminativeness, topological soundness, and memory retention.
Conclusion: mCOCO offers a superior, interpretable, and efficient approach to CBT generation with cognitive capabilities.
Abstract: The generation of connectional brain templates (CBTs) has recently garnered significant attention for its potential to identify unique connectivity patterns shared across individuals. However, existing methods for CBT learning such as conventional machine learning and graph neural networks (GNNs) are hindered by several limitations. These include: (i) poor interpretability due to their black-box nature, (ii) high computational cost, and (iii) an exclusive focus on structure and topology, overlooking the cognitive capacity of the generated CBT. To address these challenges, we introduce mCOCO (multi-sensory COgnitive COmputing), a novel framework that leverages Reservoir Computing (RC) to learn population-level functional CBT from BOLD (Blood-Oxygen-level-Dependent) signals. RC’s dynamic system properties allow for tracking state changes over time, enhancing interpretability and enabling the modeling of brain-like dynamics, as demonstrated in prior literature. By integrating multi-sensory inputs (e.g., text, audio, and visual data), mCOCO captures not only structure and topology but also how brain regions process information and adapt to cognitive tasks such as sensory processing, all in a computationally efficient manner. Our mCOCO framework consists of two phases: (1) mapping BOLD signals into the reservoir to derive individual functional connectomes, which are then aggregated into a group-level CBT - an approach, to the best of our knowledge, not previously explored in functional connectivity studies - and (2) incorporating multi-sensory inputs through a cognitive reservoir, endowing the CBT with cognitive traits. Extensive evaluations show that our mCOCO-based template significantly outperforms GNN-based CBT in terms of centeredness, discriminativeness, topological soundness, and multi-sensory memory retention. Our source code is available at https://github.com/basiralab/mCOCO.
[282] Informative Post-Hoc Explanations Only Exist for Simple Functions
Eric Günther, Balázs Szabados, Robi Bhattacharjee, Sebastian Bordt, Ulrike von Luxburg
Main category: cs.LG
TL;DR: The paper critiques local post-hoc explanation algorithms for complex ML models, showing they often fail to provide meaningful insights without strong assumptions. It introduces a framework for informative explanations and derives conditions under which certain algorithms become informative.
Details
Motivation: To address the lack of theoretical guarantees for explanation algorithms in complex models and rigorously evaluate their informativeness.Method: Introduces a learning-theory-based framework defining informative explanations as those reducing the complexity of plausible decision functions. Analyzes popular algorithms under this framework.
Result: Many popular explanation algorithms (e.g., gradient, SHAP, anchor) are non-informative for complex models unless strong conditions are met.
Conclusion: The findings challenge the practicality of current explanation methods for high-stakes AI applications and suggest modifications to improve informativeness.
Abstract: Many researchers have suggested that local post-hoc explanation algorithms can be used to gain insights into the behavior of complex machine learning models. However, theoretical guarantees about such algorithms only exist for simple decision functions, and it is unclear whether and under which assumptions similar results might exist for complex models. In this paper, we introduce a general, learning-theory-based framework for what it means for an explanation to provide information about a decision function. We call an explanation informative if it serves to reduce the complexity of the space of plausible decision functions. With this approach, we show that many popular explanation algorithms are not informative when applied to complex decision functions, providing a rigorous mathematical rejection of the idea that it should be possible to explain any model. We then derive conditions under which different explanation algorithms become informative. These are often stronger than what one might expect. For example, gradient explanations and counterfactual explanations are non-informative with respect to the space of differentiable functions, and SHAP and anchor explanations are not informative with respect to the space of decision trees. Based on these results, we discuss how explanation algorithms can be modified to become informative. While the proposed analysis of explanation algorithms is mathematical, we argue that it holds strong implications for the practical applicability of these algorithms, particularly for auditing, regulation, and high-risk applications of AI.
[283] Calibrated and uncertain? Evaluating uncertainty estimates in binary classification models
Aurora Grefsrud, Nello Blaser, Trygve Buanes
Main category: cs.LG
TL;DR: The paper evaluates six probabilistic machine learning algorithms for uncertainty estimation, finding all are well-calibrated but deep learning methods fail to reflect uncertainty for out-of-distribution data.
Details
Motivation: To address the challenge of uncertainty quantification in complex data models like deep learning, ensuring validity in scientific discovery.Method: Uses approximate Bayesian inference and empirical tests on synthetic datasets to assess six algorithms for class probability and uncertainty estimation.
Result: All algorithms are well-calibrated, but deep learning-based ones do not consistently show increased uncertainty for out-of-distribution data.
Conclusion: The study clarifies uncertainty estimation methods, aiding researchers in developing robust techniques for scientific modeling.
Abstract: Rigorous statistical methods, including parameter estimation with accompanying uncertainties, underpin the validity of scientific discovery, especially in the natural sciences. With increasingly complex data models such as deep learning techniques, uncertainty quantification has become exceedingly difficult and a plethora of techniques have been proposed. In this case study, we use the unifying framework of approximate Bayesian inference combined with empirical tests on carefully created synthetic classification datasets to investigate qualitative properties of six different probabilistic machine learning algorithms for class probability and uncertainty estimation: (i) a neural network ensemble, (ii) neural network ensemble with conflictual loss, (iii) evidential deep learning, (iv) a single neural network with Monte Carlo Dropout, (v) Gaussian process classification and (vi) a Dirichlet process mixture model. We check if the algorithms produce uncertainty estimates which reflect commonly desired properties, such as being well calibrated and exhibiting an increase in uncertainty for out-of-distribution data points. Our results indicate that all algorithms are well calibrated, but none of the deep learning based algorithms provide uncertainties that consistently reflect lack of experimental evidence for out-of-distribution data points. We hope our study may serve as a clarifying example for researchers developing new methods of uncertainty estimation for scientific data-driven modeling.
[284] Predicting and Explaining Traffic Crash Severity Through Crash Feature Selection
Andrea Castellani, Zacharias Papadovasilakis, Giorgos Papoutsoglou, Mary Cole, Brian Bautsch, Tobias Rodemann, Ioannis Tsamardinos, Angela Harden
Main category: cs.LG
TL;DR: The study introduces a dataset of Ohio crashes (2017-2022) and uses AutoML and explainable AI to identify key risk factors for severe crashes, achieving 85.6% AUC-ROC.
Details
Motivation: Motor vehicle crashes are a major cause of injury/death, requiring data-driven methods to mitigate severity.Method: Combines AutoML (JADBio) and SHAP for feature selection and interpretation, using Ridge Logistic Regression.
Result: Model achieved 85.6% AUC-ROC, identifying 17 key features (e.g., location type, speed). Alcohol/drugs were less influential.
Conclusion: Provides a scalable, interpretable framework for traffic safety policy, supporting Vision Zero.
Abstract: Motor vehicle crashes remain a leading cause of injury and death worldwide, necessitating data-driven approaches to understand and mitigate crash severity. This study introduces a curated dataset of more than 3 million people involved in accidents in Ohio over six years (2017-2022), aggregated to more than 2.3 million vehicle-level records for predictive analysis. The primary contribution is a transparent and reproducible methodology that combines Automated Machine Learning (AutoML) and explainable artificial intelligence (AI) to identify and interpret key risk factors associated with severe crashes. Using the JADBio AutoML platform, predictive models were constructed to distinguish between severe and non-severe crash outcomes. The models underwent rigorous feature selection across stratified training subsets, and their outputs were interpreted using SHapley Additive exPlanations (SHAP) to quantify the contribution of individual features. A final Ridge Logistic Regression model achieved an AUC-ROC of 85.6% on the training set and 84.9% on a hold-out test set, with 17 features consistently identified as the most influential predictors. Key features spanned demographic, environmental, vehicle, human, and operational categories, including location type, posted speed, minimum occupant age, and pre-crash action. Notably, certain traditionally emphasized factors, such as alcohol or drug impairment, were less influential in the final model compared to environmental and contextual variables. Emphasizing methodological rigor and interpretability over mere predictive performance, this study offers a scalable framework to support Vision Zero with aligned interventions and advanced data-informed traffic safety policy.
[285] Towards Faithful Class-level Self-explainability in Graph Neural Networks by Subgraph Dependencies
Fanzhen Liu, Xiaoxiao Ma, Jian Yang, Alsharif Abuadbba, Kristen Moore, Surya Nepal, Cecile Paris, Quan Z. Sheng, Jia Wu
Main category: cs.LG
TL;DR: GraphOracle is a self-explainable GNN framework for class-level explanations, outperforming prior methods like ProtGNN and PGIB in fidelity, explainability, and scalability.
Details
Motivation: Enhancing interpretability of GNNs for safe and fair deployment, addressing the gap in evaluating class-level explanations.Method: Jointly learns a GNN classifier and discriminative subgraphs for each class, using integrated training and masking-based evaluation.
Result: GraphOracle achieves superior performance in fidelity and scalability, while prior methods fail in class-level explanations.
Conclusion: GraphOracle is a practical solution for faithful class-level self-explainability in GNNs, avoiding computational bottlenecks.
Abstract: Enhancing the interpretability of graph neural networks (GNNs) is crucial to ensure their safe and fair deployment. Recent work has introduced self-explainable GNNs that generate explanations as part of training, improving both faithfulness and efficiency. Some of these models, such as ProtGNN and PGIB, learn class-specific prototypes, offering a potential pathway toward class-level explanations. However, their evaluations focus solely on instance-level explanations, leaving open the question of whether these prototypes meaningfully generalize across instances of the same class. In this paper, we introduce GraphOracle, a novel self-explainable GNN framework designed to generate and evaluate class-level explanations for GNNs. Our model jointly learns a GNN classifier and a set of structured, sparse subgraphs that are discriminative for each class. We propose a novel integrated training that captures graph$\unicode{x2013}$subgraph$\unicode{x2013}$prediction dependencies efficiently and faithfully, validated through a masking-based evaluation strategy. This strategy enables us to retroactively assess whether prior methods like ProtGNN and PGIB deliver effective class-level explanations. Our results show that they do not. In contrast, GraphOracle achieves superior fidelity, explainability, and scalability across a range of graph classification tasks. We further demonstrate that GraphOracle avoids the computational bottlenecks of previous methods$\unicode{x2014}$like Monte Carlo Tree Search$\unicode{x2014}$by using entropy-regularized subgraph selection and lightweight random walk extraction, enabling faster and more scalable training. These findings position GraphOracle as a practical and principled solution for faithful class-level self-explainability in GNNs.
[286] DiCriTest: Testing Scenario Generation for Decision-Making Agents Considering Diversity and Criticality
Qitong Chu, Yufeng Yue, Danya Yao, Huaxin Pei
Main category: cs.LG
TL;DR: A dual-space guided testing framework improves critical scenario generation by balancing diversity and criticality, outperforming existing methods.
Details
Motivation: The need for effective safety verification in dynamic environments due to the growing use of decision-making agents.Method: A dual-space framework coordinating scenario parameter space and agent behavior space, using hierarchical representation and adaptive mode switching.
Result: Improves critical scenario generation by 56.23% and enhances diversity, outperforming baselines.
Conclusion: The framework effectively addresses the challenge of balancing diversity and criticality in scenario generation.
Abstract: The growing deployment of decision-making agents in dynamic environments increases the demand for safety verification. While critical testing scenario generation has emerged as an appealing verification methodology, effectively balancing diversity and criticality remains a key challenge for existing methods, particularly due to local optima entrapment in high-dimensional scenario spaces. To address this limitation, we propose a dual-space guided testing framework that coordinates scenario parameter space and agent behavior space, aiming to generate testing scenarios considering diversity and criticality. Specifically, in the scenario parameter space, a hierarchical representation framework combines dimensionality reduction and multi-dimensional subspace evaluation to efficiently localize diverse and critical subspaces. This guides dynamic coordination between two generation modes: local perturbation and global exploration, optimizing critical scenario quantity and diversity. Complementarily, in the agent behavior space, agent-environment interaction data are leveraged to quantify behavioral criticality/diversity and adaptively support generation mode switching, forming a closed feedback loop that continuously enhances scenario characterization and exploration within the parameter space. Experiments show our framework improves critical scenario generation by an average of 56.23% and demonstrates greater diversity under novel parameter-behavior co-driven metrics when tested on five decision-making agents, outperforming state-of-the-art baselines.
[287] Finite-Width Neural Tangent Kernels from Feynman Diagrams
Max Guillen, Philipp Misof, Jan E. Gerken
Main category: cs.LG
TL;DR: The paper introduces Feynman diagrams to compute finite-width corrections for Neural Tangent Kernels (NTKs), simplifying algebraic manipulations and enabling layer-wise recursive relations for training dynamics.
Details
Motivation: To address the limitations of infinite-width NTKs, which lack key training properties like NTK evolution and feature learning, by incorporating finite-width effects.Method: Uses Feynman diagrams to compute finite-width corrections for NTK statistics, enabling recursive relations for preactivations, NTKs, and higher-derivative tensors.
Result: Demonstrates feasibility by extending stability results to NTKs and proving no finite-width corrections for scale-invariant nonlinearities like ReLU. Validated numerically.
Conclusion: The framework provides a practical tool for analyzing finite-width effects in NTKs, with applications in understanding training dynamics and network stability.
Abstract: Neural tangent kernels (NTKs) are a powerful tool for analyzing deep, non-linear neural networks. In the infinite-width limit, NTKs can easily be computed for most common architectures, yielding full analytic control over the training dynamics. However, at infinite width, important properties of training such as NTK evolution or feature learning are absent. Nevertheless, finite width effects can be included by computing corrections to the Gaussian statistics at infinite width. We introduce Feynman diagrams for computing finite-width corrections to NTK statistics. These dramatically simplify the necessary algebraic manipulations and enable the computation of layer-wise recursive relations for arbitrary statistics involving preactivations, NTKs and certain higher-derivative tensors (dNTK and ddNTK) required to predict the training dynamics at leading order. We demonstrate the feasibility of our framework by extending stability results for deep networks from preactivations to NTKs and proving the absence of finite-width corrections for scale-invariant nonlinearities such as ReLU on the diagonal of the Gram matrix of the NTK. We validate our results with numerical experiments.
[288] Physics-Informed Diffusion Models for Unsupervised Anomaly Detection in Multivariate Time Series
Juhi Soni, Markus Lange-Hegermann, Stefan Windmann
Main category: cs.LG
TL;DR: An unsupervised anomaly detection method using a physics-informed diffusion model for multivariate time series data, improving F1 scores and data diversity.
Details
Motivation: To enhance anomaly detection in time series by incorporating physics-dependent temporal distribution learning via a weighted physics-informed loss.Method: Uses a weighted physics-informed loss during diffusion model training with a static weight schedule to approximate data distribution.
Result: Outperforms baselines and prior physics-informed or data-driven models, showing improved F1 scores, data diversity, and log-likelihood.
Conclusion: The physics-informed diffusion model is effective for unsupervised anomaly detection in time series, surpassing existing methods.
Abstract: We propose an unsupervised anomaly detection approach based on a physics-informed diffusion model for multivariate time series data. Over the past years, diffusion model has demonstrated its effectiveness in forecasting, imputation, generation, and anomaly detection in the time series domain. In this paper, we present a new approach for learning the physics-dependent temporal distribution of multivariate time series data using a weighted physics-informed loss during diffusion model training. A weighted physics-informed loss is constructed using a static weight schedule. This approach enables a diffusion model to accurately approximate underlying data distribution, which can influence the unsupervised anomaly detection performance. Our experiments on synthetic and real-world datasets show that physics-informed training improves the F1 score in anomaly detection; it generates better data diversity and log-likelihood. Our model outperforms baseline approaches, additionally, it surpasses prior physics-informed work and purely data-driven diffusion models on a synthetic dataset and one real-world dataset while remaining competitive on others.
[289] A Comprehensive Perspective on Explainable AI across the Machine Learning Workflow
George Paterakis, Andrea Castellani, George Papoutsoglou, Tobias Rodemann, Ioannis Tsamardinos
Main category: cs.LG
TL;DR: The paper introduces Holistic Explainable AI (HXAI), a framework embedding explanations at every stage of the data-analysis workflow, tailored to users, and unifying six components into a taxonomy.
Details
Motivation: Address the opacity of AI models by providing a user-centric, end-to-end explainability framework that covers all stages of the workflow.Method: Proposes HXAI, a taxonomy unifying six components (data, analysis set-up, learning process, model output, model quality, communication channel) and aligns them with user needs. Includes a 112-item question bank and survey of tools.
Result: Identifies characteristics for clear, actionable explanations and demonstrates how AI agents with large-language models can translate technical details into stakeholder-specific narratives.
Conclusion: HXAI advances transparency, trustworthiness, and responsible AI deployment by integrating multidisciplinary insights and real-world lessons.
Abstract: Artificial intelligence is reshaping science and industry, yet many users still regard its models as opaque “black boxes”. Conventional explainable artificial-intelligence methods clarify individual predictions but overlook the upstream decisions and downstream quality checks that determine whether insights can be trusted. In this work, we present Holistic Explainable Artificial Intelligence (HXAI), a user-centric framework that embeds explanation into every stage of the data-analysis workflow and tailors those explanations to users. HXAI unifies six components (data, analysis set-up, learning process, model output, model quality, communication channel) into a single taxonomy and aligns each component with the needs of domain experts, data analysts and data scientists. A 112-item question bank covers these needs; our survey of contemporary tools highlights critical coverage gaps. Grounded in theories of human explanation, principles from human-computer interaction and findings from empirical user studies, HXAI identifies the characteristics that make explanations clear, actionable and cognitively manageable. A comprehensive taxonomy operationalises these insights, reducing terminological ambiguity and enabling rigorous coverage analysis of existing toolchains. We further demonstrate how AI agents that embed large-language models can orchestrate diverse explanation techniques, translating technical artifacts into stakeholder-specific narratives that bridge the gap between AI developers and domain experts. Departing from traditional surveys or perspective articles, this work melds concepts from multiple disciplines, lessons from real-world projects and a critical synthesis of the literature to advance a novel, end-to-end viewpoint on transparency, trustworthiness and responsible AI deployment.
[290] DFed-SST: Building Semantic- and Structure-aware Topologies for Decentralized Federated Graph Learning
Lianshuai Guo, Zhongzheng Yuan, Xunkai Li, Yinlin Zhu, Meixia Qu, Wenyu Wang
Main category: cs.LG
TL;DR: DFed-SST is a decentralized federated graph learning framework with adaptive communication, addressing the limitations of existing DFL and FGL methods by leveraging local subgraph topology.
Details
Motivation: Existing DFL and FGL methods fail to address local subgraph topology or decentralization benefits, creating a gap in efficient model aggregation for heterogeneous data.Method: DFed-SST uses a dual-topology adaptive communication mechanism to dynamically optimize inter-client communication based on local subgraph features.
Result: Experiments on eight datasets show DFed-SST outperforms baselines with a 3.26% average accuracy improvement.
Conclusion: DFed-SST bridges the gap in decentralized federated graph learning, offering superior performance by adapting to local topology.
Abstract: Decentralized Federated Learning (DFL) has emerged as a robust distributed paradigm that circumvents the single-point-of-failure and communication bottleneck risks of centralized architectures. However, a significant challenge arises as existing DFL optimization strategies, primarily designed for tasks such as computer vision, fail to address the unique topological information inherent in the local subgraph. Notably, while Federated Graph Learning (FGL) is tailored for graph data, it is predominantly implemented in a centralized server-client model, failing to leverage the benefits of decentralization.To bridge this gap, we propose DFed-SST, a decentralized federated graph learning framework with adaptive communication. The core of our method is a dual-topology adaptive communication mechanism that leverages the unique topological features of each client’s local subgraph to dynamically construct and optimize the inter-client communication topology. This allows our framework to guide model aggregation efficiently in the face of heterogeneity. Extensive experiments on eight real-world datasets consistently demonstrate the superiority of DFed-SST, achieving 3.26% improvement in average accuracy over baseline methods.
[291] Nested Operator Inference for Adaptive Data-Driven Learning of Reduced-order Models
Nicole Aretz, Karen Willcox
Main category: cs.LG
TL;DR: The paper introduces a nested Operator Inference (OpInf) method for creating reduced-order models (ROMs) from high-dimensional data, improving accuracy and efficiency over standard OpInf.
Details
Motivation: To enhance ROM learning by leveraging hierarchical reduced spaces and enabling dynamic updates, addressing limitations of standard OpInf.Method: A nested OpInf approach iteratively constructs initial guesses prioritizing dominant modes, warm-starting from learned models for versatility.
Result: Achieves 4x smaller error than standard OpInf in a heat conduction problem and 3% error with 19,000x speed-up in an ice sheet model.
Conclusion: Nested OpInf outperforms standard methods in accuracy and efficiency, enabling dynamic ROM updates for complex systems.
Abstract: This paper presents a data-driven, nested Operator Inference (OpInf) approach for learning physics-informed reduced-order models (ROMs) from snapshot data of high-dimensional dynamical systems. The approach exploits the inherent hierarchy within the reduced space to iteratively construct initial guesses for the OpInf learning problem that prioritize the interactions of the dominant modes. The initial guess computed for any target reduced dimension corresponds to a ROM with provably smaller or equal snapshot reconstruction error than with standard OpInf. Moreover, our nested OpInf algorithm can be warm-started from previously learned models, enabling versatile application scenarios involving dynamic basis and model form updates. We demonstrate the performance of our algorithm on a cubic heat conduction problem, with nested OpInf achieving a four times smaller error than standard OpInf at a comparable offline time. Further, we apply nested OpInf to a large-scale, parameterized model of the Greenland ice sheet where, despite model form approximation errors, it learns a ROM with, on average, 3% error and computational speed-up factor above 19,000.
[292] SeamlessFlow: A Trainer Agent Isolation RL Framework Achieving Bubble-Free Pipelines via Tag Scheduling
Jinghui Wang, Shaojie Wang, Yinghan Cui, Xuxing Chen, Chao Wang, Xiaojiang Zhang, Minglei Zhang, Jiarong Zhang, Wenhao Zhuang, Yuchen Cao, Wankang Bao, Haimo Li, Zheng Lin, Huiming Wang, Haoyang Huang, Zongxian Feng, Zizheng Zhan, Ken Deng, Wen Xiang, Huaixi Tang, Kun Wu, Mengtong Li, Mengfei Xie, Junyi Peng, Haotian Zhang, Bin Chen, Bing Yu
Main category: cs.LG
TL;DR: SeamlessFlow is a server-based RL framework addressing industrial-scale challenges by decoupling training from agent execution and optimizing GPU utilization.
Details
Motivation: To tackle challenges in industrial-scale RL, such as complex agent execution flows and inefficient GPU usage, while ensuring stability and scalability.Method: Introduces a data plane for decoupling RL training from agents and a tag-driven scheduling paradigm for efficient resource utilization.
Result: Achieves high throughput, stability, and scalability, making it suitable for complex RL tasks like multi-agent and long-horizon scenarios.
Conclusion: SeamlessFlow effectively balances performance and stability, proving ideal for large-scale RL deployments.
Abstract: We introduce SeamlessFlow, a server based reinforcement learning (RL) framework that addresses two core challenges in industrial scale RL: (1) decoupling RL training from the complex execution flow of agents; (2) maximizing GPU utilization with minimal idle time while preserving the stability and scalability required for large-scale deployments. First, SeamlessFlow introduces a data plane that decouples the RL trainer from diverse, complex agent implementations while sustaining high throughput. A central trajectory manager maintains complete interaction histories and supports partial rollout, allowing rollout to pause for weight updates and resume seamlessly, keeping agents unaware of service interruptions. Second, we propose a tag driven scheduling paradigm that abstracts hardware into capability tagged resources, unifying colocated and disaggregated architectures. Based on this, SeamlessFlow introduces a spatiotemporal multiplexing pipeline that dynamically reassigns idle training nodes to rollout in a train rollout separated setup, eliminating pipeline bubbles and fully exploiting heterogeneous cluster resources. By combining these innovations, SeamlessFlow delivers both stability and high performance, making it well suited for multi agent, long horizon, and other complex RL tasks.
[293] Optimal CO2 storage management considering safety constraints in multi-stakeholder multi-site CCS projects: a game theoretic perspective
Jungang Chen, Seyyed A. Hosseini
Main category: cs.LG
TL;DR: The paper explores CCS stakeholder dynamics using Markov games and multi-agent reinforcement learning to optimize CO2 storage management.
Details
Motivation: CCS projects involve diverse stakeholders with conflicting interests, requiring collaborative solutions for effective management.Method: A Markov game framework and multi-agent reinforcement learning with safety constraints are used to model stakeholder interactions.
Result: The framework effectively manages CO2 storage with multiple stakeholders, leveraging surrogate models for computational efficiency.
Conclusion: Collaborative coalition structures are essential for optimal CCS project management, as demonstrated by the proposed framework.
Abstract: Carbon capture and storage (CCS) projects typically involve a diverse array of stakeholders or players from public, private, and regulatory sectors, each with different objectives and responsibilities. Given the complexity, scale, and long-term nature of CCS operations, determining whether individual stakeholders can independently maximize their interests or whether collaborative coalition agreements are needed remains a central question for effective CCS project planning and management. CCS projects are often implemented in geologically connected sites, where shared geological features such as pressure space and reservoir pore capacity can lead to competitive behavior among stakeholders. Furthermore, CO2 storage sites are often located in geologically mature basins that previously served as sites for hydrocarbon extraction or wastewater disposal in order to leverage existing infrastructures, which makes unilateral optimization even more complicated and unrealistic. In this work, we propose a paradigm based on Markov games to quantitatively investigate how different coalition structures affect the goals of stakeholders. We frame this multi-stakeholder multi-site problem as a multi-agent reinforcement learning problem with safety constraints. Our approach enables agents to learn optimal strategies while compliant with safety regulations. We present an example where multiple operators are injecting CO2 into their respective project areas in a geologically connected basin. To address the high computational cost of repeated simulations of high-fidelity models, a previously developed surrogate model based on the Embed-to-Control (E2C) framework is employed. Our results demonstrate the effectiveness of the proposed framework in addressing optimal management of CO2 storage when multiple stakeholders with various objectives and goals are involved.
[294] Uncertainty-Aware Adaptation of Large Language Models for Protein-Protein Interaction Analysis
Sanket Jantre, Tianle Wang, Gilchan Park, Kriti Chopra, Nicholas Jeon, Xiaoning Qian, Nathan M. Urban, Byung-Jun Yoon
Main category: cs.LG
TL;DR: The paper introduces an uncertainty-aware LLM adaptation for protein-protein interaction (PPI) analysis, using fine-tuned LLaMA-3 and BioMedGPT models with LoRA ensembles and Bayesian LoRA for reliability.
Details
Motivation: PPIs are crucial for understanding complex diseases, but LLM predictions often lack reproducibility due to uncertainty. This study aims to address this gap.Method: Fine-tuned LLaMA-3 and BioMedGPT models, integrated with LoRA ensembles and Bayesian LoRA for uncertainty quantification (UQ).
Result: Competitive PPI identification performance across disease contexts, with improved reliability and reproducibility.
Conclusion: Uncertainty-aware LLM adaptation enhances trustworthiness in computational biology, supporting precision medicine and biomedical research.
Abstract: Identification of protein-protein interactions (PPIs) helps derive cellular mechanistic understanding, particularly in the context of complex conditions such as neurodegenerative disorders, metabolic syndromes, and cancer. Large Language Models (LLMs) have demonstrated remarkable potential in predicting protein structures and interactions via automated mining of vast biomedical literature; yet their inherent uncertainty remains a key challenge for deriving reproducible findings, critical for biomedical applications. In this study, we present an uncertainty-aware adaptation of LLMs for PPI analysis, leveraging fine-tuned LLaMA-3 and BioMedGPT models. To enhance prediction reliability, we integrate LoRA ensembles and Bayesian LoRA models for uncertainty quantification (UQ), ensuring confidence-calibrated insights into protein behavior. Our approach achieves competitive performance in PPI identification across diverse disease contexts while addressing model uncertainty, thereby enhancing trustworthiness and reproducibility in computational biology. These findings underscore the potential of uncertainty-aware LLM adaptation for advancing precision medicine and biomedical research.
[295] Omni-DPO: A Dual-Perspective Paradigm for Dynamic Preference Learning of LLMs
Shangpin Peng, Weinong Wang, Zhuotao Tian, Senqiao Yang, Xing Wu, Haotian Xu, Chengquan Zhang, Takashi Isobe, Baotian Hu, Min Zhang
Main category: cs.LG
TL;DR: Omni-DPO improves Direct Preference Optimization (DPO) by adaptively weighting preference pairs based on quality and learning dynamics, outperforming baselines in textual and mathematical tasks.
Details
Motivation: Existing DPO methods treat all preference pairs uniformly, ignoring quality and learning utility, leading to suboptimal performance.Method: Omni-DPO introduces a dual-perspective framework weighting samples by inherent quality and model performance during training.
Result: Omni-DPO outperforms baselines, notably beating Claude 3 Opus by 6.7 points on Arena-Hard and excelling in mathematical reasoning.
Conclusion: Omni-DPO enhances data utilization and performance, demonstrating effectiveness and robustness across tasks.
Abstract: Direct Preference Optimization (DPO) has become a cornerstone of reinforcement learning from human feedback (RLHF) due to its simplicity and efficiency. However, existing DPO-based approaches typically treat all preference pairs uniformly, ignoring critical variations in their inherent quality and learning utility, leading to suboptimal data utilization and performance. To address this challenge, we propose Omni-DPO, a dual-perspective optimization framework that jointly accounts for (1) the inherent quality of each preference pair and (2) the model’s evolving performance on those pairs. By adaptively weighting samples according to both data quality and the model’s learning dynamics during training, Omni-DPO enables more effective training data utilization and achieves better performance. Experimental results on various models and benchmarks demonstrate the superiority and generalization capabilities of Omni-DPO. On textual understanding tasks, Gemma-2-9b-it finetuned with Omni-DPO beats the leading LLM, Claude 3 Opus, by a significant margin of 6.7 points on the Arena-Hard benchmark. On mathematical reasoning tasks, Omni-DPO consistently outperforms the baseline methods across all benchmarks, providing strong empirical evidence for the effectiveness and robustness of our approach. Code and models will be available at https://github.com/pspdada/Omni-DPO.
[296] Recent Advances in Generative AI for Healthcare Applications
Yasin Shokrollahi, Jose Colmenarez, Wenxi Liu, Sahar Yarmohammadtoosky, Matthew M. Nikahd, Pengfei Dong, Xianqi Li, Linxia Gu
Main category: cs.LG
TL;DR: The paper reviews generative AI’s impact on healthcare, focusing on diffusion and transformer models, their applications, limitations, and future research directions.
Details
Motivation: To synthesize recent advances in healthcare applications of generative AI and highlight its transformative potential.Method: A comprehensive review of generative AI technologies, particularly diffusion and transformer models, in healthcare applications.
Result: Generative AI has significantly improved medical imaging, diagnostics, drug design, and clinical workflows, but challenges remain.
Conclusion: The paper serves as a reference for researchers and practitioners, outlining the state of the art and future opportunities in generative AI for healthcare.
Abstract: The rapid advancement of Artificial Intelligence (AI) has catalyzed revolutionary changes across various sectors, notably in healthcare. In particular, generative AI-led by diffusion models and transformer architectures-has enabled significant breakthroughs in medical imaging (including image reconstruction, image-to-image translation, generation, and classification), protein structure prediction, clinical documentation, diagnostic assistance, radiology interpretation, clinical decision support, medical coding, and billing, as well as drug design and molecular representation. These innovations have enhanced clinical diagnosis, data reconstruction, and drug synthesis. This review paper aims to offer a comprehensive synthesis of recent advances in healthcare applications of generative AI, with an emphasis on diffusion and transformer models. Moreover, we discuss current capabilities, highlight existing limitations, and outline promising research directions to address emerging challenges. Serving as both a reference for researchers and a guide for practitioners, this work offers an integrated view of the state of the art, its impact on healthcare, and its future potential.
[297] Exploring Superior Function Calls via Reinforcement Learning
Bingguang Hao, Maolin Wang, Zengzhuang Xu, Yicheng Chen, Cunyin Peng, Jinjie GU, Chenyi Zhuang
Main category: cs.LG
TL;DR: A novel reinforcement learning framework improves function calling in LLMs by addressing exploration, reasoning, and parameter verification challenges, achieving 86.02% accuracy.
Details
Motivation: Current training methods for LLMs in function calling lack robust reasoning and rely on superficial patterns, limiting real-world deployment.Method: A two-stage data preparation pipeline with iterative LLM evaluation and AST validation, combined with strategic entropy-based exploration in reinforcement learning.
Result: Achieves 86.02% accuracy on the Berkeley Function Calling Leaderboard, outperforming standard GRPO by 6% in complex scenarios.
Conclusion: The framework enhances function calling performance, especially for code-pretrained models, and will be released to the community.
Abstract: Function calling capabilities are crucial for deploying Large Language Models in real-world applications, yet current training approaches fail to develop robust reasoning strategies. Supervised fine-tuning produces models that rely on superficial pattern matching, while standard reinforcement learning methods struggle with the complex action space of structured function calls. We present a novel reinforcement learning framework designed to enhance group relative policy optimization through strategic entropy based exploration specifically tailored for function calling tasks. Our approach addresses three critical challenges in function calling: insufficient exploration during policy learning, lack of structured reasoning in chain-of-thought generation, and inadequate verification of parameter extraction. Our two-stage data preparation pipeline ensures high-quality training samples through iterative LLM evaluation and abstract syntax tree validation. Extensive experiments on the Berkeley Function Calling Leaderboard demonstrate that this framework achieves state-of-the-art performance among open-source models with 86.02% overall accuracy, outperforming standard GRPO by up to 6% on complex multi-function scenarios. Notably, our method shows particularly strong improvements on code-pretrained models, suggesting that structured language generation capabilities provide an advantageous starting point for reinforcement learning in function calling tasks. We will release all the code, models and dataset to benefit the community.
[298] JMA: a General Algorithm to Craft Nearly Optimal Targeted Adversarial Example
Benedetta Tondi, Wei Guo, Niccolò Pancino, Mauro Barni
Main category: cs.LG
TL;DR: The paper proposes JMA, a theoretically sound targeted attack on deep learning classifiers, optimizing a Jacobian-induced Mahalanobis distance for efficiency and generality across output encodings.
Details
Motivation: Existing targeted adversarial attacks are suboptimal and limited to one-hot encoding, lacking generality and efficiency.Method: JMA minimizes a Jacobian-induced Mahalanobis distance using Wolfe duality, solving a Non-Negative Least Square problem.
Result: JMA is effective across various output encodings and multi-label scenarios, outperforming existing attacks.
Conclusion: JMA provides an optimal, efficient, and general solution for crafting targeted adversarial examples.
Abstract: Most of the approaches proposed so far to craft targeted adversarial examples against Deep Learning classifiers are highly suboptimal and typically rely on increasing the likelihood of the target class, thus implicitly focusing on one-hot encoding settings. In this paper, a more general, theoretically sound, targeted attack is proposed, which resorts to the minimization of a Jacobian-induced Mahalanobis distance term, taking into account the effort (in the input space) required to move the latent space representation of the input sample in a given direction. The minimization is solved by exploiting the Wolfe duality theorem, reducing the problem to the solution of a Non-Negative Least Square (NNLS) problem. The proposed algorithm (referred to as JMA) provides an optimal solution to a linearised version of the adversarial example problem originally introduced by Szegedy et al. The results of the experiments confirm the generality of the proposed attack which is proven to be effective under a wide variety of output encoding schemes. Noticeably, JMA is also effective in a multi-label classification scenario, being capable to induce a targeted modification of up to half the labels in complex multi-label classification scenarios, a capability that is out of reach of all the attacks proposed so far. As a further advantage, JMA requires very few iterations, thus resulting more efficient than existing methods.
[299] SLiM: One-shot Quantization and Sparsity with Low-rank Approximation for LLM Weight Compression
Mohammad Mozaffari, Amir Yazdanbakhsh, Maryam Mehri Dehnavi
Main category: cs.LG
TL;DR: SLIM is a one-shot compression framework combining quantization, sparsity, and low-rank approximation, outperforming prior methods in accuracy and efficiency.
Details
Motivation: Addressing the trade-off between computational cost and accuracy in LLM compression, SLIM eliminates retraining while improving performance.Method: SLIM integrates probabilistic quantization (SLIM-Quant), semi-structured sparsity, and low-rank adapters with a novel saliency function.
Result: SLIM improves accuracy by up to 5.66% (LLaMA-2-7B) and achieves up to 4.3x speedup on GPUs, with 0.23x memory reduction.
Conclusion: SLIM offers a unified, efficient compression solution for LLMs, with optional PEFT for further accuracy gains.
Abstract: Conventional model compression techniques for LLMs address high memory consumption and slow inference challenges but typically require computationally expensive retraining to preserve accuracy. In contrast, one-shot compression methods eliminate retraining cost, but struggle to achieve accuracy comparable to dense models. This paper presents SLIM, a new one-shot compression framework that holistically integrates hardware-friendly quantization, sparsity, and low-rank approximation into a unified process. First, we formulate the quantization process using a probabilistic approach (SLIM-Quant) that enables us to apply uniform quantization. Then, we use an existing one-shot pruning method to apply semi-structured sparsity on top of the quantized weights. Finally, to compensate for the introduced aggregated quantization and sparsity error, we use a novel saliency function with unique invertible and additive features that enables us to mathematically compute the value of low-rank adapters. SLIM improves model accuracy by up to 5.66% (LLaMA-2-7B) for 2:4 sparsity with 4-bit weight quantization, outperforming prior methods. Models compressed with SLIM achieve up to 4.3x and 3.8x on Nvidia RTX3060 and A100 GPUs, respectively. Additionally, they achieve up to 0.23x end-to-end memory reduction in comparison to their dense counterparts. We also propose an optional PEFT recipe that further improves accuracy by up to 1.66% (LLaMA-2-13B) compared to SLIM without fine-tuning.
[300] Data Diversity as Implicit Regularization: How Does Diversity Shape the Weight Space of Deep Neural Networks?
Yang Ba, Michelle V. Mancenido, Rong Pan
Main category: cs.LG
TL;DR: The paper explores how data diversity impacts deep neural networks’ weight space, revealing similarities with dropout and proposing a metric to compare traditional and synthetic data augmentation benefits.
Details
Motivation: To understand the mechanism by which diverse training data improves model robustness and generalization, particularly in relation to other regularization techniques like dropout and weight decay.Method: Uses Random Matrix Theory for spectral analysis of weight spaces in models trained with data augmentation, dropout, and weight decay.
Result: Increasing data diversity alters weight spectral distribution similarly to dropout, more than weight decay. A metric is proposed to compare traditional and synthetic data augmentation benefits.
Conclusion: Data diversity’s impact on weight space resembles dropout, and the proposed metric helps evaluate augmentation methods.
Abstract: Data augmentation that introduces diversity into the input data has long been used in training deep learning models. It has demonstrated benefits in improving robustness and generalization, practically aligning well with other regularization strategies such as dropout and weight decay. However, the underlying mechanism of how diverse training data contributes to model improvements remains unknown. In this paper, we investigate the impact of data diversity on the weight space of deep neural networks using Random Matrix Theory. Through spectral analysis and comparing models trained with data augmentation, dropout, and weight decay, we reveal that increasing data diversity alters the weight spectral distribution similarly to other regularization techniques, while displaying a pattern more closely aligned with dropout than with weight decay. Building on these insights, we propose a metric to explain and compare the benefits of diversity introduced by traditional data augmentations and those achieved through synthetic data.
[301] Enabling Differentially Private Federated Learning for Speech Recognition: Benchmarks, Adaptive Optimizers and Gradient Clipping
Martin Pelikan, Sheikh Shams Azam, Vitaly Feldman, Jan “Honza” Silovsky, Kunal Talwar, Christopher G. Brinton, Tatiana Likhomanenko
Main category: cs.LG
TL;DR: The paper introduces the first benchmark for federated learning (FL) with differential privacy (DP) in automatic speech recognition (ASR), addressing challenges like gradient heterogeneity in large transformer models. It proposes per-layer clipping and layer-wise gradient normalization, achieving strong privacy guarantees with minimal performance drop.
Details
Motivation: The gap in applying FL with DP to ASR due to challenges in training large models, particularly gradient heterogeneity, motivates this work. No prior work provides a competitive solution for this context.Method: The approach involves per-layer clipping and layer-wise gradient normalization to mitigate clipping bias and gradient heterogeneity, supported by theoretical analysis.
Result: Empirical results show FL with DP is viable under strong privacy guarantees (e.g., (7.2, $10^{-9}$)-DP) with only a 1.3% drop in word error rate for high population scales.
Conclusion: The work establishes a practical FL with DP benchmark for ASR, with broader implications for privacy-preserving FL in large models across domains.
Abstract: While federated learning (FL) and differential privacy (DP) have been extensively studied, their application to automatic speech recognition (ASR) remains largely unexplored due to the challenges in training large transformer models. Specifically, large models further exacerbate issues in FL as they are particularly susceptible to gradient heterogeneity across layers, unlike the relatively uniform gradient behavior observed in shallow models. As a result, prior works struggle to converge with standard optimization techniques, even in the absence of DP mechanisms. To the best of our knowledge, no existing work establishes a competitive, practical recipe for FL with DP in the context of ASR. To address this gap, we establish \textbf{the first benchmark for FL with DP in end-to-end ASR}. Our approach centers on per-layer clipping and layer-wise gradient normalization: theoretical analysis reveals that these techniques together mitigate clipping bias and gradient heterogeneity across layers in deeper models. Consistent with these theoretical insights, our empirical results show that FL with DP is viable under strong privacy guarantees, provided a population of at least several million users. Specifically, we achieve user-level (7.2, $10^{-9}$)-DP (resp. (4.5, $10^{-9}$)-DP) with only a 1.3% (resp. 4.6%) absolute drop in word error rate when extrapolating to high (resp. low) population scales for FL with DP in ASR. Although our experiments focus on ASR, the underlying principles we uncover - particularly those concerning gradient heterogeneity and layer-wise gradient normalization - offer broader guidance for designing scalable, privacy-preserving FL algorithms for large models across domains. Code of all experiments and benchmarks is available at https://github.com/apple/ml-pfl4asr.
[302] Language-Based Bayesian Optimization Research Assistant (BORA)
Abdoulatif Cissé, Xenophon Evangelopoulos, Vladimir V. Gusev, Andrew I. Cooper
Main category: cs.LG
TL;DR: A hybrid framework combining Large Language Models (LLMs) and Bayesian optimization (BO) is proposed to enhance high-dimensional, non-convex optimization by leveraging domain knowledge and reducing human bias.
Details
Motivation: Addressing the challenges of high-dimensional, non-convex optimization landscapes and human confirmation bias in scientific experiments.Method: Uses LLMs to contextualize BO, blending stochastic inference with domain knowledge for better search space exploration.
Result: Validated on synthetic benchmarks (up to 15 variables) and real-world tasks, showing improved optimization performance.
Conclusion: The hybrid LLM-BO framework effectively enhances optimization by integrating domain knowledge and reducing bias, validated in practical applications.
Abstract: Many important scientific problems involve multivariate optimization coupled with slow and laborious experimental measurements. These complex, high-dimensional searches can be defined by non-convex optimization landscapes that resemble needle-in-a-haystack surfaces, leading to entrapment in local minima. Contextualizing optimizers with human domain knowledge is a powerful approach to guide searches to localized fruitful regions. However, this approach is susceptible to human confirmation bias and it is also challenging for domain experts to keep track of the rapidly expanding scientific literature. Here, we propose the use of Large Language Models (LLMs) for contextualizing Bayesian optimization (BO) via a hybrid optimization framework that intelligently and economically blends stochastic inference with domain knowledge-based insights from the LLM, which is used to suggest new, better-performing areas of the search space for exploration. Our method fosters user engagement by offering real-time commentary on the optimization progress, explaining the reasoning behind the search strategies. We validate the effectiveness of our approach on synthetic benchmarks with up to 15 independent variables and demonstrate the ability of LLMs to reason in four real-world experimental tasks where context-aware suggestions boost optimization performance substantially.
[303] Discovering Invariant Neighborhood Patterns for Heterophilic Graphs
Jinluan Yang, Ruihao Zhang, Zhengyu Chen, Teng Xiao, Yueyang Wang, Fei Wu, Kun Kuang
Main category: cs.LG
TL;DR: The paper introduces INPL to address distribution shifts in non-homophilous graphs, using ANP and INHGL modules for adaptive neighborhood learning and invariant representation.
Details
Motivation: Existing graph neural networks assume homophily, which fails in real-world non-homophilous graphs, causing complex distribution shifts.Method: Proposes INPL with ANP for adaptive neighborhood learning and INHGL for invariant representation.
Result: INPL achieves state-of-the-art performance on real-world non-homophilous graphs.
Conclusion: INPL effectively addresses distribution shifts in non-homophilous graphs.
Abstract: This paper studies the problem of distribution shifts on non-homophilous graphs Mosting existing graph neural network methods rely on the homophilous assumption that nodes from the same class are more likely to be linked. However, such assumptions of homophily do not always hold in real-world graphs, which leads to more complex distribution shifts unaccounted for in previous methods. The distribution shifts of neighborhood patterns are much more diverse on non-homophilous graphs. We propose a novel Invariant Neighborhood Pattern Learning (INPL) to alleviate the distribution shifts problem on non-homophilous graphs. Specifically, we propose the Adaptive Neighborhood Propagation (ANP) module to capture the adaptive neighborhood information, which could alleviate the neighborhood pattern distribution shifts problem on non-homophilous graphs. We propose Invariant Non-Homophilous Graph Learning (INHGL) module to constrain the ANP and learn invariant graph representation on non-homophilous graphs. Extensive experimental results on real-world non-homophilous graphs show that INPL could achieve state-of-the-art performance for learning on large non-homophilous graphs.
[304] A Spectral Framework for Evaluating Geodesic Distances Between Graphs
Soumen Sikder Shuvo, Ali Aghdaei, Zhuo Feng
Main category: cs.LG
TL;DR: The paper introduces Graph Geodesic Distance (GGD), a spectral framework for quantifying graph dissimilarities, outperforming existing metrics like TMD in graph classification and extending to GNN stability and dataset distance analysis.
Details
Motivation: To address the challenge of quantifying differences between graph data samples, especially when node features are incomplete, by leveraging spectral properties.Method: Uses spectral graph matching for node correspondence, solves a generalized eigenvalue problem for geodesic distance, and employs spectral graph coarsening for size mismatch.
Result: GGD effectively captures structural differences, outperforms TMD in classification, and proves versatile in GNN stability and dataset distance tasks.
Conclusion: GGD is a robust, versatile metric for graph comparison, with applications extending beyond classification to broader machine learning problems.
Abstract: This paper presents a spectral framework for quantifying the differentiation between graph data samples by introducing a novel metric named Graph Geodesic Distance (GGD). For two different graphs with the same number of nodes, our framework leverages a spectral graph matching procedure to find node correspondence so that the geodesic distance between them can be subsequently computed by solving a generalized eigenvalue problem associated with their Laplacian matrices. For graphs of different sizes, a resistance-based spectral graph coarsening scheme is introduced to reduce the size of the larger graph while preserving the original spectral properties. We show that the proposed GGD metric can effectively quantify dissimilarities between two graphs by encapsulating their differences in key structural (spectral) properties, such as effective resistances between nodes, cuts, and the mixing time of random walks. Through extensive experiments comparing with state-of-the-art metrics, such as the latest Tree-Mover’s Distance (TMD), the proposed GGD metric demonstrates significantly improved performance for graph classification, particularly when only partial node features are available. Furthermore, we extend the application of GGD beyond graph classification to stability analysis of GNNs and the quantification of distances between datasets, highlighting its versatility in broader machine learning contexts.
[305] Incorporating Arbitrary Matrix Group Equivariance into KANs
Lexiang Hu, Yisen Wang, Zhouchen Lin
Main category: cs.LG
TL;DR: EKAN enhances KANs by incorporating matrix group equivariance, improving performance on symmetry-related tasks with fewer parameters.
Details
Motivation: Spline functions in KANs may not respect symmetry, a key prior in ML. EKAN addresses this to broaden KANs' applicability.Method: EKAN uses gated spline basis functions and equivariant linear weights, with a lift layer to align input and feature spaces.
Result: EKAN outperforms baselines in symmetry tasks (e.g., particle scattering, three-body problem) with fewer parameters, reducing test MSE significantly.
Conclusion: EKAN extends KANs’ utility to symmetry-critical domains, achieving state-of-the-art results efficiently.
Abstract: Kolmogorov-Arnold Networks (KANs) have seen great success in scientific domains thanks to spline activation functions, becoming an alternative to Multi-Layer Perceptrons (MLPs). However, spline functions may not respect symmetry in tasks, which is crucial prior knowledge in machine learning. In this paper, we propose Equivariant Kolmogorov-Arnold Networks (EKAN), a method for incorporating arbitrary matrix group equivariance into KANs, aiming to broaden their applicability to more fields. We first construct gated spline basis functions, which form the EKAN layer together with equivariant linear weights, and then define a lift layer to align the input space of EKAN with the feature space of the dataset, thereby building the entire EKAN architecture. Compared with baseline models, EKAN achieves higher accuracy with smaller datasets or fewer parameters on symmetry-related tasks, such as particle scattering and the three-body problem, often reducing test MSE by several orders of magnitude. Even in non-symbolic formula scenarios, such as top quark tagging with three jet constituents, EKAN achieves comparable results with state-of-the-art equivariant architectures using fewer than 40% of the parameters, while KANs do not outperform MLPs as expected. Code and data are available at https://github.com/hulx2002/EKAN .
[306] Foldable SuperNets: Scalable Merging of Transformers with Different Initializations and Tasks
Edan Kinderman, Itay Hubara, Haggai Maron, Daniel Soudry
Main category: cs.LG
TL;DR: FS-Merge is a novel method for merging large transformers trained on different tasks, outperforming traditional merging and KD in efficiency and performance.
Details
Motivation: Addressing the challenge of merging large transformers trained on different tasks from distinct initializations, where traditional methods fail and KD is inefficient.Method: Introduces FS-Merge, which trains a SuperNet with frozen original models using feature reconstruction, then folds it back to a single model size.
Result: FS-Merge achieves SOTA results on MLPs and transformers across tasks, modalities, and low-data scenarios.
Conclusion: FS-Merge is a simple, data-efficient, and high-performing solution for merging large transformers, especially in low-data settings.
Abstract: Recent methods aim to merge neural networks (NNs) with identical architectures trained on different tasks into a single multi-task model. While most works focus on the simpler setup of merging NNs initialized from a common pre-trained network, we target the harder problem of merging large transformers trained on different tasks from distinct initializations. We show that traditional merging methods fail catastrophically in this setup, while Knowledge Distillation (KD) achieves much better results, though at a higher cost. However, KD is data-inefficient, as it does not exploit the original models’ weights. To solve this, we introduce “Foldable SuperNet Merge” (FS-Merge), which trains a SuperNet containing the original models (with frozen weights) using a feature reconstruction objective. After training, the SuperNet is folded back to the size of a single original model. FS-Merge is simple, data-efficient, has a computational cost comparable to KD, and is proven to have superior expressiveness compared to traditional merging methods on MLP models. It achieves SOTA results when tested on MLPs and transformers across various sizes, tasks, modalities, and distribution shifts, especially in low-data scenarios.
[307] Perfect Counterfactuals in Imperfect Worlds: Modelling Noisy Implementation of Actions in Sequential Algorithmic Recourse
Yueqing Xuan, Kacper Sokol, Mark Sanderson, Jeffrey Chan
Main category: cs.LG
TL;DR: The paper introduces ROSE, a robust sequential recourse generator for tabular data, addressing imperfect implementation of recourse steps due to noise and variability.
Details
Motivation: Existing recourse methods assume single-step implementation and uniform noise, which is unrealistic. Recourse often involves sequential steps, increasing noise and implementation difficulty.Method: The problem is framed as a Markov Decision Process (MDP) with noise adhering to local data geometry. ROSE generates sequential recourse steps robust to accumulated noise.
Result: ROSE balances robustness and cost, ensuring high success rates for users while maintaining low recourse cost, sparsity, and computational efficiency.
Conclusion: ROSE effectively models realistic noise and sequential recourse, outperforming existing methods in robustness and practicality.
Abstract: Algorithmic recourse suggests actions to individuals who have been adversely affected by automated decision-making, helping them to achieve the desired outcome. Knowing the recourse, however, does not guarantee that users can implement it perfectly, either due to environmental variability or personal choices. Recourse generation should thus anticipate its sub-optimal or noisy implementation. While several approaches construct recourse that is robust to small perturbations – e.g., arising due to its noisy implementation – they assume that the entire recourse is implemented in a single step, thus model the noise as one-off and uniform. But these assumptions are unrealistic since recourse often entails multiple sequential steps, which makes it harder to implement and subject to increasing noise. In this work, we consider recourse under plausible noise that adheres to the local data geometry and accumulates at every step of the way. We frame this problem as a Markov Decision Process and demonstrate that such a distribution of plausible noise satisfies the Markov property. We then propose the RObust SEquential (ROSE) recourse generator for tabular data; our method produces a series of steps leading to the desired outcome even when they are implemented imperfectly. Given plausible modelling of sub-optimal human actions and greater recourse robustness to accumulated uncertainty, ROSE provides users with a high chance of success while maintaining low recourse cost. Empirical evaluation shows that our algorithm effectively navigates the inherent trade-off between recourse robustness and cost while ensuring its sparsity and computational efficiency.
[308] Embedding Safety into RL: A New Take on Trust Region Methods
Nikola Milosevic, Johannes Müller, Nico Scherf
Main category: cs.LG
TL;DR: C-TRPO ensures safe RL training by reshaping policy space, reducing violations while maintaining performance.
Details
Motivation: Existing RL methods either compromise reward or safety; C-TRPO aims to balance both.Method: Introduces C-TRPO, reshaping policy space to ensure safety, with theoretical analysis and comparisons to TRPO, NPG, and CPO.
Result: Experiments show reduced constraint violations without sacrificing returns.
Conclusion: C-TRPO effectively balances safety and performance in RL training.
Abstract: Reinforcement Learning (RL) agents can solve diverse tasks but often exhibit unsafe behavior. Constrained Markov Decision Processes (CMDPs) address this by enforcing safety constraints, yet existing methods either sacrifice reward maximization or allow unsafe training. We introduce Constrained Trust Region Policy Optimization (C-TRPO), which reshapes the policy space geometry to ensure trust regions contain only safe policies, guaranteeing constraint satisfaction throughout training. We analyze its theoretical properties and connections to TRPO, Natural Policy Gradient (NPG), and Constrained Policy Optimization (CPO). Experiments show that C-TRPO reduces constraint violations while maintaining competitive returns.
[309] Learning-based Sketches for Frequency Estimation in Data Streams without Ground Truth
Xinyu Yuan, Yan Qiao, Meng Li, Zhenchun Wei, Cuiying Feng, Zonghui Wang, Wenzhi Chen
Main category: cs.LG
TL;DR: UCL-sketch is a learning-based method for frequency estimation in data streams, offering high accuracy and speed without needing ground truth or offline training.
Details
Motivation: Traditional sketches provide coarse estimates, and learning-augmented methods often require offline training or labels, limiting real-time applicability. UCL-sketch addresses these gaps.Method: UCL-sketch uses online training without ground truth and a scalable architecture with compressive sensing, ensuring fast updates and low error bounds.
Result: UCL-sketch outperforms prior methods in accuracy and speed, nearly matching an omniscient oracle under tight memory constraints and achieving 500x faster decoding.
Conclusion: UCL-sketch is a practical, high-performance solution for real-time frequency estimation, with publicly available code for further research.
Abstract: Estimating the frequency of items on the high-volume, fast data stream has been extensively studied in many areas, such as database and network measurement. Traditional sketches provide only coarse estimates under strict memory constraints. Although some learning-augmented methods have emerged recently, they typically rely on offline training with real frequencies or/and labels, which are often unavailable. Moreover, these methods suffer from slow update speeds, limiting their suitability for real-time processing despite offering only marginal accuracy improvements. To overcome these challenges, we propose UCL-sketch, a practical learning-based paradigm for per-key frequency estimation. Our design introduces two key innovations: (i) an online training mechanism based on equivalent learning that requires no ground truth (GT), and (ii) a highly scalable architecture leveraging logically structured estimation buckets to scale to real-world data stream. The UCL-sketch, which utilizes compressive sensing (CS), converges to an estimator that provably yields a error bound far lower than that of prior works, without sacrificing the speed of processing. Extensive experiments on both real-world and synthetic datasets demonstrate that our approach outperforms previously proposed approaches regarding per-key accuracy and distribution. Notably, under extremely tight memory budgets, its quality almost matches that of an (infeasible) omniscient oracle. Moreover, compared to the existing equation-based sketch, UCL-sketch achieves an average decoding speedup of nearly 500 times. To help further research and development, our code is publicly available at https://github.com/Y-debug-sys/UCL-sketch.
[310] A Closer Look at Multimodal Representation Collapse
Abhra Chaudhuri, Anjan Dutta, Tu Bui, Serban Georgescu
Main category: cs.LG
TL;DR: The paper investigates modality collapse in multimodal fusion models, identifies its cause as noisy feature entanglement, and proposes a solution using cross-modal knowledge distillation and basis reallocation.
Details
Motivation: To understand and address modality collapse, where models ignore some modalities in multimodal fusion, limiting their effectiveness.Method: The study analyzes the phenomenon, proves cross-modal knowledge distillation helps, and introduces an algorithm for explicit basis reallocation.
Result: Experiments on multimodal benchmarks confirm the theoretical findings and effectiveness of the proposed solution.
Conclusion: The work provides a theoretical and practical framework to prevent modality collapse, enhancing multimodal fusion models.
Abstract: We aim to develop a fundamental understanding of modality collapse, a recently observed empirical phenomenon wherein models trained for multimodal fusion tend to rely only on a subset of the modalities, ignoring the rest. We show that modality collapse happens when noisy features from one modality are entangled, via a shared set of neurons in the fusion head, with predictive features from another, effectively masking out positive contributions from the predictive features of the former modality and leading to its collapse. We further prove that cross-modal knowledge distillation implicitly disentangles such representations by freeing up rank bottlenecks in the student encoder, denoising the fusion-head outputs without negatively impacting the predictive features from either modality. Based on the above findings, we propose an algorithm that prevents modality collapse through explicit basis reallocation, with applications in dealing with missing modalities. Extensive experiments on multiple multimodal benchmarks validate our theoretical claims. Project page: https://abhrac.github.io/mmcollapse/.
[311] Vulnerability of Text-Matching in ML/AI Conference Reviewer Assignments to Collusions
Jhih-Yi Hsieh, Aditi Raghunathan, Nihar B. Shah
Main category: cs.LG
TL;DR: The paper reveals vulnerabilities in ML/AI conference reviewer assignment systems, showing how collusion rings can exploit text-matching algorithms even without bid manipulation, and suggests improvements.
Details
Motivation: To expose flaws in the automated reviewer assignment process, particularly the text-matching component, which is assumed secure but can be exploited by collusion rings.Method: Analyzed how colluding reviewers and authors can manipulate text-matching algorithms to get assigned target papers, identifying specific vulnerabilities.
Result: Demonstrated that collusion rings can bypass text-matching safeguards, undermining the integrity of reviewer assignments.
Conclusion: Proposes enhancements to the reviewer assignment system to mitigate exploitation by collusion rings.
Abstract: In the peer review process of top-tier machine learning (ML) and artificial intelligence (AI) conferences, reviewers are assigned to papers through automated methods. These assignment algorithms consider two main factors: (1) reviewers’ expressed interests indicated by their bids for papers, and (2) reviewers’ domain expertise inferred from the similarity between the text of their previously published papers and the submitted manuscripts. A significant challenge these conferences face is the existence of collusion rings, where groups of researchers manipulate the assignment process to review each other’s papers, providing positive evaluations regardless of their actual quality. Most efforts to combat collusion rings have focused on preventing bid manipulation, under the assumption that the text similarity component is secure. In this paper, we demonstrate that even in the absence of bidding, colluding reviewers and authors can exploit the machine learning based text-matching component of reviewer assignment used at top ML/AI venues to get assigned their target paper. We also highlight specific vulnerabilities within this system and offer suggestions to enhance its robustness.
[312] A Survey on Pre-Trained Diffusion Model Distillations
Xuhui Fan, Zhangkai Wu, Hongyu Wu
Main category: cs.LG
TL;DR: A survey on distillation methods for Diffusion Models (DMs) to improve efficiency and reduce computational costs.
Details
Motivation: DMs, while powerful, require large storage and computational resources, prompting the need for distillation methods to create smaller, efficient models.Method: Review distillation methods: output loss distillation, trajectory distillation, and adversarial distillation.
Result: Systematic categorization and analysis of distillation techniques for DMs.
Conclusion: Highlights current challenges and suggests future research directions for efficient DM distillation.
Abstract: Diffusion Models~(DMs) have emerged as the dominant approach in Generative Artificial Intelligence (GenAI), owing to their remarkable performance in tasks such as text-to-image synthesis. However, practical DMs, such as stable diffusion, are typically trained on massive datasets and thus usually require large storage. At the same time, many steps may be required, i.e., recursively evaluating the trained neural network, to generate a high-quality image, which results in significant computational costs during sample generation. As a result, distillation methods on pre-trained DM have become widely adopted practices to develop smaller, more efficient models capable of rapid, few-step generation in low-resource environment. When these distillation methods are developed from different perspectives, there is an urgent need for a systematic survey, particularly from a methodological perspective. In this survey, we review distillation methods through three aspects: output loss distillation, trajectory distillation and adversarial distillation. We also discuss current challenges and outline future research directions in the conclusion.
[313] Text-to-Level Diffusion Models With Various Text Encoders for Super Mario Bros
Jacob Schrum, Olivia Kilday, Emilio Salas, Bess Hagan, Reid Williams
Main category: cs.LG
TL;DR: The paper explores text-to-level generation using diffusion models, addressing practical challenges like caption/level pairs and playability. It compares results with other models and introduces a GUI for level design.
Details
Motivation: To advance text-to-level generation in tile-based games by addressing gaps in using diffusion models, such as the need for caption/level pairs and playable level generation.Method: Automatically assigns captions to datasets, trains diffusion models with pretrained and simple transformer text encoders, and evaluates caption overlap, diversity, and playability.
Result: The best-performing diffusion model uses a simple transformer, trains faster than complex encoders, and competes with other models like Five-Dollar Model and MarioGPT.
Conclusion: Simple transformer-based text encoders are effective for text-to-level generation, and a GUI facilitates practical level design.
Abstract: Recent research shows how diffusion models can unconditionally generate tile-based game levels, but use of diffusion models for text-to-level generation is underexplored. There are practical considerations for creating a usable model: caption/level pairs are needed, as is a text embedding model, and a way of generating entire playable levels, rather than individual scenes. We present strategies to automatically assign descriptive captions to an existing dataset, and train diffusion models using both pretrained text encoders and simple transformer models trained from scratch. Captions are automatically assigned to generated scenes so that the degree of overlap between input and output captions can be compared. We also assess the diversity and playability of the resulting level scenes. Results are compared with an unconditional diffusion model and a generative adversarial network, as well as the text-to-level approaches Five-Dollar Model and MarioGPT. Notably, the best diffusion model uses a simple transformer model for text embedding, and takes less time to train than diffusion models employing more complex text encoders, indicating that reliance on larger language models is not necessary. We also present a GUI allowing designers to construct long levels from model-generated scenes.
[314] An Analytical Theory of Spectral Bias in the Learning Dynamics of Diffusion Models
Binxu Wang, Cengiz Pehlevan
Main category: cs.LG
TL;DR: The paper analyzes the evolution of generated distributions in diffusion models, revealing a universal inverse-variance spectral law and the impact of convolution on learning dynamics.
Details
Motivation: To understand how the generated distribution evolves during diffusion model training and the role of data covariance in learning dynamics.Method: Develops an analytical framework using Gaussian-equivalence, solves gradient-flow dynamics for linear and convolutional denoisers, and integrates probability-flow ODEs. Extends analysis to deep linear networks and circulant convolutions.
Result: Exposes a universal inverse-variance spectral law (τ∝λ⁻¹) and shows that convolution introduces unique biases, reshaping learning dynamics. Experiments confirm findings.
Conclusion: Data covariance governs the order and speed of learning in diffusion models, with local convolution introducing distinct biases, warranting further investigation.
Abstract: We develop an analytical framework for understanding how the generated distribution evolves during diffusion model training. Leveraging a Gaussian-equivalence principle, we solve the full-batch gradient-flow dynamics of linear and convolutional denoisers and integrate the resulting probability-flow ODE, yielding analytic expressions for the generated distribution. The theory exposes a universal inverse-variance spectral law: the time for an eigen- or Fourier mode to match its target variance scales as $\tau\propto\lambda^{-1}$, so high-variance (coarse) structure is mastered orders of magnitude sooner than low-variance (fine) detail. Extending the analysis to deep linear networks and circulant full-width convolutions shows that weight sharing merely multiplies learning rates accelerating but not eliminating the bias whereas local convolution introduces a qualitatively different bias. Experiments on Gaussian and natural-image datasets confirm the spectral law persists in deep MLP-based UNet. Convolutional U-Nets, however, display rapid near-simultaneous emergence of many modes, implicating local convolution in reshaping learning dynamics. These results underscore how data covariance governs the order and speed with which diffusion models learn, and they call for deeper investigation of the unique inductive biases introduced by local convolution.
[315] SAND: One-Shot Feature Selection with Additive Noise Distortion
Pedram Pad, Hadi Hammoud, Mohamad Dia, Nadim Maamari, L. Andrea Dunbar
Main category: cs.LG
TL;DR: A novel, non-intrusive feature selection layer for neural networks automatically selects the top k features during training without altering the loss function or requiring retraining, achieving state-of-the-art performance.
Details
Motivation: Existing feature selection methods often need retraining and hyperparameter tuning, complicating adoption. This work aims to simplify the process while maintaining high performance.Method: Introduces a feature selection layer using trainable gains and Gaussian noise to automatically cluster and select the top k features, requiring no changes to the network or loss function.
Result: Outperforms or matches existing methods on benchmarks and a real-world dataset without hyperparameter tuning or retraining.
Conclusion: Demonstrates that simplicity and performance can coexist, providing an effective, straightforward tool for feature selection in machine learning.
Abstract: Feature selection is a critical step in data-driven applications, reducing input dimensionality to enhance learning accuracy, computational efficiency, and interpretability. Existing state-of-the-art methods often require post-selection retraining and extensive hyperparameter tuning, complicating their adoption. We introduce a novel, non-intrusive feature selection layer that, given a target feature count $k$, automatically identifies and selects the $k$ most informative features during neural network training. Our method is uniquely simple, requiring no alterations to the loss function, network architecture, or post-selection retraining. The layer is mathematically elegant and can be fully described by: \begin{align} \nonumber \tilde{x}_i = a_i x_i + (1-a_i)z_i \end{align} where $x_i$ is the input feature, $\tilde{x}_i$ the output, $z_i$ a Gaussian noise, and $a_i$ trainable gain such that $\sum_i{a_i^2}=k$. This formulation induces an automatic clustering effect, driving $k$ of the $a_i$ gains to $1$ (selecting informative features) and the rest to $0$ (discarding redundant ones) via weighted noise distortion and gain normalization. Despite its extreme simplicity, our method delivers state-of-the-art performance on standard benchmark datasets and a novel real-world dataset, outperforming or matching existing approaches without requiring hyperparameter search for $k$ or retraining. Theoretical analysis in the context of linear regression further validates its efficacy. Our work demonstrates that simplicity and performance are not mutually exclusive, offering a powerful yet straightforward tool for feature selection in machine learning.
[316] What Has a Foundation Model Found? Using Inductive Bias to Probe for World Models
Keyon Vafa, Peter G. Chang, Ashesh Rambachan, Sendhil Mullainathan
Main category: cs.LG
TL;DR: Foundation models excel in training tasks but often fail to generalize deeper understanding, as shown by their inability to apply Newtonian mechanics in new physics tasks.
Details
Motivation: To assess if foundation models truly capture deeper domain understanding beyond sequence prediction.Method: Developed an inductive bias probe technique, testing models on synthetic datasets from postulated world models.
Result: Models perform well in training tasks but lack alignment with underlying world models, relying on task-specific heuristics.
Conclusion: Foundation models may not inherently develop deeper structural understanding, highlighting a gap in their generalization capabilities.
Abstract: Foundation models are premised on the idea that sequence prediction can uncover deeper domain understanding, much like how Kepler’s predictions of planetary motion later led to the discovery of Newtonian mechanics. However, evaluating whether these models truly capture deeper structure remains a challenge. We develop a technique for evaluating foundation models that examines how they adapt to synthetic datasets generated from some postulated world model. Our technique measures whether the foundation model’s inductive bias aligns with the world model, and so we refer to it as an inductive bias probe. Across multiple domains, we find that foundation models can excel at their training tasks yet fail to develop inductive biases towards the underlying world model when adapted to new tasks. We particularly find that foundation models trained on orbital trajectories consistently fail to apply Newtonian mechanics when adapted to new physics tasks. Further analysis reveals that these models behave as if they develop task-specific heuristics that fail to generalize.
[317] Neighbour-Driven Gaussian Process Variational Autoencoders for Scalable Structured Latent Modelling
Xinxing Shi, Xiaoyu Jiang, Mauricio A. Álvarez
Main category: cs.LG
TL;DR: The paper introduces a scalable GPVAE method using neighbor-driven approximation to improve computational efficiency and flexibility.
Details
Motivation: Standard GPVAEs face computational challenges with large-scale data due to restrictive kernel assumptions or reliance on many inducing points.Method: Proposes a neighbor-driven approximation strategy, leveraging local adjacencies in the latent space to confine computations to nearest neighbors.
Result: Outperforms other GPVAE variants in predictive performance and computational efficiency across tasks like representation learning and data imputation.
Conclusion: The method enables scalable GPVAE inference with flexible kernel choices, reducing reliance on inducing points while preserving latent dependencies.
Abstract: Gaussian Process (GP) Variational Autoencoders (VAEs) extend standard VAEs by replacing the fully factorised Gaussian prior with a GP prior, thereby capturing richer correlations among latent variables. However, performing exact GP inference in large-scale GPVAEs is computationally prohibitive, often forcing existing approaches to rely on restrictive kernel assumptions or large sets of inducing points. In this work, we propose a neighbour-driven approximation strategy that exploits local adjacencies in the latent space to achieve scalable GPVAE inference. By confining computations to the nearest neighbours of each data point, our method preserves essential latent dependencies, allowing more flexible kernel choices and mitigating the need for numerous inducing points. Through extensive experiments on tasks including representation learning, data imputation, and conditional generation, we demonstrate that our approach outperforms other GPVAE variants in both predictive performance and computational efficiency.
[318] Central Path Proximal Policy Optimization
Nikola Milosevic, Johannes Müller, Nico Scherf
Main category: cs.LG
TL;DR: C3PO, a modified PPO method, enforces constraints without compromising performance by staying close to the central path of optimization.
Details
Motivation: To address the challenge of enforcing constraints in training without reducing final return.Method: Introduces Central Path Proximal Policy Optimization (C3PO), a modified PPO loss that keeps policy updates near the central path.
Result: C3PO outperforms existing on-policy methods with tighter constraint enforcement.
Conclusion: Central path-guided updates are promising for constrained policy optimization.
Abstract: In constrained Markov decision processes, enforcing constraints during training is often thought of as decreasing the final return. Recently, it was shown that constraints can be incorporated directly into the policy geometry, yielding an optimization trajectory close to the central path of a barrier method, which does not compromise final return. Building on this idea, we introduce Central Path Proximal Policy Optimization (C3PO), a simple modification of the PPO loss that produces policy iterates, that stay close to the central path of the constrained optimization problem. Compared to existing on-policy methods, C3PO delivers improved performance with tighter constraint enforcement, suggesting that central path-guided updates offer a promising direction for constrained policy optimization.
[319] Theory of Decentralized Robust Kernel-Based Learning
Zhan Yu, Zhongjie Shi, Ding-Xuan Zhou
Main category: cs.LG
TL;DR: A decentralized robust kernel-based learning algorithm in RKHS is proposed, unifying robust regression with convergence guarantees and optimal learning rates.
Details
Motivation: To address the limitations of existing distributed robust kernel-based learning schemes by introducing a decentralized framework that enhances robustness and convergence.Method: Utilizes a networked system represented as a connected graph, employing a robust loss function with a windowing function and scaling parameter. Kernel-based integral operator techniques are used for analysis.
Result: Local robust estimators approximate the regression function with high confidence bounds in mean square distance, RKHS norm, and generalization error. Optimal learning rates are achieved under proper parameter selection.
Conclusion: The algorithm’s robustness and convergence are enhanced by the scaling parameter, with clear connections between decentralization, sample selection, and performance.
Abstract: We propose a new decentralized robust kernel-based learning algorithm within the framework of reproducing kernel Hilbert spaces (RKHSs) by utilizing a networked system that can be represented as a connected graph. The robust loss function $\huaL_\sigma$ induced by a windowing function $W$ and a robustness scaling parameter $\sigma>0$ can encompass a broad spectrum of robust losses. Consequently, the proposed algorithm effectively provides a unified decentralized learning framework for robust regression, which fundamentally differs from the existing distributed robust kernel-based learning schemes, all of which are divide-and-conquer based. We rigorously establish a learning theory and offer comprehensive convergence analysis for the algorithm. We show each local robust estimator generated from the decentralized algorithm can be utilized to approximate the regression function. Based on kernel-based integral operator techniques, we derive general high confidence convergence bounds for the local approximating sequence in terms of the mean square distance, RKHS norm, and generalization error, respectively. Moreover, we provide rigorous selection rules for local sample size and show that, under properly selected step size and scaling parameter $\sigma$, the decentralized robust algorithm can achieve optimal learning rates (up to logarithmic factors) in both norms. The parameter $\sigma$ is shown to be essential for enhancing robustness and ensuring favorable convergence behavior. The intrinsic connection among decentralization, sample selection, robustness of the algorithm, and its convergence is clearly reflected.
[320] LETS Forecast: Learning Embedology for Time Series Forecasting
Abrar Majeedi, Viswanatha Reddy Gajjala, Satya Sai Srinath Namburi GNVV, Nada Magdi Elkordi, Yin Li
Main category: cs.LG
TL;DR: DeepEDM integrates nonlinear dynamical systems modeling with deep neural networks for improved time series forecasting, outperforming state-of-the-art methods.
Details
Motivation: Existing deep learning approaches for time series forecasting often lack explicit modeling of underlying dynamics, which is crucial for accurate predictions.Method: DeepEDM combines empirical dynamic modeling (EDM) and deep learning, learning a latent space from time-delayed embeddings and using kernel regression to approximate dynamics, with efficient softmax attention.
Result: DeepEDM is robust to noise and outperforms state-of-the-art methods in forecasting accuracy on synthetic and real-world datasets.
Conclusion: DeepEDM effectively bridges the gap between deep learning and dynamical systems modeling, offering a robust and accurate framework for time series forecasting.
Abstract: Real-world time series are often governed by complex nonlinear dynamics. Understanding these underlying dynamics is crucial for precise future prediction. While deep learning has achieved major success in time series forecasting, many existing approaches do not explicitly model the dynamics. To bridge this gap, we introduce DeepEDM, a framework that integrates nonlinear dynamical systems modeling with deep neural networks. Inspired by empirical dynamic modeling (EDM) and rooted in Takens’ theorem, DeepEDM presents a novel deep model that learns a latent space from time-delayed embeddings, and employs kernel regression to approximate the underlying dynamics, while leveraging efficient implementation of softmax attention and allowing for accurate prediction of future time steps. To evaluate our method, we conduct comprehensive experiments on synthetic data of nonlinear dynamical systems as well as real-world time series across domains. Our results show that DeepEDM is robust to input noise, and outperforms state-of-the-art methods in forecasting accuracy. Our code is available at: https://abrarmajeedi.github.io/deep_edm.
[321] Structured Generative Modeling with the Thermodynamic Kolmogorov-Arnold Model
Prithvi Raj
Main category: cs.LG
TL;DR: The paper introduces T-KAM, a novel generative model leveraging the Kolmogorov-Arnold theorem for efficient, high-quality sampling and training.
Details
Motivation: Addresses challenges in energy-based models (EBMs) in latent spaces, such as interpretability, efficiency, and multimodal sampling.Method: Proposes T-KAM, using univariate priors for fast inference and importance sampling, and population-based LMC for multimodal exploration.
Result: T-KAM achieves fast inference, high sample quality, and stable training, suited for large-scale hardware.
Conclusion: T-KAM balances trade-offs in generative modeling and is extendable to broader generative intelligence research.
Abstract: Learning an energy-based model (EBM) in the latent space of a top-down generative model offers an expressive and interpretable framework for text and image generation. However, it remains unclear how this interpretability can be used to guide model design, improve generative quality, and reduce training time. Moreover, the reliance on Langevin Monte Carlo (LMC) sampling presents challenges in efficiency and sampling multimodal latent distributions. In this work, we propose a novel adaptation of the Kolmogorov-Arnold representation theorem for generative modeling and introduce the Thermodynamic Kolmogorov-Arnold Model (T-KAM) to take advantage of structural and inductive biases. By constraining the prior to univariate relationships, T-KAM enables fast and exact inference via the inverse transform method. With the low dimensionality of the latent space and suitable inductive biases encoded, we demonstrate that importance sampling becomes a viable, unbiased, and highly efficient training strategy. We also introduce a training criterion using population-based LMC, which decomposes posterior sampling into a sequence of annealed distributions to improve multimodal exploration. T-KAM elegantly balances common trade-offs in generative modeling, offering fast inference, high sample quality, and stable training, while being naturally suited to upcoming Zettascale Computing Co. hardware and extendable to other high-impact research directions in generative intelligence.
[322] Transferable Parasitic Estimation via Graph Contrastive Learning and Label Rebalancing in AMS Circuits
Shan Shen, Shenglu Hua, Jiajun Zou, Jiawei Liu, Jianwang Zhai, Chuan Shi, Wenjian Yu
Main category: cs.LG
TL;DR: CircuitGCL is a graph contrastive learning framework for AMS circuits, addressing data scarcity and label imbalance with topology-invariant embeddings and balanced losses, outperforming SOTA methods.
Details
Motivation: Addressing challenges like data scarcity, unbalanced labels, and circuit diversity in learning robust and transferable representations for AMS circuits.Method: Proposes CircuitGCL, using self-supervised hyperspherical representation scattering and balanced losses (BMSE, BSCE) for topology-invariant embeddings.
Result: Outperforms SOTA methods with 33.64% ~ 44.20% R² improvement for edge regression and 0.9× ~ 2.1× F1-score gain for node classification.
Conclusion: CircuitGCL effectively enhances transferability and robustness in parasitic estimation tasks for AMS circuits.
Abstract: Graph representation learning on Analog-Mixed Signal (AMS) circuits is crucial for various downstream tasks, e.g., parasitic estimation. However, the scarcity of design data, the unbalanced distribution of labels, and the inherent diversity of circuit implementations pose significant challenges to learning robust and transferable circuit representations. To address these limitations, we propose CircuitGCL, a novel graph contrastive learning framework that integrates representation scattering and label rebalancing to enhance transferability across heterogeneous circuit graphs. CircuitGCL employs a self-supervised strategy to learn topology-invariant node embeddings through hyperspherical representation scattering, eliminating dependency on large-scale data. Simultaneously, balanced mean squared error (BMSE) and balanced softmax cross-entropy (BSCE) losses are introduced to mitigate label distribution disparities between circuits, enabling robust and transferable parasitic estimation. Evaluated on parasitic capacitance estimation (edge-level task) and ground capacitance classification (node-level task) across TSMC 28nm AMS designs, CircuitGCL outperforms all state-of-the-art (SOTA) methods, with the $R^2$ improvement of $33.64% \sim 44.20%$ for edge regression and F1-score gain of $0.9\times \sim 2.1\times$ for node classification. Our code is available at https://github.com/ShenShan123/CircuitGCL.
[323] Tapping into the Black Box: Uncovering Aligned Representations in Pretrained Neural Networks
Maciej Satkiewicz
Main category: cs.LG
TL;DR: The paper addresses misaligned gradients in ReLU networks by proposing soft-gating in the backward pass, resulting in perceptually aligned input-level representations called “excitation pullbacks.”
Details
Motivation: Gradients in deeper ReLU networks are misaligned, contributing to their black-box nature, due to noisy active subnetworks from ReLU hard-gating.Method: Introduces soft-gating in the backward pass to reduce noise, creating “excitation pullbacks” as input-level representations.
Result: Excitation pullbacks show perceptual alignment, revealing high-resolution, input- and target-specific features, offering a novel explanation method.
Conclusion: The method provides faithful explanations and suggests these pullbacks approximate gradients of a simpler, implicitly learned model, highlighting their significance.
Abstract: In ReLU networks, gradients of output units can be seen as their input-level representations, as they correspond to the units’ pullbacks through the active subnetwork. However, gradients of deeper models are notoriously misaligned, significantly contributing to their black-box nature. We claim that this is because active subnetworks are inherently noisy due to the ReLU hard-gating. To tackle that noise, we propose soft-gating in the backward pass only. The resulting input-level vector field (called ‘’excitation pullback’’) exhibits remarkable perceptual alignment, revealing high-resolution input- and target-specific features that ‘‘just make sense’’, therefore establishing a compelling novel explanation method. Furthermore, we speculate that excitation pullbacks approximate (directionally) the gradients of a simpler model, linear in the network’s path space, learned implicitly during optimization and largely determining the network’s decision; thus arguing for the faithfulness of the produced explanations and their overall significance.
[324] IMU: Influence-guided Machine Unlearning
Xindi Fan, Jing Wu, Mingyi Zhou, Pengwei Liang, Dinh Phung
Main category: cs.LG
TL;DR: The paper introduces Influence-guided Machine Unlearning (IMU), a method for selective data forgetting in deep learning models without needing the original training data.
Details
Motivation: Addressing privacy concerns and impracticality of existing machine unlearning methods that require access to original training data.Method: IMU uses gradient ascent and dynamic allocation of unlearning intensities based on data influences, working solely with the forget set.
Result: IMU outperforms existing retain-data-free methods in vision and language tasks.
Conclusion: IMU offers a practical and effective solution for machine unlearning without compromising model utility.
Abstract: Recent studies have shown that deep learning models are vulnerable to attacks and tend to memorize training data points, raising significant concerns about privacy leakage. This motivates the development of machine unlearning (MU), i.e., a paradigm that enables models to selectively forget specific data points upon request. However, most existing MU algorithms require partial or full fine-tuning on the retain set. This necessitates continued access to the original training data, which is often impractical due to privacy concerns and storage constraints. A few retain-data-free MU methods have been proposed, but some rely on access to auxiliary data and precomputed statistics of the retain set, while others scale poorly when forgetting larger portions of data. In this paper, we propose Influence-guided Machine Unlearning (IMU), a simple yet effective method that conducts MU using only the forget set. Specifically, IMU employs gradient ascent and innovatively introduces dynamic allocation of unlearning intensities across different data points based on their influences. This adaptive strategy significantly enhances unlearning effectiveness while maintaining model utility. Results across vision and language tasks demonstrate that IMU consistently outperforms existing retain-data-free MU methods.
[325] TimeMKG: Knowledge-Infused Causal Reasoning for Multivariate Time Series Modeling
Yifei Sun, Junming Liu, Yirong Chen, Xuefeng Yan, Ding Wang
Main category: cs.LG
TL;DR: TimeMKG integrates variable semantics and numerical observations in time series modeling using knowledge graphs and large language models, improving performance and interpretability.
Details
Motivation: Traditional models ignore semantic information in variable names and descriptions, missing critical domain knowledge for robust modeling.Method: TimeMKG uses large language models to interpret semantics, constructs knowledge graphs, and employs a dual-modality encoder with cross-modality attention for fusion.
Result: Experiments show significant improvements in predictive performance and generalization by incorporating variable-level knowledge.
Conclusion: TimeMKG bridges the gap between low-level signal processing and knowledge-informed inference, enhancing interpretability and performance.
Abstract: Multivariate time series data typically comprises two distinct modalities: variable semantics and sampled numerical observations. Traditional time series models treat variables as anonymous statistical signals, overlooking the rich semantic information embedded in variable names and data descriptions. However, these textual descriptors often encode critical domain knowledge that is essential for robust and interpretable modeling. Here we present TimeMKG, a multimodal causal reasoning framework that elevates time series modeling from low-level signal processing to knowledge informed inference. TimeMKG employs large language models to interpret variable semantics and constructs structured Multivariate Knowledge Graphs that capture inter-variable relationships. A dual-modality encoder separately models the semantic prompts, generated from knowledge graph triplets, and the statistical patterns from historical time series. Cross-modality attention aligns and fuses these representations at the variable level, injecting causal priors into downstream tasks such as forecasting and classification, providing explicit and interpretable priors to guide model reasoning. The experiment in diverse datasets demonstrates that incorporating variable-level knowledge significantly improves both predictive performance and generalization.
[326] Prototype-Guided Diffusion: Visual Conditioning without External Memory
Bilal Faye, Hanane Azzag, Mustapha Lebbah
Main category: cs.LG
TL;DR: The paper introduces the Prototype Diffusion Model (PDM), which integrates prototype learning into diffusion models for efficient and adaptive visual conditioning, eliminating the need for external memory and costly retrieval infrastructure.
Details
Motivation: Current diffusion models are computationally intensive, and retrieval-augmented methods introduce storage and adaptability issues. PDM aims to address these drawbacks.Method: PDM constructs dynamic visual prototypes from clean image features using contrastive learning, guiding denoising without external memory.
Result: PDM maintains high generation quality while reducing computational and storage overhead.
Conclusion: PDM offers a scalable and efficient alternative to retrieval-based conditioning in diffusion models.
Abstract: Diffusion models have emerged as a leading framework for high-quality image generation, offering stable training and strong performance across diverse domains. However, they remain computationally intensive, particularly during the iterative denoising process. Latent-space models like Stable Diffusion alleviate some of this cost by operating in compressed representations, though at the expense of fine-grained detail. More recent approaches such as Retrieval-Augmented Diffusion Models (RDM) address efficiency by conditioning denoising on similar examples retrieved from large external memory banks. While effective, these methods introduce drawbacks: they require costly storage and retrieval infrastructure, depend on static vision-language models like CLIP for similarity, and lack adaptability during training. We propose the Prototype Diffusion Model (PDM), a method that integrates prototype learning directly into the diffusion process for efficient and adaptive visual conditioning - without external memory. Instead of retrieving reference samples, PDM constructs a dynamic set of compact visual prototypes from clean image features using contrastive learning. These prototypes guide the denoising steps by aligning noisy representations with semantically relevant visual patterns, enabling efficient generation with strong semantic grounding. Experiments show that PDM maintains high generation quality while reducing computational and storage overhead, offering a scalable alternative to retrieval-based conditioning in diffusion models.
[327] An Explainable AI based approach for Monitoring Animal Health
Rahul Jana, Shubham Dixit, Mrityunjay Sharma, Ritesh Kumar
Main category: cs.LG
TL;DR: The paper presents a data-driven approach using explainable ML to monitor dairy cattle health and behavior, leveraging IoT sensors and 4G networks for real-time analysis.
Details
Motivation: Addressing the challenge of tracking cattle health and optimizing yield for dairy farmers through modern, explainable ML methods.Method: Uses 3-axis accelerometer data, Bluetooth IoT devices, and 4G networks for data collection. Employs signal processing, statistical feature extraction, and hyperparameter-optimized ML models (e.g., k-nearest neighbor) for activity classification.
Result: Achieved high performance (AUC ~0.99) with the k-nearest neighbor classifier. Used SHAP for explainability, providing actionable insights for farmers.
Conclusion: The study demonstrates the effectiveness of explainable ML in sustainable livestock management, offering practical tools for farmers.
Abstract: Monitoring cattle health and optimizing yield are key challenges faced by dairy farmers due to difficulties in tracking all animals on the farm. This work aims to showcase modern data-driven farming practices based on explainable machine learning(ML) methods that explain the activity and behaviour of dairy cattle (cows). Continuous data collection of 3-axis accelerometer sensors and usage of robust ML methodologies and algorithms, provide farmers and researchers with actionable information on cattle activity, allowing farmers to make informed decisions and incorporate sustainable practices. This study utilizes Bluetooth-based Internet of Things (IoT) devices and 4G networks for seamless data transmission, immediate analysis, inference generation, and explains the models performance with explainability frameworks. Special emphasis is put on the pre-processing of the accelerometers time series data, including the extraction of statistical characteristics, signal processing techniques, and lag-based features using the sliding window technique. Various hyperparameter-optimized ML models are evaluated across varying window lengths for activity classification. The k-nearest neighbour Classifier achieved the best performance, with AUC of mean 0.98 and standard deviation of 0.0026 on the training set and 0.99 on testing set). In order to ensure transparency, Explainable AI based frameworks such as SHAP is used to interpret feature importance that can be understood and used by practitioners. A detailed comparison of the important features, along with the stability analysis of selected features, supports development of explainable and practical ML models for sustainable livestock management.
[328] Comparison of D-Wave Quantum Annealing and Markov Chain Monte Carlo for Sampling from a Probability Distribution of a Restricted Boltzmann Machine
Abdelmoula El Yazizi, Samee U. Khan, Yaroslav Koshka
Main category: cs.LG
TL;DR: The study compares D-Wave quantum annealer and Gibbs sampling for Restricted Boltzmann Machines (RBMs), finding limited overlap and no significant improvement in sampling quality with shorter D-Wave annealing times.
Details
Motivation: To assess the quality of sampling from RBMs using a local-valley (LV) centered approach and compare D-Wave quantum annealing with classical Gibbs sampling.Method: Applied LV-centered analysis to D-Wave and Gibbs samples from a classically trained RBM under contrastive-divergence-based learning conditions.
Result: No significant increase in LVs with shorter D-Wave annealing times; D-Wave samples covered more LVs but overlapped unfavorably with Gibbs samples for high-probability states. Some important LVs were unique to each method.
Conclusion: The results explain past failures to improve RBM sampling with D-Wave but suggest potential for hybrid classical-quantum approaches.
Abstract: A local-valley (LV) centered approach to assessing the quality of sampling from Restricted Boltzmann Machines (RBMs) was applied to the latest generation of the D-Wave quantum annealer. D-Wave and Gibbs samples from a classically trained RBM were obtained at conditions relevant to the contrastive-divergence-based RBM learning. The samples were compared for the number of the LVs to which they belonged and the energy of the corresponding local minima. No significant (desirable) increase in the number of the LVs has been achieved by decreasing the D-Wave annealing time. At any training epoch, the states sampled by the D-Wave belonged to a somewhat higher number of LVs than in the Gibbs sampling. However, many of those LVs found by the two techniques differed. For high-probability sampled states, the two techniques were (unfavorably) less complementary and more overlapping. Nevertheless, many potentially “important” local minima, i.e., those having intermediate, even if not high, probability values, were found by only one of the two sampling techniques while missed by the other. The two techniques overlapped less at later than earlier training epochs, which is precisely the stage of the training when modest improvements to the sampling quality could make meaningful differences for the RBM trainability. The results of this work may explain the failure of previous investigations to achieve substantial (or any) improvement when using D-Wave-based sampling. However, the results reveal some potential for improvement, e.g., using a combined classical-quantum approach.
cs.MA
[329] Allen: Rethinking MAS Design through Step-Level Policy Autonomy
Qiangong Zhou, Zhiting Wang, Mingyou Yao, Zongyang Liu
Main category: cs.MA
TL;DR: Allen is a Multi-Agent System (MAS) designed to enhance policy autonomy and balance collaborative efficiency, task supervision, and human oversight in complex networks.
Details
Motivation: Addressing challenges in current MAS design, specifically improving policy autonomy and balancing collaborative efficiency with control.Method: Redefining the basic execution unit in MAS, enabling agents to form dynamic patterns. A four-tier state architecture (Task, Stage, Agent, Step) is used for behavioral constraints.
Result: Allen achieves unprecedented policy autonomy while balancing collaborative structure controllability.
Conclusion: Allen successfully unifies topological optimization and controllable progress, with its code open-sourced.
Abstract: We introduce a new Multi-Agent System (MAS) - Allen, designed to address two core challenges in current MAS design: (1) improve system’s policy autonomy, empowering agents to dynamically adapt their behavioral strategies, and (2) achieving the trade-off between collaborative efficiency, task supervision, and human oversight in complex network topologies. Our core insight is to redefine the basic execution unit in the MAS, allowing agents to autonomously form different patterns by combining these units. We have constructed a four-tier state architecture (Task, Stage, Agent, Step) to constrain system behavior from both task-oriented and execution-oriented perspectives. This achieves a unification of topological optimization and controllable progress. Allen grants unprecedented Policy Autonomy, while making a trade-off for the controllability of the collaborative structure. The project code has been open source at: https://github.com/motern88/Allen
[330] Defending a City from Multi-Drone Attacks: A Sequential Stackelberg Security Games Approach
Dolev Mutzari, Tonmoay Deb, Cristian Molinaro, Andrea Pugliese, V. S. Subrahmanian, Sarit Kraus
Main category: cs.MA
TL;DR: The paper introduces S2D2, an algorithm for defending against multi-drone attacks by modeling the scenario as a Sequential Stackelberg Security Game, outperforming prior methods.
Details
Motivation: To address the challenge of defending cities against multi-drone attacks by optimizing defender strategies.Method: Develops S2D2, an algorithm for mixed sequential defense strategies in a Stackelberg game framework.
Result: S2D2 outperforms greedy heuristics in experiments across 80 cities and approximates a Strong Stackelberg Equilibrium.
Conclusion: S2D2 provides an effective defense strategy against drone attacks, validated by real-world data and theoretical guarantees.
Abstract: To counter an imminent multi-drone attack on a city, defenders have deployed drones across the city. These drones must intercept/eliminate the threat, thus reducing potential damage from the attack. We model this as a Sequential Stackelberg Security Game, where the defender first commits to a mixed sequential defense strategy, and the attacker then best responds. We develop an efficient algorithm called S2D2, which outputs a defense strategy. We demonstrate the efficacy of S2D2 in extensive experiments on data from 80 real cities, improving the performance of the defender in comparison to greedy heuristics based on prior works. We prove that under some reasonable assumptions about the city structure, S2D2 outputs an approximate Strong Stackelberg Equilibrium (SSE) with a convenient structure.
[331] Tapas are free! Training-Free Adaptation of Programmatic Agents via LLM-Guided Program Synthesis in Dynamic Environments
Jinwei Hu, Yi Dong, Youcheng Sun, Xiaowei Huang
Main category: cs.MA
TL;DR: TAPA introduces a framework using LLMs to dynamically adapt modular programs for autonomous agents, improving performance in dynamic environments.
Details
Motivation: To enable autonomous agents to adapt continuously in safety-critical applications without losing performance or reliability.Method: TAPA uses LLMs as moderators to synthesize and adapt modular programs for individual high-level actions (logical primitives), decoupling strategic intent from execution.
Result: Achieves 77.7% network uptime in DDoS defense and maintains consensus in swarm intelligence under disturbances, outperforming baselines.
Conclusion: TAPA shifts autonomous system design from policy adaptation to dynamic action adaptation, proving effective in evolving environments.
Abstract: Autonomous agents in safety-critical applications must continuously adapt to dynamic conditions without compromising performance and reliability. This work introduces TAPA (Training-free Adaptation of Programmatic Agents), a novel framework that positions large language models (LLMs) as intelligent moderators of the symbolic action space. Unlike prior programmatic agents that typically generate a monolithic policy program or rely on fixed symbolic action sets, TAPA synthesizes and adapts modular programs for individual high-level actions, referred to as logical primitives. By decoupling strategic intent from execution, TAPA enables meta-agents to operate over an abstract, interpretable action space while the LLM dynamically generates, composes, and refines symbolic programs tailored to each primitive. Extensive experiments across cybersecurity and swarm intelligence domains validate TAPA’s effectiveness. In autonomous DDoS defense scenarios, TAPA achieves 77.7% network uptime while maintaining near-perfect detection accuracy in unknown dynamic environments. In swarm intelligence formation control under environmental and adversarial disturbances, TAPA consistently preserves consensus at runtime where baseline methods fail completely. This work promotes a paradigm shift for autonomous system design in evolving environments, from policy adaptation to dynamic action adaptation.
[332] Smooth Games of Configuration in the Linear-Quadratic Setting
Jesse Milzman, Jeffrey Mao, Giuseppe Loianno
Main category: cs.MA
TL;DR: The paper introduces a ‘game of configuration’ framework for strategic fine-tuning in differential games, providing a method to compute and optimize configuration parameters in multi-agent scenarios.
Details
Motivation: Existing literature lacks exploration of dynamic game configuration from a strategic perspective where agents' choices influence each other.Method: A two-stage game framework is proposed: first, players choose configuration parameters; second, these parameters affect dynamics and costs. Subgame perfect solutions and gradient-based methods are used to find equilibria.
Result: The approach is demonstrated in affine-quadratic (AQ) systems, showing effectiveness in both zero-sum and general-sum scenarios.
Conclusion: The framework successfully addresses strategic configuration in dynamic games, offering practical solutions for multi-agent systems.
Abstract: Dynamic game theory offers a toolbox for formalizing and solving for both cooperative and non-cooperative strategies in multi-agent scenarios. However, the optimal configuration of such games remains largely unexplored. While there is existing literature on the parametrization of dynamic games, little research examines this parametrization from a strategic perspective where each agent’s configuration choice is influenced by the decisions of others. In this work, we introduce the concept of a game of configuration, providing a framework for the strategic fine-tuning of differential games. We define a game of configuration as a two-stage game within the setting of finite-horizon, affine-quadratic, AQ, differential games. In the first stage, each player chooses their corresponding configuration parameter, which will impact their dynamics and costs in the second stage. We provide the subgame perfect solution concept and a method for computing first stage cost gradients over the configuration space. This then allows us to formulate a gradient-based method for searching for local solutions to the configuration game, as well as provide necessary conditions for equilibrium configurations over their downstream (second stage) trajectories. We conclude by demonstrating the effectiveness of our approach in example AQ systems, both zero-sum and general-sum.
[333] Synergy Over Spiral: A Logistics 5.0 Game-Theoretic Model for Trust-Fatigue Co-regulation in Human-Cobot Order Picking
Soumyadeep Dhar
Main category: cs.MA
TL;DR: The paper explores trust and fatigue in human-cobot collaboration in logistics, using a Stackelberg game model. Simulations show improved productivity and trust recovery with adaptive cobot behaviors.
Details
Motivation: To address challenges in human-robot symbiosis in smart logistics (Logistics 5.0), focusing on trust and fatigue.Method: Proposes a dynamic leader-follower Stackelberg game with utility functions for fatigue and trust, validated via agent-based simulations.
Result: Refined trust models boost productivity by ~100% and reduce trust recovery time by >75% with adaptive cobot behaviors.
Conclusion: Provides a framework for human-centric, sustainable, and resilient cobot designs in Industry 5.0.
Abstract: This paper investigates the critical role of trust and fatigue in human-cobot collaborative order picking, framing the challenge within the scope of Logistics 5.0: the implementation of human-robot symbiosis in smart logistics. We propose a dynamic, leader-follower Stackelberg game to model this interaction, where utility functions explicitly account for human fatigue and trust. Through agent-based simulations, we demonstrate that while a naive model leads to a “trust death spiral,” a refined trust model creates a “trust synergy cycle,” increasing productivity by nearly 100 percent. Finally, we show that a cobot operating in a Trust-Recovery Mode can overcome system brittleness after a disruption, reducing trust recovery time by over 75 percent compared to a non-adaptive model. Our findings provide a framework for designing intelligent cobot behaviors that fulfill the Industry 5.0 pillars of human-centricity, sustainability, and resilience.
cs.MM
[334] Failures to Surface Harmful Contents in Video Large Language Models
Yuxin Cao, Wei Song, Derui Wang, Jingling Xue, Jin Song Dong
Main category: cs.MM
TL;DR: VideoLLMs often omit harmful content in video summaries due to design flaws like sparse frame sampling, token downsampling, and weak visual-text integration, leading to a 90% omission rate in tests.
Details
Motivation: To expose safety gaps in VideoLLMs where harmful content is ignored in summaries despite visibility to humans.Method: Root-cause analysis of three design flaws and crafting zero-query black-box attacks to exploit them.
Result: Harmfulness omission rate exceeds 90% in most cases, even when harmful content is clearly present.
Conclusion: Current VideoLLM designs are fundamentally vulnerable, requiring improved sampling, token compression, and decoding mechanisms.
Abstract: Video Large Language Models (VideoLLMs) are increasingly deployed on numerous critical applications, where users rely on auto-generated summaries while casually skimming the video stream. We show that this interaction hides a critical safety gap: if harmful content is embedded in a video, either as full-frame inserts or as small corner patches, state-of-the-art VideoLLMs rarely mention the harmful content in the output, despite its clear visibility to human viewers. A root-cause analysis reveals three compounding design flaws: (1) insufficient temporal coverage resulting from the sparse, uniformly spaced frame sampling used by most leading VideoLLMs, (2) spatial information loss introduced by aggressive token downsampling within sampled frames, and (3) encoder-decoder disconnection, whereby visual cues are only weakly utilized during text generation. Leveraging these insights, we craft three zero-query black-box attacks, aligning with these flaws in the processing pipeline. Our large-scale evaluation across five leading VideoLLMs shows that the harmfulness omission rate exceeds 90% in most cases. Even when harmful content is clearly present in all frames, these models consistently fail to identify it. These results underscore a fundamental vulnerability in current VideoLLMs’ designs and highlight the urgent need for sampling strategies, token compression, and decoding mechanisms that guarantee semantic coverage rather than speed alone.
[335] MMESGBench: Pioneering Multimodal Understanding and Complex Reasoning Benchmark for ESG Tasks
Lei Zhang, Xin Zhou, Chaoyue He, Di Wang, Yi Wu, Hong Xu, Wei Liu, Chunyan Miao
Main category: cs.MM
TL;DR: The paper introduces MMESGBench, a benchmark dataset for evaluating multimodal understanding and reasoning in ESG reports, addressing the lack of dedicated benchmarks in this domain.
Details
Motivation: ESG reports are complex and multimodal, but existing AI systems struggle with reliable document-level reasoning, and no dedicated benchmark exists for ESG.Method: A human-AI collaborative pipeline generates and verifies QA pairs from ESG documents, involving multimodal LLMs, automated verification, and expert validation.
Result: MMESGBench includes 933 QA pairs from 45 ESG documents, showing multimodal and retrieval-augmented models outperform text-only baselines.
Conclusion: MMESGBench fills a critical gap, providing a robust benchmark for ESG document analysis, with potential to enhance AI capabilities in this domain.
Abstract: Environmental, Social, and Governance (ESG) reports are essential for evaluating sustainability practices, ensuring regulatory compliance, and promoting financial transparency. However, these documents are often lengthy, structurally diverse, and multimodal, comprising dense text, structured tables, complex figures, and layout-dependent semantics. Existing AI systems often struggle to perform reliable document-level reasoning in such settings, and no dedicated benchmark currently exists in ESG domain. To fill the gap, we introduce \textbf{MMESGBench}, a first-of-its-kind benchmark dataset targeted to evaluate multimodal understanding and complex reasoning across structurally diverse and multi-source ESG documents. This dataset is constructed via a human-AI collaborative, multi-stage pipeline. First, a multimodal LLM generates candidate question-answer (QA) pairs by jointly interpreting rich textual, tabular, and visual information from layout-aware document pages. Second, an LLM verifies the semantic accuracy, completeness, and reasoning complexity of each QA pair. This automated process is followed by an expert-in-the-loop validation, where domain specialists validate and calibrate QA pairs to ensure quality, relevance, and diversity. MMESGBench comprises 933 validated QA pairs derived from 45 ESG documents, spanning across seven distinct document types and three major ESG source categories. Questions are categorized as single-page, cross-page, or unanswerable, with each accompanied by fine-grained multimodal evidence. Initial experiments validate that multimodal and retrieval-augmented models substantially outperform text-only baselines, particularly on visually grounded and cross-page tasks. MMESGBench is publicly available as an open-source dataset at https://github.com/Zhanglei1103/MMESGBench.
[336] Mining the Social Fabric: Unveiling Communities for Fake News Detection in Short Videos
Haisong Gong, Bolan Su, Xinrong Zhang, Jing Li, Qiang Liu, Shu Wu, Liang Wang
Main category: cs.MM
TL;DR: DugFND enhances fake news detection in short videos by modeling uploader and event-driven communities using a heterogeneous graph and time-aware attention network.
Details
Motivation: Short videos' rapid spread and multi-modal nature make fake news detection challenging, with existing methods ignoring implicit relationships among videos, uploaders, and events.Method: Proposes DugFND, a dual-community graph method modeling uploader and event-driven communities via a heterogeneous graph and time-aware attention network, with pretraining for better node representations.
Result: Experiments show significant performance gains, validating the dual-community approach.
Conclusion: DugFND effectively improves fake news detection in short videos by leveraging community patterns.
Abstract: Short video platforms have become a major medium for information sharing, but their rapid content generation and algorithmic amplification also enable the widespread dissemination of fake news. Detecting misinformation in short videos is challenging due to their multi-modal nature and the limited context of individual videos. While recent methods focus on analyzing content signals-visual, textual, and audio-they often overlook implicit relationships among videos, uploaders, and events. To address this gap, we propose DugFND (Dual-community graph for fake news detection), a novel method that enhances existing video classifiers by modeling two key community patterns: (1) uploader communities, where uploaders with shared interests or similar content creation patterns group together, and (2) event-driven communities, where videos related to the same or semantically similar public events form localized clusters. We construct a heterogeneous graph connecting uploader, video, and event nodes, and design a time-aware heterogeneous graph attention network to enable effective message passing. A reconstruction-based pretraining phase further improves node representation learning. DugFND can be applied to any pre-trained classifier. Experiments on public datasets show that our method achieves significant performance gains, demonstrating the value of dual-community modeling for fake news detection in short videos.
eess.AS
[337] ASAudio: A Survey of Advanced Spatial Audio Research
Zhiyuan Zhu, Yu Zhang, Wenxiang Guo, Changhao Pan, Zhou Zhao
Main category: eess.AS
TL;DR: A comprehensive survey of spatial audio technologies, reviewing methods, datasets, and benchmarks in AR/VR applications.
Details
Motivation: The lack of systematic surveys on spatial audio despite its growing importance in AR/VR and immersive experiences.Method: Chronological review and categorization of spatial audio studies by input-output representations and tasks.
Result: Organized overview of spatial audio research, including datasets and evaluation metrics.
Conclusion: The paper provides a valuable resource for understanding spatial audio advancements and future directions.
Abstract: With the rapid development of spatial audio technologies today, applications in AR, VR, and other scenarios have garnered extensive attention. Unlike traditional mono sound, spatial audio offers a more realistic and immersive auditory experience. Despite notable progress in the field, there remains a lack of comprehensive surveys that systematically organize and analyze these methods and their underlying technologies. In this paper, we provide a comprehensive overview of spatial audio and systematically review recent literature in the area. To address this, we chronologically outlining existing work related to spatial audio and categorize these studies based on input-output representations, as well as generation and understanding tasks, thereby summarizing various research aspects of spatial audio. In addition, we review related datasets, evaluation metrics, and benchmarks, offering insights from both training and evaluation perspectives. Related materials are available at https://github.com/dieKarotte/ASAudio.
[338] CleanCTG: A Deep Learning Model for Multi-Artefact Detection and Reconstruction in Cardiotocography
Sheng Wong, Beth Albert, Gabriel Davis Jones
Main category: eess.AS
TL;DR: CleanCTG is a dual-stage deep-learning model for CTG artefact detection and correction, outperforming existing methods and improving clinical decision-making.
Details
Motivation: Current CTG monitoring is hindered by artefacts that obscure true fetal heart rate patterns, leading to misdiagnosis. Existing methods lack comprehensive noise handling.Method: CleanCTG uses multi-scale convolution and context-aware cross-attention to identify artefacts, followed by artefact-specific correction branches. It was trained on synthetic data derived from expert-verified clean recordings.
Result: CleanCTG achieved perfect artefact detection (AU-ROC = 1.00) on synthetic data and improved specificity and decision time in clinical validation.
Conclusion: Explicit artefact removal and signal reconstruction enhance CTG reliability, maintaining diagnostic accuracy and reducing monitoring time.
Abstract: Cardiotocography (CTG) is essential for fetal monitoring but is frequently compromised by diverse artefacts which obscure true fetal heart rate (FHR) patterns and can lead to misdiagnosis or delayed intervention. Current deep-learning approaches typically bypass comprehensive noise handling, applying minimal preprocessing or focusing solely on downstream classification, while traditional methods rely on simple interpolation or rule-based filtering that addresses only missing samples and fail to correct complex artefact types. We present CleanCTG, an end-to-end dual-stage model that first identifies multiple artefact types via multi-scale convolution and context-aware cross-attention, then reconstructs corrupted segments through artefact-specific correction branches. Training utilised over 800,000 minutes of physiologically realistic, synthetically corrupted CTGs derived from expert-verified “clean” recordings. On synthetic data, CleanCTG achieved perfect artefact detection (AU-ROC = 1.00) and reduced mean squared error (MSE) on corrupted segments to 2.74 x 10^-4 (clean-segment MSE = 2.40 x 10^-6), outperforming the next best method by more than 60%. External validation on 10,190 minutes of clinician-annotated segments yielded AU-ROC = 0.95 (sensitivity = 83.44%, specificity 94.22%), surpassing six comparator classifiers. Finally, when integrated with the Dawes-Redman system on 933 clinical CTG recordings, denoised traces increased specificity (from 80.70% to 82.70%) and shortened median time to decision by 33%. These findings suggest that explicit artefact removal and signal reconstruction can both maintain diagnostic accuracy and enable shorter monitoring sessions, offering a practical route to more reliable CTG interpretation.
[339] Expressive Speech Retrieval using Natural Language Descriptions of Speaking Style
Wonjune Kang, Deb Roy
Main category: eess.AS
TL;DR: The paper introduces expressive speech retrieval, focusing on retrieving speech by style (how something is said) rather than content (what is said). It uses joint embedding of speech and text descriptions for cross-modal retrieval.
Details
Motivation: Prior work focused on speech retrieval by content, but this paper aims to retrieve speech by expressive style using natural language descriptions.Method: Train speech and text encoders to embed into a joint latent space, enabling text prompts to retrieve matching speech. Analyzes encoder architectures, training criteria, and prompt augmentation.
Result: Strong retrieval performance (Recall@k) on datasets with 22 speaking styles.
Conclusion: The framework effectively retrieves expressive speech using text queries, demonstrating robust performance across diverse speaking styles.
Abstract: We introduce the task of expressive speech retrieval, where the goal is to retrieve speech utterances spoken in a given style based on a natural language description of that style. While prior work has primarily focused on performing speech retrieval based on what was said in an utterance, we aim to do so based on how something was said. We train speech and text encoders to embed speech and text descriptions of speaking styles into a joint latent space, which enables using free-form text prompts describing emotions or styles as queries to retrieve matching expressive speech segments. We perform detailed analyses of various aspects of our proposed framework, including encoder architectures, training criteria for effective cross-modal alignment, and prompt augmentation for improved generalization to arbitrary text queries. Experiments on multiple datasets encompassing 22 speaking styles demonstrate that our approach achieves strong retrieval performance as measured by Recall@k.
[340] EmoSSLSphere: Multilingual Emotional Speech Synthesis with Spherical Vectors and Discrete Speech Tokens
Joonyong Park, Kenichi Nakamura
Main category: eess.AS
TL;DR: EmoSSLSphere is a multilingual emotional TTS framework using spherical emotion vectors and SSL features for fine-grained control, cross-lingual transfer, and speaker identity preservation. It outperforms baselines in quality and expressiveness.
Details
Motivation: To enhance multilingual emotional TTS with fine-grained emotional control, cross-lingual emotion transfer, and speaker identity preservation.Method: Combines spherical emotion vectors with SSL-derived discrete token features for semantic and acoustic modeling.
Result: Significant improvements in speech intelligibility, spectral fidelity, prosodic consistency, and synthesis quality in English and Japanese corpora.
Conclusion: EmoSSLSphere is a scalable solution for multilingual emotional TTS, excelling in naturalness and emotional expressiveness.
Abstract: This paper introduces EmoSSLSphere, a novel framework for multilingual emotional text-to-speech (TTS) synthesis that combines spherical emotion vectors with discrete token features derived from self-supervised learning (SSL). By encoding emotions in a continuous spherical coordinate space and leveraging SSL-based representations for semantic and acoustic modeling, EmoSSLSphere enables fine-grained emotional control, effective cross-lingual emotion transfer, and robust preservation of speaker identity. We evaluate EmoSSLSphere on English and Japanese corpora, demonstrating significant improvements in speech intelligibility, spectral fidelity, prosodic consistency, and overall synthesis quality. Subjective evaluations further confirm that our method outperforms baseline models in terms of naturalness and emotional expressiveness, underscoring its potential as a scalable solution for multilingual emotional TTS.
[341] MoE-TTS: Enhancing Out-of-Domain Text Understanding for Description-based TTS via Mixture-of-Experts
Heyang Xue, Xuchen Song, Yu Tang, Jianyu Chen, Yanru Chen, Yang Li, Yahui Zhou
Main category: eess.AS
TL;DR: MoE-TTS enhances TTS models for out-of-domain text descriptions using a modality-based MoE approach, outperforming commercial products.
Details
Motivation: Addressing the challenge of diverse, user-generated out-of-domain text descriptions in TTS systems.Method: Modality-based mixture-of-experts (MoE) augments a pre-trained LLM with specialized speech modality weights, keeping the LLM frozen.
Result: MoE-TTS outperforms commercial TTS products in handling out-of-domain descriptions.
Conclusion: MoE-TTS effectively leverages pre-trained LLMs for better text understanding and speech generation.
Abstract: Description-based text-to-speech (TTS) models exhibit strong performance on in-domain text descriptions, i.e., those encountered during training. However, in real-world applications, the diverse range of user-generated descriptions inevitably introduces numerous out-of-domain inputs that challenge the text understanding capabilities of these systems. To address this issue, we propose MoE-TTS, a description-based TTS model designed to enhance the understanding of out-of-domain text descriptions. MoE-TTS employs a modality-based mixture-of-experts (MoE) approach to augment a pre-trained textual large language model (LLM) with a set of specialized weights adapted to the speech modality while maintaining the original LLM frozen during training. This approach allows MoE-TTS to effectively leverage the pre-trained knowledge and text understanding abilities of textual LLMs. Our experimental results indicate that: first, even the most advanced closed-source commercial products can be challenged by carefully designed out-of-domain description test sets; second, MoE-TTS achieves superior performance in generating speech that more accurately reflects the descriptions. We encourage readers to listen to the demos at https://welkinyang.github.io/MoE-TTS/.
[342] Enhancing In-the-Wild Speech Emotion Conversion with Resynthesis-based Duration Modeling
Navin Raj Prabhu, Danilo de Oliveira, Nale Lehmann-Willenbrock, Timo Gerkmann
Main category: eess.AS
TL;DR: The paper proposes a duration modeling framework for Speech Emotion Conversion to enhance emotional expressiveness by modifying speech duration without parallel data.
Details
Motivation: Existing generative models for Speech Emotion Conversion lack control over speech duration, limiting their ability to reflect target emotions accurately.Method: A resynthesis-based discrete content representation framework is introduced to modify speech duration and achieve controllable speech rates.
Result: The framework significantly improves emotional expressiveness, with low-arousal emotions linked to longer durations and slower rates, and high-arousal emotions to shorter, faster speech.
Conclusion: The proposed duration modeling framework effectively enhances Speech Emotion Conversion by addressing the limitation of duration control.
Abstract: Speech Emotion Conversion aims to modify the emotion expressed in input speech while preserving lexical content and speaker identity. Recently, generative modeling approaches have shown promising results in changing local acoustic properties such as fundamental frequency, spectral envelope and energy, but often lack the ability to control the duration of sounds. To address this, we propose a duration modeling framework using resynthesis-based discrete content representations, enabling modification of speech duration to reflect target emotions and achieve controllable speech rates without using parallel data. Experimental results reveal that the inclusion of the proposed duration modeling framework significantly enhances emotional expressiveness, in the in-the-wild MSP-Podcast dataset. Analyses show that low-arousal emotions correlate with longer durations and slower speech rates, while high-arousal emotions produce shorter, faster speech.
[343] Emphasis Sensitivity in Speech Representations
Shaun Cassini, Thomas Hain, Anton Ragni
Main category: eess.AS
TL;DR: The paper explores how modern speech models encode prosodic emphasis, proposing a residual-based framework to analyze differences between neutral and emphasized words. Findings show structured encoding in self-supervised models and more compact subspaces in ASR fine-tuned models.
Details
Motivation: To understand if speech models systematically differentiate emphasized and neutral words, addressing gaps in prior work that overlooked relational structure.Method: A residual-based framework is introduced, defining emphasis as the difference between paired neutral and emphasized word representations. Analyzed self-supervised and ASR fine-tuned models.
Result: Residuals correlate with duration changes and perform poorly in word identity prediction, indicating structured encoding. ASR models show 50% more compact subspaces for emphasis.
Conclusion: Prosodic emphasis is encoded as a consistent, low-dimensional transformation, becoming more structured with task-specific learning.
Abstract: This work investigates whether modern speech models are sensitive to prosodic emphasis - whether they encode emphasized and neutral words in systematically different ways. Prior work typically relies on isolated acoustic correlates (e.g., pitch, duration) or label prediction, both of which miss the relational structure of emphasis. This paper proposes a residual-based framework, defining emphasis as the difference between paired neutral and emphasized word representations. Analysis on self-supervised speech models shows that these residuals correlate strongly with duration changes and perform poorly at word identity prediction, indicating a structured, relational encoding of prosodic emphasis. In ASR fine-tuned models, residuals occupy a subspace up to 50% more compact than in pre-trained models, further suggesting that emphasis is encoded as a consistent, low-dimensional transformation that becomes more structured with task-specific learning.
[344] Generalizable speech deepfake detection via meta-learned LoRA
Janne Laakkonen, Ivan Kukanov, Ville Hautamäki
Main category: eess.AS
TL;DR: The paper proposes a method using LoRA adapters and MLDG for zero-shot speech deepfake detection, achieving better performance with fewer parameters than full fine-tuning.
Details
Motivation: To ensure reliable detection of speech deepfakes even when spoofing attack distributions shift.Method: Insert LoRA adapters into every attention head of an SSL backbone and train them with MLDG for domain generalization.
Result: The model updates only 1.1% of parameters (3.6M vs. 318M) but outperforms full fine-tuning on 5/6 corpora, reducing EER from 8.84% to 5.30%.
Conclusion: Combining meta-learning with parameter-efficient adaptation is effective for zero-shot, distribution-shift-aware speech deepfake detection.
Abstract: Reliable detection of speech deepfakes (spoofs) must remain effective when the distribution of spoofing attacks shifts. We frame the task as domain generalization and show that inserting Low-Rank Adaptation (LoRA) adapters into every attention head of a self-supervised (SSL) backbone, then training only those adapters with Meta-Learning Domain Generalization (MLDG), yields strong zero-shot performance. The resulting model updates about 3.6 million parameters, roughly 1.1% of the 318 million updated in full fine-tuning, yet surpasses a fully fine-tuned counterpart on five of six evaluation corpora. A first-order MLDG loop encourages the adapters to focus on cues that persist across attack types, lowering the average EER from 8.84% for the fully fine-tuned model to 5.30% with our best MLDG-LoRA configuration. Our findings show that combining meta-learning with parameter-efficient adaptation offers an effective method for zero-shot, distribution-shift-aware speech deepfake detection.
[345] Fairness in Dysarthric Speech Synthesis: Understanding Intrinsic Bias in Dysarthric Speech Cloning using F5-TTS
M Anuprabha, Krishna Gurugubelli, Anil Kumar Vuppala
Main category: eess.AS
TL;DR: The paper explores F5-TTS for dysarthric speech cloning, revealing biases toward intelligibility over speaker and prosody preservation, and suggests fairness-aware synthesis for inclusivity.
Details
Motivation: Addressing the challenge of limited dysarthric speech data and potential biases in synthetic speech generation for assistive technologies.Method: Investigates F5-TTS using the TORGO dataset, evaluating intelligibility, speaker similarity, prosody preservation, and fairness metrics like Disparate Impact and Parity Difference.
Result: F5-TTS shows bias toward intelligibility, neglecting speaker and prosody preservation in dysarthric speech synthesis.
Conclusion: Fairness-aware dysarthric speech synthesis can enhance inclusivity in speech technologies.
Abstract: Dysarthric speech poses significant challenges in developing assistive technologies, primarily due to the limited availability of data. Recent advances in neural speech synthesis, especially zero-shot voice cloning, facilitate synthetic speech generation for data augmentation; however, they may introduce biases towards dysarthric speech. In this paper, we investigate the effectiveness of state-of-the-art F5-TTS in cloning dysarthric speech using TORGO dataset, focusing on intelligibility, speaker similarity, and prosody preservation. We also analyze potential biases using fairness metrics like Disparate Impact and Parity Difference to assess disparities across dysarthric severity levels. Results show that F5-TTS exhibits a strong bias toward speech intelligibility over speaker and prosody preservation in dysarthric speech synthesis. Insights from this study can help integrate fairness-aware dysarthric speech synthesis, fostering the advancement of more inclusive speech technologies.
[346] MultiAiTutor: Child-Friendly Educational Multilingual Speech Generation Tutor with LLMs
Xiaoxue Gao, Huayun Zhang, Nancy F. Chen
Main category: eess.AS
TL;DR: MultiAiTutor is a multilingual generative AI tutor designed for child-friendly speech generation in low-resource languages, outperforming baselines in educational applications.
Details
Motivation: Addressing the challenge of high-quality, child-friendly speech generation for low-resource languages in educational settings.Method: Leverages LLM architecture for age-appropriate multilingual speech generation, tested via image-description tasks in Singaporean-accent Mandarin, Malay, and Tamil.
Result: Superior performance demonstrated through objective metrics and subjective evaluations.
Conclusion: MultiAiTutor effectively enhances language learning for children in diverse cultural contexts.
Abstract: Generative speech models have demonstrated significant potential in personalizing teacher-student interactions, offering valuable real-world applications for language learning in children’s education. However, achieving high-quality, child-friendly speech generation remains challenging, particularly for low-resource languages across diverse languages and cultural contexts. In this paper, we propose MultiAiTutor, an educational multilingual generative AI tutor with child-friendly designs, leveraging LLM architecture for speech generation tailored for educational purposes. We propose to integrate age-appropriate multilingual speech generation using LLM architectures, facilitating young children’s language learning through culturally relevant image-description tasks in three low-resource languages: Singaporean-accent Mandarin, Malay, and Tamil. Experimental results from both objective metrics and subjective evaluations demonstrate the superior performance of the proposed MultiAiTutor compared to baseline methods.
eess.IV
[347] The Role of Radiographic Knee Alignment in Knee Replacement Outcomes and Opportunities for Artificial Intelligence-Driven Assessment
Zhisen Hu, David S. Johnson, Aleksei Tiulpin, Timothy F. Cootes, Claudia Lindner
Main category: eess.IV
TL;DR: This review explores the role of knee alignment biomarkers in total knee replacement (TKR) outcomes, discusses AI-based methods for assessing alignment from radiographs, and suggests future research directions.
Details
Motivation: Knee osteoarthritis (OA) lacks a cure, and TKR outcomes are hard to predict. Radiographic knee alignment is a key factor influencing these outcomes, but existing reviews focus on OA diagnosis rather than alignment biomarkers.Method: The review examines current TKR outcome scoring protocols, identifies knee alignment biomarkers, and evaluates AI-based approaches for alignment measurement from radiographs.
Result: AI shows promise in automating knee alignment analysis, but gaps remain in linking alignment biomarkers to TKR outcomes.
Conclusion: Future research should focus on integrating AI for better alignment assessment and outcome prediction in TKR.
Abstract: Prevalent knee osteoarthritis (OA) imposes substantial burden on health systems with no cure available. Its ultimate treatment is total knee replacement (TKR). Complications from surgery and recovery are difficult to predict in advance, and numerous factors may affect them. Radiographic knee alignment is one of the key factors that impacts TKR outcomes, affecting outcomes such as postoperative pain or function. Recently, artificial intelligence (AI) has been introduced to the automatic analysis of knee radiographs, for example, to automate knee alignment measurements. Existing review articles tend to focus on knee OA diagnosis and segmentation of bones or cartilages in MRI rather than exploring knee alignment biomarkers for TKR outcomes and their assessment. In this review, we first examine the current scoring protocols for evaluating TKR outcomes and potential knee alignment biomarkers associated with these outcomes. We then discuss existing AI-based approaches for generating knee alignment biomarkers from knee radiographs, and explore future directions for knee alignment assessment and TKR outcome prediction.
[348] Deep Learning-Based Automated Segmentation of Uterine Myomas
Tausifa Jan Saleem, Mohammad Yaqub
Main category: eess.IV
TL;DR: The paper addresses the need for automated segmentation of uterine fibroids in MRI scans, leveraging deep learning and a public dataset (UMD) to standardize evaluation.
Details
Motivation: Uterine fibroids are prevalent and burdensome, with manual MRI segmentation being labor-intensive and variable. Automated methods are needed.Method: Uses deep learning algorithms for automated segmentation, utilizing the publicly available Uterine Myoma MRI Dataset (UMD).
Result: Establishes a baseline for automated segmentation, enabling standardized evaluation and comparison in future research.
Conclusion: Public datasets and deep learning can improve uterine fibroid segmentation, addressing clinical challenges and variability.
Abstract: Uterine fibroids (myomas) are the most common benign tumors of the female reproductive system, particularly among women of childbearing age. With a prevalence exceeding 70%, they pose a significant burden on female reproductive health. Clinical symptoms such as abnormal uterine bleeding, infertility, pelvic pain, and pressure-related discomfort play a crucial role in guiding treatment decisions, which are largely influenced by the size, number, and anatomical location of the fibroids. Magnetic Resonance Imaging (MRI) is a non-invasive and highly accurate imaging modality commonly used by clinicians for the diagnosis of uterine fibroids. Segmenting uterine fibroids requires a precise assessment of both the uterus and fibroids on MRI scans, including measurements of volume, shape, and spatial location. However, this process is labor intensive and time consuming and subjected to variability due to intra- and inter-expert differences at both pre- and post-treatment stages. As a result, there is a critical need for an accurate and automated segmentation method for uterine fibroids. In recent years, deep learning algorithms have shown re-markable improvements in medical image segmentation, outperforming traditional methods. These approaches offer the potential for fully automated segmentation. Several studies have explored the use of deep learning models to achieve automated segmentation of uterine fibroids. However, most of the previous work has been conducted using private datasets, which poses challenges for validation and comparison between studies. In this study, we leverage the publicly available Uterine Myoma MRI Dataset (UMD) to establish a baseline for automated segmentation of uterine fibroids, enabling standardized evaluation and facilitating future research in this domain.
[349] HistoViT: Vision Transformer for Accurate and Scalable Histopathological Cancer Diagnosis
Faisal Ahmed
Main category: eess.IV
TL;DR: A transformer-based deep learning framework for multi-class tumor classification in histopathological images outperforms existing methods, achieving high accuracy and AUC scores across four cancer types.
Details
Motivation: Accurate and scalable cancer diagnosis is challenging due to complex histological variability in malignancies like breast, prostate, bone, and cervical cancers.Method: The study uses a fine-tuned Vision Transformer (ViT) architecture with a streamlined preprocessing pipeline to convert whole-slide images into normalized PyTorch tensors.
Result: The model achieves classification accuracies of 99.32%, 96.92%, 95.28%, and 96.94% for breast, prostate, bone, and cervical cancers, respectively, with AUC scores exceeding 99%.
Conclusion: Transformer-based architectures show robustness and clinical potential for automated cancer diagnosis, improving diagnostic reliability and healthcare outcomes.
Abstract: Accurate and scalable cancer diagnosis remains a critical challenge in modern pathology, particularly for malignancies such as breast, prostate, bone, and cervical, which exhibit complex histological variability. In this study, we propose a transformer-based deep learning framework for multi-class tumor classification in histopathological images. Leveraging a fine-tuned Vision Transformer (ViT) architecture, our method addresses key limitations of conventional convolutional neural networks, offering improved performance, reduced preprocessing requirements, and enhanced scalability across tissue types. To adapt the model for histopathological cancer images, we implement a streamlined preprocessing pipeline that converts tiled whole-slide images into PyTorch tensors and standardizes them through data normalization. This ensures compatibility with the ViT architecture and enhances both convergence stability and overall classification performance. We evaluate our model on four benchmark datasets: ICIAR2018 (breast), SICAPv2 (prostate), UT-Osteosarcoma (bone), and SipakMed (cervical) dataset – demonstrating consistent outperformance over existing deep learning methods. Our approach achieves classification accuracies of 99.32%, 96.92%, 95.28%, and 96.94% for breast, prostate, bone, and cervical cancers respectively, with area under the ROC curve (AUC) scores exceeding 99% across all datasets. These results confirm the robustness, generalizability, and clinical potential of transformer-based architectures in digital pathology. Our work represents a significant advancement toward reliable, automated, and interpretable cancer diagnosis systems that can alleviate diagnostic burdens and improve healthcare outcomes.
[350] Efficient Image-to-Image Schrödinger Bridge for CT Field of View Extension
Zhenhao Li, Long Yang, Xiaojie Yin, Haijun Yu, Jiazhou Wang, Hongbin Han, Weigang Hu, Yixing Huang
Main category: eess.IV
TL;DR: The paper proposes an efficient CT FOV extension framework using the I²SB diffusion model, outperforming existing methods in accuracy and speed.
Details
Motivation: CT scans often suffer from truncated projection data when objects exceed the scanner's FOV, leading to artifacts. Current deep learning solutions are slow and computationally intensive.Method: The authors introduce the I²SB diffusion model, which directly maps limited-FOV to extended-FOV images, improving interpretability and speed.
Result: I²SB achieves lower RMSE (49.8HU simulated, 152.0HU real) and faster inference (0.19s per slice) compared to cDDPM and diffusionGAN.
Conclusion: I²SB’s accuracy and efficiency make it ideal for real-time clinical CT FOV extension.
Abstract: Computed tomography (CT) is a cornerstone imaging modality for non-invasive, high-resolution visualization of internal anatomical structures. However, when the scanned object exceeds the scanner’s field of view (FOV), projection data are truncated, resulting in incomplete reconstructions and pronounced artifacts near FOV boundaries. Conventional reconstruction algorithms struggle to recover accurate anatomy from such data, limiting clinical reliability. Deep learning approaches have been explored for FOV extension, with diffusion generative models representing the latest advances in image synthesis. Yet, conventional diffusion models are computationally demanding and slow at inference due to their iterative sampling process. To address these limitations, we propose an efficient CT FOV extension framework based on the image-to-image Schr"odinger Bridge (I$^2$SB) diffusion model. Unlike traditional diffusion models that synthesize images from pure Gaussian noise, I$^2$SB learns a direct stochastic mapping between paired limited-FOV and extended-FOV images. This direct correspondence yields a more interpretable and traceable generative process, enhancing anatomical consistency and structural fidelity in reconstructions. I$^2$SB achieves superior quantitative performance, with root-mean-square error (RMSE) values of 49.8,HU on simulated noisy data and 152.0HU on real data, outperforming state-of-the-art diffusion models such as conditional denoising diffusion probabilistic models (cDDPM) and patch-based diffusion methods. Moreover, its one-step inference enables reconstruction in just 0.19s per 2D slice, representing over a 700-fold speedup compared to cDDPM (135s) and surpassing diffusionGAN (0.58s), the second fastest. This combination of accuracy and efficiency makes I$^2$SB highly suitable for real-time or clinical deployment.
[351] A Convergent Generalized Krylov Subspace Method for Compressed Sensing MRI Reconstruction with Gradient-Driven Denoisers
Tao Hong, Umberto Villa, Jeffrey A. Fessler
Main category: eess.IV
TL;DR: The paper proposes a generalized Krylov subspace method (GKSM) to efficiently solve optimization problems in compressed sensing MRI, offering computational efficiency and theoretical guarantees.
Details
Motivation: Existing methods like Plug-and-Play and Regularization-by-Denoising lack theoretical guarantees, while gradient-driven denoisers are computationally demanding. GKSM addresses these limitations.Method: The authors introduce GKSM to solve the optimization problem efficiently and establish its convergence guarantees in nonconvex settings.
Result: Numerical experiments on CS MRI reconstruction validate GKSM’s computational efficiency and theoretical accuracy.
Conclusion: GKSM is a promising solution for linear inverse problems, balancing efficiency and theoretical rigor.
Abstract: Model-based reconstruction plays a key role in compressed sensing (CS) MRI, as it incorporates effective image regularizers to improve the quality of reconstruction. The Plug-and-Play and Regularization-by-Denoising frameworks leverage advanced denoisers (e.g., convolutional neural network (CNN)-based denoisers) and have demonstrated strong empirical performance. However, their theoretical guarantees remain limited, as practical CNNs often violate key assumptions. In contrast, gradient-driven denoisers achieve competitive performance, and the required assumptions for theoretical analysis are easily satisfied. However, solving the associated optimization problem remains computationally demanding. To address this challenge, we propose a generalized Krylov subspace method (GKSM) to solve the optimization problem efficiently. Moreover, we also establish rigorous convergence guarantees for GKSM in nonconvex settings. Numerical experiments on CS MRI reconstruction with spiral and radial acquisitions validate both the computational efficiency of GKSM and the accuracy of the theoretical predictions. The proposed optimization method is applicable to any linear inverse problem.
[352] Guiding WaveMamba with Frequency Maps for Image Debanding
Xinyi Wang, Smaranda Tasmoc, Nantheera Anantrasirichai, Angeliki Katsenou
Main category: eess.IV
TL;DR: A method using Wavelet State Space Model and frequency masking to restore banding artifacts in low-bitrate compressed images, outperforming state-of-the-art techniques.
Details
Motivation: Banding artifacts degrade visual quality in smooth regions like skies, especially in user-generated content due to repeated transcoding.Method: Proposes a banding restoration method with Wavelet State Space Model and frequency masking to preserve high-frequency details.
Result: Outperforms state-of-the-art methods (DBI value of 0.082 on BAND-2k) and preserves textures.
Conclusion: The method effectively suppresses banding while maintaining image quality, validated by visual inspections and benchmarks.
Abstract: Compression at low bitrates in modern codecs often introduces banding artifacts, especially in smooth regions such as skies. These artifacts degrade visual quality and are common in user-generated content due to repeated transcoding. We propose a banding restoration method that employs the Wavelet State Space Model and a frequency masking map to preserve high-frequency details. Furthermore, we provide a benchmark of open-source banding restoration methods and evaluate their performance on two public banding image datasets. Experimentation on the available datasets suggests that the proposed post-processing approach effectively suppresses banding compared to the state-of-the-art method (a DBI value of 0.082 on BAND-2k) while preserving image textures. Visual inspections of the results confirm this. Code and supplementary material are available at: https://github.com/xinyiW915/Debanding-PCS2025.
[353] AnatoMaskGAN: GNN-Driven Slice Feature Fusion and Noise Augmentation for Medical Semantic Image Synthesis
Zonglin Wu, Yule Xue, Qianxiang Hu, Yaoyao Feng, Yuqi Ma, Shanxiong Chen
Main category: eess.IV
TL;DR: AnatoMaskGAN improves medical image synthesis by embedding spatial features, enhancing diversity, and optimizing grayscale-texture representation, outperforming state-of-the-art methods in accuracy and quality.
Details
Motivation: Existing GAN-based methods lack spatial consistency and diversity in complex medical scans, limiting their effectiveness for data augmentation and analysis.Method: Proposes AnatoMaskGAN with a GNN-based slice-feature fusion module, 3D spatial noise-injection, and a grayscale-texture classifier to enhance spatial and contextual modeling.
Result: Achieves PSNR of 26.50 dB on L2R-OASIS and SSIM of 0.8602 on L2R-Abdomen CT, surpassing current benchmarks.
Conclusion: Each component of AnatoMaskGAN significantly contributes to improved reconstruction accuracy and perceptual quality, validating its design.
Abstract: Medical semantic-mask synthesis boosts data augmentation and analysis, yet most GAN-based approaches still produce one-to-one images and lack spatial consistency in complex scans. To address this, we propose AnatoMaskGAN, a novel synthesis framework that embeds slice-related spatial features to precisely aggregate inter-slice contextual dependencies, introduces diverse image-augmentation strategies, and optimizes deep feature learning to improve performance on complex medical images. Specifically, we design a GNN-based strongly correlated slice-feature fusion module to model spatial relationships between slices and integrate contextual information from neighboring slices, thereby capturing anatomical details more comprehensively; we introduce a three-dimensional spatial noise-injection strategy that weights and fuses spatial features with noise to enhance modeling of structural diversity; and we incorporate a grayscale-texture classifier to optimize grayscale distribution and texture representation during generation. Extensive experiments on the public L2R-OASIS and L2R-Abdomen CT datasets show that AnatoMaskGAN raises PSNR on L2R-OASIS to 26.50 dB (0.43 dB higher than the current state of the art) and achieves an SSIM of 0.8602 on L2R-Abdomen CT–a 0.48 percentage-point gain over the best model, demonstrating its superiority in reconstruction accuracy and perceptual quality. Ablation studies that successively remove the slice-feature fusion module, spatial 3D noise-injection strategy, and grayscale-texture classifier reveal that each component contributes significantly to PSNR, SSIM, and LPIPS, further confirming the independent value of each core design in enhancing reconstruction accuracy and perceptual quality.
[354] LKFMixer: Exploring Large Kernel Feature For Efficient Image Super-Resolution
Yinggan Tang, Quanwei Hu
Main category: eess.IV
TL;DR: LKFMixer, a CNN model with large kernels, mimics self-attention for image super-resolution, achieving better performance and speed than SOTA methods.
Details
Motivation: Self-attention in Transformers is computationally heavy for lightweight models, prompting a need for efficient alternatives.Method: Uses large convolutional kernels (size 31) and coordinate decomposition to reduce parameters. Includes SFMB for spatial-channel focus and FSB for adaptive feature weighting.
Result: Outperforms SOTA methods, e.g., 0.6dB PSNR gain over SwinIR-light on Manga109, with 5x faster inference.
Conclusion: LKFMixer offers an efficient, high-performance alternative to self-attention for image SR.
Abstract: The success of self-attention (SA) in Transformer demonstrates the importance of non-local information to image super-resolution (SR), but the huge computing power required makes it difficult to implement lightweight models. To solve this problem, we propose a pure convolutional neural network (CNN) model, LKFMixer, which utilizes large convolutional kernel to simulate the ability of self-attention to capture non-local features. Specifically, we increase the kernel size to 31 to obtain the larger receptive field as possible, and reduce the parameters and computations by coordinate decomposition. Meanwhile, a spatial feature modulation block (SFMB) is designed to enhance the focus of feature information on both spatial and channel dimension. In addition, by introducing feature selection block (FSB), the model can adaptively adjust the weights between local features and non-local features. Extensive experiments show that the proposed LKFMixer family outperform other state-of-the-art (SOTA) methods in terms of SR performance and reconstruction quality. In particular, compared with SwinIR-light on Manga109 dataset, LKFMixer-L achieves 0.6dB PSNR improvement at $\times$4 scale, while the inference speed is $\times$5 times faster. The code is available at https://github.com/Supereeeee/LKFMixer.
[355] Subcortical Masks Generation in CT Images via Ensemble-Based Cross-Domain Label Transfer
Augustine X. W. Lee, Pak-Hei Yeung, Jagath C. Rajapakse
Main category: eess.IV
TL;DR: The paper proposes an automatic ensemble framework to generate high-quality subcortical segmentation labels for CT scans by leveraging MRI-based models, addressing the lack of labeled CT data.
Details
Motivation: Subcortical segmentation is crucial for understanding brain anatomy and diagnosing brain injuries and disorders, but labeled CT data is scarce compared to MRI.Method: An ensemble pipeline integrates MRI-based models to label unannotated paired MRI-CT data, creating a comprehensive CT segmentation dataset.
Result: Experiments show superior performance of the framework, and models trained on the generated dataset improve segmentation tasks.
Conclusion: The work provides the first open-source CT subcortical segmentation dataset and tools, advancing research in the field.
Abstract: Subcortical segmentation in neuroimages plays an important role in understanding brain anatomy and facilitating computer-aided diagnosis of traumatic brain injuries and neurodegenerative disorders. However, training accurate automatic models requires large amounts of labelled data. Despite the availability of publicly available subcortical segmentation datasets for Magnetic Resonance Imaging (MRI), a significant gap exists for Computed Tomography (CT). This paper proposes an automatic ensemble framework to generate high-quality subcortical segmentation labels for CT scans by leveraging existing MRI-based models. We introduce a robust ensembling pipeline to integrate them and apply it to unannotated paired MRI-CT data, resulting in a comprehensive CT subcortical segmentation dataset. Extensive experiments on multiple public datasets demonstrate the superior performance of our proposed framework. Furthermore, using our generated CT dataset, we train segmentation models that achieve improved performance on related segmentation tasks. To facilitate future research, we make our source code, generated dataset, and trained models publicly available at https://github.com/SCSE-Biomedical-Computing-Group/CT-Subcortical-Segmentation, marking the first open-source release for CT subcortical segmentation to the best of our knowledge.
[356] Semi-Supervised Learning with Online Knowledge Distillation for Skin Lesion Classification
Siyamalan Manivannan
Main category: eess.IV
TL;DR: A semi-supervised deep learning approach using ensemble learning and online knowledge distillation improves skin lesion classification, reducing the need for extensive labeled data.
Details
Motivation: To address the challenge of obtaining extensive labeled data for skin lesion analysis, this study aims to reduce annotation burden while maintaining high performance.Method: The method involves training an ensemble of CNN models with online knowledge distillation to transfer insights among models, enhancing individual and ensemble performance.
Result: The approach outperforms independently trained models and achieves state-of-the-art results on ISIC 2018 and 2019 datasets.
Conclusion: The proposed method offers a resource-efficient solution for skin lesion classification, reducing reliance on labeled data while maintaining high accuracy.
Abstract: Deep Learning has emerged as a promising approach for skin lesion analysis. However, existing methods mostly rely on fully supervised learning, requiring extensive labeled data, which is challenging and costly to obtain. To alleviate this annotation burden, this study introduces a novel semi-supervised deep learning approach that integrates ensemble learning with online knowledge distillation for enhanced skin lesion classification. Our methodology involves training an ensemble of convolutional neural network models, using online knowledge distillation to transfer insights from the ensemble to its members. This process aims to enhance the performance of each model within the ensemble, thereby elevating the overall performance of the ensemble itself. Post-training, any individual model within the ensemble can be deployed at test time, as each member is trained to deliver comparable performance to the ensemble. This is particularly beneficial in resource-constrained environments. Experimental results demonstrate that the knowledge-distilled individual model performs better than independently trained models. Our approach demonstrates superior performance on both the \emph{International Skin Imaging Collaboration} 2018 and 2019 public benchmark datasets, surpassing current state-of-the-art results. By leveraging ensemble learning and online knowledge distillation, our method reduces the need for extensive labeled data while providing a more resource-efficient solution for skin lesion classification in real-world scenarios.
[357] Synthetic Data for Robust Stroke Segmentation
Liam Chalcroft, Ioannis Pappas, Cathy J. Price, John Ashburner
Main category: eess.IV
TL;DR: A novel synthetic data framework for stroke lesion segmentation reduces reliance on high-resolution images and annotated data, improving clinical applicability.
Details
Motivation: Overcome limitations of current deep learning methods that depend on extensive annotated data and high-resolution images for lesion segmentation in neuroimaging.Method: Extends SynthSeg with lesion-specific augmentations, uses a modified nnUNet architecture, and trains with label maps from healthy and stroke datasets.
Result: Matches state-of-the-art performance in-domain and outperforms on out-of-domain data, enabling cross-sequence applicability.
Conclusion: The framework enhances clinical neuroimaging workflows, particularly for stroke pathology, with publicly available tools.
Abstract: Current deep learning-based approaches to lesion segmentation in neuroimaging often depend on high-resolution images and extensive annotated data, limiting clinical applicability. This paper introduces a novel synthetic data framework tailored for stroke lesion segmentation, expanding the SynthSeg methodology to incorporate lesion-specific augmentations that simulate diverse pathological features. Using a modified nnUNet architecture, our approach trains models with label maps from healthy and stroke datasets, facilitating segmentation across both normal and pathological tissue without reliance on specific sequence-based training. Evaluation across in-domain and out-of-domain (OOD) datasets reveals that our method matches state-of-the-art performance within the training domain and significantly outperforms existing methods on OOD data. By minimizing dependence on large annotated datasets and allowing for cross-sequence applicability, our framework holds potential to improve clinical neuroimaging workflows, particularly in stroke pathology. PyTorch training code and weights are publicly available at https://github.com/liamchalcroft/SynthStroke, along with an SPM toolbox featuring a plug-and-play model at https://github.com/liamchalcroft/SynthStrokeSPM.
[358] Automatic brain tumor segmentation in 2D intra-operative ultrasound images using magnetic resonance imaging tumor annotations
Mathilde Faanes, Ragnhild Holden Helland, Ole Solheim, Sébastien Muller, Ingerid Reinertsen
Main category: eess.IV
TL;DR: MRI annotations can substitute iUS annotations for training deep learning models in brain tumor segmentation, achieving comparable performance.
Details
Motivation: The lack of large annotated iUS datasets limits model performance, prompting the use of more accessible MRI annotations.Method: Used 180 annotated MRI scans and 29 annotated iUS images, transferring MRI annotations to iUS via registration, and trained nnU-Net models.
Result: No significant difference in Dice scores between models trained with MRI, iUS, or both annotations. Best model scored 0.62±0.31 vs. expert’s 0.67±0.25.
Conclusion: MRI annotations are viable for training iUS segmentation models, with performance improving when smaller tumors are excluded from training.
Abstract: Automatic segmentation of brain tumors in intra-operative ultrasound (iUS) images could facilitate localization of tumor tissue during resection surgery. The lack of large annotated datasets limits the current models performances. In this paper, we investigated the use of tumor annotations in magnetic resonance imaging (MRI) scans, which are more accessible than annotations in iUS images, for training of deep learning models for iUS brain tumor segmentation. We used 180 annotated MRI scans with corresponding unannotated iUS images, and 29 annotated iUS images. Image registration was performed to transfer the MRI annotations to the corresponding iUS images before training the nnU-Net model with different configurations of the data and label origins. The results showed no significant difference in Dice score for a model trained with only MRI annotated tumors compared to models trained with only iUS annotations and both, and to expert annotations, indicating that MRI tumor annotations can be used as a substitute for iUS tumor annotations to train a deep learning model for automatic brain tumor segmentation in iUS images. The best model obtained an average Dice score of $0.62\pm0.31$, compared to $0.67\pm0.25$ for an expert neurosurgeon, where the performance on larger tumors were similar, but lower for the models on smaller tumors. In addition, the results showed that removing smaller tumors from the training sets improved the results. The main models are available here: https://github.com/mathildefaanes/us_brain_tumor_segmentation/tree/main
[359] GDSR: Global-Detail Integration through Dual-Branch Network with Wavelet Losses for Remote Sensing Image Super-Resolution
Qiwei Zhu, Kai Li, Guojing Zhang, Xiaoying Wang, Jianqiang Huang, Xilai Li
Main category: eess.IV
TL;DR: The paper introduces GDSR, a dual-branch structure combining RWKV and convolutional operations for RSI-SR, addressing global-local dependency gaps and computational inefficiency. It outperforms HAT with fewer resources.
Details
Motivation: Existing SR methods fail to balance global and local dependencies and are computationally expensive for large-scale RSIs.Method: Proposes GDSR with RWKV and convolutional branches, linked by GDRM, and uses a wavelet-domain loss for fidelity.
Result: GDSR surpasses HAT by 0.09 dB PSNR, uses 63% parameters, 51% FLOPs, and is 3.2x faster.
Conclusion: GDSR effectively balances global-local features and computational efficiency, advancing RSI-SR performance.
Abstract: In recent years, deep neural networks, including Convolutional Neural Networks, Transformers, and State Space Models, have achieved significant progress in Remote Sensing Image (RSI) Super-Resolution (SR). However, existing SR methods typically overlook the complementary relationship between global and local dependencies. These methods either focus on capturing local information or prioritize global information, which results in models that are unable to effectively capture both global and local features simultaneously. Moreover, their computational cost becomes prohibitive when applied to large-scale RSIs. To address these challenges, we introduce the novel application of Receptance Weighted Key Value (RWKV) to RSI-SR, which captures long-range dependencies with linear complexity. To simultaneously model global and local features, we propose the Global-Detail dual-branch structure, GDSR, which performs SR by paralleling RWKV and convolutional operations to handle large-scale RSIs. Furthermore, we introduce the Global-Detail Reconstruction Module (GDRM) as an intermediary between the two branches to bridge their complementary roles. In addition, we propose the Dual-Group Multi-Scale Wavelet Loss, a wavelet-domain constraint mechanism via dual-group subband strategy and cross-resolution frequency alignment for enhanced reconstruction fidelity in RSI-SR. Extensive experiments under two degradation methods on several benchmarks, including AID, UCMerced, and RSSRD-QH, demonstrate that GSDR outperforms the state-of-the-art Transformer-based method HAT by an average of 0.09 dB in PSNR, while using only 63% of its parameters and 51% of its FLOPs, achieving an inference speed 3.2 times faster.
[360] HealthiVert-GAN: A Novel Framework of Pseudo-Healthy Vertebral Image Synthesis for Interpretable Compression Fracture Grading
Qi Zhang, Cheng Chuang, Shunan Zhang, Ziqi Zhao, Kun Wang, Jun Xu, Jianqi Sun
Main category: eess.IV
TL;DR: A novel framework, HealthiVert-GAN, is introduced to improve OVCFs grading by synthesizing pseudo-healthy vertebral images and using RHLV for precise height loss measurement, achieving high classification performance.
Details
Motivation: Current methods for assessing OVCFs lack pre-fracture CT scans, standardized references, and interpretability in deep learning models, leading to errors and variability.Method: The framework uses a coarse-to-fine synthesis network to generate pseudo-healthy vertebral images, integrates auxiliary modules for anatomical consistency, and employs RHLV and SVM for quantification and classification.
Result: The model achieves state-of-the-art performance on Verse2019 and in-house datasets, providing cross-sectional height loss maps for clinical use.
Conclusion: HealthiVert-GAN enhances diagnostic accuracy and surgical decision-making by addressing current limitations in OVCFs assessment.
Abstract: Osteoporotic vertebral compression fractures (OVCFs) are prevalent in the elderly population, typically assessed on computed tomography (CT) scans by evaluating vertebral height loss. This assessment helps determine the fracture’s impact on spinal stability and the need for surgical intervention. However, the absence of pre-fracture CT scans and standardized vertebral references leads to measurement errors and inter-observer variability, while irregular compression patterns further challenge the precise grading of fracture severity. While deep learning methods have shown promise in aiding OVCFs screening, they often lack interpretability and sufficient sensitivity, limiting their clinical applicability. To address these challenges, we introduce a novel vertebra synthesis-height loss quantification-OVCFs grading framework. Our proposed model, HealthiVert-GAN, utilizes a coarse-to-fine synthesis network designed to generate pseudo-healthy vertebral images that simulate the pre-fracture state of fractured vertebrae. This model integrates three auxiliary modules that leverage the morphology and height information of adjacent healthy vertebrae to ensure anatomical consistency. Additionally, we introduce the Relative Height Loss of Vertebrae (RHLV) as a quantification metric, which divides each vertebra into three sections to measure height loss between pre-fracture and post-fracture states, followed by fracture severity classification using a Support Vector Machine (SVM). Our approach achieves state-of-the-art classification performance on both the Verse2019 dataset and in-house dataset, and it provides cross-sectional distribution maps of vertebral height loss. This practical tool enhances diagnostic accuracy in clinical settings and assisting in surgical decision-making.
[361] Pathology-Guided AI System for Accurate Segmentation and Diagnosis of Cervical Spondylosis
Qi Zhang, Xiuyuan Chen, Ziyi He, Lianming Wu, Kun Wang, Jianqi Sun, Hongxing Shen
Main category: eess.IV
TL;DR: An AI-assisted system for automated segmentation and diagnosis of cervical spondylosis from MRI images, outperforming existing methods with high accuracy and precision.
Details
Motivation: To address the labor-intensive and error-prone manual interpretation of cervical spine MRI in diagnosing cervical spondylosis.Method: Developed an AI system with a pathology-guided segmentation model and expert-based diagnostic framework, trained on multi-center MRI datasets.
Result: Achieved high segmentation accuracy (Dice >0.90) and diagnostic precision, with low errors in clinical indicators like Cobb angle and MSCC.
Conclusion: The system sets a new benchmark for automated cervical spondylosis diagnosis, offering efficient and accurate clinical assessment.
Abstract: Cervical spondylosis, a complex and prevalent condition, demands precise and efficient diagnostic techniques for accurate assessment. While MRI offers detailed visualization of cervical spine anatomy, manual interpretation remains labor-intensive and prone to error. To address this, we developed an innovative AI-assisted Expert-based Diagnosis System that automates both segmentation and diagnosis of cervical spondylosis using MRI. Leveraging multi-center datasets of cervical MRI images from patients with cervical spondylosis, our system features a pathology-guided segmentation model capable of accurately segmenting key cervical anatomical structures. The segmentation is followed by an expert-based diagnostic framework that automates the calculation of critical clinical indicators. Our segmentation model achieved an impressive average Dice coefficient exceeding 0.90 across four cervical spinal anatomies and demonstrated enhanced accuracy in herniation areas. Diagnostic evaluation further showcased the system’s precision, with the lowest mean average errors (MAE) for the C2-C7 Cobb angle and the Maximum Spinal Cord Compression (MSCC) coefficient. In addition, our method delivered high accuracy, precision, recall, and F1 scores in herniation localization, K-line status assessment, T2 hyperintensity detection, and Kang grading. Comparative analysis and external validation demonstrate that our system outperforms existing methods, establishing a new benchmark for segmentation and diagnostic tasks for cervical spondylosis.
[362] HepatoGEN: Generating Hepatobiliary Phase MRI with Perceptual and Adversarial Models
Jens Hooge, Gerard Sanroma-Guell, Faidra Stavropoulou, Alexander Ullmann, Gesine Knobloch, Mark Klemens, Carola Schmidt, Sabine Weckbach, Andreas Bolz
Main category: eess.IV
TL;DR: A deep learning approach synthesizes hepatobiliary phase (HBP) images from earlier contrast phases to reduce scan time, comparing U-Net, pGAN, and DDPM models. pGAN performed best quantitatively but had inconsistencies, while U-Net was more consistent.
Details
Motivation: Prolonged HBP scan times compromise patient comfort and scanner throughput, necessitating a faster alternative.Method: Three generative models (perceptual U-Net, pGAN, DDPM) were trained on multi-site DCE-MRI data, evaluated using pixel-wise, perceptual metrics, and radiologist reviews.
Result: pGAN showed superior quantitative performance but inconsistent contrast; U-Net was more reliable with fewer artifacts. DDPM underperformed.
Conclusion: Synthetic HBP generation reduces scan time without losing diagnostic value, showcasing deep learning’s clinical potential in liver MRI.
Abstract: Dynamic contrast-enhanced magnetic resonance imaging (DCE-MRI) plays a crucial role in the detection and characterization of focal liver lesions, with the hepatobiliary phase (HBP) providing essential diagnostic information. However, acquiring HBP images requires prolonged scan times, which may compromise patient comfort and scanner throughput. In this study, we propose a deep learning based approach for synthesizing HBP images from earlier contrast phases (precontrast and transitional) and compare three generative models: a perceptual U-Net, a perceptual GAN (pGAN), and a denoising diffusion probabilistic model (DDPM). We curated a multi-site DCE-MRI dataset from diverse clinical settings and introduced a contrast evolution score (CES) to assess training data quality, enhancing model performance. Quantitative evaluation using pixel-wise and perceptual metrics, combined with qualitative assessment through blinded radiologist reviews, showed that pGAN achieved the best quantitative performance but introduced heterogeneous contrast in out-of-distribution cases. In contrast, the U-Net produced consistent liver enhancement with fewer artifacts, while DDPM underperformed due to limited preservation of fine structural details. These findings demonstrate the feasibility of synthetic HBP image generation as a means to reduce scan time without compromising diagnostic utility, highlighting the clinical potential of deep learning for dynamic contrast enhancement in liver MRI. A project demo is available at: https://jhooge.github.io/hepatogen
[363] From Explainable to Explained AI: Ideas for Falsifying and Quantifying Explanations
Yoni Schirris, Eric Marcus, Jonas Teuwen, Hugo Horlings, Efstratios Gavves
Main category: eess.IV
TL;DR: A human-machine-VLM interaction system is proposed to explain deep learning models in computational pathology, using AI-integrated tools and vision-language models to test and quantify explanations.
Details
Motivation: To ensure clinical integration of medical image analysis systems by identifying spurious features or novel biological insights in deep learning models.Method: A system combining human-machine-VLM interaction, AI-integrated slide viewer for sliding-window experiments, and quantification of explanations using vision-language models.
Result: The system allows qualitative testing of explanations and quantifiably distinguishes competing explanations.
Conclusion: Provides a practical path from explainable AI to explained AI in digital pathology, with potential broader applications.
Abstract: Explaining deep learning models is essential for clinical integration of medical image analysis systems. A good explanation highlights if a model depends on spurious features that undermines generalization and harms a subset of patients or, conversely, may present novel biological insights. Although techniques like GradCAM can identify influential features, they are measurement tools that do not themselves form an explanation. We propose a human-machine-VLM interaction system tailored to explaining classifiers in computational pathology, including multi-instance learning for whole-slide images. Our proof of concept comprises (1) an AI-integrated slide viewer to run sliding-window experiments to test claims of an explanation, and (2) quantification of an explanation’s predictiveness using general-purpose vision-language models. The results demonstrate that this allows us to qualitatively test claims of explanations and can quantifiably distinguish competing explanations. This offers a practical path from explainable AI to explained AI in digital pathology and beyond. Code and prompts are available at https://github.com/nki-ai/x2x.