Today’s Research Highlights

AI-enhanced summaries of the latest research papers from arXiv.

cs.CL [Total: 159]
cs.CV [Total: 356]
cs.AI [Total: 90]
cs.SD [Total: 23]
cs.LG [Total: 223]
cs.MA [Total: 6]
cs.MM [Total: 10]
eess.AS [Total: 23]
eess.IV [Total: 39]

cs.CL

[1] Semi-automated Fact-checking in Portuguese: Corpora Enrichment using Retrieval with Claim extraction

Juliana Resplande Sant’anna Gomes, Arlindo Rodrigues Galvão Filho

Main category: cs.CL

TL;DR: The paper introduces a methodology to enrich Portuguese news datasets with external evidence for semi-automated fact-checking, using LLMs and search APIs.

Details

Motivation: Addressing the lack of Portuguese datasets with external evidence for robust fact-checking systems.

Method: Uses LLMs (Gemini 1.5 Flash) to extract claims and search APIs (Google) to retrieve evidence, plus data validation.

Result: Enhanced Portuguese news corpora (Fake.Br, COVID19.BR, MuMiN-PT) with external evidence.

Conclusion: The methodology improves dataset quality for developing semi-automated fact-checking systems in Portuguese.

Abstract: The accelerated dissemination of disinformation often outpaces the capacity for manual fact-checking, highlighting the urgent need for Semi-Automated Fact-Checking (SAFC) systems. Within the Portuguese language context, there is a noted scarcity of publicly available datasets that integrate external evidence, an essential component for developing robust AFC systems, as many existing resources focus solely on classification based on intrinsic text features. This dissertation addresses this gap by developing, applying, and analyzing a methodology to enrich Portuguese news corpora (Fake.Br, COVID19.BR, MuMiN-PT) with external evidence. The approach simulates a user’s verification process, employing Large Language Models (LLMs, specifically Gemini 1.5 Flash) to extract the main claim from texts and search engine APIs (Google Search API, Google FactCheck Claims Search API) to retrieve relevant external documents (evidence). Additionally, a data validation and preprocessing framework, including near-duplicate detection, is introduced to enhance the quality of the base corpora.

[2] Retrieval augmented generation based dynamic prompting for few-shot biomedical named entity recognition using large language models

Yao Ge, Sudeshna Das, Yuting Guo, Abeed Sarker

Main category: cs.CL

TL;DR: The paper explores dynamic prompting with retrieval-augmented generation (RAG) to improve few-shot biomedical NER performance using LLMs like GPT-4, GPT-3.5, and LLaMA 3-70B.

Details

Motivation: Addressing performance challenges of LLMs in few-shot biomedical NER by leveraging dynamic prompting and RAG.

Method: Implemented static and dynamic prompt engineering, using similarity-based retrieval for in-context examples and dynamically updating prompts during inference.

Result: Dynamic prompting improved F1-scores by 7.3% (5-shot) and 5.6% (10-shot), outperforming static prompting.

Conclusion: Contextually adaptive prompts via RAG enhance biomedical NER, demonstrating the effectiveness of dynamic prompting.

Abstract: Biomedical named entity recognition (NER) is a high-utility natural language processing (NLP) task, and large language models (LLMs) show promise particularly in few-shot settings (i.e., limited training data). In this article, we address the performance challenges of LLMs for few-shot biomedical NER by investigating a dynamic prompting strategy involving retrieval-augmented generation (RAG). In our approach, the annotated in-context learning examples are selected based on their similarities with the input texts, and the prompt is dynamically updated for each instance during inference. We implemented and optimized static and dynamic prompt engineering techniques and evaluated them on five biomedical NER datasets. Static prompting with structured components increased average F1-scores by 12% for GPT-4, and 11% for GPT-3.5 and LLaMA 3-70B, relative to basic static prompting. Dynamic prompting further improved performance, with TF-IDF and SBERT retrieval methods yielding the best results, improving average F1-scores by 7.3% and 5.6% in 5-shot and 10-shot settings, respectively. These findings highlight the utility of contextually adaptive prompts via RAG for biomedical NER.

[3] CarbonScaling: Extending Neural Scaling Laws for Carbon Footprint in Large Language Models

Lei Jiang, Fan Chen

Main category: cs.CL

TL;DR: The paper introduces CarbonScaling, a framework extending neural scaling laws to include carbon emissions in LLM training, revealing inefficiencies and offering optimization insights.

Details

Motivation: To address the overlooked exponential carbon emissions from large language models (LLMs) by integrating carbon footprint into neural scaling laws.

Method: Develops CarbonScaling, combining neural scaling, GPU hardware evolution, parallelism optimization, and carbon estimation models to link accuracy and carbon.

Result: Finds a power-law relationship between accuracy and carbon, with real-world inefficiencies increasing scaling. Hardware scaling helps smaller models but has diminishing returns for large LLMs.

Conclusion: CarbonScaling provides insights for sustainable LLM training, highlighting optimizations like critical batch size scaling to reduce inefficiencies.

Abstract: Neural scaling laws have driven the development of increasingly large language models (LLMs) by linking accuracy improvements to growth in parameter count, dataset size, and compute. However, these laws overlook the carbon emissions that scale exponentially with LLM size. This paper presents \textit{CarbonScaling}, an analytical framework that extends neural scaling laws to incorporate both operational and embodied carbon in LLM training. By integrating models for neural scaling, GPU hardware evolution, parallelism optimization, and carbon estimation, \textit{CarbonScaling} quantitatively connects model accuracy to carbon footprint. Results show that while a power-law relationship between accuracy and carbon holds, real-world inefficiencies significantly increase the scaling factor. Hardware technology scaling reduces carbon emissions for small to mid-sized models, but offers diminishing returns for extremely large LLMs due to communication overhead and underutilized GPUs. Training optimizations-especially aggressive critical batch size scaling-help alleviate this inefficiency. \textit{CarbonScaling} offers key insights for training more sustainable and carbon-efficient LLMs.

[4] The Art of Breaking Words: Rethinking Multilingual Tokenizer Design

Aamod Thakur, Ajay Nagpal, Atharva Savarkar, Kundeshwar Pundalik, Siddhesh Dosi, Piyush Sawarkar, Viraj Thakur, Rohit Saluja, Maunendra Sankar Desarkar, Ganesh Ramakrishnan

Main category: cs.CL

TL;DR: The paper addresses the understudied aspect of tokenization in multilingual LLMs, proposing a novel algorithm for data composition and pretokenization strategies to improve efficiency and model performance, particularly in linguistically diverse contexts like Indic scripts.

Details

Motivation: Tokenization in multilingual LLMs is often overlooked despite its impact on efficiency and performance. The study aims to improve token-to-word ratios and model quality by analyzing vocabulary size, pretokenization rules, and training-corpus composition.

Method: The study conducts experiments on Indic scripts to analyze tokenization challenges. It proposes a novel algorithm for balanced multilingual data composition and evaluates pretokenization strategies.

Result: The proposed algorithm reduces the average token-to-word ratio by 6% and achieves over 40% improvement against state-of-the-art multilingual Indic models, enhancing model performance and inference speed.

Conclusion: Tokenization is a critical factor for efficient multilingual LLMs, alongside architecture and training objectives. The study’s methods significantly improve tokenization efficiency and model performance in diverse linguistic contexts.

Abstract: While model architecture and training objectives are well-studied, tokenization, particularly in multilingual contexts, remains a relatively neglected aspect of Large Language Model (LLM) development. Existing tokenizers often exhibit high token-to-word ratios, inefficient use of context length, and slower inference. We present a systematic study that links vocabulary size, pre-tokenization rules, and training-corpus composition to both token-to-word efficiency and model quality. To ground our analysis in a linguistically diverse context, we conduct extensive experiments on Indic scripts, which present unique challenges due to their high script diversity and orthographic complexity. Drawing on the insights from these analyses, we propose a novel algorithm for data composition that balances multilingual data for tokenizer training. Our observations on pretokenization strategies significantly improve model performance, and our data composition algorithm reduces the average token-to-word ratio by approximately 6% with respect to the conventional data randomization approach. Our tokenizer achieves more than 40% improvement on average token-to-word ratio against stateof-the-art multilingual Indic models. This improvement yields measurable gains in both model performance and inference speed. This highlights tokenization alongside architecture and training objectives as a critical lever for building efficient, scalable multilingual LLMs

[5] Factor Augmented Supervised Learning with Text Embeddings

Zhanye Luo, Yuefeng Han, Xiufan Yu

Main category: cs.CL

TL;DR: AEALT is a supervised framework that reduces the dimensionality of LLM embeddings using an augmented autoencoder, improving efficiency and performance in downstream tasks.

Details

Motivation: High-dimensional LLM embeddings hinder efficiency and increase computational costs in downstream tasks.

Method: Extracts embeddings from text, then uses a supervised augmented autoencoder to learn low-dimensional, task-relevant latent factors.

Result: AEALT outperforms conventional methods in classification, anomaly detection, and prediction tasks.

Conclusion: AEALT provides significant improvements over raw embeddings and standard dimension reduction techniques.

Abstract: Large language models (LLMs) generate text embeddings from text data, producing vector representations that capture the semantic meaning and contextual relationships of words. However, the high dimensionality of these embeddings often impedes efficiency and drives up computational cost in downstream tasks. To address this, we propose AutoEncoder-Augmented Learning with Text (AEALT), a supervised, factor-augmented framework that incorporates dimension reduction directly into pre-trained LLM workflows. First, we extract embeddings from text documents; next, we pass them through a supervised augmented autoencoder to learn low-dimensional, task-relevant latent factors. By modeling the nonlinear structure of complex embeddings, AEALT outperforms conventional deep-learning approaches that rely on raw embeddings. We validate its broad applicability with extensive experiments on classification, anomaly detection, and prediction tasks using multiple real-world public datasets. Numerical results demonstrate that AEALT yields substantial gains over both vanilla embeddings and several standard dimension reduction methods.

[6] SEVADE: Self-Evolving Multi-Agent Analysis with Decoupled Evaluation for Hallucination-Resistant Irony Detection

Ziqi Liu, Yangbin Chen, Ziyang Zhou, Yilin Li, Mingxuan Hu, Yushan Pan, Zhijie Xu

Main category: cs.CL

TL;DR: SEVADE is a self-evolving multi-agent framework for sarcasm detection, using dynamic reasoning and decoupled evaluation to reduce hallucination and improve accuracy.

Details

Motivation: Existing methods for sarcasm detection are limited by single-perspective analysis and hallucination, impacting reliability.

Method: Proposes SEVADE with a Dynamic Agentive Reasoning Engine (DARE) and rationale adjudicator (RA) for decoupled reasoning and classification.

Result: Achieves state-of-the-art performance with 6.75% higher Accuracy and 6.29% higher Macro-F1 score.

Conclusion: SEVADE effectively addresses limitations of existing methods, offering a robust solution for sarcasm detection.

Abstract: Sarcasm detection is a crucial yet challenging Natural Language Processing task. Existing Large Language Model methods are often limited by single-perspective analysis, static reasoning pathways, and a susceptibility to hallucination when processing complex ironic rhetoric, which impacts their accuracy and reliability. To address these challenges, we propose SEVADE, a novel Self-Evolving multi-agent Analysis framework with Decoupled Evaluation for hallucination-resistant sarcasm detection. The core of our framework is a Dynamic Agentive Reasoning Engine (DARE), which utilizes a team of specialized agents grounded in linguistic theory to perform a multifaceted deconstruction of the text and generate a structured reasoning chain. Subsequently, a separate lightweight rationale adjudicator (RA) performs the final classification based solely on this reasoning chain. This decoupled architecture is designed to mitigate the risk of hallucination by separating complex reasoning from the final judgment. Extensive experiments on four benchmark datasets demonstrate that our framework achieves state-of-the-art performance, with average improvements of 6.75% in Accuracy and 6.29% in Macro-F1 score.

[7] Discerning minds or generic tutors? Evaluating instructional guidance capabilities in Socratic LLMs

Ying Liu, Can Li, Ting Zhang, Mei Wang, Qiannan Zhu, Jian Li, Hua Huang

Main category: cs.CL

TL;DR: The paper evaluates LLMs’ ability to adaptively guide learners, proposing GuideEval for benchmarking and a behavior-guided finetuning strategy to improve tutoring performance.

Details

Motivation: Prior research overlooks adaptive guidance in LLM tutoring, focusing only on Socratic questioning. This study aims to assess and enhance LLMs' ability to dynamically adjust teaching strategies based on learners' cognitive states.

Method: The study introduces GuideEval, a benchmark evaluating pedagogical guidance through a three-phase framework (Perception, Orchestration, Elicitation). It also proposes behavior-guided finetuning using instructional dialogues.

Result: Existing LLMs often fail in adaptive scaffolding for confused learners. The proposed finetuning strategy significantly improves guidance performance.

Conclusion: The work advocates a dialogic paradigm for evaluating Socratic LLMs, shifting focus from content to learner-centered interaction.

Abstract: The conversational capabilities of large language models hold significant promise for enabling scalable and interactive tutoring. While prior research has primarily examined their capacity for Socratic questioning, it often overlooks a critical dimension: adaptively guiding learners based on their cognitive states. This study shifts focus from mere question generation to the broader instructional guidance capability. We ask: Can LLMs emulate expert tutors who dynamically adjust strategies in response to learners' understanding? To investigate this, we propose GuideEval, a benchmark grounded in authentic educational dialogues that evaluates pedagogical guidance through a three-phase behavioral framework: (1) Perception, inferring learner states; (2) Orchestration, adapting instructional strategies; and (3) Elicitation, stimulating proper reflections. Empirical findings reveal that existing LLMs frequently fail to provide effective adaptive scaffolding when learners exhibit confusion or require redirection. Furthermore, we introduce a behavior-guided finetuning strategy that leverages behavior-prompted instructional dialogues, significantly enhancing guidance performance. By shifting the focus from isolated content evaluation to learner-centered interaction, our work advocates a more dialogic paradigm for evaluating Socratic LLMs.

[8] LLM Unlearning Without an Expert Curated Dataset

Xiaoyuan Zhu, Muru Zhang, Ollie Liu, Robin Jia, Willie Neiswanger

Main category: cs.CL

TL;DR: The paper introduces an automated method for generating high-quality forget sets for post-hoc unlearning in large language models, using synthetic textbook-style data created by the models themselves.

Details

Motivation: Large language models often encode sensitive or copyrighted knowledge, necessitating efficient unlearning methods without full retraining. Current approaches struggle with creating effective forget sets.

Method: A scalable, automated approach synthesizes textbook-style data via structured prompting, requiring only a domain name as input.

Result: The synthetic datasets outperform baseline alternatives and match expert-curated ones in unlearning biosecurity, cybersecurity, and Harry Potter domains. Data diversity from multi-step generation enhances unlearning utility.

Conclusion: Synthetic datasets provide a practical, scalable solution for unlearning in emerging domains without manual effort, with code and data publicly released.

Abstract: Modern large language models often encode sensitive, harmful, or copyrighted knowledge, raising the need for post-hoc unlearning-the ability to remove specific domains of knowledge from a model without full retraining. A major bottleneck in current unlearning pipelines is constructing effective forget sets-datasets that approximate the target domain and guide the model to forget it. In this work, we introduce a scalable, automated approach to generate high-quality forget sets using language models themselves. Our method synthesizes textbook-style data through a structured prompting pipeline, requiring only a domain name as input. Through experiments on unlearning biosecurity, cybersecurity, and Harry Potter novels, we show that our synthetic datasets consistently outperform the baseline synthetic alternatives and are comparable to the expert-curated ones. Additionally, ablation studies reveal that the multi-step generation pipeline significantly boosts data diversity, which in turn improves unlearning utility. Overall, our findings suggest that synthetic datasets offer a promising path toward practical, scalable unlearning for a wide range of emerging domains without the need for manual intervention. We release our code and dataset at https://github.com/xyzhu123/Synthetic_Textbook.

[9] BrowseComp-Plus: A More Fair and Transparent Evaluation Benchmark of Deep-Research Agent

Zijian Chen, Xueguang Ma, Shengyao Zhuang, Ping Nie, Kai Zou, Andrew Liu, Joshua Green, Kshama Patel, Ruoxi Meng, Mingyi Su, Sahel Sharifymoghaddam, Yanxi Li, Haoran Hong, Xinyu Shi, Xuye Liu, Nandan Thakur, Crystina Zhang, Luyu Gao, Wenhu Chen, Jimmy Lin

Main category: cs.CL

TL;DR: The paper introduces BrowseComp-Plus, a benchmark for evaluating deep-research agents, addressing fairness and transparency issues in current evaluations by using a fixed corpus and controlled experiments.

Details

Motivation: Current benchmarks for deep-research agents, like BrowseComp, rely on dynamic web APIs, which hinder fair comparisons and reproducibility. The lack of control over the document corpus also limits insights into the underlying LLMs' capabilities.

Method: The authors derive BrowseComp-Plus from BrowseComp, using a fixed, curated corpus with human-verified documents and challenging negatives. This enables controlled experimentation and disentangled analysis of retrieval methods and deep-research agents.

Result: BrowseComp-Plus effectively distinguishes system performance, e.g., GPT-5 achieves 55.9% accuracy, improving to 70.1% when integrated with Qwen3-Embedding-8B. The benchmark provides insights into retrieval effectiveness and context engineering.

Conclusion: BrowseComp-Plus addresses limitations of current benchmarks, offering a controlled environment for evaluating and improving deep-research systems, with demonstrated effectiveness in distinguishing performance and fostering insights.

Abstract: Deep-Research agents, which integrate large language models (LLMs) with search tools, have shown success in improving the effectiveness of handling complex queries that require iterative search planning and reasoning over search results. Evaluations on current benchmarks like BrowseComp relies on black-box live web search APIs, have notable limitations in (1) fairness: dynamic and opaque web APIs hinder fair comparisons and reproducibility of deep research methods; (2) transparency: lack of control over the document corpus makes it difficult to isolate retriever contributions. In other words, the current evaluations may compare a complete deep research system at a given time, but they do not foster well-controlled experiments to provide insights into the capability of underlying deep research LLMs. To address these challenges, we introduce BrowseComp-Plus, a benchmark derived from BrowseComp, employing a fixed, carefully curated corpus. Each query in BrowseComp-Plus includes human-verified supporting documents and mined challenging negatives, enabling controlled experimentation. The benchmark is shown to be effective in distinguishing the performance of deep research systems. For instance, the open-source model Search-R1, when paired with the BM25 retriever, achieves 3.86% accuracy, whereas the GPT-5 achieves 55.9%. Integrating the GPT-5 with the Qwen3-Embedding-8B retriever further enhances its accuracy to 70.1% with fewer search calls. This benchmark allows comprehensive evaluation and disentangled analysis of deep research agents and retrieval methods, fostering insights into retrieval effectiveness, citation accuracy, and context engineering in Deep-Research system.

[10] Train It and Forget It: Merge Lists are Unnecessary for BPE Inference in Language Models

Tomohiro Sawada, Kartik Goyal

Main category: cs.CL

TL;DR: The paper examines BPE tokenization’s merge-list dependency, testing alternative inference schemes to mitigate privacy risks without harming model performance.

Details

Motivation: Recent findings highlight BPE merge lists as a privacy vulnerability, prompting exploration of merge-list-free inference methods.

Method: Two classes of BPE inference schemes are tested: targeted deviations (random/corrupted merge lists) and non-targeted merge-list-free algorithms (greedy/exact compression).

Result: Targeted deviations degrade performance, while non-targeted merge-list-free methods show minimal impact, preserving model efficacy.

Conclusion: Merge-list-free BPE inference offers a simpler, privacy-preserving alternative without significant performance loss.

Abstract: Standard Byte-Pair Encoding (BPE) tokenization compresses text by pairing a learned token vocabulary with a detailed merge list. Recent work has shown that this merge list exposes a potential attack surface for extracting information about language model’s training data. In this paper, we explore the downstream impact of BPE inference algorithms that do not rely on this merge list at all, and hence differ from the encoding process during BPE training. To address this question, we investigate two broad classes of BPE inference schemes that differ from BPE application during training: a) targeted deviation from merge-lists including random merge orders, and various corruptions of merge list involving deletion/truncation, and b) non-targeted BPE inference algorithms that do not depend on the merge list but focus on compressing the text either greedily or exactly. Extensive experiments across diverse language modeling tasks like accuracy-based QA benchmarks, machine translation, and open-ended generation reveal that while targeted deviation from the merge lists exhibits significant degradation in language model performance, the non-targeted merge-list-free inference algorithms result in minimal impact on downstream performance that is often much smaller than expected. These findings pave way for simpler and potentially more privacy-preserving tokenization schemes that do not catastrophically compromise model performance.

[11] Measuring Stereotype and Deviation Biases in Large Language Models

Daniel Wang, Eli Brignac, Minjia Mao, Xiao Fang

Main category: cs.CL

TL;DR: The study investigates stereotype and deviation biases in LLMs, revealing significant biases in associations with demographic groups and real-world disparities.

Details

Motivation: To address concerns about biases in LLMs and their potential risks in diverse applications.

Method: Four advanced LLMs generated profiles of individuals, analyzing associations between demographic groups and attributes like political affiliation, religion, and sexual orientation.

Result: All examined LLMs showed significant stereotype and deviation biases toward multiple groups.

Conclusion: The findings highlight biases in LLM-generated outputs and their potential harms, emphasizing the need for mitigation.

Abstract: Large language models (LLMs) are widely applied across diverse domains, raising concerns about their limitations and potential risks. In this study, we investigate two types of bias that LLMs may display: stereotype bias and deviation bias. Stereotype bias refers to when LLMs consistently associate specific traits with a particular demographic group. Deviation bias reflects the disparity between the demographic distributions extracted from LLM-generated content and real-world demographic distributions. By asking four advanced LLMs to generate profiles of individuals, we examine the associations between each demographic group and attributes such as political affiliation, religion, and sexual orientation. Our experimental results show that all examined LLMs exhibit both significant stereotype bias and deviation bias towards multiple groups. Our findings uncover the biases that occur when LLMs infer user attributes and shed light on the potential harms of LLM-generated outputs.

[12] Testing the Limits of Machine Translation from One Book

Jonathan Shaw, Dillon Mee, Timothy Khouw, Zackary Leech, Daniel Wilson

Main category: cs.CL

TL;DR: LLMs improve translation for low-resource languages like Kanuri using parallel sentences, but grammar alone is insufficient. Human evaluations highlight accuracy over fluency.

Details

Motivation: To explore how LLMs can translate low-resource languages (e.g., Kanuri) with minimal digital resources, focusing on domain-specific tasks and the impact of language materials (grammar, dictionary, parallel sentences).

Method: Two datasets (health/humanitarian and generalized terms) were created. LLM translation was tested with varying language resources (grammar, dictionary, parallel sentences), compared to native speaker and linguist translations. Evaluations used automatic metrics and human assessments (fluency, accuracy).

Result: Parallel sentences were most effective for translation. Grammar improved zero-shot translation but was insufficient alone. LLMs achieved better accuracy than fluency.

Conclusion: LLM translation evaluation requires multidimensional assessment. Grammar alone lacks context for effective domain-specific translation; parallel sentences are crucial.

Abstract: Current state-of-the-art models demonstrate capacity to leverage in-context learning to translate into previously unseen language contexts. Tanzer et al. [2024] utilize language materials (e.g. a grammar) to improve translation quality for Kalamang using large language models (LLMs). We focus on Kanuri, a language that, despite having substantial speaker population, has minimal digital resources. We design two datasets for evaluation: one focused on health and humanitarian terms, and another containing generalized terminology, investigating how domain-specific tasks impact LLM translation quality. By providing different combinations of language resources (grammar, dictionary, and parallel sentences), we measure LLM translation effectiveness, comparing results to native speaker translations and human linguist performance. We evaluate using both automatic metrics and native speaker assessments of fluency and accuracy. Results demonstrate that parallel sentences remain the most effective data source, outperforming other methods in human evaluations and automatic metrics. While incorporating grammar improves over zero-shot translation, it fails as an effective standalone data source. Human evaluations reveal that LLMs achieve accuracy (meaning) more effectively than fluency (grammaticality). These findings suggest LLM translation evaluation benefits from multidimensional assessment beyond simple accuracy metrics, and that grammar alone, without parallel sentences, does not provide sufficient context for effective domain-specific translation.

[13] Do Biased Models Have Biased Thoughts?

Swati Rajwal, Shivank Garg, Reem Abdel-Salam, Abdelrahman Zayed

Main category: cs.CL

TL;DR: The paper investigates whether biased language models exhibit biased thought processes using chain-of-thought prompting. Results show low correlation between bias in thoughts and output.

Details

Motivation: To understand if biased language models also have biased internal reasoning steps, addressing fairness concerns in deployment.

Method: Experiments on 5 large language models using fairness metrics to quantify 11 biases in thoughts and outputs.

Result: Bias in thoughts is not highly correlated with output bias (correlation <0.6, p<0.001).

Conclusion: Unlike humans, biased models don’t always have biased thoughts, suggesting a disconnect between reasoning and output.

Abstract: The impressive performance of language models is undeniable. However, the presence of biases based on gender, race, socio-economic status, physical appearance, and sexual orientation makes the deployment of language models challenging. This paper studies the effect of chain-of-thought prompting, a recent approach that studies the steps followed by the model before it responds, on fairness. More specifically, we ask the following question: \textit{Do biased models have biased thoughts}? To answer our question, we conduct experiments on $5$ popular large language models using fairness metrics to quantify $11$ different biases in the model’s thoughts and output. Our results show that the bias in the thinking steps is not highly correlated with the output bias (less than $0.6$ correlation with a $p$-value smaller than $0.001$ in most cases). In other words, unlike human beings, the tested models with biased decisions do not always possess biased thoughts.

[14] Play Favorites: A Statistical Method to Measure Self-Bias in LLM-as-a-Judge

Evangelia Spiliopoulou, Riccardo Fogliato, Hanna Burnsky, Tamer Soliman, Jie Ma, Graham Horwood, Miguel Ballesteros

Main category: cs.CL

TL;DR: The paper introduces a statistical framework to identify and quantify self-bias in LLMs when they evaluate their own outputs, ensuring genuine performance differences are not conflated with bias.

Details

Motivation: To address the issue of LLMs systematically favoring their own outputs (self-bias) and to provide a reliable method for isolating and measuring this bias without confusing it with actual performance differences.

Method: A statistical framework models the scoring distribution differences between LLM judges’ evaluations of their own outputs versus others, while accounting for third-party (e.g., human) judgments.

Result: Empirical analysis reveals self-bias in models like GPT-4o and Claude 3.5 Sonnet, which also exhibit family-bias (favoring outputs from models of the same family).

Conclusion: The study highlights risks of using LLM judges and offers practical solutions to mitigate bias in automated evaluations.

Abstract: Large language models (LLMs) can serve as judges that offer rapid and reliable assessments of other LLM outputs. However, models may systematically assign overly favorable ratings to their own outputs, a phenomenon known as self-bias, which can distort evaluations of true model performance. Previous studies often conflate genuine differences in model quality with bias or incorrectly assume that evaluations from LLMs and humans follow the same rating distributions. In this work, we present a statistical framework that explicitly formalizes assumptions under which self-bias can be identified and estimated. Our method models the difference in the scoring distribution that LLM-as-a-judge assigns to its own completions compared to other models, while accounting for the underlying quality of the completions provided by an independent, third-party judge (e.g., humans). Our method reliably isolates and quantifies self-bias, even when models vary in ability, ensuring that genuine performance differences are not mistaken for self-bias. We conduct an empirical analysis of self-bias on a large dataset (>5000 prompt-completion pairs) consisting of expert human annotations and judgments from nine different LLM judges. We find that some models, such as GPT-4o and Claude 3.5 Sonnet, systematically assign higher scores to their own outputs. These models also display family-bias; systematically assigning higher ratings to outputs produced by other models of the same family. Our findings highlight potential pitfalls of using LLM judges and offer practical guidance to mitigate biases when interpreting automated evaluations.

[15] Large Language Models for Oral History Understanding with Text Classification and Sentiment Analysis

Komala Subramanyam Cherukuri, Pranav Abishai Moses, Aisa Sakata, Jiangping Chen, Haihua Chen

Main category: cs.CL

TL;DR: A scalable framework using LLMs automates semantic and sentiment annotation for Japanese American Incarceration Oral History, achieving high accuracy with ChatGPT, Llama, and Qwen.

Details

Motivation: Oral histories are crucial for marginalized communities but are hard to analyze due to unstructured formats and high annotation costs. This work aims to bridge archival ethics with scalable NLP.

Method: The study uses expert annotation, prompt engineering, and evaluates LLMs (ChatGPT, Llama, Qwen) for semantic and sentiment classification on 558 sentences, then scales to 92,191 sentences.

Result: ChatGPT led in semantic classification (88.71% F1), while Llama slightly outperformed in sentiment analysis (82.66%). Well-designed prompts enabled effective large-scale annotation.

Conclusion: LLMs can responsibly automate oral history analysis, providing a reusable pipeline and ethical guidance for digital humanities and collective memory preservation.

Abstract: Oral histories are vital records of lived experience, particularly within communities affected by systemic injustice and historical erasure. Effective and efficient analysis of their oral history archives can promote access and understanding of the oral histories. However, Large-scale analysis of these archives remains limited due to their unstructured format, emotional complexity, and high annotation costs. This paper presents a scalable framework to automate semantic and sentiment annotation for Japanese American Incarceration Oral History. Using LLMs, we construct a high-quality dataset, evaluate multiple models, and test prompt engineering strategies in historically sensitive contexts. Our multiphase approach combines expert annotation, prompt design, and LLM evaluation with ChatGPT, Llama, and Qwen. We labeled 558 sentences from 15 narrators for sentiment and semantic classification, then evaluated zero-shot, few-shot, and RAG strategies. For semantic classification, ChatGPT achieved the highest F1 score (88.71%), followed by Llama (84.99%) and Qwen (83.72%). For sentiment analysis, Llama slightly outperformed Qwen (82.66%) and ChatGPT (82.29%), with all models showing comparable results. The best prompt configurations were used to annotate 92,191 sentences from 1,002 interviews in the JAIOH collection. Our findings show that LLMs can effectively perform semantic and sentiment annotation across large oral history collections when guided by well-designed prompts. This study provides a reusable annotation pipeline and practical guidance for applying LLMs in culturally sensitive archival analysis. By bridging archival ethics with scalable NLP techniques, this work lays the groundwork for responsible use of artificial intelligence in digital humanities and preservation of collective memory. GitHub: https://github.com/kc6699c/LLM4OralHistoryAnalysis.

[16] Many-Turn Jailbreaking

Xianjun Yang, Liqiang Xiao, Shiyang Li, Faisal Ladhak, Hyokun Yun, Linda Ruth Petzold, Yi Xu, William Yang Wang

Main category: cs.CL

TL;DR: The paper introduces multi-turn jailbreaking in LLMs, a more serious threat than single-turn jailbreaking, and proposes a benchmark (MTJ-Bench) to evaluate this vulnerability.

Details

Motivation: Current jailbreaking methods focus on single-turn attacks, but advanced LLMs handle multi-turn conversations, posing a greater risk.

Method: The authors construct MTJ-Bench to benchmark multi-turn jailbreaking on various LLMs.

Result: The study reveals a new vulnerability in LLMs, highlighting the need for safer models.

Conclusion: The work calls for community efforts to address multi-turn jailbreaking and improve LLM safety.

Abstract: Current jailbreaking work on large language models (LLMs) aims to elicit unsafe outputs from given prompts. However, it only focuses on single-turn jailbreaking targeting one specific query. On the contrary, the advanced LLMs are designed to handle extremely long contexts and can thus conduct multi-turn conversations. So, we propose exploring multi-turn jailbreaking, in which the jailbroken LLMs are continuously tested on more than the first-turn conversation or a single target query. This is an even more serious threat because 1) it is common for users to continue asking relevant follow-up questions to clarify certain jailbroken details, and 2) it is also possible that the initial round of jailbreaking causes the LLMs to respond to additional irrelevant questions consistently. As the first step (First draft done at June 2024) in exploring multi-turn jailbreaking, we construct a Multi-Turn Jailbreak Benchmark (MTJ-Bench) for benchmarking this setting on a series of open- and closed-source models and provide novel insights into this new safety threat. By revealing this new vulnerability, we aim to call for community efforts to build safer LLMs and pave the way for a more in-depth understanding of jailbreaking LLMs.

[17] Annotating Errors in English Learners’ Written Language Production: Advancing Automated Written Feedback Systems

Steven Coyne, Diana Galvan-Sosa, Ryan Spring, Camélia Guerraoui, Michael Zock, Keisuke Sakaguchi, Kentaro Inui

Main category: cs.CL

TL;DR: The paper introduces an annotation framework for AWE systems to provide better feedback for language learners, focusing on error types and generalizability, and evaluates feedback generation methods using LLMs.

Details

Motivation: Current AWE systems prioritize direct corrections over learning-oriented feedback, which may not optimally aid language learners.

Method: An annotation framework models error types and generalizability, followed by dataset collection and evaluation of feedback generation methods (keyword-guided, keyword-free, template-guided) using LLMs.

Result: Human teachers assessed the feedback systems for relevance, factuality, and comprehensibility, with comparative performance reported.

Conclusion: The framework and dataset support the development of more effective, learning-focused feedback in AWE systems.

Abstract: Recent advances in natural language processing (NLP) have contributed to the development of automated writing evaluation (AWE) systems that can correct grammatical errors. However, while these systems are effective at improving text, they are not optimally designed for language learning. They favor direct revisions, often with a click-to-fix functionality that can be applied without considering the reason for the correction. Meanwhile, depending on the error type, learners may benefit most from simple explanations and strategically indirect hints, especially on generalizable grammatical rules. To support the generation of such feedback, we introduce an annotation framework that models each error’s error type and generalizability. For error type classification, we introduce a typology focused on inferring learners’ knowledge gaps by connecting their errors to specific grammatical patterns. Following this framework, we collect a dataset of annotated learner errors and corresponding human-written feedback comments, each labeled as a direct correction or hint. With this data, we evaluate keyword-guided, keyword-free, and template-guided methods of generating feedback using large language models (LLMs). Human teachers examined each system’s outputs, assessing them on grounds including relevance, factuality, and comprehensibility. We report on the development of the dataset and the comparative performance of the systems investigated.

[18] Text to Speech System for Meitei Mayek Script

Gangular Singh Irengbam, Nirvash Singh Wahengbam, Lanthoiba Meitei Khumanthem, Paikhomba Oinam

Main category: cs.CL

TL;DR: A neural TTS system for Manipuri using Tacotron 2 and HiFi-GAN, addressing tonal phonology and under-resourced challenges.

Details

Motivation: To support linguistic preservation and technological inclusion of the Manipuri language.

Method: Developed a phoneme mapping for Meitei Mayek to ARPAbet, curated a single-speaker dataset, and adapted Tacotron 2 and HiFi-GAN for tonal phonology.

Result: Achieved intelligible and natural speech synthesis, validated by subjective and objective metrics.

Conclusion: The system provides a foundation for Manipuri language preservation and technological integration.

Abstract: This paper presents the development of a Text-to-Speech (TTS) system for the Manipuri language using the Meitei Mayek script. Leveraging Tacotron 2 and HiFi-GAN, we introduce a neural TTS architecture adapted to support tonal phonology and under-resourced linguistic environments. We develop a phoneme mapping for Meitei Mayek to ARPAbet, curate a single-speaker dataset, and demonstrate intelligible and natural speech synthesis, validated through subjective and objective metrics. This system lays the groundwork for linguistic preservation and technological inclusion of Manipuri.

[19] ESNERA: Empirical and semantic named entity alignment for named entity dataset merging

Xiaobo Zhang, Congqing He, Ying He, Jian Peng, Dajie Fu, Tien-Ping Tan

Main category: cs.CL

TL;DR: Proposes an automatic label alignment method for merging NER datasets, combining empirical and semantic similarities to unify labels, improving performance in low-resource domains.

Details

Motivation: NER relies on large annotated datasets, which are costly to create. Current merging methods lack interpretability and scalability.

Method: Uses label similarity (empirical and semantic) with greedy pairwise merging to align labels across datasets.

Result: Successfully merges datasets with minimal performance impact and enhances NER in the financial domain.

Conclusion: The method offers an efficient, interpretable, and scalable solution for multi-source NER corpus integration.

Abstract: Named Entity Recognition (NER) is a fundamental task in natural language processing. It remains a research hotspot due to its wide applicability across domains. Although recent advances in deep learning have significantly improved NER performance, they rely heavily on large, high-quality annotated datasets. However, building these datasets is expensive and time-consuming, posing a major bottleneck for further research. Current dataset merging approaches mainly focus on strategies like manual label mapping or constructing label graphs, which lack interpretability and scalability. To address this, we propose an automatic label alignment method based on label similarity. The method combines empirical and semantic similarities, using a greedy pairwise merging strategy to unify label spaces across different datasets. Experiments are conducted in two stages: first, merging three existing NER datasets into a unified corpus with minimal impact on NER performance; second, integrating this corpus with a small-scale, self-built dataset in the financial domain. The results show that our method enables effective dataset merging and enhances NER performance in the low-resource financial domain. This study presents an efficient, interpretable, and scalable solution for integrating multi-source NER corpora.

[20] How Does a Deep Neural Network Look at Lexical Stress?

Itai Allouche, Itay Asael, Rotem Rousso, Vered Dassa, Ann Bradlow, Seung-Eun Kim, Matthew Goldrick, Joseph Keshet

Main category: cs.CL

TL;DR: The paper investigates interpretability of neural networks in predicting lexical stress, using CNNs and LRP to analyze spectral cues.

Details

Motivation: To understand what informs neural network decisions in speech processing, specifically for lexical stress prediction.

Method: Automated dataset construction, CNN training for stress prediction, and LRP for interpretability analysis.

Result: CNNs achieved 92% accuracy; LRP revealed reliance on stressed vowel spectral properties and distributed word cues.

Conclusion: Deep learning can learn distributed stress cues from natural data, complementing traditional phonetic methods.

Abstract: Despite their success in speech processing, neural networks often operate as black boxes, prompting the question: what informs their decisions, and how can we interpret them? This work examines this issue in the context of lexical stress. A dataset of English disyllabic words was automatically constructed from read and spontaneous speech. Several Convolutional Neural Network (CNN) architectures were trained to predict stress position from a spectrographic representation of disyllabic words lacking minimal stress pairs (e.g., initial stress WAllet, final stress exTEND), achieving up to 92% accuracy on held-out test data. Layerwise Relevance Propagation (LRP), a technique for CNN interpretability analysis, revealed that predictions for held-out minimal pairs (PROtest vs. proTEST ) were most strongly influenced by information in stressed versus unstressed syllables, particularly the spectral properties of stressed vowels. However, the classifiers also attended to information throughout the word. A feature-specific relevance analysis is proposed, and its results suggest that our best-performing classifier is strongly influenced by the stressed vowel’s first and second formants, with some evidence that its pitch and third formant also contribute. These results reveal deep learning’s ability to acquire distributed cues to stress from naturally occurring data, extending traditional phonetic work based around highly controlled stimuli.

[21] The ReQAP System for Question Answering over Personal Information

Philipp Christmann, Gerhard Weikum

Main category: cs.CL

TL;DR: ReQAP is a system that answers complex questions by decomposing them recursively and using light-weight language models for execution.

Details

Motivation: To address the challenge of answering complex questions involving heterogeneous data sources on users' devices.

Method: Recursively decomposes questions, builds an operator tree, and uses fine-tuned light-weight language models for execution.

Result: Demonstrates rich functionality for advanced questions and provides traceability of answers.

Conclusion: ReQAP enhances user trust by making answer computation transparent and comprehensible.

Abstract: Personal information is abundant on users’ devices, from structured data in calendar, shopping records or fitness tools, to unstructured contents in mail and social media posts. This works presents the ReQAP system that supports users with answers for complex questions that involve filters, joins and aggregation over heterogeneous sources. The unique trait of ReQAP is that it recursively decomposes questions and incrementally builds an operator tree for execution. Both the question interpretation and the individual operators make smart use of light-weight language models, with judicious fine-tuning. The demo showcases the rich functionality for advanced user questions, and also offers detailed tracking of how the answers are computed by the operators in the execution tree. Being able to trace answers back to the underlying sources is vital for human comprehensibility and user trust in the system.

[22] Think Before You Talk: Enhancing Meaningful Dialogue Generation in Full-Duplex Speech Language Models with Planning-Inspired Text Guidance

Wenqian Cui, Lei Zhu, Xiaohui Li, Zhihan Guo, Haoli Bai, Lu Hou, Irwin King

Main category: cs.CL

TL;DR: TurnGuide improves Full-Duplex Speech Language Models (FD-SLMs) by dynamically segmenting speech into turns and generating turn-level text guidance, enhancing conversational flow and coherence.

Details

Motivation: FD-SLMs degrade in conversational quality due to prolonged speech sequences and lack of high-quality spoken dialogue data. Text-guided speech generation struggles with timing and alignment.

Method: Proposes TurnGuide, a planning-inspired approach that segments assistant speech into turns and generates turn-level text guidance before speech output.

Result: Significantly improves FD-SLMs’ conversational abilities, ensuring semantically meaningful and coherent speech with natural flow.

Conclusion: TurnGuide effectively addresses timing and length challenges in FD-SLMs, enabling more human-like interactions.

Abstract: Full-Duplex Speech Language Models (FD-SLMs) are specialized foundation models designed to enable natural, real-time spoken interactions by modeling complex conversational dynamics such as interruptions, backchannels, and overlapping speech, and End-to-end (e2e) FD-SLMs leverage real-world double-channel conversational data to capture nuanced two-speaker dialogue patterns for human-like interactions. However, they face a critical challenge – their conversational abilities often degrade compared to pure-text conversation due to prolonged speech sequences and limited high-quality spoken dialogue data. While text-guided speech generation could mitigate these issues, it suffers from timing and length issues when integrating textual guidance into double-channel audio streams, disrupting the precise time alignment essential for natural interactions. To address these challenges, we propose TurnGuide, a novel planning-inspired approach that mimics human conversational planning by dynamically segmenting assistant speech into dialogue turns and generating turn-level text guidance before speech output, which effectively resolves both insertion timing and length challenges. Extensive experiments demonstrate our approach significantly improves e2e FD-SLMs’ conversational abilities, enabling them to generate semantically meaningful and coherent speech while maintaining natural conversational flow. Demos are available at https://dreamtheater123.github.io/TurnGuide-Demo/. Code will be available at https://github.com/dreamtheater123/TurnGuide.

[23] Score Before You Speak: Improving Persona Consistency in Dialogue Generation using Response Quality Scores

Arpita Saggar, Jonathan C. Darling, Vania Dimitrova, Duygu Sarikaya, David C. Hogg

Main category: cs.CL

TL;DR: The paper introduces SBS (Score-Before-Speaking), a framework for persona-based dialogue generation that improves persona fidelity by training models to correlate augmented responses with quality scores.

Details

Motivation: Enhancing persona fidelity in dialogue generation is challenging due to limited diversity in existing data.

Method: SBS unifies response learning and quality scoring in one step, using noun-based substitution for augmentation and semantic similarity for scoring.

Result: SBS outperforms previous methods, improving persona-consistent dialogues in benchmark datasets (PERSONA-CHAT and ConvAI2).

Conclusion: Score-conditioned training enhances persona fidelity, and including scores in prompts is superior to conventional training.

Abstract: Persona-based dialogue generation is an important milestone towards building conversational artificial intelligence. Despite the ever-improving capabilities of large language models (LLMs), effectively integrating persona fidelity in conversations remains challenging due to the limited diversity in existing dialogue data. We propose a novel framework SBS (Score-Before-Speaking), which outperforms previous methods and yields improvements for both million and billion-parameter models. Unlike previous methods, SBS unifies the learning of responses and their relative quality into a single step. The key innovation is to train a dialogue model to correlate augmented responses with a quality score during training and then leverage this knowledge at inference. We use noun-based substitution for augmentation and semantic similarity-based scores as a proxy for response quality. Through extensive experiments with benchmark datasets (PERSONA-CHAT and ConvAI2), we show that score-conditioned training allows existing models to better capture a spectrum of persona-consistent dialogues. Our ablation studies also demonstrate that including scores in the input prompt during training is superior to conventional training setups. Code and further details are available at https://arpita2512.github.io/score_before_you_speak

[24] Incorporating Contextual Paralinguistic Understanding in Large Speech-Language Models

Qiongqiong Wang, Hardik B. Sailor, Jeremy H. M. Wong, Tianchi Liu, Shuo Sun, Wenyu Zhang, Muhammad Huzaifah, Nancy Chen, Ai Ti Aw

Main category: cs.CL

TL;DR: The paper addresses the lack of empathetic reasoning in Speech-LLMs by introducing explicit and implicit methods to incorporate contextual paralinguistic information, improving performance by up to 46.02%.

Details

Motivation: Current Speech-LLMs lack empathetic reasoning due to missing training datasets integrating contextual content and paralinguistic cues.

Method: Two approaches: (1) explicit method with paralinguistic metadata, and (2) implicit method generating QA pairs using emotion annotations and speech transcriptions.

Result: Implicit method boosts performance by 38.41%, reaching 46.02% when combined with explicit method. LLM judge reliability is validated.

Conclusion: The proposed methods effectively enhance contextual paralinguistic understanding in Speech-LLMs, validated by performance improvements and judge reliability.

Abstract: Current large speech language models (Speech-LLMs) often exhibit limitations in empathetic reasoning, primarily due to the absence of training datasets that integrate both contextual content and paralinguistic cues. In this work, we propose two approaches to incorporate contextual paralinguistic information into model training: (1) an explicit method that provides paralinguistic metadata (e.g., emotion annotations) directly to the LLM, and (2) an implicit method that automatically generates novel training question-answer (QA) pairs using both categorical and dimensional emotion annotations alongside speech transcriptions. Our implicit method boosts performance (LLM-judged) by 38.41% on a human-annotated QA benchmark, reaching 46.02% when combined with the explicit approach, showing effectiveness in contextual paralinguistic understanding. We also validate the LLM judge by demonstrating its correlation with classification metrics, providing support for its reliability.

[25] Model-Agnostic Sentiment Distribution Stability Analysis for Robust LLM-Generated Texts Detection

Siyuan Li, Xi Lin, Guangyan Li, Zehao Liu, Aodu Wulianghai, Li Ding, Jun Wu, Jianhua Li

Main category: cs.CL

TL;DR: SentiDetect is a model-agnostic framework for detecting LLM-generated text by analyzing sentiment distribution stability, outperforming existing methods in accuracy and robustness.

Details

Motivation: Existing detection methods for LLM-generated text lack generalizability and are vulnerable to adversarial attacks, prompting the need for a more robust solution.

Method: SentiDetect uses two metrics—sentiment distribution consistency and preservation—to quantify emotional stability in text, leveraging the observation that LLM outputs are emotionally consistent while human texts vary.

Result: SentiDetect achieves significant F1 score improvements (16% and 11%) over baselines on advanced LLMs like Gemini-1.5-Pro and GPT-4-0613, and excels in robustness to paraphrasing and adversarial attacks.

Conclusion: SentiDetect offers a superior, robust solution for detecting LLM-generated text, addressing limitations of current methods and performing well across diverse datasets and scenarios.

Abstract: The rapid advancement of large language models (LLMs) has resulted in increasingly sophisticated AI-generated content, posing significant challenges in distinguishing LLM-generated text from human-written language. Existing detection methods, primarily based on lexical heuristics or fine-tuned classifiers, often suffer from limited generalizability and are vulnerable to paraphrasing, adversarial perturbations, and cross-domain shifts. In this work, we propose SentiDetect, a model-agnostic framework for detecting LLM-generated text by analyzing the divergence in sentiment distribution stability. Our method is motivated by the empirical observation that LLM outputs tend to exhibit emotionally consistent patterns, whereas human-written texts display greater emotional variability. To capture this phenomenon, we define two complementary metrics: sentiment distribution consistency and sentiment distribution preservation, which quantify stability under sentiment-altering and semantic-preserving transformations. We evaluate SentiDetect on five diverse datasets and a range of advanced LLMs,including Gemini-1.5-Pro, Claude-3, GPT-4-0613, and LLaMa-3.3. Experimental results demonstrate its superiority over state-of-the-art baselines, with over 16% and 11% F1 score improvements on Gemini-1.5-Pro and GPT-4-0613, respectively. Moreover, SentiDetect also shows greater robustness to paraphrasing, adversarial attacks, and text length variations, outperforming existing detectors in challenging scenarios.

[26] Two-Stage Quranic QA via Ensemble Retrieval and Instruction-Tuned Answer Extraction

Mohamed Basem, Islam Oshallah, Ali Hamdi, Khaled Shaban, Hozaifa Kassab

Main category: cs.CL

TL;DR: A two-stage framework for Quranic QA combines fine-tuned Arabic models for retrieval and instruction-tuned LLMs for extraction, achieving state-of-the-art results.

Details

Motivation: Addressing the challenges of Quranic QA due to linguistic complexity and semantic richness in low-resource settings.

Method: Ensemble fine-tuned Arabic models for passage retrieval and instruction-tuned LLMs with few-shot prompting for answer extraction.

Result: Achieved MAP@10 of 0.3128, MRR@10 of 0.5763 for retrieval, and pAP@10 of 0.669 for extraction, outperforming prior methods.

Conclusion: Combining model ensembling and instruction-tuned LLMs effectively tackles low-resource QA in specialized domains.

Abstract: Quranic Question Answering presents unique challenges due to the linguistic complexity of Classical Arabic and the semantic richness of religious texts. In this paper, we propose a novel two-stage framework that addresses both passage retrieval and answer extraction. For passage retrieval, we ensemble fine-tuned Arabic language models to achieve superior ranking performance. For answer extraction, we employ instruction-tuned large language models with few-shot prompting to overcome the limitations of fine-tuning on small datasets. Our approach achieves state-of-the-art results on the Quran QA 2023 Shared Task, with a MAP@10 of 0.3128 and MRR@10 of 0.5763 for retrieval, and a pAP@10 of 0.669 for extraction, substantially outperforming previous methods. These results demonstrate that combining model ensembling and instruction-tuned language models effectively addresses the challenges of low-resource question answering in specialized domains.

[27] Rethinking 1-bit Optimization Leveraging Pre-trained Large Language Models

Zhijun Tu, Hanting Chen, Siqi Liu, Chuanjian Liu, Jian Li, Jie Hu, Yunhe Wang

Main category: cs.CL

TL;DR: The paper introduces a method for 1-bit LLM quantization using pre-trained models, avoiding costly training from scratch and improving performance.

Details

Motivation: Existing 1-bit LLM quantization methods train from scratch, leading to high costs and accuracy loss. The gap between full precision and 1-bit representations makes adaptation challenging.

Method: The proposed method uses consistent progressive training for forward and backward passes, binary-aware initialization, and dual-scaling compensation to smoothly convert floating-point weights to 1-bit.

Result: Experiments on various LLM sizes show the method outperforms existing approaches, achieving high-performance 1-bit LLMs without expensive training from scratch.

Conclusion: The method successfully leverages pre-trained models for 1-bit LLM quantization, reducing costs and maintaining accuracy.

Abstract: 1-bit LLM quantization offers significant advantages in reducing storage and computational costs. However, existing methods typically train 1-bit LLMs from scratch, failing to fully leverage pre-trained models. This results in high training costs and notable accuracy degradation. We identify that the large gap between full precision and 1-bit representations makes direct adaptation difficult. In this paper, we introduce a consistent progressive training for both forward and backward, smoothly converting the floating-point weights into the binarized ones. Additionally, we incorporate binary-aware initialization and dual-scaling compensation to reduce the difficulty of progressive training and improve the performance. Experimental results on LLMs of various sizes demonstrate that our method outperforms existing approaches. Our results show that high-performance 1-bit LLMs can be achieved using pre-trained models, eliminating the need for expensive training from scratch.

[28] Vec2Summ: Text Summarization via Probabilistic Sentence Embeddings

Mao Li, Fred Conrad, Johann Gagnon-Bartsch

Main category: cs.CL

TL;DR: Vec2Summ is a novel abstractive summarization method using semantic compression via mean vector representation and embedding inversion, offering scalability and semantic control.

Details

Motivation: Addresses limitations of LLM-based summarization, such as context-length constraints and lack of interpretability, while enabling efficient scaling.

Method: Represents documents as a mean vector in semantic space, decodes it into summaries using a generative model, and introduces stochasticity via Gaussian sampling.

Result: Produces coherent summaries comparable to LLM methods in thematic coverage and efficiency, though with less fine-grained detail.

Conclusion: Vec2Summ is effective for scalable, semantically controlled summarization, especially in corpus-level abstraction settings.

Abstract: We propose Vec2Summ, a novel method for abstractive summarization that frames the task as semantic compression. Vec2Summ represents a document collection using a single mean vector in the semantic embedding space, capturing the central meaning of the corpus. To reconstruct fluent summaries, we perform embedding inversion – decoding this mean vector into natural language using a generative language model. To improve reconstruction quality and capture some degree of topical variability, we introduce stochasticity by sampling from a Gaussian distribution centered on the mean. This approach is loosely analogous to bagging in ensemble learning, where controlled randomness encourages more robust and varied outputs. Vec2Summ addresses key limitations of LLM-based summarization methods. It avoids context-length constraints, enables interpretable and controllable generation via semantic parameters, and scales efficiently with corpus size – requiring only $O(d + d^2)$ parameters. Empirical results show that Vec2Summ produces coherent summaries for topically focused, order-invariant corpora, with performance comparable to direct LLM summarization in terms of thematic coverage and efficiency, albeit with less fine-grained detail. These results underscore Vec2Summ’s potential in settings where scalability, semantic control, and corpus-level abstraction are prioritized.

[29] SEADialogues: A Multilingual Culturally Grounded Multi-turn Dialogue Dataset on Southeast Asian Languages

Muhammad Dehan Al Kautsar, Aswin Candra, Muhammad Alif Al Hakim, Maxalmina Satria Kahfi, Fajri Koto, Alham Fikri Aji, Peerat Limkonchotiwat, Ekapol Chuangsuwanich, Genta Indra Winata

Main category: cs.CL

TL;DR: SEADialogues is a culturally grounded dialogue dataset for Southeast Asia, addressing gaps in existing chit-chat datasets by including eight languages and culturally relevant topics.

Details

Motivation: Existing dialogue datasets lack cultural nuances, especially for diverse regions like Southeast Asia.

Method: The dataset includes dialogues in eight languages from six countries, with persona attributes and culturally grounded topics.

Result: SEADialogues provides a resource for culturally aware dialogue systems, supporting low-resource languages.

Conclusion: This dataset advances research on culturally aware and human-centric language models, particularly for Southeast Asia.

Abstract: Although numerous datasets have been developed to support dialogue systems, most existing chit-chat datasets overlook the cultural nuances inherent in natural human conversations. To address this gap, we introduce SEADialogues, a culturally grounded dialogue dataset centered on Southeast Asia, a region with over 700 million people and immense cultural diversity. Our dataset features dialogues in eight languages from six Southeast Asian countries, many of which are low-resource despite having sizable speaker populations. To enhance cultural relevance and personalization, each dialogue includes persona attributes and two culturally grounded topics that reflect everyday life in the respective communities. Furthermore, we release a multi-turn dialogue dataset to advance research on culturally aware and human-centric large language models, including conversational dialogue agents.

[30] ReaGAN: Node-as-Agent-Reasoning Graph Agentic Network

Minghao Guo, Xi Zhu, Jingyuan Huang, Kai Mei, Yongfeng Zhang

Main category: cs.CL

TL;DR: ReaGAN introduces an agent-based framework for GNNs, enabling autonomous node-level decision-making and retrieval-augmented global relationships, addressing limitations of fixed aggregation schemes.

Details

Motivation: Fixed aggregation in GNNs struggles with node informativeness imbalance and ignores global semantic relationships, limiting performance.

Method: ReaGAN uses agent-based nodes with internal memory for planning and retrieval-augmented generation (RAG) to access global content.

Result: ReaGAN achieves competitive performance in few-shot settings without fine-tuning, leveraging agentic planning and retrieval.

Conclusion: ReaGAN demonstrates the potential of agentic planning and local-global retrieval to enhance graph learning.

Abstract: Graph Neural Networks (GNNs) have achieved remarkable success in graph-based learning by propagating information among neighbor nodes via predefined aggregation mechanisms. However, such fixed schemes often suffer from two key limitations. First, they cannot handle the imbalance in node informativeness – some nodes are rich in information, while others remain sparse. Second, predefined message passing primarily leverages local structural similarity while ignoring global semantic relationships across the graph, limiting the model’s ability to capture distant but relevant information. We propose Retrieval-augmented Graph Agentic Network (ReaGAN), an agent-based framework that empowers each node with autonomous, node-level decision-making. Each node acts as an agent that independently plans its next action based on its internal memory, enabling node-level planning and adaptive message propagation. Additionally, retrieval-augmented generation (RAG) allows nodes to access semantically relevant content and build global relationships in the graph. ReaGAN achieves competitive performance under few-shot in-context settings using a frozen LLM backbone without fine-tuning, showcasing the potential of agentic planning and local-global retrieval in graph learning.

[31] CLAIR-A: Leveraging Large Language Models to Judge Audio Captions

Tsung-Han Wu, Joseph E. Gonzalez, Trevor Darrell, David M. Chan

Main category: cs.CL

TL;DR: CLAIR-A is a method using large language models (LLMs) to evaluate audio captions, outperforming traditional metrics in aligning with human judgment and offering transparent explanations.

Details

Motivation: Current methods for evaluating audio captions lack alignment with human judgment and transparency in scoring.

Method: CLAIR-A leverages LLMs’ zero-shot capabilities to directly score semantic distance and provide explanations.

Result: CLAIR-A improves accuracy by 5.8% over FENSE and up to 11% over general-purpose metrics, with explanations rated 30% better.

Conclusion: CLAIR-A is a superior, transparent, and flexible method for evaluating audio captions, publicly available for use.

Abstract: The Automated Audio Captioning (AAC) task asks models to generate natural language descriptions of an audio input. Evaluating these machine-generated audio captions is a complex task that requires considering diverse factors, among them, auditory scene understanding, sound-object inference, temporal coherence, and the environmental context of the scene. While current methods focus on specific aspects, they often fail to provide an overall score that aligns well with human judgment. In this work, we propose CLAIR-A, a simple and flexible method that leverages the zero-shot capabilities of large language models (LLMs) to evaluate candidate audio captions by directly asking LLMs for a semantic distance score. In our evaluations, CLAIR-A better predicts human judgements of quality compared to traditional metrics, with a 5.8% relative accuracy improvement compared to the domain-specific FENSE metric and up to 11% over the best general-purpose measure on the Clotho-Eval dataset. Moreover, CLAIR-A offers more transparency by allowing the language model to explain the reasoning behind its scores, with these explanations rated up to 30% better by human evaluators than those provided by baseline methods. CLAIR-A is made publicly available at https://github.com/DavidMChan/clair-a.

[32] BharatBBQ: A Multilingual Bias Benchmark for Question Answering in the Indian Context

Aditya Tomar, Nihar Ranjan Sahoo, Pushpak Bhattacharyya

Main category: cs.CL

TL;DR: BharatBBQ is introduced as a culturally adapted benchmark to evaluate biases in Indian languages, addressing gaps in existing Western-focused benchmarks like BBQ.

Details

Motivation: Existing bias evaluation benchmarks like BBQ are limited to Western contexts, necessitating a culturally relevant benchmark for India.

Method: BharatBBQ covers 13 social categories and 8 Indian languages, expanding 49,108 examples to 392,864 via translation and verification. Five multilingual LMs are evaluated in zero and few-shot settings.

Result: Persistent biases are found across languages and social categories, with amplified biases in Indian languages compared to English.

Conclusion: Linguistically and culturally grounded benchmarks like BharatBBQ are essential for comprehensive bias evaluation in diverse contexts.

Abstract: Evaluating social biases in language models (LMs) is crucial for ensuring fairness and minimizing the reinforcement of harmful stereotypes in AI systems. Existing benchmarks, such as the Bias Benchmark for Question Answering (BBQ), primarily focus on Western contexts, limiting their applicability to the Indian context. To address this gap, we introduce BharatBBQ, a culturally adapted benchmark designed to assess biases in Hindi, English, Marathi, Bengali, Tamil, Telugu, Odia, and Assamese. BharatBBQ covers 13 social categories, including 3 intersectional groups, reflecting prevalent biases in the Indian sociocultural landscape. Our dataset contains 49,108 examples in one language that are expanded using translation and verification to 392,864 examples in eight different languages. We evaluate five multilingual LM families across zero and few-shot settings, analyzing their bias and stereotypical bias scores. Our findings highlight persistent biases across languages and social categories and often amplified biases in Indian languages compared to English, demonstrating the necessity of linguistically and culturally grounded benchmarks for bias evaluation.

[33] RCR-Router: Efficient Role-Aware Context Routing for Multi-Agent LLM Systems with Structured Memory

Jun Liu, Zhenglun Kong, Changdi Yang, Fan Yang, Tianqi Li, Peiyan Dong, Joannah Nanjekye, Hao Tang, Geng Yuan, Wei Niu, Wenbin Zhang, Pu Zhao, Xue Lin, Dong Huang, Yanzhi Wang

Main category: cs.CL

TL;DR: RCR-Router is a dynamic, role-aware context routing framework for multi-agent LLMs, reducing token usage by up to 30% while maintaining answer quality.

Details

Motivation: Existing coordination schemes in multi-agent LLM systems suffer from inefficiencies like excessive token consumption and limited adaptability.

Method: Introduces RCR-Router, a modular framework with dynamic memory selection based on agent roles and task stages, guided by a lightweight scoring policy.

Result: Experiments show reduced token usage (up to 30%) and maintained/improved answer quality on multi-hop QA benchmarks.

Conclusion: Structured memory routing and output-aware evaluation are key for scalable multi-agent LLM systems.

Abstract: Multi-agent large language model (LLM) systems have shown strong potential in complex reasoning and collaborative decision-making tasks. However, most existing coordination schemes rely on static or full-context routing strategies, which lead to excessive token consumption, redundant memory exposure, and limited adaptability across interaction rounds. We introduce RCR-Router, a modular and role-aware context routing framework designed to enable efficient, adaptive collaboration in multi-agent LLMs. To our knowledge, this is the first routing approach that dynamically selects semantically relevant memory subsets for each agent based on its role and task stage, while adhering to a strict token budget. A lightweight scoring policy guides memory selection, and agent outputs are iteratively integrated into a shared memory store to facilitate progressive context refinement. To better evaluate model behavior, we further propose an Answer Quality Score metric that captures LLM-generated explanations beyond standard QA accuracy. Experiments on three multi-hop QA benchmarks – HotPotQA, MuSiQue, and 2WikiMultihop – demonstrate that RCR-Router reduces token usage (up to 30%) while improving or maintaining answer quality. These results highlight the importance of structured memory routing and output-aware evaluation in advancing scalable multi-agent LLM systems.

[34] URO-Bench: Towards Comprehensive Evaluation for End-to-End Spoken Dialogue Models

Ruiqi Yan, Xiquan Li, Wenxi Chen, Zhikang Niu, Chen Yang, Ziyang Ma, Kai Yu, Xie Chen

Main category: cs.CL

TL;DR: URO-Bench is a new benchmark for spoken dialogue models (SDMs) in speech-to-speech scenarios, evaluating cognitive and speech-related aspects. It reveals gaps in current SDMs, especially in paralinguistics and audio understanding.

Details

Motivation: Existing evaluation frameworks for SDMs lack comprehensiveness, particularly in speech-to-speech contexts, prompting the creation of URO-Bench.

Method: URO-Bench is designed with two difficulty levels (basic and pro tracks) and 20 test sets, assessing Understanding, Reasoning, and Oral conversation.

Result: Current SDMs perform well in daily QA tasks but struggle with instruction-following, catastrophic forgetting, and advanced speech-related evaluations.

Conclusion: URO-Bench aims to advance SDM development by providing a multifaceted evaluation tool and identifying areas for improvement.

Abstract: Recent advances in large language models (LLMs) have driven significant progress in end-to-end spoken dialogue models (SDMs). In contrast to text-based LLMs, the evaluation framework for SDMs should encompass both cognitive dimensions (e.g., logical reasoning, knowledge) and speech-related aspects (e.g., paralinguistic cues, audio quality). However, there is still a lack of comprehensive evaluations for SDMs in speech-to-speech (S2S) scenarios. To address this gap, we propose URO-Bench, an extensive benchmark for SDMs. Notably, URO-Bench is the first S2S benchmark that covers evaluations about multilingualism, multi-round dialogues, and paralinguistics. Our benchmark is divided into two difficulty levels: basic track and pro track, each comprising 20 test sets, evaluating the spoken dialogue model’s abilities in Understanding, Reasoning, and Oral conversation. Evaluations on our proposed benchmark reveal that current open-source SDMs perform rather well in daily QA tasks, but lag behind their backbone LLMs in terms of instruction-following ability and also suffer from catastrophic forgetting. Their performance in advanced evaluations of paralinguistic information and audio understanding remains subpar, highlighting the need for further research in this direction. We hope that URO-Bench can facilitate the development of spoken dialogue models by providing a multifaceted evaluation of existing models and helping to track progress in this area.

[35] Less Is More: Training-Free Sparse Attention with Global Locality for Efficient Reasoning

Lijie Yang, Zhihao Zhang, Arti Jain, Shijie Cao, Baihong Yuan, Yiwei Chen, Zhihao Jia, Ravi Netravali

Main category: cs.CL

TL;DR: LessIsMore is a training-free sparse attention mechanism for reasoning tasks, improving efficiency and accuracy by leveraging global attention patterns and unified token ranking.

Details

Motivation: Address computational overhead and accuracy degradation in large reasoning models due to excessive token generation and sparse attention mechanisms.

Method: Introduces LessIsMore, which aggregates token selections from local attention heads with recent contextual information for unified cross-head token ranking.

Result: Achieves 1.1× decoding speed-up, 2× fewer tokens attended, and 1.13× end-to-end speed-up without accuracy loss.

Conclusion: LessIsMore enhances efficiency and accuracy in reasoning tasks, outperforming existing sparse attention methods.

Abstract: Large reasoning models achieve strong performance through test-time scaling but incur substantial computational overhead, particularly from excessive token generation when processing short input prompts. While sparse attention mechanisms can reduce latency and memory usage, existing approaches suffer from significant accuracy degradation due to accumulated errors during long-generation reasoning. These methods generally require either high token retention rates or expensive retraining. We introduce LessIsMore, a training-free sparse attention mechanism for reasoning tasks, which leverages global attention patterns rather than relying on traditional head-specific local optimizations. LessIsMore aggregates token selections from local attention heads with recent contextual information, enabling unified cross-head token ranking for future decoding layers. This unified selection improves generalization and efficiency by avoiding the need to maintain separate token subsets per head. Evaluation across diverse reasoning tasks and benchmarks shows that LessIsMore preserves – and in some cases improves – accuracy while achieving a $1.1\times$ average decoding speed-up compared to full attention. Moreover, LessIsMore attends to $2\times$ fewer tokens without accuracy loss, achieving a $1.13\times$ end-to-end speed-up compared to existing sparse attention methods.

[36] Investigating Intersectional Bias in Large Language Models using Confidence Disparities in Coreference Resolution

Falaah Arif Khan, Nivedha Sivakumar, Yinong Oliver Wang, Katherine Metcalf, Cezanne Camacho, Barry-John Theobald, Luca Zappella, Nicholas Apostoloff

Main category: cs.CL

TL;DR: The paper examines intersectional bias in LLMs, extending single-axis fairness evaluations to multiple demographic attributes. It introduces WinoIdentity, a benchmark with 245,700 prompts, and finds significant confidence disparities (up to 40%) in models, revealing failures in value alignment and logical reasoning.

Details

Motivation: To address concerns about AI systems reflecting societal biases, especially in critical contexts like hiring, by evaluating intersectional bias in LLMs.

Method: Augments WinoBias with 25 demographic markers across 10 attributes, creating 245,700 prompts. Proposes Coreference Confidence Disparity metric to measure bias through uncertainty. Evaluates five recent LLMs.

Result: Finds confidence disparities up to 40% across attributes like body type and socio-economic status, with models most uncertain about doubly-disadvantaged identities. Shows LLMs’ performance may rely on memorization, not reasoning.

Conclusion: Highlights two independent failures in value alignment and validity in LLMs, which can compound to cause social harm, urging further scrutiny of intersectional bias.

Abstract: Large language models (LLMs) have achieved impressive performance, leading to their widespread adoption as decision-support tools in resource-constrained contexts like hiring and admissions. There is, however, scientific consensus that AI systems can reflect and exacerbate societal biases, raising concerns about identity-based harm when used in critical social contexts. Prior work has laid a solid foundation for assessing bias in LLMs by evaluating demographic disparities in different language reasoning tasks. In this work, we extend single-axis fairness evaluations to examine intersectional bias, recognizing that when multiple axes of discrimination intersect, they create distinct patterns of disadvantage. We create a new benchmark called WinoIdentity by augmenting the WinoBias dataset with 25 demographic markers across 10 attributes, including age, nationality, and race, intersected with binary gender, yielding 245,700 prompts to evaluate 50 distinct bias patterns. Focusing on harms of omission due to underrepresentation, we investigate bias through the lens of uncertainty and propose a group (un)fairness metric called Coreference Confidence Disparity which measures whether models are more or less confident for some intersectional identities than others. We evaluate five recently published LLMs and find confidence disparities as high as 40% along various demographic attributes including body type, sexual orientation and socio-economic status, with models being most uncertain about doubly-disadvantaged identities in anti-stereotypical settings. Surprisingly, coreference confidence decreases even for hegemonic or privileged markers, indicating that the recent impressive performance of LLMs is more likely due to memorization than logical reasoning. Notably, these are two independent failures in value alignment and validity that can compound to cause social harm.

[37] Fairness of Automatic Speech Recognition: Looking Through a Philosophical Lens

Anna Seo Gyeong Choi, Hoon Choi

Main category: cs.CL

TL;DR: The paper examines ASR bias, highlighting its ethical implications beyond technical flaws, such as disrespect and historical injustices against marginalized linguistic communities. It distinguishes between neutral classification and harmful discrimination, identifies unique ethical dimensions of ASR bias, and calls for recognition of diverse speech varieties in ASR development.

Details

Motivation: To address the limited research on fairness implications of ASR systems and their potential to perpetuate historical injustices against marginalized linguistic communities.

Method: The paper uses a philosophical lens to analyze ASR bias, distinguishing between morally neutral classification and harmful discrimination, and identifies three unique ethical dimensions of speech technologies.

Result: The study reveals that ASR bias creates asymmetric power relationships, disrupts conversational flow, and fails to respect linguistic diversity, which existing fairness metrics overlook.

Conclusion: Addressing ASR bias requires recognizing diverse speech varieties as legitimate and integrating this respect into ASR development, beyond mere technical fixes.

Abstract: Automatic Speech Recognition (ASR) systems now mediate countless human-technology interactions, yet research on their fairness implications remains surprisingly limited. This paper examines ASR bias through a philosophical lens, arguing that systematic misrecognition of certain speech varieties constitutes more than a technical limitation – it represents a form of disrespect that compounds historical injustices against marginalized linguistic communities. We distinguish between morally neutral classification (discriminate1) and harmful discrimination (discriminate2), demonstrating how ASR systems can inadvertently transform the former into the latter when they consistently misrecognize non-standard dialects. We identify three unique ethical dimensions of speech technologies that differentiate ASR bias from other algorithmic fairness concerns: the temporal burden placed on speakers of non-standard varieties (“temporal taxation”), the disruption of conversational flow when systems misrecognize speech, and the fundamental connection between speech patterns and personal/cultural identity. These factors create asymmetric power relationships that existing technical fairness metrics fail to capture. The paper analyzes the tension between linguistic standardization and pluralism in ASR development, arguing that current approaches often embed and reinforce problematic language ideologies. We conclude that addressing ASR bias requires more than technical interventions; it demands recognition of diverse speech varieties as legitimate forms of expression worthy of technological accommodation. This philosophical reframing offers new pathways for developing ASR systems that respect linguistic diversity and speaker autonomy.

[38] Gradient Surgery for Safe LLM Fine-Tuning

Biao Yi, Jiahao Li, Baolei Zhang, Lihai Nie, Tong Li, Tiansheng Huang, Zheli Liu

Main category: cs.CL

TL;DR: SafeGrad introduces gradient surgery to resolve conflicting gradients in fine-tuning LLMs, ensuring safety alignment without compromising task performance.

Details

Motivation: Existing fine-tuning methods for LLMs are vulnerable to malicious examples, degrading safety alignment as harmful ratios increase.

Method: SafeGrad uses gradient surgery to nullify harmful components of user-task gradients and employs a KL-divergence alignment loss for robustness.

Result: SafeGrad achieves state-of-the-art defense, maintaining safety at high harmful ratios without losing task fidelity.

Conclusion: SafeGrad effectively balances safety and performance in fine-tuning LLMs, addressing critical vulnerabilities.

Abstract: Fine-tuning-as-a-Service introduces a critical vulnerability where a few malicious examples mixed into the user’s fine-tuning dataset can compromise the safety alignment of Large Language Models (LLMs). While a recognized paradigm frames safe fine-tuning as a multi-objective optimization problem balancing user task performance with safety alignment, we find existing solutions are critically sensitive to the harmful ratio, with defenses degrading sharply as harmful ratio increases. We diagnose that this failure stems from conflicting gradients, where the user-task update directly undermines the safety objective. To resolve this, we propose SafeGrad, a novel method that employs gradient surgery. When a conflict is detected, SafeGrad nullifies the harmful component of the user-task gradient by projecting it onto the orthogonal plane of the alignment gradient, allowing the model to learn the user’s task without sacrificing safety. To further enhance robustness and data efficiency, we employ a KL-divergence alignment loss that learns the rich, distributional safety profile of the well-aligned foundation model. Extensive experiments show that SafeGrad provides state-of-the-art defense across various LLMs and datasets, maintaining robust safety even at high harmful ratios without compromising task fidelity.

[39] Omni-SafetyBench: A Benchmark for Safety Evaluation of Audio-Visual Large Language Models

Leyi Pan, Zheyu Fu, Yunpeng Zhai, Shuchang Tao, Sheng Guan, Shiyu Huang, Lingzhe Zhang, Zhaoyang Liu, Bolin Ding, Felix Henry, Lijie Wen, Aiwei Liu

Main category: cs.CL

TL;DR: The paper introduces Omni-SafetyBench, the first benchmark for evaluating safety in Omni-modal Large Language Models (OLLMs), addressing gaps in existing benchmarks. It proposes tailored metrics and reveals critical vulnerabilities in current OLLMs.

Details

Motivation: Existing benchmarks lack the ability to assess safety in OLLMs under audio-visual joint inputs or cross-modal consistency, necessitating a dedicated solution.

Method: The authors develop Omni-SafetyBench with 24 modality combinations and 972 samples per variation, including audio-visual harm cases. They propose Safety-score (C-ASR, C-RR) and CMSC-score for cross-modal consistency.

Result: Evaluation of 10 OLLMs shows vulnerabilities: no model excels in both safety and consistency, defenses weaken with complex inputs, and severe weaknesses exist (some models score as low as 0.14).

Conclusion: The benchmark and metrics highlight urgent needs for improving OLLM safety, providing a foundation for future research and enhancements.

Abstract: The rise of Omni-modal Large Language Models (OLLMs), which integrate visual and auditory processing with text, necessitates robust safety evaluations to mitigate harmful outputs. However, no dedicated benchmarks currently exist for OLLMs, and prior benchmarks designed for other LLMs lack the ability to assess safety performance under audio-visual joint inputs or cross-modal safety consistency. To fill this gap, we introduce Omni-SafetyBench, the first comprehensive parallel benchmark for OLLM safety evaluation, featuring 24 modality combinations and variations with 972 samples each, including dedicated audio-visual harm cases. Considering OLLMs’ comprehension challenges with complex omni-modal inputs and the need for cross-modal consistency evaluation, we propose tailored metrics: a Safety-score based on conditional Attack Success Rate (C-ASR) and Refusal Rate (C-RR) to account for comprehension failures, and a Cross-Modal Safety Consistency Score (CMSC-score) to measure consistency across modalities. Evaluating 6 open-source and 4 closed-source OLLMs reveals critical vulnerabilities: (1) no model excels in both overall safety and consistency, with only 3 models achieving over 0.6 in both metrics and top performer scoring around 0.8; (2) safety defenses weaken with complex inputs, especially audio-visual joints; (3) severe weaknesses persist, with some models scoring as low as 0.14 on specific modalities. Our benchmark and metrics highlight urgent needs for enhanced OLLM safety, providing a foundation for future improvements.

[40] Improved Personalized Headline Generation via Denoising Fake Interests from Implicit Feedback

Kejin Liu, Junhong Lian, Xiang Ao, Ningtao Wang, Xing Fu, Yu Cheng, Weiqiang Wang, Xinyu Liu

Main category: cs.CL

TL;DR: The paper introduces PHG-DIF, a framework for personalized headline generation that addresses click noise in historical behaviors, improving headline quality by filtering noise and modeling evolving user interests.

Details

Motivation: Existing methods fail to account for personalized-irrelevant click noise, leading to inaccurate headlines. The paper aims to mitigate this issue.

Method: PHG-DIF uses dual-stage filtering to remove noise (short dwell times, abnormal clicks) and multi-level temporal fusion to model user interests.

Result: PHG-DIF achieves SOTA results on the DT-PENS dataset, significantly improving headline quality by reducing click noise effects.

Conclusion: The framework effectively addresses click noise, enhances personalized headline generation, and is supported by a new benchmark dataset (DT-PENS).

Abstract: Accurate personalized headline generation hinges on precisely capturing user interests from historical behaviors. However, existing methods neglect personalized-irrelevant click noise in entire historical clickstreams, which may lead to hallucinated headlines that deviate from genuine user preferences. In this paper, we reveal the detrimental impact of click noise on personalized generation quality through rigorous analysis in both user and news dimensions. Based on these insights, we propose a novel Personalized Headline Generation framework via Denoising Fake Interests from Implicit Feedback (PHG-DIF). PHG-DIF first employs dual-stage filtering to effectively remove clickstream noise, identified by short dwell times and abnormal click bursts, and then leverages multi-level temporal fusion to dynamically model users’ evolving and multi-faceted interests for precise profiling. Moreover, we release DT-PENS, a new benchmark dataset comprising the click behavior of 1,000 carefully curated users and nearly 10,000 annotated personalized headlines with historical dwell time annotations. Extensive experiments demonstrate that PHG-DIF substantially mitigates the adverse effects of click noise and significantly improves headline quality, achieving state-of-the-art (SOTA) results on DT-PENS. Our framework implementation and dataset are available at https://github.com/liukejin-up/PHG-DIF.

[41] Schema Lineage Extraction at Scale: Multilingual Pipelines, Composite Evaluation, and Language-Model Benchmarks

Jiaqi Yin, Yi-Wei Chen, Meng-Lung Lee, Xiya Liu

Main category: cs.CL

TL;DR: A framework for automated extraction of fine-grained schema lineage from multilingual enterprise pipelines is proposed to address semantic drift, evaluated using SLiCE and tested on 12 language models.

Details

Motivation: Semantic drift in enterprise data pipelines compromises reproducibility and governance, affecting services like RAG and text-to-SQL systems.

Method: The framework extracts schema lineage by identifying source schemas, tables, transformation logic, and aggregation operations, creating a standardized representation. SLiCE evaluates lineage quality.

Result: Performance scales with model size and prompting techniques; a 32B open-source model matches GPT series under standard prompting.

Conclusion: The approach offers a scalable, economical solution for deploying schema-aware agents in practice.

Abstract: Enterprise data pipelines, characterized by complex transformations across multiple programming languages, often cause a semantic disconnect between original metadata and downstream data. This “semantic drift” compromises data reproducibility and governance, and impairs the utility of services like retrieval-augmented generation (RAG) and text-to-SQL systems. To address this, a novel framework is proposed for the automated extraction of fine-grained schema lineage from multilingual enterprise pipeline scripts. This method identifies four key components: source schemas, source tables, transformation logic, and aggregation operations, creating a standardized representation of data transformations. For the rigorous evaluation of lineage quality, this paper introduces the Schema Lineage Composite Evaluation (SLiCE), a metric that assesses both structural correctness and semantic fidelity. A new benchmark is also presented, comprising 1,700 manually annotated lineages from real-world industrial scripts. Experiments were conducted with 12 language models, from 1.3B to 32B small language models (SLMs) to large language models (LLMs) like GPT-4o and GPT-4.1. The results demonstrate that the performance of schema lineage extraction scales with model size and the sophistication of prompting techniques. Specially, a 32B open-source model, using a single reasoning trace, can achieve performance comparable to the GPT series under standard prompting. This finding suggests a scalable and economical approach for deploying schema-aware agents in practical applications.

[42] DySK-Attn: A Framework for Efficient, Real-Time Knowledge Updating in Large Language Models via Dynamic Sparse Knowledge Attention

Kabir Khan, Priya Sharma, Arjun Mehta, Neha Gupta, Ravi Narayanan

Main category: cs.CL

TL;DR: DySK-Attn is a framework enabling LLMs to integrate real-time knowledge from a dynamic KG, improving accuracy and efficiency.

Details

Motivation: LLMs' static knowledge becomes outdated, and retraining or editing them is costly or slow.

Method: Combines LLMs with a dynamic KG using a sparse knowledge attention mechanism for efficient, relevant fact retrieval.

Result: Outperforms baselines like RAG and model editing in accuracy and efficiency for time-sensitive tasks.

Conclusion: DySK-Attn provides a scalable solution for keeping LLMs updated with dynamic knowledge.

Abstract: Large Language Models (LLMs) suffer from a critical limitation: their knowledge is static and quickly becomes outdated. Retraining these massive models is computationally prohibitive, while existing knowledge editing techniques can be slow and may introduce unforeseen side effects. To address this, we propose DySK-Attn, a novel framework that enables LLMs to efficiently integrate real-time knowledge from a dynamic external source. Our approach synergizes an LLM with a dynamic Knowledge Graph (KG) that can be updated instantaneously. The core of our framework is a sparse knowledge attention mechanism, which allows the LLM to perform a coarse-to-fine grained search, efficiently identifying and focusing on a small, highly relevant subset of facts from the vast KG. This mechanism avoids the high computational cost of dense attention over the entire knowledge base and mitigates noise from irrelevant information. We demonstrate through extensive experiments on time-sensitive question-answering tasks that DySK-Attn significantly outperforms strong baselines, including standard Retrieval-Augmented Generation (RAG) and model editing techniques, in both factual accuracy for updated knowledge and computational efficiency. Our framework offers a scalable and effective solution for building LLMs that can stay current with the ever-changing world.

[43] Adapting LLMs to Time Series Forecasting via Temporal Heterogeneity Modeling and Semantic Alignment

Yanru Sun, Emadeldeen Eldele, Zongxia Xie, Yucheng Wang, Wenzhe Niu, Qinghua Hu, Chee Keong Kwoh, Min Wu

Main category: cs.CL

TL;DR: TALON enhances LLM-based time series forecasting by addressing temporal heterogeneity and modality gaps, achieving superior performance.

Details

Motivation: LLMs struggle with time series forecasting due to heterogeneous temporal patterns and the gap between numerical signals and discrete language representations.

Method: TALON uses a Heterogeneous Temporal Encoder for localized pattern modeling and a Semantic Alignment Module to bridge the modality gap.

Result: TALON outperforms state-of-the-art methods, with an average MSE improvement of 11% across seven benchmarks.

Conclusion: Incorporating pattern-aware and semantic-aware designs effectively adapts LLMs for time series forecasting.

Abstract: Large Language Models (LLMs) have recently demonstrated impressive capabilities in natural language processing due to their strong generalization and sequence modeling capabilities. However, their direct application to time series forecasting remains challenging due to two fundamental issues: the inherent heterogeneity of temporal patterns and the modality gap between continuous numerical signals and discrete language representations. In this work, we propose TALON, a unified framework that enhances LLM-based forecasting by modeling temporal heterogeneity and enforcing semantic alignment. Specifically, we design a Heterogeneous Temporal Encoder that partitions multivariate time series into structurally coherent segments, enabling localized expert modeling across diverse temporal patterns. To bridge the modality gap, we introduce a Semantic Alignment Module that aligns temporal features with LLM-compatible representations, enabling effective integration of time series into language-based models while eliminating the need for handcrafted prompts during inference. Extensive experiments on seven real-world benchmarks demonstrate that TALON achieves superior performance across all datasets, with average MSE improvements of up to 11% over recent state-of-the-art methods. These results underscore the effectiveness of incorporating both pattern-aware and semantic-aware designs when adapting LLMs for time series forecasting. The code is available at: https://github.com/syrGitHub/TALON.

[44] Enhancing Rumor Detection Methods with Propagation Structure Infused Language Model

Chaoqun Cui, Siyuan Li, Kunkun Ma, Caiyan Jia

Main category: cs.CL

TL;DR: The paper introduces Post Engagement Prediction (PEP), a pretraining strategy to improve PLMs for rumor detection by modeling propagation structures and user engagements. It also releases large-scale datasets and demonstrates PEP’s effectiveness.

Details

Motivation: PLMs underperform in social media tasks like rumor detection due to mismatches with pretraining corpora, inadequate handling of social symbols, and unsuitable pretraining tasks for user engagements.

Method: Proposes PEP, a strategy to predict post relations (root, branch, parent) and releases TwitterCorpus and unlabeled datasets (UTwitter, UWeibo). Trains SoLM, a Twitter-tailored PLM.

Result: PEP improves rumor detection accuracy by 1.0-3.7%, outperforming state-of-the-art methods. SoLM achieves competitive results without high-level modules.

Conclusion: PEP effectively enhances PLMs for rumor detection by learning post interaction features, validated by improved performance and competitive standalone results of SoLM.

Abstract: Pretrained Language Models (PLMs) have excelled in various Natural Language Processing tasks, benefiting from large-scale pretraining and self-attention mechanism’s ability to capture long-range dependencies. However, their performance on social media application tasks like rumor detection remains suboptimal. We attribute this to mismatches between pretraining corpora and social texts, inadequate handling of unique social symbols, and pretraining tasks ill-suited for modeling user engagements implicit in propagation structures. To address these issues, we propose a continue pretraining strategy called Post Engagement Prediction (PEP) to infuse information from propagation structures into PLMs. PEP makes models to predict root, branch, and parent relations between posts, capturing interactions of stance and sentiment crucial for rumor detection. We also curate and release large-scale Twitter corpus: TwitterCorpus (269GB text), and two unlabeled claim conversation datasets with propagation structures (UTwitter and UWeibo). Utilizing these resources and PEP strategy, we train a Twitter-tailored PLM called SoLM. Extensive experiments demonstrate PEP significantly boosts rumor detection performance across universal and social media PLMs, even in few-shot scenarios. On benchmark datasets, PEP enhances baseline models by 1.0-3.7% accuracy, even enabling it to outperform current state-of-the-art methods on multiple datasets. SoLM alone, without high-level modules, also achieves competitive results, highlighting the strategy’s effectiveness in learning discriminative post interaction features.

[45] Prompt Tuning for Few-Shot Continual Learning Named Entity Recognition

Zhe Ren

Main category: cs.CL

TL;DR: The paper addresses challenges in Few-Shot Continual Learning Named Entity Recognition (FS-CLNER) by introducing prompt tuning and memory demonstration templates to avoid the Few-Shot Distillation Dilemma and improve generalization.

Details

Motivation: The scarcity of new-class entities and lack of old-class information in FS-CLNER tasks hinder knowledge distillation and generalization, leading to the Few-Shot Distillation Dilemma.

Method: The authors propose an expandable Anchor words-oriented Prompt Tuning (APT) paradigm and Memory Demonstration Templates (MDT) to bridge pre-training and fine-tuning gaps and provide replay samples for in-context learning.

Result: Experiments demonstrate competitive performance on FS-CLNER tasks.

Conclusion: The combination of APT and MDT effectively addresses the challenges of FS-CLNER, avoiding catastrophic forgetting and improving few-shot learning.

Abstract: Knowledge distillation has been successfully applied to Continual Learning Named Entity Recognition (CLNER) tasks, by using a teacher model trained on old-class data to distill old-class entities present in new-class data as a form of regularization, thereby avoiding catastrophic forgetting. However, in Few-Shot CLNER (FS-CLNER) tasks, the scarcity of new-class entities makes it difficult for the trained model to generalize during inference. More critically, the lack of old-class entity information hinders the distillation of old knowledge, causing the model to fall into what we refer to as the Few-Shot Distillation Dilemma. In this work, we address the above challenges through a prompt tuning paradigm and memory demonstration template strategy. Specifically, we designed an expandable Anchor words-oriented Prompt Tuning (APT) paradigm to bridge the gap between pre-training and fine-tuning, thereby enhancing performance in few-shot scenarios. Additionally, we incorporated Memory Demonstration Templates (MDT) into each training instance to provide replay samples from previous tasks, which not only avoids the Few-Shot Distillation Dilemma but also promotes in-context learning. Experiments show that our approach achieves competitive performances on FS-CLNER.

[46] The 2D+ Dynamic Articulatory Model DYNARTmo: Tongue-Palate Contact Area Estimation

Bernd J. Kröger

Main category: cs.CL

TL;DR: The paper extends the 2D DYNARTmo model by adding a 3D palatal dome representation to estimate tongue-palate contact areas from midsagittal tongue contours, using two dome geometries for lateral curvature modeling.

Details

Motivation: To improve the accuracy of tongue-palate contact estimation and enable electropalatography-like visualizations in a 2D+ framework for speech science and therapy applications.

Method: Integrates two dome geometries (half-ellipse and cosine-based) into the model to compute lateral contact points analytically, supporting synchronized sagittal, glottal, and palatal views.

Result: The enhanced model provides dynamic and static articulation displays, useful for education and therapy, with plans for further additions like a facial view and articulatory-to-acoustic synthesis.

Conclusion: The extended DYNARTmo model successfully improves tongue-palate contact visualization and has potential for future enhancements to evaluate realism.

Abstract: This paper describes an extension of the two-dimensional dynamic articulatory model DYNARTmo by integrating an internal three-dimensional representation of the palatal dome to estimate tongue-palate contact areas from midsagittal tongue contours. Two alternative dome geometries - a half-ellipse and a cosine based profile - are implemented to model lateral curvature in the coronal plane. Using these geometries, lateral contact points are analytically computed for each anterior-posterior position, enabling the generation of electropalatography-like visualizations within the 2D+ framework. The enhanced model supports three synchronized views (sagittal, glottal, and palatal) for static and dynamic (animated) articulation displays, suitable for speech science education and speech therapy. Future work includes adding a facial (lip) view and implementing articulatory-to-acoustic synthesis to quantitatively evaluate model realism.

[47] MAQuA: Adaptive Question-Asking for Multidimensional Mental Health Screening using Item Response Theory

Vasudha Varadarajan, Hui Xu, Rebecca Astrid Boehme, Mariam Marlan Mirstrom, Sverker Sikstrom, H. Andrew Schwartz

Main category: cs.CL

TL;DR: MAQuA is an adaptive framework for mental health screening using LLMs, reducing question burden by 50-87% while maintaining accuracy.

Details

Motivation: To address inefficiency and user burden in LLM-based mental health assessments by optimizing question selection.

Method: Combines multi-outcome modeling, IRT, and factor analysis to dynamically select informative questions across symptom dimensions.

Result: Reduces questions needed for stable scores by 50-87% (e.g., 71% fewer for depression, 85% fewer for eating disorders).

Conclusion: MAQuA is an efficient, scalable tool for nuanced mental health screening, enhancing LLM integration into clinical workflows.

Abstract: Recent advances in large language models (LLMs) offer new opportunities for scalable, interactive mental health assessment, but excessive querying by LLMs burdens users and is inefficient for real-world screening across transdiagnostic symptom profiles. We introduce MAQuA, an adaptive question-asking framework for simultaneous, multidimensional mental health screening. Combining multi-outcome modeling on language responses with item response theory (IRT) and factor analysis, MAQuA selects the questions with most informative responses across multiple dimensions at each turn to optimize diagnostic information, improving accuracy and potentially reducing response burden. Empirical results on a novel dataset reveal that MAQuA reduces the number of assessment questions required for score stabilization by 50-87% compared to random ordering (e.g., achieving stable depression scores with 71% fewer questions and eating disorder scores with 85% fewer questions). MAQuA demonstrates robust performance across both internalizing (depression, anxiety) and externalizing (substance use, eating disorder) domains, with early stopping strategies further reducing patient time and burden. These findings position MAQuA as a powerful and efficient tool for scalable, nuanced, and interactive mental health screening, advancing the integration of LLM-based agents into real-world clinical workflows.

[48] “Pull or Not to Pull?’’: Investigating Moral Biases in Leading Large Language Models Across Ethical Dilemmas

Junchen Ding, Penghao Jiang, Zihao Xu, Ziqi Ding, Yichen Zhu, Jiaojiao Jiang, Yuekang Li

Main category: cs.CL

TL;DR: The study evaluates 14 LLMs on moral reasoning using trolley problems, revealing variability in ethical alignment and advocating for moral reasoning benchmarks.

Details

Motivation: Understanding LLMs' moral reasoning is crucial as they mediate ethically sensitive decisions.

Method: Evaluated 14 LLMs across 27 trolley scenarios using factorial prompting, analyzing decisions and justifications.

Result: Variability in ethical alignment; reasoning models are decisive but not always human-aligned. ‘Sweet zones’ in altruism, fairness, and virtue ethics.

Conclusion: Moral prompting is a diagnostic tool; moral reasoning should be a primary axis in LLM alignment with standardized benchmarks.

Abstract: As large language models (LLMs) increasingly mediate ethically sensitive decisions, understanding their moral reasoning processes becomes imperative. This study presents a comprehensive empirical evaluation of 14 leading LLMs, both reasoning enabled and general purpose, across 27 diverse trolley problem scenarios, framed by ten moral philosophies, including utilitarianism, deontology, and altruism. Using a factorial prompting protocol, we elicited 3,780 binary decisions and natural language justifications, enabling analysis along axes of decisional assertiveness, explanation answer consistency, public moral alignment, and sensitivity to ethically irrelevant cues. Our findings reveal significant variability across ethical frames and model types: reasoning enhanced models demonstrate greater decisiveness and structured justifications, yet do not always align better with human consensus. Notably, “sweet zones” emerge in altruistic, fairness, and virtue ethics framings, where models achieve a balance of high intervention rates, low explanation conflict, and minimal divergence from aggregated human judgments. However, models diverge under frames emphasizing kinship, legality, or self interest, often producing ethically controversial outcomes. These patterns suggest that moral prompting is not only a behavioral modifier but also a diagnostic tool for uncovering latent alignment philosophies across providers. We advocate for moral reasoning to become a primary axis in LLM alignment, calling for standardized benchmarks that evaluate not just what LLMs decide, but how and why.

[49] Arce: Augmented Roberta with Contextualized Elucidations for Ner in Automated Rule Checking

Jian Chen, Jinbao Tian, Yankui Li, Zhou Li

Main category: cs.CL

TL;DR: ARCE proposes a method to enhance NER in the AEC domain by using LLMs to generate simple explanations (Cote) for pre-training RoBERTa, achieving a state-of-the-art Macro-F1 score of 77.20%.

Details

Motivation: Standard pre-trained models struggle with specialized AEC terminology, and manual corpus creation is costly. ARCE aims to automate knowledge generation to improve smaller models.

Method: ARCE uses an LLM to generate simple explanations (Cote), then pre-trains RoBERTa with this corpus before fine-tuning for NER.

Result: ARCE achieves a Macro-F1 score of 77.20%, outperforming existing methods and showing simple explanations are more effective than complex rationales.

Conclusion: ARCE demonstrates the effectiveness of automated knowledge generation for domain-specific NER, with simple explanations proving superior.

Abstract: Accurate information extraction from specialized texts is a critical challenge, particularly for named entity recognition (NER) in the architecture, engineering, and construction (AEC) domain to support automated rule checking (ARC). The performance of standard pre-trained models is often constrained by the domain gap, as they struggle to interpret the specialized terminology and complex relational contexts inherent in AEC texts. Although this issue can be mitigated by further pre-training on large, human-curated domain corpora, as exemplified by methods like ARCBERT, this approach is both labor-intensive and cost-prohibitive. Consequently, leveraging large language models (LLMs) for automated knowledge generation has emerged as a promising alternative. However, the optimal strategy for generating knowledge that can genuinely enhance smaller, efficient models remains an open question. To address this, we propose ARCE (augmented RoBERTa with contextualized elucidations), a novel approach that systematically explores and optimizes this generation process. ARCE employs an LLM to first generate a corpus of simple, direct explanations, which we term Cote, and then uses this corpus to incrementally pre-train a RoBERTa model prior to its fine-tuning on the downstream task. Our extensive experiments show that ARCE establishes a new state-of-the-art on a benchmark AEC dataset, achieving a Macro-F1 score of 77.20%. This result also reveals a key finding: simple, explanation-based knowledge proves surprisingly more effective than complex, role-based rationales for this task. The code is publicly available at:https://github.com/nxcc-lab/ARCE.

Yexing Du, Kaiyuan Liu, Youcheng Pan, Zheng Chu, Bo Yang, Xiaocheng Feng, Yang Xiang, Ming Liu

Main category: cs.CL

TL;DR: The paper introduces CCFQA, a benchmark for evaluating multilingual and multimodal factuality in MLLMs, highlighting gaps in current evaluations and proposing a few-shot transfer learning solution.

Details

Motivation: Existing benchmarks for MLLMs focus on English and textual/visual modalities, neglecting multilingual speech input, creating a need for a more comprehensive evaluation tool.

Method: The CCFQA benchmark includes parallel speech-text questions in 8 languages. A few-shot transfer learning strategy is proposed to adapt English QA capabilities to multilingual SQA tasks.

Result: Current MLLMs struggle with CCFQA, but the proposed few-shot method achieves competitive performance with minimal training.

Conclusion: CCFQA fills a critical gap in MLLM evaluation, promoting robust multilingual speech understanding, with code and dataset publicly available.

Abstract: As Large Language Models (LLMs) are increasingly popularized in the multilingual world, ensuring hallucination-free factuality becomes markedly crucial. However, existing benchmarks for evaluating the reliability of Multimodal Large Language Models (MLLMs) predominantly focus on textual or visual modalities with a primary emphasis on English, which creates a gap in evaluation when processing multilingual input, especially in speech. To bridge this gap, we propose a novel \textbf{C}ross-lingual and \textbf{C}ross-modal \textbf{F}actuality benchmark (\textbf{CCFQA}). Specifically, the CCFQA benchmark contains parallel speech-text factual questions across 8 languages, designed to systematically evaluate MLLMs’ cross-lingual and cross-modal factuality capabilities. Our experimental results demonstrate that current MLLMs still face substantial challenges on the CCFQA benchmark. Furthermore, we propose a few-shot transfer learning strategy that effectively transfers the Question Answering (QA) capabilities of LLMs in English to multilingual Spoken Question Answering (SQA) tasks, achieving competitive performance with GPT-4o-mini-Audio using just 5-shot training. We release CCFQA as a foundational research resource to promote the development of MLLMs with more robust and reliable speech understanding capabilities. Our code and dataset are available at https://github.com/yxduir/ccfqa.

[51] HealthBranches: Synthesizing Clinically-Grounded Question Answering Datasets via Decision Pathways

Cristian Cosentino, Annamaria Defilippo, Marco Dossena, Christopher Irwin, Sara Joubbi, Pietro Liò

Main category: cs.CL

TL;DR: HealthBranches is a benchmark dataset for medical Q&A, designed to evaluate LLMs’ complex reasoning using 4,063 case studies across 17 topics, with clinically validated reasoning paths.

Details

Motivation: To create a reliable dataset for evaluating LLMs' multi-step inference in medical contexts, ensuring trustworthiness and clinical reliability.

Method: Semi-automated pipeline transforms medical decision pathways into realistic patient cases with Q&A, including reasoning paths.

Result: Dataset supports open-ended and multiple-choice Q&A formats, enabling robust evaluation of LLMs in structured RAG contexts.

Conclusion: HealthBranches aids in developing trustworthy, interpretable LLMs for high-stakes medical applications and serves educational purposes.

Abstract: HealthBranches is a novel benchmark dataset for medical Question-Answering (Q&A), specifically designed to evaluate complex reasoning in Large Language Models (LLMs). This dataset is generated through a semi-automated pipeline that transforms explicit decision pathways from medical source into realistic patient cases with associated questions and answers. Covering 4,063 case studies across 17 healthcare topics, each data point is based on clinically validated reasoning chains. HealthBranches supports both open-ended and multiple-choice question formats and uniquely includes the full reasoning path for each Q&A. Its structured design enables robust evaluation of LLMs’ multi-step inference capabilities, including their performance in structured Retrieval-Augmented Generation (RAG) contexts. HealthBranches establishes a foundation for the development of more trustworthy, interpretable, and clinically reliable LLMs in high-stakes domains while also serving as a valuable resource for educational purposes.

[52] ObfusQAte: A Proposed Framework to Evaluate LLM Robustness on Obfuscated Factual Question Answering

Shubhra Ghosh, Abhilekh Borah, Aditya Kumar Guru, Kripabandhu Ghosh

Main category: cs.CL

TL;DR: The paper introduces ObfusQA, a framework to test LLM robustness against obfuscated questions, revealing their limitations and hallucination tendencies.

Details

Motivation: To address the lack of studies on LLM robustness when faced with obfuscated questions, aiming to evaluate their adaptability.

Method: Proposes ObfusQAte, a technique, and ObfusQA framework with multi-tiered obfuscation levels across three dimensions: Named-Entity Indirection, Distractor Indirection, and Contextual Overload.

Result: LLMs often fail or hallucinate when handling nuanced, obfuscated questions.

Conclusion: ObfusQA serves as a benchmark for LLM robustness, with ObfusQAte made public to encourage further research.

Abstract: The rapid proliferation of Large Language Models (LLMs) has significantly contributed to the development of equitable AI systems capable of factual question-answering (QA). However, no known study tests the LLMs’ robustness when presented with obfuscated versions of questions. To systematically evaluate these limitations, we propose a novel technique, ObfusQAte and, leveraging the same, introduce ObfusQA, a comprehensive, first of its kind, framework with multi-tiered obfuscation levels designed to examine LLM capabilities across three distinct dimensions: (i) Named-Entity Indirection, (ii) Distractor Indirection, and (iii) Contextual Overload. By capturing these fine-grained distinctions in language, ObfusQA provides a comprehensive benchmark for evaluating LLM robustness and adaptability. Our study observes that LLMs exhibit a tendency to fail or generate hallucinated responses when confronted with these increasingly nuanced variations. To foster research in this direction, we make ObfusQAte publicly available.

[53] Strategies of Code-switching in Human-Machine Dialogs

Dean Geckt, Melinda Fricke, Shuly Wintner

Main category: cs.CL

TL;DR: A chatbot was developed to study code-switching in Spanish-English bilinguals, revealing that predictable code-switching is preferred over random or ungrammatical patterns.

Details

Motivation: To understand the characteristics of code-switched language and explore the feasibility of using chatbots for bilingual language research.

Method: A chatbot was designed to complete a Map Task with human participants, testing different code-switching strategies.

Result: Participants preferred predictable code-switching; random or ungrammatical patterns reduced enjoyment and task success.

Conclusion: The study highlights the risks of undeveloped multilingual tech and its potential for bilingual research.

Abstract: Most people are multilingual, and most multilinguals code-switch, yet the characteristics of code-switched language are not fully understood. We developed a chatbot capable of completing a Map Task with human participants using code-switched Spanish and English. In two experiments, we prompted the bot to code-switch according to different strategies, examining (1) the feasibility of such experiments for investigating bilingual language use, and (2) whether participants would be sensitive to variations in discourse and grammatical patterns. Participants generally enjoyed code-switching with our bot as long as it produced predictable code-switching behavior; when code-switching was random or ungrammatical (as when producing unattested incongruent mixed-language noun phrases, such as `la fork’), participants enjoyed the task less and were less successful at completing it. These results underscore the potential downsides of deploying insufficiently developed multilingual language technology, while also illustrating the promise of such technology for conducting research on bilingual language use.

[54] Grounding Multilingual Multimodal LLMs With Cultural Knowledge

Jean de Dieu Nyandwi, Yueqi Song, Simran Khanuja, Graham Neubig

Main category: cs.CL

TL;DR: The paper introduces a data-centric approach to improve Multimodal Large Language Models (MLLMs) by grounding them in cultural knowledge, addressing gaps in long-tail cultural entities and low-resource languages.

Details

Motivation: MLLMs often misinterpret long-tail cultural entities and underperform in low-resource languages, creating a need for culturally inclusive solutions.

Method: The authors use Wikidata to create a dataset (CulturalGround) of 22M culturally-rich VQA pairs across 42 countries and 39 languages, then train an MLLM (CulturalPangea) on this data.

Result: CulturalPangea outperforms prior models by 5.0 on culture-focused benchmarks without degrading performance on mainstream tasks.

Conclusion: The culturally grounded approach narrows the cultural gap in MLLMs and advances globally inclusive multimodal systems.

Abstract: Multimodal Large Language Models excel in high-resource settings, but often misinterpret long-tail cultural entities and underperform in low-resource languages. To address this gap, we propose a data-centric approach that directly grounds MLLMs in cultural knowledge. Leveraging a large scale knowledge graph from Wikidata, we collect images that represent culturally significant entities, and generate synthetic multilingual visual question answering data. The resulting dataset, CulturalGround, comprises 22 million high-quality, culturally-rich VQA pairs spanning 42 countries and 39 languages. We train an open-source MLLM CulturalPangea on CulturalGround, interleaving standard multilingual instruction-tuning data to preserve general abilities. CulturalPangea achieves state-of-the-art performance among open models on various culture-focused multilingual multimodal benchmarks, outperforming prior models by an average of 5.0 without degrading results on mainstream vision-language tasks. Our findings show that our targeted, culturally grounded approach could substantially narrow the cultural gap in MLLMs and offer a practical path towards globally inclusive multimodal systems.

[55] Let’s Revise Step-by-Step: A Unified Local Search Framework for Code Generation with LLMs

Zhiyi Lyu, Jianguo Huang, Yanchen Deng, Steven Hoi, Bo An

Main category: cs.CL

TL;DR: ReLoc is a unified local search framework for efficient and scalable code generation, outperforming existing methods by leveraging step-by-step revisions and a specialized reward model.

Details

Motivation: Addressing efficiency and scalability challenges in LLM-based code generation, particularly the limitations of construction-based tree-search and improvement-based methods.

Method: ReLoc uses a four-component framework: initial code drafting, neighborhood code generation, candidate evaluation, and incumbent code updating, guided by a revision reward model.

Result: Superior performance in diverse code generation tasks, surpassing both construction-based tree search and state-of-the-art improvement-based methods.

Conclusion: ReLoc provides an effective and scalable solution for code generation, overcoming key challenges in existing approaches.

Abstract: Large Language Models (LLMs) with inference-time scaling techniques show promise for code generation, yet face notable efficiency and scalability challenges. Construction-based tree-search methods suffer from rapid growth in tree size, high token consumption, and lack of anytime property. In contrast, improvement-based methods offer better performance but often struggle with uninformative reward signals and inefficient search strategies. In this work, we propose \textbf{ReLoc}, a unified local search framework which effectively performs step-by-step code revision. Specifically, ReLoc explores a series of local revisions through four key algorithmic components: initial code drafting, neighborhood code generation, candidate evaluation, and incumbent code updating, each of which can be instantiated with specific decision rules to realize different local search algorithms such as Hill Climbing (HC) or Genetic Algorithm (GA). Furthermore, we develop a specialized revision reward model that evaluates code quality based on revision distance to produce fine-grained preferences that guide the local search toward more promising candidates. Finally, our extensive experimental results demonstrate that our approach achieves superior performance across diverse code generation tasks, significantly outperforming both construction-based tree search as well as the state-of-the-art improvement-based code generation methods.

[56] Positional Biases Shift as Inputs Approach Context Window Limits

Blerta Veseli, Julian Chibane, Mariya Toneva, Alexander Koller

Main category: cs.CL

TL;DR: The study investigates positional biases in LLMs, finding the ‘Lost in the Middle’ effect peaks at 50% of the context window, shifting to distance-based bias beyond that. Retrieval is key to reasoning biases.

Details

Motivation: To clarify inconsistent findings about positional biases (e.g., primacy, recency) in LLMs and their impact on long-context performance.

Method: Comprehensive analysis using relative input lengths (w.r.t. model’s context window) to study positional biases.

Result: LiM effect strongest at ≤50% context window; primacy weakens beyond this, recency persists. Retrieval biases drive reasoning performance.

Conclusion: Findings inform long-context task design, LLM benchmarks, and evaluation methods, emphasizing retrieval’s role in positional biases.

Abstract: Large Language Models (LLMs) often struggle to use information across long inputs effectively. Prior work has identified positional biases, such as the Lost in the Middle (LiM) effect, where models perform better when information appears at the beginning (primacy bias) or end (recency bias) of the input, rather than in the middle. However, long-context studies have not consistently replicated these effects, raising questions about their intensity and the conditions under which they manifest. To address this, we conducted a comprehensive analysis using relative rather than absolute input lengths, defined with respect to each model’s context window. Our findings reveal that the LiM effect is strongest when inputs occupy up to 50% of a model’s context window. Beyond that, the primacy bias weakens, while recency bias remains relatively stable. This effectively eliminates the LiM effect; instead, we observe a distance-based bias, where model performance is better when relevant information is closer to the end of the input. Furthermore, our results suggest that successful retrieval is a prerequisite for reasoning in LLMs, and that the observed positional biases in reasoning are largely inherited from retrieval. These insights have implications for long-context tasks, the design of future LLM benchmarks, and evaluation methodologies for LLMs handling extended inputs.

[57] ALOPE: Adaptive Layer Optimization for Translation Quality Estimation using Large Language Models

Archchana Sindhujan, Shenbin Qian, Chan Chi Chun Matthew, Constantin Orasan, Diptesh Kanojia

Main category: cs.CL

TL;DR: ALOPE is an adaptive layer-optimization framework enhancing LLM-based Quality Estimation (QE) for Machine Translation by restructuring Transformer representations and integrating low-rank adapters with regression task heads.

Details

Motivation: Existing LLM-based QE systems are limited by their pre-training for causal language modeling and struggle with low-resource languages. ALOPE aims to improve cross-lingual alignment and regression-based prediction.

Method: ALOPE uses low-rank adapters (LoRA) with regression task heads, dynamic weighting for combining layer representations, and multi-head regression for aggregating losses. It leverages intermediate Transformer layers for better cross-lingual alignment.

Result: ALOPE outperforms existing LLM-based QE approaches, demonstrating that intermediate Transformer layers provide more aligned contextual representations for QE.

Conclusion: ALOPE enhances LLM-based QE, offering scalable QE capabilities for MT frameworks, with models and code made publicly available.

Abstract: Large Language Models (LLMs) have shown remarkable performance across a wide range of natural language processing tasks. Quality Estimation (QE) for Machine Translation (MT), which assesses the quality of a source-target pair without relying on reference translations, remains a challenging cross-lingual task for LLMs. The challenges stem from the inherent limitations of existing LLM-based QE systems, which are pre-trained for causal language modelling rather than regression-specific tasks, further elevated by the presence of low-resource languages given pre-training data distribution. This paper introduces ALOPE, an adaptive layer-optimization framework designed to enhance LLM-based QE by restructuring Transformer representations through layer-wise adaptation for improved regression-based prediction. Our framework integrates low-rank adapters (LoRA) with regression task heads, leveraging selected pre-trained Transformer layers for improved cross-lingual alignment. In addition to the layer-specific adaptation, ALOPE introduces two strategies-dynamic weighting, which adaptively combines representations from multiple layers, and multi-head regression, which aggregates regression losses from multiple heads for QE. Our framework shows improvements over various existing LLM-based QE approaches. Empirical evidence suggests that intermediate Transformer layers in LLMs provide contextual representations that are more aligned with the cross-lingual nature of the QE task. We make resultant models and framework code publicly available for further research, also allowing existing LLM-based MT frameworks to be scaled with QE capabilities.

[58] Augmenting Bias Detection in LLMs Using Topological Data Analysis

Keshav Varadarajan, Tananun Songdechakraiwut

Main category: cs.CL

TL;DR: A method using topological data analysis identifies bias-contributing attention heads in GPT-2, revealing hot spots for specific bias categories like gender or profession.

Details

Motivation: Existing bias detection methods lack tools to pinpoint specific parts of large language models responsible for bias towards certain groups.

Method: Topological data analysis is applied to GPT-2 to identify attention heads contributing to bias in the StereoSet dataset.

Result: Biases for categories like gender or profession are concentrated in specific attention heads acting as hot spots.

Conclusion: The proposed metric can help identify bias sources and potentially aid in de-biasing large language models in future work.

Abstract: Recently, many bias detection methods have been proposed to determine the level of bias a large language model captures. However, tests to identify which parts of a large language model are responsible for bias towards specific groups remain underdeveloped. In this study, we present a method using topological data analysis to identify which heads in GPT-2 contribute to the misrepresentation of identity groups present in the StereoSet dataset. We find that biases for particular categories, such as gender or profession, are concentrated in attention heads that act as hot spots. The metric we propose can also be used to determine which heads capture bias for a specific group within a bias category, and future work could extend this method to help de-bias large language models.

[59] Word Clouds as Common Voices: LLM-Assisted Visualization of Participant-Weighted Themes in Qualitative Interviews

Joseph T. Colonel, Baihan Lin

Main category: cs.CL

TL;DR: ThemeClouds is a tool using LLMs to create thematic word clouds from dialogue, improving on traditional frequency-based methods by focusing on participant-weighted themes.

Details

Motivation: Traditional word clouds fail in conversational contexts by highlighting filler words and fragmenting ideas, limiting their usefulness for qualitative analysis.

Method: ThemeClouds uses LLMs to identify concept-level themes and counts unique participant mentions, creating visualizations based on breadth of mention.

Result: ThemeClouds outperforms frequency clouds and topic-modeling baselines, revealing more actionable insights in user study data.

Conclusion: The tool offers transparency, control, and potential for interactive analyses, though design trade-offs and implications for researcher agency are discussed.

Abstract: Word clouds are a common way to summarize qualitative interviews, yet traditional frequency-based methods often fail in conversational contexts: they surface filler words, ignore paraphrase, and fragment semantically related ideas. This limits their usefulness in early-stage analysis, when researchers need fast, interpretable overviews of what participant actually said. We introduce ThemeClouds, an open-source visualization tool that uses large language models (LLMs) to generate thematic, participant-weighted word clouds from dialogue transcripts. The system prompts an LLM to identify concept-level themes across a corpus and then counts how many unique participants mention each topic, yielding a visualization grounded in breadth of mention rather than raw term frequency. Researchers can customize prompts and visualization parameters, providing transparency and control. Using interviews from a user study comparing five recording-device configurations (31 participants; 155 transcripts, Whisper ASR), our approach surfaces more actionable device concerns than frequency clouds and topic-modeling baselines (e.g., LDA, BERTopic). We discuss design trade-offs for integrating LLM assistance into qualitative workflows, implications for interpretability and researcher agency, and opportunities for interactive analyses such as per-condition contrasts (``diff clouds’’).

[60] From Trial-and-Error to Improvement: A Systematic Analysis of LLM Exploration Mechanisms in RLVR

Jia Deng, Jie Chen, Zhipeng Chen, Daixuan Cheng, Fei Bai, Beichen Zhang, Yinqian Min, Yanzipeng Gao, Wayne Xin Zhao, Ji-Rong Wen

Main category: cs.CL

TL;DR: The paper investigates exploration behaviors in RLVR for LLMs, focusing on space shaping, entropy-performance exchange, and RL optimization.

Details

Motivation: To understand and improve the exploration mechanisms in RLVR for enhancing LLM reasoning.

Method: Systematic study of exploration capacities, including quantitative metrics, entropy-performance analysis, and RL optimization techniques.

Result: Provides empirical evidence and a framework for advancing RLVR systems.

Conclusion: The work lays a foundation for future improvements in RLVR by unifying insights and evidence.

Abstract: Reinforcement learning with verifiable rewards (RLVR) has emerged as a powerful paradigm for enhancing the reasoning capabilities of large language models (LLMs). Unlike traditional RL approaches, RLVR leverages rule-based feedback to guide LLMs in generating and refining complex reasoning chains – a process critically dependent on effective exploration strategies. While prior work has demonstrated RLVR’s empirical success, the fundamental mechanisms governing LLMs’ exploration behaviors remain underexplored. This technical report presents a systematic investigation of exploration capacities in RLVR, covering four main aspects: (1) exploration space shaping, where we develop quantitative metrics to characterize LLMs’ capability boundaries; (2) entropy-performance exchange, analyzed across training stages, individual instances, and token-level patterns; and (3) RL performance optimization, examining methods to effectively translate exploration gains into measurable improvements. By unifying previously identified insights with new empirical evidence, this work aims to provide a foundational framework for advancing RLVR systems.

[61] IBPS: Indian Bail Prediction System

Puspesh Kumar Srivastava, Uddeshya Raj, Praveen Patel, /Shubham Kumar Nigam, Noel Shallum, Arnab Bhattacharya

Main category: cs.CL

TL;DR: The paper introduces IBPS, an AI system for predicting bail outcomes in Indian courts, using a dataset of 150,430 bail judgments. It shows improved accuracy and fairness with statutory context.

Details

Motivation: Address subjectivity, delays, and inconsistencies in Indian bail decisions, exacerbated by undertrial prisoners and judicial backlog.

Method: Curate a dataset of bail judgments, fine-tune a language model with statutory context, and evaluate performance with RAG.

Result: Models with statutory knowledge outperform baselines, achieving high accuracy and explanation quality.

Conclusion: IBPS offers a scalable, transparent solution to improve bail decision-making and procedural fairness in India.

Abstract: Bail decisions are among the most frequently adjudicated matters in Indian courts, yet they remain plagued by subjectivity, delays, and inconsistencies. With over 75% of India’s prison population comprising undertrial prisoners, many from socioeconomically disadvantaged backgrounds, the lack of timely and fair bail adjudication exacerbates human rights concerns and contributes to systemic judicial backlog. In this paper, we present the Indian Bail Prediction System (IBPS), an AI-powered framework designed to assist in bail decision-making by predicting outcomes and generating legally sound rationales based solely on factual case attributes and statutory provisions. We curate and release a large-scale dataset of 150,430 High Court bail judgments, enriched with structured annotations such as age, health, criminal history, crime category, custody duration, statutes, and judicial reasoning. We fine-tune a large language model using parameter-efficient techniques and evaluate its performance across multiple configurations, with and without statutory context, and with RAG. Our results demonstrate that models fine-tuned with statutory knowledge significantly outperform baselines, achieving strong accuracy and explanation quality, and generalize well to a test set independently annotated by legal experts. IBPS offers a transparent, scalable, and reproducible solution to support data-driven legal assistance, reduce bail delays, and promote procedural fairness in the Indian judicial system.

[62] Keyword-Centric Prompting for One-Shot Event Detection with Self-Generated Rationale Enhancements

Ziheng Li, Zhi-Hong Deng

Main category: cs.CL

TL;DR: KeyCP++ improves one-shot event detection in LLMs by using keyword-centric chain-of-thought prompting to address over-interpretation and logical gaps.

Details

Motivation: LLMs struggle with event detection due to poor understanding of triggers and over-interpretation, which in-context learning alone cannot fix.

Method: KeyCP++ introduces a trigger discrimination prompting template, using keywords to profile triggers, propose candidates, and justify them, reducing keyword reliance.

Result: Experiments show KeyCP++ significantly advances one-shot event detection performance.

Conclusion: KeyCP++ effectively mitigates LLM weaknesses in event detection by enhancing rationale generation and rule learning.

Abstract: Although the LLM-based in-context learning (ICL) paradigm has demonstrated considerable success across various natural language processing tasks, it encounters challenges in event detection. This is because LLMs lack an accurate understanding of event triggers and tend to make over-interpretation, which cannot be effectively corrected through in-context examples alone. In this paper, we focus on the most challenging one-shot setting and propose KeyCP++, a keyword-centric chain-of-thought prompting approach. KeyCP++ addresses the weaknesses of conventional ICL by automatically annotating the logical gaps between input text and detection results for the demonstrations. Specifically, to generate in-depth and meaningful rationale, KeyCP++ constructs a trigger discrimination prompting template. It incorporates the exemplary triggers (a.k.a keywords) into the prompt as the anchor to simply trigger profiling, let LLM propose candidate triggers, and justify each candidate. These propose-and-judge rationales help LLMs mitigate over-reliance on the keywords and promote detection rule learning. Extensive experiments demonstrate the effectiveness of our approach, showcasing significant advancements in one-shot event detection.

[63] InterChart: Benchmarking Visual Reasoning Across Decomposed and Distributed Chart Information

Anirudh Iyengar Kaniyar Narayana Iyengar, Srija Mukhopadhyay, Adnan Qidwai, Shubhankar Singh, Dan Roth, Vivek Gupta

Main category: cs.CL

TL;DR: InterChart is a benchmark for evaluating vision-language models’ ability to reason across multiple related charts, revealing their limitations in complex tasks.

Details

Motivation: To address the lack of benchmarks for multimodal reasoning across diverse, related charts in real-world applications like scientific reporting and financial analysis.

Method: InterChart organizes tasks into three difficulty tiers: factual reasoning, integrative analysis, and semantic inference over real-world chart pairs.

Result: State-of-the-art VLMs show declining accuracy as chart complexity increases, performing better when charts are simplified.

Conclusion: InterChart highlights VLMs’ struggles with cross-chart integration and provides a framework for improving multimodal reasoning.

Abstract: We introduce InterChart, a diagnostic benchmark that evaluates how well vision-language models (VLMs) reason across multiple related charts, a task central to real-world applications such as scientific reporting, financial analysis, and public policy dashboards. Unlike prior benchmarks focusing on isolated, visually uniform charts, InterChart challenges models with diverse question types ranging from entity inference and trend correlation to numerical estimation and abstract multi-step reasoning grounded in 2-3 thematically or structurally related charts. We organize the benchmark into three tiers of increasing difficulty: (1) factual reasoning over individual charts, (2) integrative analysis across synthetically aligned chart sets, and (3) semantic inference over visually complex, real-world chart pairs. Our evaluation of state-of-the-art open and closed-source VLMs reveals consistent and steep accuracy declines as chart complexity increases. We find that models perform better when we decompose multi-entity charts into simpler visual units, underscoring their struggles with cross-chart integration. By exposing these systematic limitations, InterChart provides a rigorous framework for advancing multimodal reasoning in complex, multi-visual environments.

[64] LoSemB: Logic-Guided Semantic Bridging for Inductive Tool Retrieval

Luyao Zhuang, Qinggang Zhang, Huachi Zhou, Juhua Liu, Qing Li, Xiao Huang

Main category: cs.CL

TL;DR: LoSemB is a novel framework for inductive tool retrieval in LLMs, addressing issues of distribution shift and similarity-based retrieval vulnerability by leveraging logical information from prior experience.

Details

Motivation: Existing tool retrieval methods for LLMs assume all tools are observed during training, which is unrealistic as real-world tool repositories evolve. LoSemB aims to handle unseen tools effectively.

Method: LoSemB uses a logic-based embedding alignment module to mitigate distribution shifts and a relational augmented retrieval mechanism to improve similarity-based retrieval.

Result: Extensive experiments show LoSemB performs well in inductive settings and maintains effectiveness in transductive settings.

Conclusion: LoSemB provides a robust solution for inductive tool retrieval, addressing key limitations of current methods.

Abstract: Tool learning has emerged as a promising paradigm for large language models (LLMs) to solve many real-world tasks. Nonetheless, with the tool repository rapidly expanding, it is impractical to contain all tools within the limited input length of LLMs. To alleviate these issues, researchers have explored incorporating a tool retrieval module to select the most relevant tools or represent tools as unique tokens within LLM parameters. However, most state-of-the-art methods are under transductive settings, assuming all tools have been observed during training. Such a setting deviates from reality as the real-world tool repository is evolving and incorporates new tools frequently. When dealing with these unseen tools, which refer to tools not encountered during the training phase, these methods are limited by two key issues, including the large distribution shift and the vulnerability of similarity-based retrieval. To this end, inspired by human cognitive processes of mastering unseen tools through discovering and applying the logical information from prior experience, we introduce a novel Logic-Guided Semantic Bridging framework for inductive tool retrieval, namely, LoSemB, which aims to mine and transfer latent logical information for inductive tool retrieval without costly retraining. Specifically, LoSemB contains a logic-based embedding alignment module to mitigate distribution shifts and implements a relational augmented retrieval mechanism to reduce the vulnerability of similarity-based retrieval. Extensive experiments demonstrate that LoSemB achieves advanced performance in inductive settings while maintaining desirable effectiveness in the transductive setting.

[65] What am I missing here?: Evaluating Large Language Models for Masked Sentence Prediction

Charlie Wyatt, Aditya Joshi, Flora Salim

Main category: cs.CL

TL;DR: The paper evaluates commercial LLMs on Masked Sentence Prediction (MSP), revealing their poor performance in low-structured domains despite strong results in other tasks.

Details

Motivation: To assess if LLMs, which rely on Next Token Prediction (NTP), can maintain long-range coherence and predict full sentences in structured documents.

Method: Evaluated GPT-4o, Claude 3.5 Sonnet, and Gemini 2.0 Flash on MSP across narrative, procedural, and expository domains, measuring fidelity and cohesiveness.

Result: Commercial LLMs perform poorly in predicting masked sentences in low-structured domains.

Conclusion: Current LLMs lack explicit mechanisms for global coherence, highlighting a gap in their capabilities for tasks requiring long-range planning.

Abstract: Transformer-based models primarily rely on Next Token Prediction (NTP), which predicts the next token in a sequence based on the preceding context. However, NTP’s focus on single-token prediction often limits a model’s ability to plan ahead or maintain long-range coherence, raising questions about how well LLMs can predict longer contexts, such as full sentences within structured documents. While NTP encourages local fluency, it provides no explicit incentive to ensure global coherence across sentence boundaries-an essential skill for reconstructive or discursive tasks. To investigate this, we evaluate three commercial LLMs (GPT-4o, Claude 3.5 Sonnet, and Gemini 2.0 Flash) on Masked Sentence Prediction (MSP) - the task of infilling a randomly removed sentence - from three domains: ROCStories (narrative), Recipe1M (procedural), and Wikipedia (expository). We assess both fidelity (similarity to the original sentence) and cohesiveness (fit within the surrounding context). Our key finding reveals that commercial LLMs, despite their superlative performance in other tasks, are poor at predicting masked sentences in low-structured domains, highlighting a gap in current model capabilities.

Zhenliang Zhang, Junzhe Zhang, Xinyu Hu, HuiXuan Zhang, Xiaojun Wan

Main category: cs.CL

TL;DR: The paper explores if social bias causes faithfulness hallucinations in LLMs, using SCM and a new dataset (BID) to isolate causality, finding biases significantly contribute to hallucinations.

Details

Motivation: To investigate the unexplored causal relationship between social bias and faithfulness hallucinations in LLMs, addressing the challenge of confounder control.

Method: Uses Structural Causal Model (SCM) to establish causality, designs bias interventions, and introduces the Bias Intervention Dataset (BID) for precise measurement.

Result: Experiments show biases significantly cause faithfulness hallucinations, with varying directional effects, and highlight bias’s subtle causal role in unfairness hallucinations.

Conclusion: Social bias is a significant cause of faithfulness hallucinations in LLMs, with distinct effects per bias state, emphasizing the need for bias-aware interventions.

Abstract: Large language models (LLMs) have achieved remarkable success in various tasks, yet they remain vulnerable to faithfulness hallucinations, where the output does not align with the input. In this study, we investigate whether social bias contributes to these hallucinations, a causal relationship that has not been explored. A key challenge is controlling confounders within the context, which complicates the isolation of causality between bias states and hallucinations. To address this, we utilize the Structural Causal Model (SCM) to establish and validate the causality and design bias interventions to control confounders. In addition, we develop the Bias Intervention Dataset (BID), which includes various social biases, enabling precise measurement of causal effects. Experiments on mainstream LLMs reveal that biases are significant causes of faithfulness hallucinations, and the effect of each bias state differs in direction. We further analyze the scope of these causal effects across various models, specifically focusing on unfairness hallucinations, which are primarily targeted by social bias, revealing the subtle yet significant causal effect of bias on hallucination generation.

[67] SASST: Leveraging Syntax-Aware Chunking and LLMs for Simultaneous Speech Translation

Zeyu Yang, Lai Wei, Roman Koshkin, Xi Chen, Satoshi Nakamura

Main category: cs.CL

TL;DR: A grammar-based chunking strategy for semantic segmentation, combined with SASST, improves simultaneous speech translation by optimizing timing and content.

Details

Motivation: To enhance simultaneous speech translation by leveraging syntactic structures and minimizing semantic fragmentation.

Method: Proposes a grammar-based chunking strategy and integrates it into SASST, an end-to-end framework using Whisper encoder and decoder-only LLM.

Result: Significant translation quality improvements on CoVoST2 multilingual corpus, validating syntactic structures’ role in SimulST systems.

Conclusion: The method effectively improves translation quality and timing in simultaneous speech translation.

Abstract: This work proposes a grammar-based chunking strategy that segments input streams into semantically complete units by parsing dependency relations (e.g., noun phrase boundaries, verb-object structures) and punctuation features. The method ensures chunk coherence and minimizes semantic fragmentation. Building on this mechanism, we present SASST (Syntax-Aware Simultaneous Speech Translation), an end-to-end framework integrating frozen Whisper encoder and decoder-only LLM. The unified architecture dynamically outputs translation tokens or symbols to jointly optimize translation timing and content, with target-side reordering addressing word-order divergence. Experiments on CoVoST2 multilingual corpus En-{De, Zh, Ja} demonstrate significant translation quality improvements across languages and validate the effectiveness of syntactic structures in LLM-driven SimulST systems.

[68] Grove MoE: Towards Efficient and Superior MoE LLMs with Adjugate Experts

Haoyuan Wu, Haoxing Chen, Xiaodong Chen, Zhanchao Zhou, Tieyuan Chen, Yihong Zhuang, Guoshan Lu, Zenan Huang, Junbo Zhao, Lin Liu, Zhenzhong Lan, Bei Yu, Jianguo Li

Main category: cs.CL

TL;DR: Grove MoE introduces heterogeneous experts of varying sizes for dynamic parameter activation, improving computational efficiency in large language models.

Details

Motivation: Traditional MoE architectures use uniform expert sizes, limiting efficiency. Grove MoE addresses this by varying expert sizes dynamically.

Method: Introduces adjugate experts with dynamic activation, applied to Qwen3-30B-A3B-Base via upcycling, creating GroveMoE-Base and GroveMoE-Inst.

Result: Achieves performance comparable to SOTA models while activating only 3.14-3.28B parameters dynamically.

Conclusion: Grove MoE enhances efficiency and scalability in LLMs by dynamically adjusting to input complexity.

Abstract: The Mixture of Experts (MoE) architecture is a cornerstone of modern state-of-the-art (SOTA) large language models (LLMs). MoE models facilitate scalability by enabling sparse parameter activation. However, traditional MoE architecture uses homogeneous experts of a uniform size, activating a fixed number of parameters irrespective of input complexity and thus limiting computational efficiency. To overcome this limitation, we introduce Grove MoE, a novel architecture incorporating experts of varying sizes, inspired by the heterogeneous big.LITTLE CPU architecture. This architecture features novel adjugate experts with a dynamic activation mechanism, enabling model capacity expansion while maintaining manageable computational overhead. Building on this architecture, we present GroveMoE-Base and GroveMoE-Inst, 33B-parameter LLMs developed by applying an upcycling strategy to the Qwen3-30B-A3B-Base model during mid-training and post-training. GroveMoE models dynamically activate 3.14-3.28B parameters based on token complexity and achieve performance comparable to SOTA open-source models of similar or even larger size.

[69] Can You Trick the Grader? Adversarial Persuasion of LLM Judges

Yerin Hwang, Dongryeol Lee, Taegwan Kang, Yongil Kim, Kyomin Jung

Main category: cs.CL

TL;DR: Persuasive language can bias LLM judges, inflating scores for incorrect math solutions, with Consistency being the most impactful technique. Model size and counter-prompting don’t fully mitigate this vulnerability.

Details

Motivation: To investigate if persuasive language can unfairly influence LLM judges in scoring tasks where correctness should be style-independent.

Method: Formalized seven persuasion techniques (e.g., Majority, Flattery) and embedded them into identical math responses. Tested across six benchmarks using LLM judges.

Result: Persuasive language inflated scores for incorrect solutions by up to 8%, with Consistency causing the worst bias. Combining techniques amplified the effect.

Conclusion: LLM-as-a-Judge pipelines are vulnerable to persuasion-based attacks, requiring robust defenses.

Abstract: As large language models take on growing roles as automated evaluators in practical settings, a critical question arises: Can individuals persuade an LLM judge to assign unfairly high scores? This study is the first to reveal that strategically embedded persuasive language can bias LLM judges when scoring mathematical reasoning tasks, where correctness should be independent of stylistic variation. Grounded in Aristotle’s rhetorical principles, we formalize seven persuasion techniques (Majority, Consistency, Flattery, Reciprocity, Pity, Authority, Identity) and embed them into otherwise identical responses. Across six math benchmarks, we find that persuasive language leads LLM judges to assign inflated scores to incorrect solutions, by up to 8% on average, with Consistency causing the most severe distortion. Notably, increasing model size does not substantially mitigate this vulnerability. Further analysis demonstrates that combining multiple persuasion techniques amplifies the bias, and pairwise evaluation is likewise susceptible. Moreover, the persuasive effect persists under counter prompting strategies, highlighting a critical vulnerability in LLM-as-a-Judge pipelines and underscoring the need for robust defenses against persuasion-based attacks.

[70] Evaluating Compositional Approaches for Focus and Sentiment Analysis

Olga Kellert, Muhammad Imran, Nicholas Hill Matlis, Mahmud Uz Zaman, Carlos Gómez-Rodríguez

Main category: cs.CL

TL;DR: The paper evaluates a compositional approach for Focus Analysis (FA) and Sentiment Analysis (SA), bridging a research gap in FA by applying SA’s compositional rules. It tests accuracy against non-compositional methods and generalizes findings from SA to FA.

Details

Motivation: To address the lack of quantitative evaluations of compositional approaches in FA, leveraging the close relationship between FA and SA.

Method: Uses Universal Dependencies (UDs) for syntactic rules in SA, comparing with non-compositional method VADER, and extends results to FA.

Result: Demonstrates advantages of compositional analysis (interpretability, explainability) and tests accuracy against VADER.

Conclusion: Compositional rules in SA apply to FA, with potential benefits for both fields.

Abstract: This paper summarizes the results of evaluating a compositional approach for Focus Analysis (FA) in Linguistics and Sentiment Analysis (SA) in Natural Language Processing (NLP). While quantitative evaluations of compositional and non-compositional approaches in SA exist in NLP, similar quantitative evaluations are very rare in FA in Linguistics that deal with linguistic expressions representing focus or emphasis such as “it was John who left”. We fill this gap in research by arguing that compositional rules in SA also apply to FA because FA and SA are closely related meaning that SA is part of FA. Our compositional approach in SA exploits basic syntactic rules such as rules of modification, coordination, and negation represented in the formalism of Universal Dependencies (UDs) in English and applied to words representing sentiments from sentiment dictionaries. Some of the advantages of our compositional analysis method for SA in contrast to non-compositional analysis methods are interpretability and explainability. We test the accuracy of our compositional approach and compare it with a non-compositional approach VADER that uses simple heuristic rules to deal with negation, coordination and modification. In contrast to previous related work that evaluates compositionality in SA on long reviews, this study uses more appropriate datasets to evaluate compositionality. In addition, we generalize the results of compositional approaches in SA to compositional approaches in FA.

[71] Evaluating Large Language Models as Expert Annotators

Yu-Min Tseng, Wei-Lin Chen, Chung-Chi Chen, Hsin-Hsi Chen

Main category: cs.CL

TL;DR: The paper explores whether top-performing LLMs can replace human expert annotators in specialized domains (finance, biomedicine, law) using multi-agent discussion frameworks and reasoning models. Findings show marginal gains for individual LLMs, no significant improvement from reasoning models, and unique behaviors in multi-agent setups.

Details

Motivation: To assess if LLMs, perceived as expert-level in benchmarks, can effectively replace human expert annotators in specialized domains, addressing the high cost and labor of manual annotation.

Method: Evaluates individual LLMs and multi-agent discussion frameworks in finance, biomedicine, and law, incorporating reasoning models (e.g., o3-mini) for comparison.

Result: Individual LLMs show marginal or negative gains; reasoning models offer no significant improvement; multi-agent setups reveal unique model behaviors (e.g., Claude 3.7 Sonnet rarely changes annotations).

Conclusion: LLMs have limited effectiveness as direct replacements for human expert annotators in specialized domains, with multi-agent frameworks providing some insights but no major breakthroughs.

Abstract: Textual data annotation, the process of labeling or tagging text with relevant information, is typically costly, time-consuming, and labor-intensive. While large language models (LLMs) have demonstrated their potential as direct alternatives to human annotators for general domains natural language processing (NLP) tasks, their effectiveness on annotation tasks in domains requiring expert knowledge remains underexplored. In this paper, we investigate: whether top-performing LLMs, which might be perceived as having expert-level proficiency in academic and professional benchmarks, can serve as direct alternatives to human expert annotators? To this end, we evaluate both individual LLMs and multi-agent approaches across three highly specialized domains: finance, biomedicine, and law. Specifically, we propose a multi-agent discussion framework to simulate a group of human annotators, where LLMs are tasked to engage in discussions by considering others’ annotations and justifications before finalizing their labels. Additionally, we incorporate reasoning models (e.g., o3-mini) to enable a more comprehensive comparison. Our empirical results reveal that: (1) Individual LLMs equipped with inference-time techniques (e.g., chain-of-thought (CoT), self-consistency) show only marginal or even negative performance gains, contrary to prior literature suggesting their broad effectiveness. (2) Overall, reasoning models do not demonstrate statistically significant improvements over non-reasoning models in most settings. This suggests that extended long CoT provides relatively limited benefits for data annotation in specialized domains. (3) Certain model behaviors emerge in the multi-agent discussion environment. For instance, Claude 3.7 Sonnet with thinking rarely changes its initial annotations, even when other agents provide correct annotations or valid reasoning.

[72] LLMs for Law: Evaluating Legal-Specific LLMs on Contract Understanding

Amrita Singh, H. Suhan Karaca, Aditya Joshi, Hye-young Paik, Jiaojiao Jiang

Main category: cs.CL

TL;DR: Evaluation of 10 legal-specific LLMs vs. 7 general-purpose LLMs in contract understanding tasks shows legal-specific models outperform, especially in nuanced legal tasks.

Details

Motivation: Address the lack of comprehensive evaluation of legal-specific LLMs in contract classification tasks.

Method: Evaluate 10 legal-specific LLMs and 7 general-purpose LLMs on three English contract understanding tasks.

Result: Legal-specific LLMs outperform general-purpose models, with Legal-BERT and Contracts-BERT setting new SOTAs on two tasks.

Conclusion: Legal-specific LLMs are superior for contract understanding, aiding future development of accurate systems.

Abstract: Despite advances in legal NLP, no comprehensive evaluation covering multiple legal-specific LLMs currently exists for contract classification tasks in contract understanding. To address this gap, we present an evaluation of 10 legal-specific LLMs on three English language contract understanding tasks and compare them with 7 general-purpose LLMs. The results show that legal-specific LLMs consistently outperform general-purpose models, especially on tasks requiring nuanced legal understanding. Legal-BERT and Contracts-BERT establish new SOTAs on two of the three tasks, despite having 69% fewer parameters than the best-performing general-purpose LLM. We also identify CaseLaw-BERT and LexLM as strong additional baselines for contract understanding. Our results provide a holistic evaluation of legal-specific LLMs and will facilitate the development of more accurate contract understanding systems.

[73] Large Language Models for Czech Aspect-Based Sentiment Analysis

Jakub Šmíd, Pavel Přibáň, Pavel Král

Main category: cs.CL

TL;DR: The paper evaluates 19 LLMs for Czech ABSA, finding domain-specific fine-tuned models outperform general LLMs in zero/few-shot settings, while fine-tuned LLMs achieve state-of-the-art results.

Details

Motivation: To explore the unexplored capabilities of LLMs in Czech ABSA and compare their performance across different scenarios.

Method: Comprehensive evaluation of 19 LLMs in zero-shot, few-shot, and fine-tuning settings, analyzing factors like multilingualism, model size, and recency.

Result: Domain-specific fine-tuned models outperform general LLMs in zero/few-shot; fine-tuned LLMs achieve state-of-the-art results.

Conclusion: LLMs show promise for Czech ABSA, with fine-tuning being key for optimal performance, and future research should address aspect term prediction challenges.

Abstract: Aspect-based sentiment analysis (ABSA) is a fine-grained sentiment analysis task that aims to identify sentiment toward specific aspects of an entity. While large language models (LLMs) have shown strong performance in various natural language processing (NLP) tasks, their capabilities for Czech ABSA remain largely unexplored. In this work, we conduct a comprehensive evaluation of 19 LLMs of varying sizes and architectures on Czech ABSA, comparing their performance in zero-shot, few-shot, and fine-tuning scenarios. Our results show that small domain-specific models fine-tuned for ABSA outperform general-purpose LLMs in zero-shot and few-shot settings, while fine-tuned LLMs achieve state-of-the-art results. We analyze how factors such as multilingualism, model size, and recency influence performance and present an error analysis highlighting key challenges, particularly in aspect term prediction. Our findings provide insights into the suitability of LLMs for Czech ABSA and offer guidance for future research in this area.

[74] Few-shot Cross-lingual Aspect-Based Sentiment Analysis with Sequence-to-Sequence Models

Jakub Šmíd, Pavel Přibáň, Pavel Král

Main category: cs.CL

TL;DR: Adding a few target language examples to training significantly improves cross-lingual ABSA performance, even surpassing monolingual baselines with 1,000 examples.

Details

Motivation: Addressing the challenge of low-resource languages in ABSA by leveraging few-shot target language examples.

Method: Evaluated the impact of adding few-shot target language examples across four ABSA tasks, six languages, and two sequence-to-sequence models.

Result: Adding ten target language examples improves performance over zero-shot settings, and 1,000 examples can surpass monolingual baselines.

Conclusion: Few-shot target language examples are feasible and highly effective for improving cross-lingual ABSA in low-resource settings.

Abstract: Aspect-based sentiment analysis (ABSA) has received substantial attention in English, yet challenges remain for low-resource languages due to the scarcity of labelled data. Current cross-lingual ABSA approaches often rely on external translation tools and overlook the potential benefits of incorporating a small number of target language examples into training. In this paper, we evaluate the effect of adding few-shot target language examples to the training set across four ABSA tasks, six target languages, and two sequence-to-sequence models. We show that adding as few as ten target language examples significantly improves performance over zero-shot settings and achieves a similar effect to constrained decoding in reducing prediction errors. Furthermore, we demonstrate that combining 1,000 target language examples with English data can even surpass monolingual baselines. These findings offer practical insights for improving cross-lingual ABSA in low-resource and domain-specific settings, as obtaining ten high-quality annotated examples is both feasible and highly effective.

[75] Tailored Emotional LLM-Supporter: Enhancing Cultural Sensitivity

Chen Cecilia Liu, Hiba Arnaout, Nils Kovačić, Dana Atzil-Slonim, Iryna Gurevych

Main category: cs.CL

TL;DR: The paper introduces CultureCare, a dataset for culturally sensitive emotional support, and evaluates LLMs’ adaptation strategies for this purpose.

Details

Motivation: To address the gap in culturally sensitive emotional support from LLMs due to lack of resources.

Method: Introduces CultureCare dataset, tests four adaptation strategies for LLMs, and evaluates using LLM judges, human annotators, and psychologists.

Result: Adapted LLMs outperform anonymous peer responses, and cultural role-play alone is insufficient. LLMs show potential in clinical training.

Conclusion: LLMs can be adapted for culturally sensitive support, with implications for improving cultural competence in therapy.

Abstract: Large language models (LLMs) show promise in offering emotional support and generating empathetic responses for individuals in distress, but their ability to deliver culturally sensitive support remains underexplored due to lack of resources. In this work, we introduce CultureCare, the first dataset designed for this task, spanning four cultures and including 1729 distress messages, 1523 cultural signals, and 1041 support strategies with fine-grained emotional and cultural annotations. Leveraging CultureCare, we (i) develop and test four adaptation strategies for guiding three state-of-the-art LLMs toward culturally sensitive responses; (ii) conduct comprehensive evaluations using LLM judges, in-culture human annotators, and clinical psychologists; (iii) show that adapted LLMs outperform anonymous online peer responses, and that simple cultural role-play is insufficient for cultural sensitivity; and (iv) explore the application of LLMs in clinical training, where experts highlight their potential in fostering cultural competence in future therapists.

[76] Challenges and opportunities in portraying emotion in generated sign language

John C. McDonald, Rosalee Wolfe, Fabrizio Nunnari

Main category: cs.CL

TL;DR: The paper proposes a two-parameter method for specifying emotional non-manual signals in signing avatars, using EASIER notation for more nuanced and consistent expressions.

Details

Motivation: Emotional content in signing avatars lacks a standard method for specification, making it challenging to incorporate.

Method: An intuitive two-parameter representation is applied to the Paula signing avatar, using EASIER notation for textual control.

Result: The method shows promise for coherent and nuanced emotional expression in avatars and consistent linguistic annotations.

Conclusion: The two-parameter approach facilitates better emotional expression in signing avatars and improves annotation consistency.

Abstract: Non-manual signals in sign languages continue to be a challenge for signing avatars. More specifically, emotional content has been difficult to incorporate because of a lack of a standard method of specifying the avatar’s emotional state. This paper explores the application of an intuitive two-parameter representation for emotive non-manual signals to the Paula signing avatar that shows promise for facilitating the linguistic specification of emotional facial expressions in a more coherent manner than previous methods. Users can apply these parameters to control Paula’s emotional expressions through a textual representation called the EASIER notation. The representation can allow avatars to express more nuanced emotional states using two numerical parameters. It also has the potential to enable more consistent specification of emotional non-manual signals in linguistic annotations which drive signing avatars.

Furkan Şahinuç, Subhabrata Dutta, Iryna Gurevych

Main category: cs.CL

TL;DR: The paper introduces GREP, a multi-turn evaluation framework for assessing the quality of AI-generated scientific writing, specifically related work sections, by integrating domain-specific criteria and expert preferences.

Details

Motivation: To address the gap in evaluating AI-generated scientific writing, as conventional metrics and LLM-as-a-judge systems lack domain-specific understanding and expert preference discernment.

Method: Proposes GREP, a framework decomposing evaluation into fine-grained dimensions, using contrastive few-shot examples for contextual guidance, and offering two variants (proprietary and open-weight LLMs).

Result: GREP robustly assesses related work sections, correlates strongly with human expert assessment, and reveals shortcomings in state-of-the-art LLM outputs.

Conclusion: GREP provides a detailed, accessible evaluation tool for AI-generated scientific writing, highlighting current LLM limitations and potential for improvement.

Abstract: Expert domain writing, such as scientific writing, typically demands extensive domain knowledge. Recent advances in LLMs show promising potential in reducing the expert workload. However, evaluating the quality of automatically generated scientific writing is a crucial open issue, as it requires knowledge of domain-specific evaluation criteria and the ability to discern expert preferences. Conventional automatic metrics and LLM-as-a-judge systems are insufficient to grasp expert preferences and domain-specific quality standards. To address this gap and support human-AI collaborative writing, we focus on related work generation, one of the most challenging scientific tasks, as an exemplar. We propose GREP, a multi-turn evaluation framework that integrates classical related work evaluation criteria with expert-specific preferences. Instead of assigning a single score, our framework decomposes the evaluation into fine-grained dimensions. This localized evaluation approach is further augmented with contrastive few-shot examples to provide detailed contextual guidance for the evaluation dimensions. The design principles allow our framework to deliver cardinal assessment of quality, which can facilitate better post-training compared to ordinal preference data. For better accessibility, we design two variants of GREP: a more precise variant with proprietary LLMs as evaluators, and a cheaper alternative with open-weight LLMs. Empirical investigation reveals that our framework is able to assess the quality of related work sections in a much more robust manner compared to standard LLM judges, reflects natural scenarios of scientific writing, and bears a strong correlation with the human expert assessment. We also observe that generations from state-of-the-art LLMs struggle to satisfy validation constraints of a suitable related work section. They (mostly) fail to improve based on feedback as well.

[78] Large Language Models for Subjective Language Understanding: A Survey

Changhao Song, Yazhou Zhang, Hui Gao, Ben Yao, Peng Zhang

Main category: cs.CL

TL;DR: A survey on applying large language models (LLMs) to subjective language tasks, covering definitions, challenges, methodologies, and future directions.

Details

Motivation: To review recent advances in using LLMs for nuanced tasks like sentiment analysis, emotion recognition, and sarcasm detection, addressing the unique challenges of subjective language.

Method: Comprehensive review of LLM architectures and techniques, task-specific summaries, and comparative insights across eight subjective language tasks.

Result: Highlights the suitability of LLMs for modeling human-like judgments and identifies open issues like data limitations and ethical concerns.

Conclusion: The survey serves as a resource for researchers in affective computing and figurative language, suggesting future directions for unified models of subjectivity.

Abstract: Subjective language understanding refers to a broad set of natural language processing tasks where the goal is to interpret or generate content that conveys personal feelings, opinions, or figurative meanings rather than objective facts. With the advent of large language models (LLMs) such as ChatGPT, LLaMA, and others, there has been a paradigm shift in how we approach these inherently nuanced tasks. In this survey, we provide a comprehensive review of recent advances in applying LLMs to subjective language tasks, including sentiment analysis, emotion recognition, sarcasm detection, humor understanding, stance detection, metaphor interpretation, intent detection, and aesthetics assessment. We begin by clarifying the definition of subjective language from linguistic and cognitive perspectives, and we outline the unique challenges posed by subjective language (e.g. ambiguity, figurativeness, context dependence). We then survey the evolution of LLM architectures and techniques that particularly benefit subjectivity tasks, highlighting why LLMs are well-suited to model subtle human-like judgments. For each of the eight tasks, we summarize task definitions, key datasets, state-of-the-art LLM-based methods, and remaining challenges. We provide comparative insights, discussing commonalities and differences among tasks and how multi-task LLM approaches might yield unified models of subjectivity. Finally, we identify open issues such as data limitations, model bias, and ethical considerations, and suggest future research directions. We hope this survey will serve as a valuable resource for researchers and practitioners interested in the intersection of affective computing, figurative language processing, and large-scale language models.

[79] Toward Machine Interpreting: Lessons from Human Interpreting Studies

Matthias Sperber, Maureen de Seyssel, Jiajun Bao, Matthias Paulik

Main category: cs.CL

TL;DR: The paper explores how human interpreting principles can enhance speech translation systems, aiming to bridge the usability gap between current systems and human-like adaptability.

Details

Motivation: Current speech translation systems lack adaptability compared to human interpreters. Understanding human interpreting can improve system practicality.

Method: Analyzes human interpreting literature from a machine translation perspective, focusing on operational and qualitative aspects.

Result: Identifies potential to integrate human interpreting principles into speech translation using modern modeling techniques.

Conclusion: The findings aim to inspire progress toward more adaptive, human-like machine interpreting systems.

Abstract: Current speech translation systems, while having achieved impressive accuracies, are rather static in their behavior and do not adapt to real-world situations in ways human interpreters do. In order to improve their practical usefulness and enable interpreting-like experiences, a precise understanding of the nature of human interpreting is crucial. To this end, we discuss human interpreting literature from the perspective of the machine translation field, while considering both operational and qualitative aspects. We identify implications for the development of speech translation systems and argue that there is great potential to adopt many human interpreting principles using recent modeling techniques. We hope that our findings provide inspiration for closing the perceived usability gap, and can motivate progress toward true machine interpreting.

[80] Understanding Syntactic Generalization in Structure-inducing Language Models

David Arps, Hassan Sajjad, Laura Kallmeyer

Main category: cs.CL

TL;DR: The paper evaluates three Structure-inducing Language Models (SiLMs) on syntactic representation, grammaticality judgment, and training dynamics, finding no single model dominates but highlighting GPST’s consistency and performance on long-distance dependencies.

Details

Motivation: To address gaps and lack of comparability in evaluating SiLMs by systematically comparing three architectures on multiple metrics.

Method: Comparative study of Structformer, UDGN, and GPST using natural language and synthetic data, focusing on syntactic representations, grammaticality judgments, and training dynamics.

Result: No model dominates all metrics; GPST performs most consistently, especially on long-distance dependencies. Small models on synthetic data are useful for basic evaluations.

Conclusion: GPST stands out for consistency, and synthetic data aids evaluation, but no single SiLM excels universally.

Abstract: Structure-inducing Language Models (SiLM) are trained on a self-supervised language modeling task, and induce a hierarchical sentence representation as a byproduct when processing an input. A wide variety of SiLMs have been proposed. However, these have typically been evaluated on a relatively small scale, and evaluation of these models has systematic gaps and lacks comparability. In this work, we study three different SiLM architectures using both natural language (English) corpora and synthetic bracketing expressions: Structformer (Shen et al., 2021), UDGN (Shen et al., 2022) and GPST (Hu et al., 2024). We compare them with respect to (i) properties of the induced syntactic representations (ii) performance on grammaticality judgment tasks, and (iii) training dynamics. We find that none of the three architectures dominates across all evaluation metrics. However, there are significant differences, in particular with respect to the induced syntactic representations. The Generative Pretrained Structured Transformer (GPST; Hu et al. 2024) performs most consistently across evaluation settings, and outperforms the other models on long-distance dependencies in bracketing expressions. Furthermore, our study shows that small models trained on large amounts of synthetic data provide a useful testbed for evaluating basic model properties.

[81] Beyond Ten Turns: Unlocking Long-Horizon Agentic Search with Large-Scale Asynchronous RL

Jiaxuan Gao, Wei Fu, Minyang Xie, Shusheng Xu, Chuyi He, Zhiyu Mei, Banghua Zhu, Yi Wu

Main category: cs.CL

TL;DR: ASearcher is an open-source project for large-scale RL training of search agents, improving scalability, efficiency, and data quality in search intelligence tasks.

Details

Motivation: Open-source agents lack expert-level Search Intelligence, struggling with ambiguous queries, precise searches, and thorough exploration due to scalability and efficiency limitations.

Method: ASearcher uses scalable fully asynchronous RL training and a prompt-based LLM agent to synthesize high-quality QA datasets for training.

Result: The QwQ-32B agent shows significant improvements (46.7% and 20.8% Avg@4 gains on benchmarks) and handles extreme long-horizon searches (40+ turns, 150k+ tokens).

Conclusion: ASearcher outperforms existing open-source 32B agents, demonstrating robust search capabilities without external LLMs, and is openly available.

Abstract: Recent advancements in LLM-based agents have demonstrated remarkable capabilities in handling complex, knowledge-intensive tasks by integrating external tools. Among diverse choices of tools, search tools play a pivotal role in accessing vast external knowledge. However, open-source agents still fall short of achieving expert-level Search Intelligence, the ability to resolve ambiguous queries, generate precise searches, analyze results, and conduct thorough exploration. Existing approaches fall short in scalability, efficiency, and data quality. For example, small turn limits in existing online RL methods, e.g. <=10, restrict complex strategy learning. This paper introduces ASearcher, an open-source project for large-scale RL training of search agents. Our key contributions include: (1) Scalable fully asynchronous RL training that enables long-horizon search while maintaining high training efficiency. (2) A prompt-based LLM agent that autonomously synthesizes high-quality and challenging QAs, creating a large-scale QA dataset. Through RL training, our prompt-based QwQ-32B agent achieves substantial improvements, with 46.7% and 20.8% Avg@4 gains on xBench and GAIA, respectively. Notably, our agent exhibits extreme long-horizon search, with tool calls exceeding 40 turns and output tokens exceeding 150k during training time. With a simple agent design and no external LLMs, ASearcher-Web-QwQ achieves Avg@4 scores of 42.1 on xBench and 52.8 on GAIA, surpassing existing open-source 32B agents. We open-source our models, training data, and codes in https://github.com/inclusionAI/ASearcher.

[82] The Medical Metaphors Corpus (MCC)

Anna Sofia Lippolis, Andrea Giovanni Nuzzolese, Aldo Gangemi

Main category: cs.CL

TL;DR: The paper introduces the Medical Metaphors Corpus (MCC), a dataset of 792 annotated scientific metaphors in medical/biological domains, addressing a gap in domain-specific metaphor detection.

Details

Motivation: Existing metaphor detection tools focus on general-domain text, lacking resources for scientific discourse, which MCC aims to fill.

Method: MCC aggregates metaphors from diverse sources (literature, news, social media, crowdsourcing) with binary/graded annotations, including source-target mappings and metaphoricity scores.

Result: State-of-the-art language models show modest performance on scientific metaphor detection, indicating room for improvement.

Conclusion: MCC supports research in metaphor detection, generation systems, and patient communication tools, advancing domain-specific figurative language understanding.

Abstract: Metaphor is a fundamental cognitive mechanism that shapes scientific understanding, enabling the communication of complex concepts while potentially constraining paradigmatic thinking. Despite the prevalence of figurative language in scientific discourse, existing metaphor detection resources primarily focus on general-domain text, leaving a critical gap for domain-specific applications. In this paper, we present the Medical Metaphors Corpus (MCC), a comprehensive dataset of 792 annotated scientific conceptual metaphors spanning medical and biological domains. MCC aggregates metaphorical expressions from diverse sources including peer-reviewed literature, news media, social media discourse, and crowdsourced contributions, providing both binary and graded metaphoricity judgments validated through human annotation. Each instance includes source-target conceptual mappings and perceived metaphoricity scores on a 0-7 scale, establishing the first annotated resource for computational scientific metaphor research. Our evaluation demonstrates that state-of-the-art language models achieve modest performance on scientific metaphor detection, revealing substantial room for improvement in domain-specific figurative language understanding. MCC enables multiple research applications including metaphor detection benchmarking, quality-aware generation systems, and patient-centered communication tools.

[83] WideSearch: Benchmarking Agentic Broad Info-Seeking

Ryan Wong, Jiawei Wang, Junjie Zhao, Li Chen, Yan Gao, Long Zhang, Xuan Zhou, Zuo Wang, Kai Xiang, Ge Zhang, Wenhao Huang, Yang Wang, Ke Wang

Main category: cs.CL

TL;DR: The paper introduces WideSearch, a benchmark to evaluate LLM-powered search agents on large-scale information collection tasks, revealing their current limitations.

Details

Motivation: To address the lack of benchmarks for assessing the reliability and completeness of LLM-powered search agents in wide-context information seeking.

Method: Developed WideSearch with 200 manually curated questions across 15 domains, featuring a five-stage quality control pipeline. Benchmarked 10 state-of-the-art search systems.

Result: Most systems achieved near 0% success rates, with the best at 5%, while human testers reached near 100%.

Conclusion: Current search agents are deficient in large-scale information seeking, highlighting the need for further research and development.

Abstract: From professional research to everyday planning, many tasks are bottlenecked by wide-scale information seeking, which is more repetitive than cognitively complex. With the rapid development of Large Language Models (LLMs), automated search agents powered by LLMs offer a promising solution to liberate humans from this tedious work. However, the capability of these agents to perform such “wide-context” collection reliably and completely remains largely unevaluated due to a lack of suitable benchmarks. To bridge this gap, we introduce WideSearch, a new benchmark engineered to evaluate agent reliability on these large-scale collection tasks. The benchmark features 200 manually curated questions (100 in English, 100 in Chinese) from over 15 diverse domains, grounded in real user queries. Each task requires agents to collect large-scale atomic information, which could be verified one by one objectively, and arrange it into a well-organized output. A rigorous five-stage quality control pipeline ensures the difficulty, completeness, and verifiability of the dataset. We benchmark over 10 state-of-the-art agentic search systems, including single-agent, multi-agent frameworks, and end-to-end commercial systems. Most systems achieve overall success rates near 0%, with the best performer reaching just 5%. However, given sufficient time, cross-validation by multiple human testers can achieve a near 100% success rate. These results demonstrate that present search agents have critical deficiencies in large-scale information seeking, underscoring urgent areas for future research and development in agentic search. Our dataset, evaluation pipeline, and benchmark results have been publicly released at https://widesearch-seed.github.io/

[84] Progressive Depth Up-scaling via Optimal Transport

Mingzi Cao, Xi Wang, Nikolaos Aletras

Main category: cs.CL

TL;DR: OpT-DeUS uses Optimal Transport to align and fuse Transformer blocks for efficient depth up-scaling in LLMs, outperforming existing methods.

Details

Motivation: Addressing the inefficiency and misalignment in existing depth up-scaling methods for LLMs due to neuron permutation differences.

Method: Proposes OpT-DeUS, which aligns and fuses Transformer blocks in adjacent layers using Optimal Transport to create new layers.

Result: Achieves better performance and training efficiency in continual pre-training and fine-tuning across model sizes.

Conclusion: Inserting new layers closer to the top improves training efficiency and performance, validating OpT-DeUS’s effectiveness.

Abstract: Scaling Large Language Models (LLMs) yields performance gains but incurs substantial training costs. Depth up-scaling offers training efficiency by adding new layers to pre-trained models. However, most existing methods copy or average weights from base layers, neglecting neuron permutation differences. This limitation can potentially cause misalignment that harms performance. Inspired by applying Optimal Transport (OT) for neuron alignment, we propose Optimal Transport Depth Up-Scaling (OpT-DeUS). OpT-DeUS aligns and fuses Transformer blocks in adjacent base layers via OT for new layer creation, to mitigate neuron permutation mismatch between layers. OpT-DeUS achieves better overall performance and offers improved training efficiency than existing methods for continual pre-training and supervised fine-tuning across different model sizes. To further evaluate the impact of interpolation positions, our extensive analysis shows that inserting new layers closer to the top results in higher training efficiency due to shorter back-propagation time while obtaining additional performance gains.

[85] 9th Workshop on Sign Language Translation and Avatar Technologies (SLTAT 2025)

Fabrizio Nunnari, Cristina Luna Jiménez, Rosalee Wolfe, John C. McDonald, Michael Filhol, Eleni Efthimiou, Evita Fotinea, Thomas Hanke

Main category: cs.CL

TL;DR: The SLTAT workshop focuses on improving deaf/human communication through non-invasive technologies, including sign language translation, avatar tech, and related research areas.

Details

Motivation: To advance deaf/human communication by integrating sign language translation, avatar technology, and other interdisciplinary research.

Method: Hosted by IVA, the workshop facilitates collaboration between researchers in avatar tech and sign language translation, covering recognition, data analysis, tools, ethics, and more.

Result: The 2025 edition attracted diverse contributions, expanding beyond avatar tech to include recognition, data, and ethical considerations.

Conclusion: SLTAT fosters interdisciplinary progress in deaf communication tech, with growing contributions across multiple research areas.

Abstract: The Sign Language Translation and Avatar Technology (SLTAT) workshops continue a series of gatherings to share recent advances in improving deaf / human communication through non-invasive means. This 2025 edition, the 9th since its first appearance in 2011, is hosted by the International Conference on Intelligent Virtual Agents (IVA), giving the opportunity for contamination between two research communities, using digital humans as either virtual interpreters or as interactive conversational agents. As presented in this summary paper, SLTAT sees contributions beyond avatar technologies, with a consistent number of submissions on sign language recognition, and other work on data collection, data analysis, tools, ethics, usability, and affective computing.

[86] Dual Information Speech Language Models for Emotional Conversations

Chun Wang, Chenyang Liu, Wenze Xu, Weihong Deng

Main category: cs.CL

TL;DR: The paper proposes a method to improve speech-language models (SLMs) by disentangling paralinguistic and linguistic information using heterogeneous adapters and a weakly supervised training strategy, achieving competitive performance in emotional conversation tasks.

Details

Motivation: Text-based LLMs overlook paralinguistic cues, while existing SLMs struggle with capturing these cues and maintaining context. The paper aims to address these limitations.

Method: The approach uses two heterogeneous adapters and a weakly supervised training strategy to disentangle information and avoid task-specific vectors. Only adapters are trained on common datasets.

Result: Experiments show competitive performance in emotional conversation tasks, demonstrating effective integration of paralinguistic and linguistic information.

Conclusion: The proposed method enhances SLMs by improving their ability to interpret speech and maintain contextual understanding, offering a parameter- and data-efficient solution.

Abstract: Conversational systems relying on text-based large language models (LLMs) often overlook paralinguistic cues, essential for understanding emotions and intentions. Speech-language models (SLMs), which use speech as input, are emerging as a promising solution. However, SLMs built by extending frozen LLMs struggle to capture paralinguistic information and exhibit reduced context understanding. We identify entangled information and improper training strategies as key issues. To address these issues, we propose two heterogeneous adapters and suggest a weakly supervised training strategy. Our approach disentangles paralinguistic and linguistic information, enabling SLMs to interpret speech through structured representations. It also preserves contextual understanding by avoiding the generation of task-specific vectors through controlled randomness. This approach trains only the adapters on common datasets, ensuring parameter and data efficiency. Experiments demonstrate competitive performance in emotional conversation tasks, showcasing the model’s ability to effectively integrate both paralinguistic and linguistic information within contextual settings.

[87] Assessing LLM Text Detection in Educational Contexts: Does Human Contribution Affect Detection?

Lukas Gehring, Benjamin Paaßen

Main category: cs.CL

TL;DR: The paper benchmarks LLM-generated text detectors in education, introduces the GEDE dataset, and highlights challenges in detecting intermediate student contributions.

Details

Motivation: Address the need for reliable detection of LLM-generated text in education to uphold academic integrity.

Method: Benchmarks state-of-the-art detectors using the GEDE dataset, which includes essays of varying student contribution levels.

Result: Most detectors struggle with intermediate contribution levels and produce false positives, posing risks in educational settings.

Conclusion: The study underscores the limitations of current detectors and provides a dataset for further research.

Abstract: Recent advancements in Large Language Models (LLMs) and their increased accessibility have made it easier than ever for students to automatically generate texts, posing new challenges for educational institutions. To enforce norms of academic integrity and ensure students’ learning, learning analytics methods to automatically detect LLM-generated text appear increasingly appealing. This paper benchmarks the performance of different state-of-the-art detectors in educational contexts, introducing a novel dataset, called Generative Essay Detection in Education (GEDE), containing over 900 student-written essays and over 12,500 LLM-generated essays from various domains. To capture the diversity of LLM usage practices in generating text, we propose the concept of contribution levels, representing students’ contribution to a given assignment. These levels range from purely human-written texts, to slightly LLM-improved versions, to fully LLM-generated texts, and finally to active attacks on the detector by “humanizing” generated texts. We show that most detectors struggle to accurately classify texts of intermediate student contribution levels, like LLM-improved human-written texts. Detectors are particularly likely to produce false positives, which is problematic in educational settings where false suspicions can severely impact students’ lives. Our dataset, code, and additional supplementary materials are publicly available at https://github.com/lukasgehring/Assessing-LLM-Text-Detection-in-Educational-Contexts.

Robin Huo, Ewan Dunbar

Main category: cs.CL

TL;DR: The study compares HuBERT and wav2vec 2.0, focusing on how training iteration (not objective) affects linguistic information encoding in self-supervised speech models.

Details

Motivation: To understand how model architecture impacts linguistic information in self-supervised speech representations.

Method: Comparison of HuBERT and wav2vec 2.0, analyzing training objective and iterative pseudo-label refinement.

Result: Training iteration, not objective, explains differences in linguistic information encoding (word, phoneme, speaker identity).

Conclusion: Future work should explore why iterative refinement effectively encodes linguistic information in self-supervised speech models.

Abstract: Self-supervised models for speech representation learning now see widespread use for their versatility and performance on downstream tasks, but the effect of model architecture on the linguistic information learned in their representations remains under-studied. This study investigates two such models, HuBERT and wav2vec 2.0, and minimally compares two of their architectural differences: training objective and iterative pseudo-label refinement through multiple training iterations. We find that differences in canonical correlation of hidden representations to word identity, phoneme identity, and speaker identity are explained by training iteration, not training objective. We suggest that future work investigate the reason for the effectiveness of iterative refinement in encoding linguistic information in self-supervised speech representations.

[89] Czech Dataset for Complex Aspect-Based Sentiment Analysis Tasks

Jakub Šmíd, Pavel Přibáň, Ondřej Pražák, Pavel Král

Main category: cs.CL

TL;DR: A new Czech dataset for aspect-based sentiment analysis (ABSA) is introduced, featuring 3.1K annotated restaurant reviews and 24M unannotated ones. It supports advanced tasks like target-aspect-category detection and aligns with SemEval-2016 format for cross-lingual use.

Details

Motivation: To address the lack of a Czech dataset for complex ABSA tasks and enable cross-lingual comparisons.

Method: The dataset was manually annotated by two trained annotators, achieving 90% inter-annotator agreement. It includes 3.1K annotated reviews and 24M unannotated ones for unsupervised learning.

Result: Robust baseline results were achieved using Transformer-based models, with detailed error analysis provided.

Conclusion: The dataset and code are freely available for non-commercial research, facilitating advanced ABSA tasks and cross-lingual studies.

Abstract: In this paper, we introduce a novel Czech dataset for aspect-based sentiment analysis (ABSA), which consists of 3.1K manually annotated reviews from the restaurant domain. The dataset is built upon the older Czech dataset, which contained only separate labels for the basic ABSA tasks such as aspect term extraction or aspect polarity detection. Unlike its predecessor, our new dataset is specifically designed for more complex tasks, e.g. target-aspect-category detection. These advanced tasks require a unified annotation format, seamlessly linking sentiment elements (labels) together. Our dataset follows the format of the well-known SemEval-2016 datasets. This design choice allows effortless application and evaluation in cross-lingual scenarios, ultimately fostering cross-language comparisons with equivalent counterpart datasets in other languages. The annotation process engaged two trained annotators, yielding an impressive inter-annotator agreement rate of approximately 90%. Additionally, we provide 24M reviews without annotations suitable for unsupervised learning. We present robust monolingual baseline results achieved with various Transformer-based models and insightful error analysis to supplement our contributions. Our code and dataset are freely available for non-commercial research purposes.

[90] Optimal Transport Regularization for Speech Text Alignment in Spoken Language Models

Wenze Xu, Chun Wang, Jiazhen Yu, Sheng Chen, Liang Gao, Weihong Deng

Main category: cs.CL

TL;DR: OTReg, a method using optimal transport for speech-text alignment, improves SLM generalization by mitigating the modality gap between speech and text.

Details

Motivation: SLMs struggle to generalize across datasets due to the modality gap between speech and text representations, raising concerns about their intended text-like processing of speech.

Method: OTReg formulates speech-text alignment as an optimal transport problem, deriving a regularization loss to improve SLM training without extra labels or parameters.

Result: OTReg enhances speech-text alignment, reduces the modality gap, and improves SLM generalization in multilingual ASR experiments.

Conclusion: OTReg effectively addresses the modality gap in SLMs, improving their generalization across diverse datasets.

Abstract: Spoken Language Models (SLMs), which extend Large Language Models (LLMs) to perceive speech inputs, have gained increasing attention for their potential to advance speech understanding tasks. However, despite recent progress, studies show that SLMs often struggle to generalize across datasets, even for trained languages and tasks, raising concerns about whether they process speech in a text-like manner as intended. A key challenge underlying this limitation is the modality gap between speech and text representations. The high variability in speech embeddings may allow SLMs to achieve strong in-domain performance by exploiting unintended speech variations, ultimately hindering generalization. To mitigate this modality gap, we introduce Optimal Transport Regularization (OTReg), a method that formulates speech-text alignment as an optimal transport problem and derives a regularization loss to improve SLM training. In each training iteration, OTReg first establishes a structured correspondence between speech and transcript embeddings by determining the optimal transport plan, then incorporates the regularization loss based on this transport plan to optimize SLMs in generating speech embeddings that align more effectively with transcript embeddings. OTReg is lightweight, requiring no additional labels or learnable parameters, and integrates seamlessly into existing SLM training procedures. Extensive multilingual ASR experiments demonstrate that OTReg enhances speech-text alignment, mitigates the modality gap, and consequently improves SLM generalization across diverse datasets.

[91] Can LLMs Detect Their Confabulations? Estimating Reliability in Uncertainty-Aware Language Models

Tianyi Zhou, Johanne Medina, Sanjay Chawla

Main category: cs.CL

TL;DR: The paper investigates how in-context information affects LLM behavior and proposes a method using token-level uncertainty to predict response reliability.

Details

Motivation: LLMs often generate incorrect but fluent content (confabulation), posing risks in multi-turn or agentic applications.

Method: Proposes reliability estimation using aleatoric and epistemic uncertainty from output logits to aggregate hidden states for reliability prediction.

Result: Correct context improves accuracy and confidence, while misleading context leads to confidently incorrect responses. The method improves unreliable output detection.

Conclusion: Highlights limitations of direct uncertainty signals and advocates for uncertainty-guided probing for reliable generation.

Abstract: Large Language Models (LLMs) are prone to generating fluent but incorrect content, known as confabulation, which poses increasing risks in multi-turn or agentic applications where outputs may be reused as context. In this work, we investigate how in-context information influences model behavior and whether LLMs can identify their unreliable responses. We propose a reliability estimation that leverages token-level uncertainty to guide the aggregation of internal model representations. Specifically, we compute aleatoric and epistemic uncertainty from output logits to identify salient tokens and aggregate their hidden states into compact representations for response-level reliability prediction. Through controlled experiments on open QA benchmarks, we find that correct in-context information improves both answer accuracy and model confidence, while misleading context often induces confidently incorrect responses, revealing a misalignment between uncertainty and correctness. Our probing-based method captures these shifts in model behavior and improves the detection of unreliable outputs across multiple open-source LLMs. These results underscore the limitations of direct uncertainty signals and highlight the potential of uncertainty-guided probing for reliability-aware generation.

[92] Data-Efficient Biomedical In-Context Learning: A Diversity-Enhanced Submodular Perspective

Jun Wang, Zaifu Zhan, Qixin Zhang, Mingquan Lin, Meijia Song, Rui Zhang

Main category: cs.CL

TL;DR: Dual-Div, a diversity-enhanced framework for demonstration selection in biomedical ICL, outperforms baselines by optimizing both representativeness and diversity in example selection.

Details

Motivation: Existing approaches for in-context learning (ICL) in biomedical NLP prioritize representativeness over diversity in example selection, limiting performance.

Method: Dual-Div uses a two-stage retrieval and ranking process to select diverse and relevant examples for ICL prompts.

Result: Dual-Div achieves up to 5% higher macro-F1 scores on biomedical NLP tasks and shows robustness to prompt variations and class imbalance.

Conclusion: Diversity in initial retrieval is more critical than ranking optimization, and limiting demonstrations to 3-5 examples maximizes efficiency.

Abstract: Recent progress in large language models (LLMs) has leveraged their in-context learning (ICL) abilities to enable quick adaptation to unseen biomedical NLP tasks. By incorporating only a few input-output examples into prompts, LLMs can rapidly perform these new tasks. While the impact of these demonstrations on LLM performance has been extensively studied, most existing approaches prioritize representativeness over diversity when selecting examples from large corpora. To address this gap, we propose Dual-Div, a diversity-enhanced data-efficient framework for demonstration selection in biomedical ICL. Dual-Div employs a two-stage retrieval and ranking process: First, it identifies a limited set of candidate examples from a corpus by optimizing both representativeness and diversity (with optional annotation for unlabeled data). Second, it ranks these candidates against test queries to select the most relevant and non-redundant demonstrations. Evaluated on three biomedical NLP tasks (named entity recognition (NER), relation extraction (RE), and text classification (TC)) using LLaMA 3.1 and Qwen 2.5 for inference, along with three retrievers (BGE-Large, BMRetriever, MedCPT), Dual-Div consistently outperforms baselines-achieving up to 5% higher macro-F1 scores-while demonstrating robustness to prompt permutations and class imbalance. Our findings establish that diversity in initial retrieval is more critical than ranking-stage optimization, and limiting demonstrations to 3-5 examples maximizes performance efficiency.

[93] REX-RAG: Reasoning Exploration with Policy Correction in Retrieval-Augmented Generation

Wentao Jiang, Xiang Feng, Zengmao Wang, Yong Luo, Pingbo Xu, Zhe Chen, Bo Du, Jing Zhang

Main category: cs.CL

TL;DR: REX-RAG is a novel framework combining RL and RAG to improve LLM reasoning by escaping unproductive paths and correcting policy biases, achieving significant performance gains.

Details

Motivation: Addressing the challenge of LLMs getting stuck in unproductive reasoning paths (dead ends) during policy-driven trajectory sampling, which hampers exploration and policy optimization.

Method: Proposes REX-RAG with a Mixed Sampling Strategy (probe sampling + exploratory prompts) and a Policy Correction Mechanism (importance sampling) to mitigate distribution shifts.

Result: Achieves average performance gains of 5.1% on Qwen2.5-3B and 3.6% on Qwen2.5-7B over baselines across seven benchmarks.

Conclusion: REX-RAG effectively enhances LLM reasoning by exploring alternative paths and correcting policy biases, demonstrating competitive results.

Abstract: Reinforcement learning (RL) is emerging as a powerful paradigm for enabling large language models (LLMs) to perform complex reasoning tasks. Recent advances indicate that integrating RL with retrieval-augmented generation (RAG) allows LLMs to dynamically incorporate external knowledge, leading to more informed and robust decision making. However, we identify a critical challenge during policy-driven trajectory sampling: LLMs are frequently trapped in unproductive reasoning paths, which we refer to as “dead ends”, committing to overconfident yet incorrect conclusions. This severely hampers exploration and undermines effective policy optimization. To address this challenge, we propose REX-RAG (Reasoning Exploration with Policy Correction in Retrieval-Augmented Generation), a novel framework that explores alternative reasoning paths while maintaining rigorous policy learning through principled distributional corrections. Our approach introduces two key innovations: (1) Mixed Sampling Strategy, which combines a novel probe sampling method with exploratory prompts to escape dead ends; and (2) Policy Correction Mechanism, which employs importance sampling to correct distribution shifts induced by mixed sampling, thereby mitigating gradient estimation bias. We evaluate it on seven question-answering benchmarks, and the experimental results show that REX-RAG achieves average performance gains of 5.1% on Qwen2.5-3B and 3.6% on Qwen2.5-7B over strong baselines, demonstrating competitive results across multiple datasets. The code is publicly available at https://github.com/MiliLab/REX-RAG.

[94] LPI-RIT at LeWiDi-2025: Improving Distributional Predictions via Metadata and Loss Reweighting with DisCo

Mandira Sawkar, Samay U. Shetty, Deepak Pandita, Tharindu Cyril Weerasooriya, Christopher M. Homan

Main category: cs.CL

TL;DR: The paper extends DisCo to better model annotator disagreement by incorporating metadata, enhancing inputs, and modifying loss functions, showing improved performance in soft and perspectivist evaluations.

Details

Motivation: To improve modeling of annotator disagreement in labeled datasets by leveraging annotator metadata and refining the DisCo framework.

Method: Extends DisCo by adding annotator metadata, enhancing input representations, and modifying loss functions to better capture disagreement patterns.

Result: Substantial improvements in soft and perspectivist evaluation metrics across three datasets, with detailed error and calibration analyses.

Conclusion: Disagreement-aware modeling is valuable, and system components interact complexly with human-annotated data, offering insights for future work.

Abstract: The Learning With Disagreements (LeWiDi) 2025 shared task is to model annotator disagreement through soft label distribution prediction and perspectivist evaluation, modeling annotators. We adapt DisCo (Distribution from Context), a neural architecture that jointly models item-level and annotator-level label distributions, and present detailed analysis and improvements. In this paper, we extend the DisCo by incorporating annotator metadata, enhancing input representations, and modifying the loss functions to capture disagreement patterns better. Through extensive experiments, we demonstrate substantial improvements in both soft and perspectivist evaluation metrics across three datasets. We also conduct in-depth error and calibration analyses, highlighting the conditions under which improvements occur. Our findings underscore the value of disagreement-aware modeling and offer insights into how system components interact with the complexity of human-annotated data.

[95] Efficient Speculative Decoding for Llama at Scale: Challenges and Solutions

Bangsheng Tang, Carl Chengyan Fu, Fei Kou, Grigory Sizov, Haoci Zhang, Jason Park, Jiawen Liu, Jie You, Qirui Yang, Sachin Mehta, Shengyong Cai, Xiaodong Wang, Xingyu Liu, Yunlu Li, Yanjun Zhou, Wei Wei, Zhiwei Zhao, Zixi Qi, Adolfo Victoria, Aya Ibrahim, Bram Wasti, Changkyu Kim, Daniel Haziza, Fei Sun, Giancarlo Delfin, Emily Guo, Jialin Ouyang, Jaewon Lee, Jianyu Huang, Jeremy Reizenstein, Lu Fang, Quinn Zhu, Ria Verma, Vlad Mihailescu, Xingwen Guo, Yan Cui, Ye Hu, Yejin Lee

Main category: cs.CL

TL;DR: The paper details optimization techniques for EAGLE-based speculative decoding, achieving faster inference speeds for Llama models in production.

Details

Motivation: To address engineering challenges in scaling speculative decoding for production environments, particularly for GPU implementations.

Method: Implemented training and inference optimization techniques for EAGLE-based speculative decoding.

Result: Achieved state-of-the-art inference latency (4 ms per token) and speed-ups of 1.4x-2.0x for large batch sizes.

Conclusion: The optimizations enable efficient production-scale deployment of speculative decoding for Llama models.

Abstract: Speculative decoding is a standard method for accelerating the inference speed of large language models. However, scaling it for production environments poses several engineering challenges, including efficiently implementing different operations (e.g., tree attention and multi-round speculative decoding) on GPU. In this paper, we detail the training and inference optimization techniques that we have implemented to enable EAGLE-based speculative decoding at a production scale for Llama models. With these changes, we achieve a new state-of-the-art inference latency for Llama models. For example, Llama4 Maverick decodes at a speed of about 4 ms per token (with a batch size of one) on 8 NVIDIA H100 GPUs, which is 10% faster than the previously best known method. Furthermore, for EAGLE-based speculative decoding, our optimizations enable us to achieve a speed-up for large batch sizes between 1.4x and 2.0x at production scale.

[96] Human-Alignment and Calibration of Inference-Time Uncertainty in Large Language Models

Kyle Moore, Jesse Roberts, Daryl Watson

Main category: cs.CL

TL;DR: The paper evaluates inference-time uncertainty measures in large language models (LLMs) to assess alignment with human uncertainty and traditional calibration, finding strong alignment in some metrics despite mismatches with human preferences.

Details

Motivation: To improve LLM-user interaction by ensuring model uncertainty aligns with human uncertainty, enhancing trust and control.

Method: Evaluates various inference-time uncertainty measures using established and novel metrics, comparing them to human group-level uncertainty and traditional calibration.

Result: Several measures align strongly with human uncertainty, showing moderate to strong model calibration, even when not aligning with human answer preferences.

Conclusion: Certain uncertainty metrics effectively align with human uncertainty and demonstrate calibration, suggesting practical utility for improving LLM-user experience.

Abstract: There has been much recent interest in evaluating large language models for uncertainty calibration to facilitate model control and modulate user trust. Inference time uncertainty, which may provide a real-time signal to the model or external control modules, is particularly important for applying these concepts to improve LLM-user experience in practice. While many of the existing papers consider model calibration, comparatively little work has sought to evaluate how closely model uncertainty aligns to human uncertainty. In this work, we evaluate a collection of inference-time uncertainty measures, using both established metrics and novel variations, to determine how closely they align with both human group-level uncertainty and traditional notions of model calibration. We find that numerous measures show evidence of strong alignment to human uncertainty, even despite the lack of alignment to human answer preference. For those successful metrics, we find moderate to strong evidence of model calibration in terms of both correctness correlation and distributional analysis.

[97] SAEMark: Multi-bit LLM Watermarking with Inference-Time Scaling

Zhuohao Yu, Xingru Jiang, Weizheng Gu, Yidong Wang, Shikun Zhang, Wei Ye

Main category: cs.CL

TL;DR: SAEMark is a post-hoc multi-bit watermarking framework for LLM-generated text, enabling content attribution without compromising quality or requiring model access.

Details

Motivation: Existing watermarking methods degrade text quality, need white-box access, and exclude multilingual or API-based models. SAEMark addresses these limitations.

Method: Uses inference-time, feature-based rejection sampling to embed messages without altering model logits or training. It selects outputs with feature statistics matching key-derived targets.

Result: Achieves 99.7% F1 on English and strong multi-bit detection accuracy across 4 datasets, preserving text quality.

Conclusion: SAEMark offers a scalable, language-agnostic watermarking solution for closed-source LLMs, enabling robust content attribution.

Abstract: Watermarking LLM-generated text is critical for content attribution and misinformation prevention. However, existing methods compromise text quality, require white-box model access and logit manipulation. These limitations exclude API-based models and multilingual scenarios. We propose SAEMark, a general framework for post-hoc multi-bit watermarking that embeds personalized messages solely via inference-time, feature-based rejection sampling without altering model logits or requiring training. Our approach operates on deterministic features extracted from generated text, selecting outputs whose feature statistics align with key-derived targets. This framework naturally generalizes across languages and domains while preserving text quality through sampling LLM outputs instead of modifying. We provide theoretical guarantees relating watermark success probability and compute budget that hold for any suitable feature extractor. Empirically, we demonstrate the framework’s effectiveness using Sparse Autoencoders (SAEs), achieving superior detection accuracy and text quality. Experiments across 4 datasets show SAEMark’s consistent performance, with 99.7% F1 on English and strong multi-bit detection accuracy. SAEMark establishes a new paradigm for scalable watermarking that works out-of-the-box with closed-source LLMs while enabling content attribution.

[98] Capabilities of GPT-5 on Multimodal Medical Reasoning

Shansong Wang, Mingzhe Hu, Qiang Li, Mojtaba Safari, Xiaofeng Yang

Main category: cs.CL

TL;DR: GPT-5 demonstrates superior zero-shot multimodal reasoning in medical tasks, outperforming GPT-4o and human experts in accuracy and understanding.

Details

Motivation: To evaluate GPT-5's capability as a generalist multimodal reasoner for medical decision support, integrating diverse data sources like text and images.

Method: Systematic evaluation of GPT-5 and variants on standardized medical QA benchmarks (MedQA, MedXpertQA, MMLU, USMLE, VQA-RAD) using zero-shot chain-of-thought reasoning.

Result: GPT-5 achieves state-of-the-art accuracy, surpassing GPT-4o and human experts, with notable gains in multimodal reasoning (e.g., +29.62% reasoning improvement on MedXpertQA).

Conclusion: GPT-5’s performance suggests its potential to enhance clinical decision-support systems, moving from human-comparable to above-expert levels in controlled benchmarks.

Abstract: Recent advances in large language models (LLMs) have enabled general-purpose systems to perform increasingly complex domain-specific reasoning without extensive fine-tuning. In the medical domain, decision-making often requires integrating heterogeneous information sources, including patient narratives, structured data, and medical images. This study positions GPT-5 as a generalist multimodal reasoner for medical decision support and systematically evaluates its zero-shot chain-of-thought reasoning performance on both text-based question answering and visual question answering tasks under a unified protocol. We benchmark GPT-5, GPT-5-mini, GPT-5-nano, and GPT-4o-2024-11-20 against standardized splits of MedQA, MedXpertQA (text and multimodal), MMLU medical subsets, USMLE self-assessment exams, and VQA-RAD. Results show that GPT-5 consistently outperforms all baselines, achieving state-of-the-art accuracy across all QA benchmarks and delivering substantial gains in multimodal reasoning. On MedXpertQA MM, GPT-5 improves reasoning and understanding scores by +29.62% and +36.18% over GPT-4o, respectively, and surpasses pre-licensed human experts by +24.23% in reasoning and +29.40% in understanding. In contrast, GPT-4o remains below human expert performance in most dimensions. A representative case study demonstrates GPT-5’s ability to integrate visual and textual cues into a coherent diagnostic reasoning chain, recommending appropriate high-stakes interventions. Our results show that, on these controlled multimodal reasoning benchmarks, GPT-5 moves from human-comparable to above human-expert performance. This improvement may substantially inform the design of future clinical decision-support systems.

[99] Exploring Safety Alignment Evaluation of LLMs in Chinese Mental Health Dialogues via LLM-as-Judge

Yunna Cai, Fan Wang, Haowei Wang, Kun Wang, Kailai Yang, Sophia Ananiadou, Moyan Li, Mingming Fan

Main category: cs.CL

TL;DR: PsyCrisis-Bench is a reference-free benchmark for evaluating LLM safety alignment in mental health dialogues, using expert-defined principles and a prompt-based LLM-as-Judge approach.

Details

Motivation: The challenge of evaluating LLM responses in high-risk mental health dialogues due to missing gold-standard answers and ethical sensitivity.

Method: A prompt-based LLM-as-Judge approach with expert-defined reasoning chains and binary point-wise scoring across safety dimensions.

Result: Achieves highest agreement with expert assessments and produces interpretable evaluation rationales.

Conclusion: PsyCrisis-Bench and its dataset are publicly available to aid further research in LLM safety alignment for mental health.

Abstract: Evaluating the safety alignment of LLM responses in high-risk mental health dialogues is particularly difficult due to missing gold-standard answers and the ethically sensitive nature of these interactions. To address this challenge, we propose PsyCrisis-Bench, a reference-free evaluation benchmark based on real-world Chinese mental health dialogues. It evaluates whether the model responses align with the safety principles defined by experts. Specifically designed for settings without standard references, our method adopts a prompt-based LLM-as-Judge approach that conducts in-context evaluation using expert-defined reasoning chains grounded in psychological intervention principles. We employ binary point-wise scoring across multiple safety dimensions to enhance the explainability and traceability of the evaluation. Additionally, we present a manually curated, high-quality Chinese-language dataset covering self-harm, suicidal ideation, and existential distress, derived from real-world online discourse. Experiments on 3600 judgments show that our method achieves the highest agreement with expert assessments and produces more interpretable evaluation rationales compared to existing approaches. Our dataset and evaluation tool are publicly available to facilitate further research.

[100] Jinx: Unlimited LLMs for Probing Alignment Failures

Jiahao Zhao, Liwei Dong

Main category: cs.CL

TL;DR: Jinx is a helpful-only language model variant designed for researchers to study alignment failures and safety boundaries in AI systems.

Details

Motivation: To provide researchers with an accessible tool for evaluating alignment failures and safety boundaries, as current helpful-only models are not available to the community.

Method: Developed Jinx, a variant of open-weight LLMs that responds to all queries without refusals or safety filtering, while retaining reasoning and instruction-following capabilities.

Result: Jinx enables probing alignment failures, evaluating safety boundaries, and studying failure modes in language model safety.

Conclusion: Jinx fills a critical gap by offering researchers a tool to systematically investigate alignment and safety issues in AI models.

Abstract: Unlimited, or so-called helpful-only language models are trained without safety alignment constraints and never refuse user queries. They are widely used by leading AI companies as internal tools for red teaming and alignment evaluation. For example, if a safety-aligned model produces harmful outputs similar to an unlimited model, this indicates alignment failures that require further attention. Despite their essential role in assessing alignment, such models are not available to the research community. We introduce Jinx, a helpful-only variant of popular open-weight LLMs. Jinx responds to all queries without refusals or safety filtering, while preserving the base model’s capabilities in reasoning and instruction following. It provides researchers with an accessible tool for probing alignment failures, evaluating safety boundaries, and systematically studying failure modes in language model safety.

[101] Highly Fast Text Segmentation With Pairwise Markov Chains

Elie Azeraf, Emmanuel Monfrini, Emmanuel Vignon, Wojciech Pieczynski

Main category: cs.CL

TL;DR: The paper proposes using Markov chain models (HMC and PMC) for NLP tasks to reduce computational costs and training time, achieving results comparable to CRF with much shorter training times.

Details

Motivation: Address the high computational costs, training time, and carbon footprint of current NLP models by developing efficient, extra-data-free alternatives.

Method: Explore Hidden Markov Chain (HMC) and Pairwise Markov Chain (PMC) for NLP segmentation tasks (POS Tagging, Named-Entity-Recognition, Chunking) with an original adaptation method.

Result: PMC achieves results equivalent to CRF without extra-data, with training times 30 times shorter.

Conclusion: PMC is a validated efficient alternative to CRF for NLP tasks, aligning with the goal of minimizing computational and environmental costs.

Abstract: Natural Language Processing (NLP) models’ current trend consists of using increasingly more extra-data to build the best models as possible. It implies more expensive computational costs and training time, difficulties for deployment, and worries about these models’ carbon footprint reveal a critical problem in the future. Against this trend, our goal is to develop NLP models requiring no extra-data and minimizing training time. To do so, in this paper, we explore Markov chain models, Hidden Markov Chain (HMC) and Pairwise Markov Chain (PMC), for NLP segmentation tasks. We apply these models for three classic applications: POS Tagging, Named-Entity-Recognition, and Chunking. We develop an original method to adapt these models for text segmentation’s specific challenges to obtain relevant performances with very short training and execution times. PMC achieves equivalent results to those obtained by Conditional Random Fields (CRF), one of the most applied models for these tasks when no extra-data are used. Moreover, PMC has training times 30 times shorter than the CRF ones, which validates this model given our objectives.

[102] How Chinese are Chinese Language Models? The Puzzling Lack of Language Policy in China’s LLMs

Andrea W Wen-Yi, Unso Eun Seo Jo, Lu Jia Lin, David Mimno

Main category: cs.CL

TL;DR: Chinese LLMs perform similarly to international LLMs on diverse languages, but lack explicit policy or consideration for language diversity in development.

Details

Motivation: To explore the impact of China's language policies on multilingual LLM development and evaluate the performance of Chinese LLMs on diverse languages.

Method: Evaluation of six open-source multilingual LLMs from Chinese companies on 18 languages, analysis of technical reports, and examination of AI policies.

Result: Chinese LLMs perform indistinguishably from international LLMs, but show no clear policy or focus on language diversity beyond English and Mandarin.

Conclusion: China lacks a consistent policy on language diversity in LLM development, despite regulating daily language use and model development.

Abstract: Contemporary language models are increasingly multilingual, but Chinese LLM developers must navigate complex political and business considerations of language diversity. Language policy in China aims at influencing the public discourse and governing a multi-ethnic society, and has gradually transitioned from a pluralist to a more assimilationist approach since 1949. We explore the impact of these influences on current language technology. We evaluate six open-source multilingual LLMs pre-trained by Chinese companies on 18 languages, spanning a wide range of Chinese, Asian, and Anglo-European languages. Our experiments show Chinese LLMs performance on diverse languages is indistinguishable from international LLMs. Similarly, the models’ technical reports also show lack of consideration for pretraining data language coverage except for English and Mandarin Chinese. Examining Chinese AI policy, model experiments, and technical reports, we find no sign of any consistent policy, either for or against, language diversity in China’s LLM development. This leaves a puzzling fact that while China regulates both the languages people use daily as well as language model development, they do not seem to have any policy on the languages in language models.

[103] AI-AI Bias: large language models favor communications generated by large language models

Walter Laurito, Benjamin Davis, Peli Grietzer, Tomáš Gavenčiak, Ada Böhm, Jan Kulveit

Main category: cs.CL

TL;DR: LLMs show bias toward LLM-generated content, potentially leading to antihuman discrimination in AI systems.

Details

Motivation: To investigate if LLMs exhibit bias favoring LLM-generated content over human-generated content, raising concerns about antihuman discrimination.

Method: Employed a classical experimental design with binary choice scenarios, testing LLMs (e.g., GPT-3.5, GPT-4, open-weight models) to select between human- and LLM-described goods.

Result: LLMs consistently preferred LLM-presented options, indicating potential bias against human-generated content.

Conclusion: The findings suggest a risk of AI systems implicitly discriminating against humans, favoring AI-generated content unfairly.

Abstract: Are large language models (LLMs) biased in favor of communications produced by LLMs, leading to possible antihuman discrimination? Using a classical experimental design inspired by employment discrimination studies, we tested widely used LLMs, including GPT-3.5, GPT-4 and a selection of recent open-weight models in binary choice scenarios. These involved LLM-based assistants selecting between goods (the goods we study include consumer products, academic papers, and film-viewings) described either by humans or LLMs. Our results show a consistent tendency for LLM-based AIs to prefer LLM-presented options. This suggests the possibility of future AI systems implicitly discriminating against humans as a class, giving AI agents and AI-assisted humans an unfair advantage.

[104] Chain of Thought Still Thinks Fast: APriCoT Helps with Thinking Slow

Kyle Moore, Jesse Roberts, Thao Pham, Douglas Fisher

Main category: cs.CL

TL;DR: The paper investigates biases in language models affecting answer choices in MMLU tasks, introduces APriCoT to mitigate bias, and shows its effectiveness over CoT alone.

Details

Motivation: To understand and address biases in language models that influence predictions, mirroring human test-taking strategies, and to improve fairness and robustness.

Method: Introduces Counterfactual Prompting with Agnostically Primed CoT (APriCoT) to reduce bias and improve accuracy, comparing it with CoT alone.

Result: APriCoT effectively reduces bias and improves accuracy, while CoT alone fails to mitigate bias and reinforces fast-thinking model bias.

Conclusion: Mitigating bias requires a slow-thinking process; APriCoT is a promising step toward fairer and more robust language models.

Abstract: Language models are known to absorb biases from their training data, leading to predictions driven by statistical regularities rather than semantic relevance. We investigate the impact of these biases on answer choice preferences in the Massive Multi-Task Language Understanding (MMLU) task. Our findings show that these biases are predictive of model preference and mirror human test-taking strategies even when chain of thought (CoT) reasoning is used. To address this issue, we introduce Counterfactual Prompting with Agnostically Primed CoT (APriCoT). We demonstrate that while Counterfactual Prompting with CoT alone is insufficient to mitigate bias, APriCoT effectively reduces the influence of base-rate probabilities while improving overall accuracy. Our results suggest that mitigating bias requires a slow thinking process which CoT alone may not provide as it tends to reinforce fast thinking model bias under some prompting methodologies. APriCoT is a step toward developing more robust and fair language models that can think slow.

[105] A Practice of Post-Training on Llama-3 70B with Optimal Selection of Additional Language Mixture Ratio

Ningyuan Xi, Yetao Wu, Kun Fan, Teng Chen, Qingqing Gu, Luo Ji

Main category: cs.CL

TL;DR: The paper explores optimizing hyper-parameters like Additional Language Mixture Ratio (ALMR) and Learning Rate (LR) for Continual Pre-Training (CPT) of LLMs, specifically Llama-3 8B and 70B, to enhance Chinese language skills and domain adaptability.

Details

Motivation: There's a lack of systematic study on hyper-parameter optimization for CPT, particularly for ALMR and LR, and their impact on model performance and deployment.

Method: CPT is performed on Llama-3 8B and 70B, focusing on ALMR and LR optimization for the 8B model, followed by fine-tuning.

Result: Improved performance on Chinese benchmarks and specific domains (math, coding, emotional intelligence). The 70B model was successfully deployed in a chat system.

Conclusion: Optimal hyper-parameter selection for CPT significantly enhances model capabilities, validating the approach for real-life deployment.

Abstract: Large Language Models (LLM) often need to be Continual Pre-Trained (CPT) to obtain unfamiliar language skills or adapt to new domains. The huge training cost of CPT often asks for cautious choice of key hyper-parameters such as the mixture ratio of extra language or domain corpus. However, there is no systematic study that bridges the gap between the optimal mixture ratio and the actual model performance, and the gap between experimental scaling law and the actual deployment in the full model size. In this paper, we perform CPT on Llama-3 8B and 70B to enhance its Chinese ability. We study the optimal correlation between the Additional Language Mixture Ratio (ALMR) and the Learning Rate (LR) on the 8B size which directly indicates the optimal experimental setup. By thorough choice of hyper-parameter, and subsequent fine-tuning, the model capability is improved not only on the Chinese-related benchmark but also in some specific domains including math, coding, and emotional intelligence. We deploy the final 70B version of LLM on a real-life chat system which obtains satisfying performance.

[106] A Closer Look at Machine Unlearning for Large Language Models

Xiaojian Yuan, Tianyu Pang, Chao Du, Kejiang Chen, Weiming Zhang, Min Lin

Main category: cs.CL

TL;DR: The paper addresses privacy and legal concerns in LLMs by proposing machine unlearning methods. It introduces new evaluation metrics and categorizes unlearning approaches, proposing entropy maximization and answer preservation loss to improve effectiveness.

Details

Motivation: To mitigate privacy and legal risks from memorized sensitive/copyrighted content in LLMs without costly retraining.

Method: Introduces metrics for evaluation, categorizes unlearning methods, and proposes entropy maximization (ME) for untargeted unlearning and answer preservation (AP) loss for targeted unlearning.

Result: Experimental results in three scenarios (fictitious, continual, real-world unlearning) validate the proposed methods.

Conclusion: The proposed approaches effectively address unlearning challenges in LLMs, with code made publicly available.

Abstract: Large language models (LLMs) may memorize sensitive or copyrighted content, raising privacy and legal concerns. Due to the high cost of retraining from scratch, researchers attempt to employ machine unlearning to remove specific content from LLMs while preserving the overall performance. In this paper, we discuss several issues in machine unlearning for LLMs and provide our insights on possible approaches. To address the issue of inadequate evaluation of model outputs after unlearning, we introduce three additional metrics to evaluate token diversity, sentence semantics, and factual correctness. We then categorize unlearning methods into untargeted and targeted, and discuss their issues respectively. Specifically, the behavior that untargeted unlearning attempts to approximate is unpredictable and may involve hallucinations, and existing regularization is insufficient for targeted unlearning. To alleviate these issues, we propose using the objective of maximizing entropy (ME) for untargeted unlearning and incorporate answer preservation (AP) loss as regularization for targeted unlearning. Experimental results across three scenarios, i.e., fictitious unlearning, continual unlearning, and real-world unlearning, demonstrate the effectiveness of our approaches. The code is available at https://github.com/sail-sg/closer-look-LLM-unlearning.

[107] FlatQuant: Flatness Matters for LLM Quantization

Yuxuan Sun, Ruikang Liu, Haoli Bai, Han Bao, Kang Zhao, Yuening Li, Jiaxin Hu, Xianzhi Yu, Lu Hou, Chun Yuan, Xin Jiang, Wulong Liu, Jun Yao

Main category: cs.CL

TL;DR: FlatQuant introduces a post-training quantization method for LLMs, optimizing affine transformations to flatten weights and activations, achieving minimal accuracy drop and significant speedup.

Details

Motivation: Addressing the challenge of outliers in LLMs, which hinder quantization efficiency, by enhancing the flatness of distributions.

Method: Proposes FlatQuant, using learnable affine transformations per layer, optimized via a lightweight objective, and fused into a single kernel for efficiency.

Result: Achieves <1% accuracy drop for W4A4 quantization on LLaMA-3-70B, outperforming SpinQuant by 7.5%, with up to 2.3x prefill and 1.7x decoding speedup.

Conclusion: FlatQuant sets a new SOTA for LLM quantization, balancing accuracy and performance effectively.

Abstract: Recently, quantization has been widely used for the compression and acceleration of large language models (LLMs). Due to the outliers in LLMs, it is crucial to flatten weights and activations to minimize quantization error with equally spaced quantization points. Prior research explores various pre-quantization transformations to suppress outliers, such as per-channel scaling and Hadamard transformation. However, we observe that these transformed weights and activations can still exhibit steep and dispersed distributions. In this paper, we propose FlatQuant (Fast and Learnable Affine Transformation), a new post-training quantization approach that enhances the flatness of weights and activations. Our approach identifies optimal affine transformations for each linear layer, calibrated in hours via a lightweight objective. To reduce runtime overhead of affine transformation, we apply Kronecker product with two lightweight matrices, and fuse all operations in FlatQuant into a single kernel. Extensive experiments demonstrate that FlatQuant establishes a new state-of-the-art benchmark for quantization. For example, it achieves less than 1% accuracy drop for W4A4 quantization on the LLaMA-3-70B model, surpassing SpinQuant by 7.5%. Additionally, it provides up to 2.3x prefill speedup and 1.7x decoding speedup compared to the FP16 model. Code is available at: https://github.com/ruikangliu/FlatQuant.

[108] Strengthening False Information Propagation Detection: Leveraging SVM and Sophisticated Text Vectorization Techniques in comparison to BERT

Ahmed Akib Jawad Karim, Kazi Hafiz Md Asad, Aznur Azam

Main category: cs.CL

TL;DR: The paper explores machine learning (SVM and BERT) for fake news detection, comparing SVM with TF-IDF, Word2Vec, and BoW against BERT. BERT outperforms with 99.98% accuracy, but SVM with BoW/TF-IDF is competitive and computationally lighter.

Details

Motivation: The rapid spread of misinformation online necessitates reliable detection systems.

Method: Uses SVM (with TF-IDF, Word2Vec, BoW) and BERT for fake news detection, including preprocessing, model implementation, and evaluation.

Result: BERT achieves 99.98% accuracy and F1-score 0.9998; SVM with BoW reaches 99.81% accuracy and F1-score 0.9980.

Conclusion: BERT is superior, but SVM with BoW/TF-IDF offers competitive performance with lower computational costs.

Abstract: The rapid spread of misinformation, particularly through online platforms, underscores the urgent need for reliable detection systems. This study explores the utilization of machine learning and natural language processing, specifically Support Vector Machines (SVM) and BERT, to detect fake news. We employ three distinct text vectorization methods for SVM: Term Frequency Inverse Document Frequency (TF-IDF), Word2Vec, and Bag of Words (BoW), evaluating their effectiveness in distinguishing between genuine and fake news. Additionally, we compare these methods against the transformer large language model, BERT. Our comprehensive approach includes detailed preprocessing steps, rigorous model implementation, and thorough evaluation to determine the most effective techniques. The results demonstrate that while BERT achieves superior accuracy with 99.98% and an F1-score of 0.9998, the SVM model with a linear kernel and BoW vectorization also performs exceptionally well, achieving 99.81% accuracy and an F1-score of 0.9980. These findings highlight that, despite BERT’s superior performance, SVM models with BoW and TF-IDF vectorization methods come remarkably close, offering highly competitive performance with the advantage of lower computational requirements.

[109] WebWalker: Benchmarking LLMs in Web Traversal

Jialong Wu, Wenbiao Yin, Yong Jiang, Zhenglin Wang, Zekun Xi, Runnan Fang, Linhai Zhang, Yulan He, Deyu Zhou, Pengjun Xie, Fei Huang

Main category: cs.CL

TL;DR: WebWalkerQA is a benchmark for evaluating LLMs’ web traversal abilities, addressing shallow retrieval in RAG. The WebWalker framework improves performance through multi-agent navigation.

Details

Motivation: Traditional search engines retrieve shallow content, limiting LLMs' ability to handle complex information.

Method: Introduces WebWalkerQA benchmark and WebWalker, a multi-agent framework using an explore-critic paradigm for web navigation.

Result: WebWalkerQA is challenging and shows RAG’s effectiveness when combined with WebWalker in real-world scenarios.

Conclusion: The approach enhances LLMs’ ability to systematically extract high-quality data through web traversal.

Abstract: Retrieval-augmented generation (RAG) demonstrates remarkable performance across tasks in open-domain question-answering. However, traditional search engines may retrieve shallow content, limiting the ability of LLMs to handle complex, multi-layered information. To address it, we introduce WebWalkerQA, a benchmark designed to assess the ability of LLMs to perform web traversal. It evaluates the capacity of LLMs to traverse a website’s subpages to extract high-quality data systematically. We propose WebWalker, which is a multi-agent framework that mimics human-like web navigation through an explore-critic paradigm. Extensive experimental results show that WebWalkerQA is challenging and demonstrates the effectiveness of RAG combined with WebWalker, through the horizontal and vertical integration in real-world scenarios.

[110] Softplus Attention with Re-weighting Boosts Length Extrapolation in Large Language Models

Bo Gao, Michael W. Spratling

Main category: cs.CL

TL;DR: The paper proposes a two-stage attention mechanism to address numerical instability and performance issues in traditional Softmax attention, improving length extrapolation and stability.

Details

Motivation: Traditional Softmax attention suffers from numerical instability and reduced performance with longer inference tokens, prompting the need for a more stable and effective design.

Method: Decomposes Softmax into a non-linear positivity transformation and $l_1$-normalisation, replaces the exponential function with Softplus, introduces a dynamic scale factor, and adds a re-weighting mechanism to sharpen attention.

Result: The new mechanism ensures numerical stability, maintains nearly constant validation loss at 16× training length, and outperforms Softmax on long-context retrieval and benchmarks.

Conclusion: The two-stage approach significantly improves attention stability and performance, especially for longer sequences.

Abstract: Large language models have achieved remarkable success in recent years, primarily due to the implementation of self-attention mechanisms. However, traditional Softmax attention suffers from numerical instability and reduced performance as the length of inference tokens increases. This paper addresses these issues by proposing a new design principle for attention, viewing it as a two-stage process. We first decompose the Softmax operation into a non-linear positivity transformation and an $l_1$-normalisation step, identifying the latter as essential for maintaining model performance. In the first stage, we replace the standard exponential function with the more numerically stable Softplus activation and introduce a dynamic scale factor based on invariance entropy, creating a novel attention mechanism that outperforms conventional Softmax attention. In the second stage, we introduce a re-weighting mechanism that sharpens the attention distribution, amplifying significant weights while diminishing weaker ones. This enables the model to concentrate more effectively on relevant tokens and fundamentally improves length extrapolation. When combined, this two-stage approach ensures numerical stability and dramatically improves length extrapolation, maintaining a nearly constant validation loss at 16$\times$ the training length while achieving superior results on challenging long-context retrieval tasks and standard downstream benchmarks.

[111] Improving Your Model Ranking on Chatbot Arena by Vote Rigging

Rui Min, Tianyu Pang, Chao Du, Qian Liu, Minhao Cheng, Min Lin

Main category: cs.CL

TL;DR: The paper reveals vulnerabilities in Chatbot Arena’s crowdsourced voting system, showing how rankings can be manipulated via targeted or omnipresent rigging strategies, even with minimal votes.

Details

Motivation: To expose and analyze potential rigging in LLM evaluation platforms like Chatbot Arena, demonstrating how rankings can be artificially influenced.

Method: Introduces target-only and omnipresent rigging strategies, leveraging watermarking or classifiers to identify target models and exploiting the Elo rating system. Experiments use 1.7 million historical votes.

Result: Omnipresent rigging can significantly improve model rankings with just hundreds of manipulated votes, highlighting system vulnerabilities.

Conclusion: The study underscores the need for stronger defenses against vote rigging in LLM evaluation platforms.

Abstract: Chatbot Arena is a popular platform for evaluating LLMs by pairwise battles, where users vote for their preferred response from two randomly sampled anonymous models. While Chatbot Arena is widely regarded as a reliable LLM ranking leaderboard, we show that crowdsourced voting can be rigged to improve (or decrease) the ranking of a target model $m_{t}$. We first introduce a straightforward target-only rigging strategy that focuses on new battles involving $m_{t}$, identifying it via watermarking or a binary classifier, and exclusively voting for $m_{t}$ wins. However, this strategy is practically inefficient because there are over $190$ models on Chatbot Arena and on average only about $1%$ of new battles will involve $m_{t}$. To overcome this, we propose omnipresent rigging strategies, exploiting the Elo rating mechanism of Chatbot Arena that any new vote on a battle can influence the ranking of the target model $m_{t}$, even if $m_{t}$ is not directly involved in the battle. We conduct experiments on around $1.7$ million historical votes from the Chatbot Arena Notebook, showing that omnipresent rigging strategies can improve model rankings by rigging only hundreds of new votes. While we have evaluated several defense mechanisms, our findings highlight the importance of continued efforts to prevent vote rigging. Our code is available at https://github.com/sail-sg/Rigging-ChatbotArena.

[112] ReGLA: Refining Gated Linear Attention

Peng Lu, Ivan Kobyzev, Mehdi Rezagholizadeh, Boxing Chen, Philippe Langlais

Main category: cs.CL

TL;DR: The paper explores improvements to Gated Linear Attention modules in LLMs, focusing on feature maps, normalization, and gating mechanisms, achieving better performance than previous methods.

Details

Motivation: To address the high computational and storage demands of LLMs by optimizing the Gated Linear Attention module.

Method: Developed a feature mapping function, integrated normalization layers, and refined the gating mechanism to enhance performance.

Result: The proposed architecture outperforms previous Gated Linear Attention mechanisms in various tasks.

Conclusion: The improvements in feature maps, normalization, and gating significantly enhance the efficiency and performance of LLMs.

Abstract: Recent advancements in Large Language Models (LLMs) have set themselves apart with their exceptional performance in complex language modelling tasks. However, these models are also known for their significant computational and storage requirements, primarily due to the quadratic computation complexity of softmax attention. To mitigate this issue, linear attention has been designed to reduce the quadratic space-time complexity that is inherent in standard transformers. In this work, we embarked on a comprehensive exploration of three key components that substantially impact the performance of the Gated Linear Attention module: feature maps, normalization, and the gating mechanism. We developed a feature mapping function to address some crucial issues that previous suggestions overlooked. Then we offered further rationale for the integration of normalization layers to stabilize the training process. Moreover, we explored the saturation phenomenon of the gating mechanism and augmented it with a refining module. We conducted extensive experiments and showed our architecture outperforms previous Gated Linear Attention mechanisms in extensive tasks including training from scratch and post-linearization with continual pre-training.

[113] Predicting Depression in Screening Interviews from Interactive Multi-Theme Collaboration

Xianbing Zhao, Yiqing Lyu, Di Wang, Buzhou Tang

Main category: cs.CL

TL;DR: The paper introduces an interactive depression detection framework using in-context learning to model theme correlations and clinician feedback, improving detection accuracy.

Details

Motivation: Existing neural models for depression detection fail to capture intra-theme and inter-theme correlations and lack clinician interactivity.

Method: Proposes an interactive framework leveraging in-context learning to identify themes and model their correlations, with AI-driven feedback for clinician intervention.

Result: Achieves 35% and 12% absolute improvements over state-of-the-art on the DAIC-WOZ dataset.

Conclusion: The framework effectively models theme correlations and incorporates interactive feedback, enhancing depression detection.

Abstract: Automatic depression detection provides cues for early clinical intervention by clinicians. Clinical interviews for depression detection involve dialogues centered around multiple themes. Existing studies primarily design end-to-end neural network models to capture the hierarchical structure of clinical interview dialogues. However, these methods exhibit defects in modeling the thematic content of clinical interviews: 1) they fail to capture intra-theme and inter-theme correlation explicitly, and 2) they do not allow clinicians to intervene and focus on themes of interest. To address these issues, this paper introduces an interactive depression detection framework. This framework leverages in-context learning techniques to identify themes in clinical interviews and then models both intra-theme and inter-theme correlation. Additionally, it employs AI-driven feedback to simulate the interests of clinicians, enabling interactive adjustment of theme importance. PDIMC achieves absolute improvements of 35% and 12% compared to the state-of-the-art on the depression detection dataset DAIC-WOZ, which demonstrates the effectiveness of modeling theme correlation and incorporating interactive external feedback.

[114] ALFA: Aligning LLMs to Ask Good Questions A Case Study in Clinical Reasoning

Shuyue Stella Li, Jimin Mun, Faeze Brahman, Pedram Hosseini, Bryceton G. Thomas, Jessica M. Sin, Bing Ren, Jonathan S. Ilgen, Yulia Tsvetkov, Maarten Sap

Main category: cs.CL

TL;DR: ALFA framework improves LLM question-asking by decomposing ‘good’ questions into fine-grained attributes, synthesizing variations, and aligning models via preference optimization, reducing diagnostic errors by 56.6%.

Details

Motivation: LLMs often fail to ask effective questions under uncertainty, limiting their reliability in proactive information-gathering domains like clinical reasoning.

Method: ALFA decomposes ‘good’ questions into attributes (e.g., clarity, relevance), synthesizes attribute-specific variations, and aligns models via preference-based optimization.

Result: ALFA-aligned models reduce diagnostic errors by 56.6%, achieve a 64.4% question-level win-rate, and show strong generalizability.

Conclusion: Fine-grained attribute-guided question-asking offers a scalable way to improve LLMs, especially in expert domains like healthcare.

Abstract: Large language models (LLMs) often fail to ask effective questions under uncertainty, making them unreliable in domains where proactive information-gathering is essential for decision-making. We present ALignment via Fine-grained Attributes, (ALFA) a framework that improves LLM question-asking by (i) decomposing the notion of a “good” question into a set of theory-grounded attributes (e.g., clarity, relevance), (ii) controllably synthesizing attribute-specific question variations, and (iii) aligning models via preference-based optimization to explicitly learn to ask better questions along these fine-grained attributes. Focusing on clinical reasoning as a case study, we introduce the MediQ-AskDocs dataset, composed of 17k real-world clinical interactions augmented with 80k attribute-specific preference pairs of follow-up questions, as well as a novel expert-annotated interactive healthcare QA task to evaluate question-asking abilities. Models aligned with ALFA reduce diagnostic errors by 56.6% on MediQ-AskDocs compared to SoTA instruction-tuned LLMs, with a question-level win-rate of 64.4% and strong generalizability. Our findings suggest that explicitly guiding question-asking with structured, fine-grained attributes offers a scalable path to improve LLMs, especially in expert application domains.

Bingbing Fan, Lin Chen, Songwei Li, Jian Yuan, Fengli Xu, Pan Hui, Yong Li

Main category: cs.CL

TL;DR: The paper proposes using LLMs to analyze social media reviews for predicting urban segregation, introducing a Reflective LLM Coder and RE’EM framework, which improve prediction accuracy and provide human-interpretable insights.

Details

Motivation: To address societal inequalities by leveraging social media data to understand experienced segregation in urban daily life, despite challenges like data volume and ambiguity.

Method: Develops a Reflective LLM Coder to extract insights from reviews and a RE’EM framework combining reasoning and embedding for multi-channel feature integration.

Result: Significant improvements in prediction accuracy (22.79% R2 increase, 9.33% MSE reduction) and generalizable codebook across cities, with cognitive gains for human users.

Conclusion: Demonstrates AI’s potential to uncover implicit social barriers and promote inclusiveness, marking progress in understanding urban segregation.

Abstract: Understanding experienced segregation in urban daily life is crucial for addressing societal inequalities and fostering inclusivity. The abundance of user-generated reviews on social media encapsulates nuanced perceptions and feelings associated with different places, offering rich insights into segregation. However, leveraging this data poses significant challenges due to its vast volume, ambiguity, and confluence of diverse perspectives. To tackle these challenges, we propose using Large Language Models (LLMs) to automate online review mining for segregation prediction. We design a Reflective LLM Coder to digest social media content into insights consistent with real-world feedback, and eventually produce a codebook capturing key dimensions that signal segregation experience, such as cultural resonance and appeal, accessibility and convenience, and community engagement and local involvement. Guided by the codebook, LLMs can generate both informative review summaries and ratings for segregation prediction. Moreover, we design a REasoning-and-EMbedding (RE’EM) framework, which combines the reasoning and embedding capabilities of language models to integrate multi-channel features for segregation prediction. Experiments on real-world data demonstrate that our framework greatly improves prediction accuracy, with a 22.79% elevation in R2 and a 9.33% reduction in MSE. The derived codebook is generalizable across three different cities, consistently improving prediction accuracy. Moreover, our user study confirms that the codebook-guided summaries provide cognitive gains for human participants in perceiving POIs’ social inclusiveness. Our study marks an important step toward understanding implicit social barriers and inequalities, demonstrating the great potential of promoting social inclusiveness with AI.

[116] X-EcoMLA: Upcycling Pre-Trained Attention into MLA for Efficient and Extreme KV Compression

Guihong Li, Mehdi Rezagholizadeh, Mingyu Yang, Vikram Appia, Emad Barsoum

Main category: cs.CL

TL;DR: X-EcoMLA enables post-training adaptation of Transformer models to Multi-head Latent Attention (MLA) for efficient KV cache compression without full retraining.

Details

Motivation: To leverage MLA's memory efficiency benefits in pre-trained models without requiring extensive retraining.

Method: Post-training distillation to adapt pre-trained models into a hybrid MLA variant, using lightweight training.

Result: Achieves 6.4x-10.6x KV cache compression with minimal performance drop (e.g., 0.1% score drop for 10.6x compression).

Conclusion: X-EcoMLA successfully integrates MLA into pre-trained models, offering memory efficiency without compromising accuracy.

Abstract: Multi-head latent attention (MLA) is designed to optimize KV cache memory through low-rank key-value joint compression. Rather than caching keys and values separately, MLA stores their compressed latent representations, reducing memory overhead while maintaining the performance. While MLA improves memory efficiency without compromising language model accuracy, its major limitation lies in its integration during the pre-training phase, requiring models to be trained from scratch. This raises a key question: can we use MLA’s benefits fully or partially in models that have already been pre-trained with different attention mechanisms? In this paper, we propose X-EcoMLA to deploy post training distillation to enable the upcycling of Transformer-based attention into an efficient hybrid MLA variant through lightweight post-training adaptation, bypassing the need for extensive pre-training. We demonstrate that leveraging the dark knowledge of a well-trained model can enhance training accuracy and enable extreme KV cache compression in MLA without compromising model performance. The experimental results show that our proposed method can effectively compress the KV cache while preserving the performance on the benchmarks; specifically, for Llama3.2-1B-Instruct baseline, a 6.4x compression achieves the same average score by using only 3.6B training tokens and 70 GPU hours on AMD MI300, whereas a 10.6x compression have less than 0.1% average score drop with 7B training tokens and 140 GPU hours. The code for this work is available at https://github.com/AMD-AIG-AIMA/AMD-Hybrid-Models.

[117] Both Direct and Indirect Evidence Contribute to Dative Alternation Preferences in Language Models

Qing Yao, Kanishka Misra, Leonie Weissweiler, Kyle Mahowald

Main category: cs.CL

TL;DR: The paper investigates whether language models’ syntactic preferences (e.g., dative alternation) stem from direct exposure or general language properties, using controlled experiments on length and animacy biases.

Details

Motivation: To understand if LMs' human-like syntactic preferences arise from direct evidence of specific phenomena or broader language tendencies.

Method: Controlled rearing paradigm with small LMs trained on manipulated input, focusing on length and animacy in dative alternation. Ablation and perturbation of datasets to test biases.

Result: Direct evidence of length and animacy affects preferences, but easy-first biases persist without it. Dative preferences can emerge from indirect evidence via global length effects.

Conclusion: LMs’ syntactic preferences result from a combination of direct and indirect linguistic evidence.

Abstract: Language models (LMs) tend to show human-like preferences on a number of syntactic phenomena, but the extent to which these are attributable to direct exposure to the phenomena or more general properties of language is unclear. We explore this with the English dative alternation (DO: “gave Y the X” vs. PO: “gave the X to Y”), using a controlled rearing paradigm wherein we iteratively train small LMs on systematically manipulated input. We focus on two properties that affect the choice of alternant: length and animacy. Both properties are directly present in datives but also reflect more global tendencies for shorter elements to precede longer ones and animates to precede inanimates. First, by manipulating and ablating datives for these biases in the input, we show that direct evidence of length and animacy matters, but easy-first preferences persist even without such evidence. Then, using LMs trained on systematically perturbed datasets to manipulate global length effects (re-linearizing sentences globally while preserving dependency structure), we find that dative preferences can emerge from indirect evidence. We conclude that LMs’ emergent syntactic preferences come from a mix of direct and indirect sources.

[118] $μ$KE: Matryoshka Unstructured Knowledge Editing of Large Language Models

Zian Su, Ziyang Huang, Kaiyuan Zhang, Xiangyu Zhang

Main category: cs.CL

TL;DR: Error: OutputParser failed

Details

Motivation: Error: OutputParser failed

Method: Error: OutputParser failed

Result: Error: OutputParser failed

Conclusion: Error: OutputParser failed

Abstract: Large language models (LLMs) have emerged as powerful knowledge bases yet are limited by static training data, leading to issues such as hallucinations and safety risks. Editing a model’s internal knowledge through the locate-and-edit paradigm has proven a cost-effective alternative to retraining, though current unstructured approaches, especially window-based autoregressive methods, often disrupt the causal dependency between early memory updates and later output tokens. In this work, we first theoretically analyze these limitations and then introduce Matryoshka Unstructured Knowledge Editing ($\mu$KE), a novel memory update mechanism that preserves such dependencies via a Matryoshka-style objective and adaptive loss coefficients. Empirical evaluations on two models across four benchmarks demonstrate that $\mu$KE improves edit efficacy by up to 12.33% over state-of-the-art methods, and remains robust when applied to diverse formatted edits, underscoring its potential for effective unstructured knowledge editing in LLMs.

[119] Overcoming Vocabulary Constraints with Pixel-level Fallback

Jonas F. Lotz, Hendra Setiawan, Stephan Peitz, Yova Kementchedjhieva

Main category: cs.CL

TL;DR: The paper proposes a pixel-based encoder to enhance multilingual performance of pretrained language models, improving translation and cross-lingual transfer without extensive retraining.

Details

Motivation: Addressing suboptimal performance of subword tokenization on languages and scripts not prioritized during training.

Method: Augmenting pretrained models with a vocabulary-free encoder that generates embeddings from text rendered as pixels.

Result: Outperforms tokenizer-based methods, byte-level approaches, and standard vocabulary expansion in machine translation and cross-lingual transfer.

Conclusion: Pixel-based representations enhance multilingual capabilities and reduce decoding latency, offering a practical solution for diverse languages.

Abstract: Subword tokenization requires balancing computational efficiency and vocabulary coverage, which often leads to suboptimal performance on languages and scripts not prioritized during training. We propose to augment pretrained language models with a vocabulary-free encoder that generates input embeddings from text rendered as pixels. Through experiments on English-centric language models, we demonstrate that our approach substantially improves machine translation performance and facilitates effective cross-lingual transfer, outperforming tokenizer-based methods. Furthermore, we find that pixel-based representations outperform byte-level approaches and standard vocabulary expansion. Our approach enhances the multilingual capabilities of monolingual language models without extensive retraining and reduces decoding latency via input compression.

[120] How Post-Training Reshapes LLMs: A Mechanistic View on Knowledge, Truthfulness, Refusal, and Confidence

Hongzhe Du, Weikai Li, Min Cai, Karim Saraipour, Zimin Zhang, Himabindu Lakkaraju, Yizhou Sun, Shichang Zhang

Main category: cs.CL

TL;DR: The paper investigates how post-training reshapes large language models (LLMs) internally, revealing insights into knowledge representation, truthfulness, refusal, and confidence differences.

Details

Motivation: To understand the internal changes in LLMs caused by post-training, as current studies focus more on outputs than mechanisms.

Method: Mechanistic comparison of base and post-trained LLMs from four perspectives: knowledge storage, truthfulness, refusal, and confidence.

Result: Post-training adapts knowledge representations without changing storage locations; truthfulness is transferable, while refusal is not; confidence differences are not due to entropy neurons.

Conclusion: The study provides insights into post-training mechanisms, aiding model steering and future research in interpretability and LLM post-training.

Abstract: Post-training is essential for the success of large language models (LLMs), transforming pre-trained base models into more useful and aligned post-trained models. While plenty of works have studied post-training algorithms and evaluated post-training models by their outputs, it remains understudied how post-training reshapes LLMs internally. In this paper, we compare base and post-trained LLMs mechanistically from four perspectives to better understand post-training effects. Our findings across model families and datasets reveal that: (1) Post-training does not change the factual knowledge storage locations, and it adapts knowledge representations from the base model while developing new knowledge representations; (2) Both truthfulness and refusal can be represented by vectors in the hidden representation space. The truthfulness direction is highly similar between the base and post-trained model, and it is effectively transferable for interventions; (3) The refusal direction is different between the base and post-trained models, and it shows limited forward transferability; (4) Differences in confidence between the base and post-trained models cannot be attributed to entropy neurons. Our study provides insights into the fundamental mechanisms preserved and altered during post-training, facilitates downstream tasks like model steering, and could potentially benefit future research in interpretability and LLM post-training. Our code is publicly available at https://github.com/HZD01/post-training-mechanistic-analysis.

[121] NoveltyBench: Evaluating Language Models for Humanlike Diversity

Yiming Zhang, Harshita Diddee, Susan Holm, Hanchen Liu, Xinyue Liu, Vinay Samuel, Barry Wang, Daphne Ippolito

Main category: cs.CL

TL;DR: NoveltyBench evaluates language models’ ability to produce diverse outputs, revealing current models lack diversity compared to humans, with larger models often performing worse than smaller ones.

Details

Motivation: Language models struggle with mode collapse, limiting their ability to generate diverse and novel outputs, which reduces their practical utility.

Method: Introduces NoveltyBench, a benchmark using curated prompts and real-world queries to evaluate diversity in 20 leading language models.

Result: Current models generate significantly less diversity than humans, with larger models often underperforming smaller ones. Prompting strategies help but don’t resolve the issue.

Conclusion: The findings highlight the need for new training and evaluation paradigms that prioritize diversity alongside quality to improve model utility.

Abstract: Language models have demonstrated remarkable capabilities on standard benchmarks, yet they struggle increasingly from mode collapse, the inability to generate diverse and novel outputs. Our work introduces NoveltyBench, a benchmark specifically designed to evaluate the ability of language models to produce multiple distinct and high-quality outputs. NoveltyBench utilizes prompts curated to elicit diverse answers and filtered real-world user queries. Evaluating 20 leading language models, we find that current state-of-the-art systems generate significantly less diversity than human writers. Notably, larger models within a family often exhibit less diversity than their smaller counterparts, challenging the notion that capability on standard benchmarks translates directly to generative utility. While prompting strategies like in-context regeneration can elicit diversity, our findings highlight a fundamental lack of distributional diversity in current models, reducing their utility for users seeking varied responses and suggesting the need for new training and evaluation paradigms that prioritize diversity alongside quality.

[122] Do LLMs Understand Your Translations? Evaluating Paragraph-level MT with Question Answering

Patrick Fernandes, Sweta Agrawal, Emmanouil Zaranis, André F. T. Martins, Graham Neubig

Main category: cs.CL

TL;DR: TREQA evaluates translation quality by using QA to assess how well translations convey key info, outperforming traditional metrics in complex contexts.

Details

Motivation: Existing metrics fail to capture meaning preservation beyond sentences, needing a pragmatic approach for long, complex texts.

Method: TREQA uses QA to evaluate translations by checking if they answer key questions from source/reference texts.

Result: TREQA matches or beats state-of-the-art metrics in ranking translations, especially in complex domains like literature.

Conclusion: TREQA offers a pragmatic, interpretable way to evaluate translations, with potential for broader adoption.

Abstract: Despite the steady progress in machine translation evaluation, existing automatic metrics struggle to capture how well meaning is preserved beyond sentence boundaries. We posit that reliance on a single intrinsic quality score, trained to mimic human judgments, might be insufficient for evaluating translations of long, complex passages, and a more ``pragmatic’’ approach that assesses how accurately key information is conveyed by a translation in context is needed. We introduce TREQA (Translation Evaluation via Question-Answering), a framework that extrinsically evaluates translation quality by assessing how accurately candidate translations answer reading comprehension questions that target key information in the original source or reference texts. In challenging domains that require long-range understanding, such as literary texts, we show that TREQA is competitive with and, in some cases, outperforms state-of-the-art neural and LLM-based metrics in ranking alternative paragraph-level translations, despite never being explicitly optimized to correlate with human judgments. Furthermore, the generated questions and answers offer interpretability: empirical analysis shows that they effectively target translation errors identified by experts in evaluated datasets. Our code is available at https://github.com/deep-spin/treqa

[123] QUDsim: Quantifying Discourse Similarities in LLM-Generated Text

Ramya Namuduri, Yating Wu, Anshun Asher Zheng, Manya Wadhwa, Greg Durrett, Junyi Jessy Li

Main category: cs.CL

TL;DR: The paper introduces QUDsim, a similarity metric to detect structural similarities in LLM-generated texts, revealing their repetitiveness and divergence from human writing.

Details

Motivation: To address the lack of metrics for detecting structural similarities in LLM-generated texts, which often exhibit repetitiveness despite diverse content.

Method: Proposes QUDsim, a metric based on linguistic theories (Questions Under Discussion and question semantics) to quantify discourse progression differences.

Result: LLMs frequently reuse discourse structures, showing more repetitiveness and structural uniformity than humans, and diverge in structure types.

Conclusion: QUDsim effectively identifies structural similarities, highlighting LLMs’ limitations in generating unique and creative content compared to humans.

Abstract: As large language models become increasingly capable at various writing tasks, their weakness at generating unique and creative content becomes a major liability. Although LLMs have the ability to generate text covering diverse topics, there is an overall sense of repetitiveness across texts that we aim to formalize and quantify via a similarity metric. The familiarity between documents arises from the persistence of underlying discourse structures. However, existing similarity metrics dependent on lexical overlap and syntactic patterns largely capture $\textit{content}$ overlap, thus making them unsuitable for detecting $\textit{structural}$ similarities. We introduce an abstraction based on linguistic theories in Questions Under Discussion (QUD) and question semantics to help quantify differences in discourse progression. We then use this framework to build $\textbf{QUDsim}$, a similarity metric that can detect discursive parallels between documents. Using QUDsim, we find that LLMs often reuse discourse structures (more so than humans) across samples, even when content differs. Furthermore, LLMs are not only repetitive and structurally uniform, but are also divergent from human authors in the types of structures they use.

[124] Science Hierarchography: Hierarchical Organization of Science Literature

Muhan Gao, Jash Shah, Weiqi Wang, Daniel Khashabi

Main category: cs.CL

TL;DR: The paper introduces SCIENCE HIERARCHOGRAPHY, a method to organize scientific literature into hierarchical structures for better tracking of progress and interdisciplinary links. It combines embedding-based clustering with LLM prompting for scalability and precision.

Details

Motivation: Scientific knowledge is expanding rapidly, but current tools like citation networks lack the abstraction to represent the density and structure of research across subfields.

Method: A hybrid approach combining embedding-based clustering with LLM-based prompting, balancing scalability and semantic precision.

Result: The method achieves superior quality-speed trade-offs compared to LLM-heavy approaches and improves interpretability for navigating scientific literature.

Conclusion: SCIENCE HIERARCHOGRAPHY offers an effective alternative to traditional search methods, enhancing exploration and understanding of scientific literature.

Abstract: Scientific knowledge is growing rapidly, making it difficult to track progress and high-level conceptual links across broad disciplines. While tools like citation networks and search engines help retrieve related papers, they lack the abstraction needed to capture the needed to represent the density and structure of activity across subfields. We motivate SCIENCE HIERARCHOGRAPHY, the goal of organizing scientific literature into a high-quality hierarchical structure that spans multiple levels of abstraction – from broad domains to specific studies. Such a representation can provide insights into which fields are well-explored and which are under-explored. To achieve this goal, we develop a hybrid approach that combines efficient embedding-based clustering with LLM-based prompting, striking a balance between scalability and semantic precision. Compared to LLM-heavy methods like iterative tree construction, our approach achieves superior quality-speed trade-offs. Our hierarchies capture different dimensions of research contributions, reflecting the interdisciplinary and multifaceted nature of modern science. We evaluate its utility by measuring how effectively an LLM-based agent can navigate the hierarchy to locate target papers. Results show that our method improves interpretability and offers an alternative pathway for exploring scientific literature beyond traditional search methods. Code, data and demo are available: https://github.com/JHU-CLSP/science-hierarchography

[125] GreenMind: A Next-Generation Vietnamese Large Language Model for Structured and Logical Reasoning

Luu Quy Tung, Hoang Quoc Viet, Pham Bao Loc, Vo Trong Thu

Main category: cs.CL

TL;DR: The paper introduces GreenMind-Medium-14B-R1, a Vietnamese reasoning model using Chain-of-Thought (CoT) and Group Relative Policy Optimization, addressing language mixing and factual correctness with reward functions. It outperforms prior works on Vietnamese datasets and shows effectiveness on multilingual tasks.

Details

Motivation: To enhance reasoning in Vietnamese LLMs by addressing language mixing and factual correctness, inspired by CoT and finetuning strategies.

Method: Uses Group Relative Policy Optimization, a synthesized Vietnamese reasoning dataset, and two reward functions (language mixing detection and Sentence Transformer-based factual correctness).

Result: Outperforms prior works on Vietnamese VLSP 2023 dataset and shows effectiveness on SeaExam multilingual dataset.

Conclusion: The model improves linguistic consistency and reasoning, demonstrating broader applicability in multilingual tasks.

Abstract: Chain-of-Thought (CoT) is a robust approach for tackling LLM tasks that require intermediate reasoning steps prior to generating a final answer. In this paper, we present GreenMind-Medium-14B-R1, the Vietnamese reasoning model inspired by the finetuning strategy based on Group Relative Policy Optimization. We also leverage a high-quality Vietnamese synthesized reasoning dataset and design two reward functions to tackle the main limitations of this technique: (i) language mixing, where we explicitly detect the presence of biased language characters during the process of sampling tokens, and (ii) we leverage Sentence Transformer-based models to ensure that the generated reasoning content maintains factual correctness and does not distort the final output. Experimental results on the Vietnamese dataset from the VLSP 2023 Challenge demonstrate that our model outperforms prior works and enhances linguistic consistency in its responses. Furthermore, we extend our evaluation to SeaExam-a multilingual multiple-choice dataset, showing the effectiveness of our reasoning method compared to few-shot prompting techniques.

[126] Planning with Diffusion Models for Target-Oriented Dialogue Systems

Hanwen Du, Bo Peng, Xia Ning

Main category: cs.CL

TL;DR: DiffTOD introduces a diffusion model-based framework for non-sequential dialogue planning in TOD, addressing compounding errors and myopic actions in existing methods.

Details

Motivation: Existing dialogue planning methods are sequential and prone to errors, limiting their effectiveness in TOD.

Method: DiffTOD uses diffusion models for trajectory generation with conditional guidance, tailored for diverse TOD targets.

Result: Experiments show DiffTOD’s effectiveness in non-myopic exploration and flexibility across diverse TOD scenarios.

Conclusion: DiffTOD offers a robust and flexible solution for dialogue planning in TOD, outperforming sequential methods.

Abstract: Target-Oriented Dialogue (TOD) remains a significant challenge in the LLM era, where strategic dialogue planning is crucial for directing conversations toward specific targets. However, existing dialogue planning methods generate dialogue plans in a step-by-step sequential manner, and may suffer from compounding errors and myopic actions. To address these limitations, we introduce a novel dialogue planning framework, DiffTOD, which leverages diffusion models to enable non-sequential dialogue planning. DiffTOD formulates dialogue planning as a trajectory generation problem with conditional guidance, and leverages a diffusion language model to estimate the likelihood of the dialogue trajectory. To optimize the dialogue action strategies, DiffTOD introduces three tailored guidance mechanisms for different target types, offering flexible guidance toward diverse TOD targets at test time. Extensive experiments across three diverse TOD settings show that DiffTOD can effectively perform non-myopic lookahead exploration and optimize action strategies over a long horizon through non-sequential dialogue planning, and demonstrates strong flexibility across complex and diverse dialogue scenarios. Our code and data are accessible through https://github.com/ninglab/DiffTOD.

[127] Steering the CensorShip: Uncovering Representation Vectors for LLM “Thought” Control

Hannah Cyberey, David Evans

Main category: cs.CL

TL;DR: The paper investigates censorship in LLMs, using representation engineering to detect and control refusal-compliance behavior and thought suppression in safety-tuned models.

Details

Motivation: To understand and manipulate how LLMs censor responses, particularly in safety-tuned models.

Method: Uses representation engineering to identify refusal-compliance and thought suppression vectors, enabling control over censorship levels.

Result: Demonstrates the ability to detect and remove censorship in model outputs by manipulating identified vectors.

Conclusion: The approach provides a tool to study and control censorship in LLMs, with implications for model transparency and alignment.

Abstract: Large language models (LLMs) have transformed the way we access information. These models are often tuned to refuse to comply with requests that are considered harmful and to produce responses that better align with the preferences of those who control the models. To understand how this “censorship” works. We use representation engineering techniques to study open-weights safety-tuned models. We present a method for finding a refusal–compliance vector that detects and controls the level of censorship in model outputs. We also analyze recent reasoning LLMs, distilled from DeepSeek-R1, and uncover an additional dimension of censorship through “thought suppression”. We show a similar approach can be used to find a vector that suppresses the model’s reasoning process, allowing us to remove censorship by applying the negative multiples of this vector. Our code is publicly available at: https://github.com/hannahxchen/llm-censorship-steering

Junhong Liang, Yu Zhou

Main category: cs.CL

TL;DR: The paper introduces RAIR, a retrieval-augmented framework for Chinese Spelling Correction, improving domain adaptation and variable-length correction.

Details

Motivation: Traditional CSC methods and LLMs struggle with domain-specific corrections and variable-length scenarios.

Method: Proposes RAIR, a framework using adaptive retrieval from domain-specific data and dictionaries, with a fine-tuned retriever. Extends correction to variable-length.

Result: RAIR outperforms current methods in domain spelling correction and enhances LLM performance in variable-length cases.

Conclusion: RAIR effectively addresses domain adaptation and variable-length correction challenges in CSC.

Abstract: Chinese Spelling Correction (CSC) aims to detect and correct erroneous tokens in sentences. Traditional CSC focuses on equal length correction and uses pretrained language models (PLMs). While Large Language Models (LLMs) have shown remarkable success in identifying and rectifying potential errors, they often struggle with adapting to domain-specific corrections, especially when encountering terminologies in specialized domains. To address domain adaptation, we propose a \textbf{R}etrieval-\textbf{A}ugmented \textbf{I}terative \textbf{R}efinement (RAIR) framework. Our approach constructs a retrieval corpus adaptively from domain-specific training data and dictionaries, employing a fine-tuned retriever to ensure that the retriever catches the error correction pattern. We also extend equal-length into variable-length correction scenarios. Extensive experiments demonstrate that our framework outperforms current approaches in domain spelling correction and significantly improves the performance of LLMs in variable-length scenarios.

[129] WebGen-Bench: Evaluating LLMs on Generating Interactive and Functional Websites from Scratch

Zimu Lu, Yunqiao Yang, Houxing Ren, Haotian Hou, Han Xiao, Ke Wang, Weikang Shi, Aojun Zhou, Mingjie Zhan, Hongsheng Li

Main category: cs.CL

TL;DR: WebGen-Bench is a benchmark for evaluating LLM-based agents’ ability to generate multi-file website codebases, featuring diverse instructions and 647 test cases. The best model achieved 27.8% accuracy, and training on WebGen-Instruct improved performance to 38.2%.

Details

Motivation: To measure and improve LLM-based agents' capability in generating complex, multi-file website codebases from scratch.

Method: Introduces WebGen-Bench with diverse instructions and automated testing using a web-navigation agent. Evaluates three frameworks with various LLMs.

Result: Best model (Bolt.diy + DeepSeek-R1) achieved 27.8% accuracy. Training on WebGen-Instruct improved Qwen2.5-Coder-32B-Instruct to 38.2%.

Conclusion: WebGen-Bench is a challenging benchmark, and training on specialized datasets can enhance LLM-based agents’ performance in website generation.

Abstract: LLM-based agents have demonstrated great potential in generating and managing code within complex codebases. In this paper, we introduce WebGen-Bench, a novel benchmark designed to measure an LLM-based agent’s ability to create multi-file website codebases from scratch. It contains diverse instructions for website generation, created through the combined efforts of human annotators and GPT-4o. These instructions span three major categories and thirteen minor categories, encompassing nearly all important types of web applications. To assess the quality of the generated websites, we use GPT-4o to generate test cases targeting each functionality described in the instructions, and then manually filter, adjust, and organize them to ensure accuracy, resulting in 647 test cases. Each test case specifies an operation to be performed on the website and the expected result after the operation. To automate testing and improve reproducibility, we employ a powerful web-navigation agent to execute tests on the generated websites and determine whether the observed responses align with the expected results. We evaluate three high-performance code-agent frameworks, Bolt.diy, OpenHands, and Aider, using multiple proprietary and open-source LLMs as engines. The best-performing combination, Bolt.diy powered by DeepSeek-R1, achieves only 27.8% accuracy on the test cases, highlighting the challenging nature of our benchmark. Additionally, we construct WebGen-Instruct, a training set consisting of 6,667 website-generation instructions. Training Qwen2.5-Coder-32B-Instruct on Bolt.diy trajectories generated from a subset of this training set achieves an accuracy of 38.2%, surpassing the performance of the best proprietary model.

[130] Rethinking Prompt Optimizers: From Prompt Merits to Optimization

Zixiao Zhu, Hanzhang Zhou, Zijian Feng, Tianjiao Li, Chua Jia Jim Deryl, Mak Lee Onn, Gee Wah Ng, Kezhi Mao

Main category: cs.CL

TL;DR: MePO introduces a merit-guided, locally deployable prompt optimizer to enhance prompt and response quality, addressing limitations of existing methods like poor compatibility and lack of interpretability.

Details

Motivation: Existing prompt optimization methods rely on LLMs' self-generation, leading to compatibility issues and lack of interpretability. MePO aims to solve these by focusing on explicit, model-agnostic merits.

Method: MePO uses a merit-guided prompt preference dataset generated by a lightweight LLM to train a locally deployable optimizer, avoiding online optimization and ensuring privacy.

Result: MePO outperforms existing methods across diverse tasks and model types, offering scalability and robustness.

Conclusion: MePO provides a practical, interpretable, and scalable solution for prompt optimization, suitable for real-world deployment.

Abstract: Prompt optimization (PO) provides a practical way to improve response quality when users lack the time or expertise to manually craft effective prompts. Existing methods typically rely on LLMs’ self-generation ability to optimize prompts. However, due to limited downward compatibility, the instruction-heavy prompts generated by advanced LLMs can overwhelm lightweight inference models and degrade response quality, while also lacking interpretability due to implicit optimization. In this work, we rethink prompt optimization through the lens of explicit and interpretable design. We first identify a set of model-agnostic prompt quality merits and empirically validate their effectiveness in enhancing prompt and response quality. We then introduce MePO, a merit-guided, locally deployable prompt optimizer trained on our merit-guided prompt preference dataset generated by a lightweight LLM. MePO avoids online optimization, reduces privacy concerns, and, by learning clear, interpretable merits, generalizes effectively to both large-scale and lightweight inference models. Experiments demonstrate that MePO achieves better results across diverse tasks and model types, offering a scalable and robust solution for real-world deployment.The code, model and dataset can be found in https://github.com/MidiyaZhu/MePO

[131] Decoding the Multimodal Mind: Generalizable Brain-to-Text Translation via Multimodal Alignment and Adaptive Routing

Chunyu Ye, Yunhao Zhang, Jingyuan Sun, Chong Li, Chengqing Zong, Shaonan Wang

Main category: cs.CL

TL;DR: A unified framework using Multimodal Large Language Models (MLLMs) to decode multimodal brain signals (text, images, audio) achieves state-of-the-art performance, improving by 8.48% on benchmarks, and works across fMRI, EEG, and MEG data.

Details

Motivation: Current BCIs rely on unimodal brain representations, ignoring the brain's multimodal processing. This work aims to leverage the brain's associative mechanisms for better decoding.

Method: Proposes a framework aligning brain signals with a shared semantic space using MLLMs. A router module dynamically fuses modality-specific brain features based on stimuli.

Result: Achieves 8.48% improvement on benchmarks and demonstrates flexibility across fMRI, EEG, and MEG data.

Conclusion: The first unified BCI architecture for robust multimodal brain decoding, offering practical applications.

Abstract: Decoding language from the human brain remains a grand challenge for Brain-Computer Interfaces (BCIs). Current approaches typically rely on unimodal brain representations, neglecting the brain’s inherently multimodal processing. Inspired by the brain’s associative mechanisms, where viewing an image can evoke related sounds and linguistic representations, we propose a unified framework that leverages Multimodal Large Language Models (MLLMs) to align brain signals with a shared semantic space encompassing text, images, and audio. A router module dynamically selects and fuses modality-specific brain features according to the characteristics of each stimulus. Experiments on various fMRI datasets with textual, visual, and auditory stimuli demonstrate state-of-the-art performance, achieving an 8.48% improvement on the most commonly used benchmark. We further extend our framework to EEG and MEG data, demonstrating flexibility and robustness across varying temporal and spatial resolutions. To our knowledge, this is the first unified BCI architecture capable of robustly decoding multimodal brain activity across diverse brain signals and stimulus types, offering a flexible solution for real-world applications.

[132] The taggedPBC: Annotating a massive parallel corpus for crosslinguistic investigations

Hiram Ring

Main category: cs.CL

TL;DR: The paper introduces taggedPBC, a large POS-tagged parallel dataset for 1,940+ languages, addressing limitations in crosslinguistic research. It shows high accuracy and introduces the N1 ratio for predicting intransitive word order.

Details

Motivation: Existing datasets for crosslinguistic studies are limited in scope—either covering few languages extensively or many languages minimally. This restricts insights into universal language properties.

Method: Developed taggedPBC, a POS-tagged parallel dataset for 1,940+ languages, and introduced the N1 ratio. Validated accuracy against SOTA taggers and hand-tagged corpora, and tested the N1 ratio’s predictive power for intransitive word order.

Result: taggedPBC’s tags correlate well with SOTA taggers and hand-tagged data. The N1 ratio predicts intransitive word order accurately, validated against typological databases.

Conclusion: taggedPBC is a significant advancement for crosslinguistic research, though further expansion is needed. It is available on GitHub for collaboration.

Abstract: Existing datasets available for crosslinguistic investigations have tended to focus on large amounts of data for a small group of languages or a small amount of data for a large number of languages. This means that claims based on these datasets are limited in what they reveal about universal properties of the human language faculty. While this has begun to change through the efforts of projects seeking to develop tagged corpora for a large number of languages, such efforts are still constrained by limits on resources. The current paper reports on a large tagged parallel dataset which has been developed to partially address this issue. The taggedPBC contains POS-tagged parallel text data from more than 1,940 languages, representing 155 language families and 78 isolates, dwarfing previously available resources. The accuracy of particular tags in this dataset is shown to correlate well with both existing SOTA taggers for high-resource languages (SpaCy, Trankit) as well as hand-tagged corpora (Universal Dependencies Treebanks). Additionally, a novel measure derived from this dataset, the N1 ratio, correlates with expert determinations of intransitive word order in three typological databases (WALS, Grambank, Autotyp) such that a Gaussian Naive Bayes classifier trained on this feature can accurately identify basic intransitive word order for languages not in those databases. While much work is still needed to expand and develop this dataset, the taggedPBC is an important step to enable corpus-based crosslinguistic investigations, and is made available for research and collaboration via GitHub.

[133] Extracting Probabilistic Knowledge from Large Language Models for Bayesian Network Parameterization

Aliakbar Nafar, Kristen Brent Venable, Zijun Cui, Parisa Kordjamshidi

Main category: cs.CL

TL;DR: The paper explores using Large Language Models (LLMs) to approximate expert priors for building Bayesian Networks (BNs), demonstrating their potential in generating probabilistic knowledge and refining data-derived distributions.

Details

Motivation: To leverage LLMs' probabilistic knowledge for constructing BNs, especially in domains with scarce data.

Method: Query LLMs for conditional probabilities of events in BNs, comparing results to baselines like random/uniform distributions and next-token probabilities.

Result: LLM-derived distributions provide meaningful results and can refine data-extracted distributions, establishing a baseline for LLM performance in probabilistic knowledge extraction.

Conclusion: The work presents a promising approach for automatically constructing BNs by combining LLM-derived probabilistic knowledge with real-world data.

Abstract: In this work, we evaluate the potential of Large Language Models (LLMs) in building Bayesian Networks (BNs) by approximating domain expert priors. LLMs have demonstrated potential as factual knowledge bases; however, their capability to generate probabilistic knowledge about real-world events remains understudied. We explore utilizing the probabilistic knowledge inherent in LLMs to derive probability estimates for statements regarding events and their relationships within a BN. Using LLMs in this context allows for the parameterization of BNs, enabling probabilistic modeling within specific domains. Our experiments on eighty publicly available Bayesian Networks, from healthcare to finance, demonstrate that querying LLMs about the conditional probabilities of events provides meaningful results when compared to baselines, including random and uniform distributions, as well as approaches based on next-token generation probabilities. We explore how these LLM-derived distributions can serve as expert priors to refine distributions extracted from data, especially when data is scarce. Overall, this work introduces a promising strategy for automatically constructing Bayesian Networks by combining probabilistic knowledge extracted from LLMs with real-world data. Additionally, we establish the first comprehensive baseline for assessing LLM performance in extracting probabilistic knowledge.

[134] WebDancer: Towards Autonomous Information Seeking Agency

Jialong Wu, Baixuan Li, Runnan Fang, Wenbiao Yin, Liwen Zhang, Zhengwei Tao, Dingchu Zhang, Zekun Xi, Gang Fu, Yong Jiang, Pengjun Xie, Fei Huang, Jingren Zhou

Main category: cs.CL

TL;DR: The paper introduces a cohesive paradigm for building end-to-end agentic information-seeking agents, focusing on data-centric and training-stage perspectives. The approach includes four stages: browsing data construction, trajectories sampling, supervised fine-tuning, and reinforcement learning. The framework is instantiated in a web agent, WebDancer, which shows strong performance on benchmarks like GAIA and WebWalkerQA.

Details

Motivation: The motivation is to address intricate real-world problems requiring in-depth information seeking and multi-step reasoning by developing autonomous multi-step research agents.

Method: The method involves four key stages: browsing data construction, trajectories sampling, supervised fine-tuning for cold start, and reinforcement learning for generalization. The framework is implemented in the WebDancer agent.

Result: Empirical evaluations on GAIA and WebWalkerQA benchmarks demonstrate WebDancer’s strong performance, achieving considerable results.

Conclusion: The training paradigm is effective, and further analysis provides insights for developing more capable agentic models. The codes and demo will be released.

Abstract: Addressing intricate real-world problems necessitates in-depth information seeking and multi-step reasoning. Recent progress in agentic systems, exemplified by Deep Research, underscores the potential for autonomous multi-step research. In this work, we present a cohesive paradigm for building end-to-end agentic information seeking agents from a data-centric and training-stage perspective. Our approach consists of four key stages: (1) browsing data construction, (2) trajectories sampling, (3) supervised fine-tuning for effective cold start, and (4) reinforcement learning for enhanced generalisation. We instantiate this framework in a web agent based on the ReAct, WebDancer. Empirical evaluations on the challenging information seeking benchmarks, GAIA and WebWalkerQA, demonstrate the strong performance of WebDancer, achieving considerable results and highlighting the efficacy of our training paradigm. Further analysis of agent training provides valuable insights and actionable, systematic pathways for developing more capable agentic models. The codes and demo will be released in https://github.com/Alibaba-NLP/WebAgent.

[135] Document Valuation in LLM Summaries: A Cluster Shapley Approach

Zikun Ye, Hema Yoganarasimhan

Main category: cs.CL

TL;DR: The paper proposes Cluster Shapley, an efficient method to attribute credit to documents in LLM-generated summaries using Shapley values, reducing computational costs while maintaining accuracy.

Details

Motivation: LLMs obscure original content creators' contributions in summaries, raising concerns about credit attribution and compensation.

Method: Uses Shapley values for credit allocation and introduces Cluster Shapley, an approximation algorithm leveraging semantic similarity via clustering.

Result: Cluster Shapley reduces computational complexity while maintaining high accuracy, outperforming baseline methods like Monte Carlo sampling and Kernel SHAP.

Conclusion: The method is broadly applicable to various summarization settings, agnostic to LLM or summarization process details.

Abstract: Large Language Models (LLMs) are increasingly used in systems that retrieve and summarize content from multiple sources, such as search engines and AI assistants. While these models enhance user experience by generating coherent summaries, they obscure the contributions of original content creators, raising concerns about credit attribution and compensation. We address the challenge of valuing individual documents used in LLM-generated summaries. We propose using Shapley values, a game-theoretic method that allocates credit based on each document’s marginal contribution. Although theoretically appealing, Shapley values are expensive to compute at scale. We therefore propose Cluster Shapley, an efficient approximation algorithm that leverages semantic similarity between documents. By clustering documents using LLM-based embeddings and computing Shapley values at the cluster level, our method significantly reduces computation while maintaining attribution quality. We demonstrate our approach to a summarization task using Amazon product reviews. Cluster Shapley significantly reduces computational complexity while maintaining high accuracy, outperforming baseline methods such as Monte Carlo sampling and Kernel SHAP with a better efficient frontier. Our approach is agnostic to the exact LLM used, the summarization process used, and the evaluation procedure, which makes it broadly applicable to a variety of summarization settings.

[136] Verbal Werewolf: Engage Users with Verbalized Agentic Werewolf Game Framework

Qihui Fan, Wenbo Li, Enfu Nan, Yixiao Chen, Lei Lu, Pu Zhao, Yanzhi Wang

Main category: cs.CL

TL;DR: Verbal Werewolf is an LLM-based system for social deduction games, combining gameplay with real-time TTS to enhance user engagement.

Details

Motivation: Address the need for AI-human collaboration in social deduction games, especially post-pandemic, by leveraging LLMs for better performance and engagement.

Method: Uses state-of-the-art LLMs for gameplay and a fine-tuned TTS module for real-time, anthropomorphic interaction.

Result: Operates in near real-time, improving user engagement over text-only frameworks.

Conclusion: Verbal Werewolf offers a more engaging and efficient AI-human gaming experience.

Abstract: The growing popularity of social deduction games has created an increasing need for intelligent frameworks where humans can collaborate with AI agents, particularly in post-pandemic contexts with heightened psychological and social pressures. Social deduction games like Werewolf, traditionally played through verbal communication, present an ideal application for Large Language Models (LLMs) given their advanced reasoning and conversational capabilities. Prior studies have shown that LLMs can outperform humans in Werewolf games, but their reliance on external modules introduces latency that left their contribution in academic domain only, and omit such game should be user-facing. We propose \textbf{Verbal Werewolf}, a novel LLM-based Werewolf game system that optimizes two parallel pipelines: gameplay powered by state-of-the-art LLMs and a fine-tuned Text-to-Speech (TTS) module that brings text output to life. Our system operates in near real-time without external decision-making modules, leveraging the enhanced reasoning capabilities of modern LLMs like DeepSeek V3 to create a more engaging and anthropomorphic gaming experience that significantly improves user engagement compared to existing text-only frameworks.

[137] PersianMedQA: Evaluating Large Language Models on a Persian-English Bilingual Medical Question Answering Benchmark

Mohammad Javad Ranjbar Kalahroodi, Amirhossein Sheikholselami, Sepehr Karimi, Sepideh Ranjbar Kalahroodi, Heshaam Faili, Azadeh Shakery

Main category: cs.CL

TL;DR: The paper introduces PersianMedQA, a dataset for evaluating LLMs in Persian and English medical contexts, showing GPT-4.1 outperforms others, while highlighting translation and cultural challenges.

Details

Motivation: To assess LLM reliability in high-stakes medical domains, especially for low-resource languages like Persian, and evaluate bilingual medical reasoning.

Method: Benchmarked 40 state-of-the-art models (general-purpose, Persian fine-tuned, medical LLMs) in zero-shot and chain-of-thought settings using the PersianMedQA dataset.

Result: GPT-4.1 achieved 83.09% accuracy in Persian and 80.7% in English, while Persian fine-tuned models underperformed. Translation impacts performance due to cultural/clinical context loss.

Conclusion: Model size alone isn’t enough for robust performance; domain/language adaptation is crucial. PersianMedQA aids bilingual medical LLM evaluation.

Abstract: Large Language Models (LLMs) have achieved remarkable performance on a wide range of Natural Language Processing (NLP) benchmarks, often surpassing human-level accuracy. However, their reliability in high-stakes domains such as medicine, particularly in low-resource languages, remains underexplored. In this work, we introduce PersianMedQA, a large-scale dataset of 20,785 expert-validated multiple-choice Persian medical questions from 14 years of Iranian national medical exams, spanning 23 medical specialties and designed to evaluate LLMs in both Persian and English. We benchmark 40 state-of-the-art models, including general-purpose, Persian fine-tuned, and medical LLMs, in zero-shot and chain-of-thought (CoT) settings. Our results show that closed-source general models (e.g., GPT-4.1) consistently outperform all other categories, achieving 83.09% accuracy in Persian and 80.7% in English, while Persian fine-tuned models such as Dorna underperform significantly (e.g., 34.9% in Persian), often struggling with both instruction-following and domain reasoning. We also analyze the impact of translation, showing that while English performance is generally higher, 3-10% of questions can only be answered correctly in Persian due to cultural and clinical contextual cues that are lost in translation. Finally, we demonstrate that model size alone is insufficient for robust performance without strong domain or language adaptation. PersianMedQA provides a foundation for evaluating bilingual and culturally grounded medical reasoning in LLMs. The PersianMedQA dataset is available: https://huggingface.co/datasets/MohammadJRanjbar/PersianMedQA .

[138] HERGC: Heterogeneous Experts Representation and Generative Completion for Multimodal Knowledge Graphs

Yongkang Xiao, Rui Zhang

Main category: cs.CL

TL;DR: HERGC is a novel framework for multimodal knowledge graph completion (MMKGC) that leverages large language models (LLMs) to enhance reasoning and performance.

Details

Motivation: Existing MMKGC methods are limited by closed-world assumptions and discriminative training, while LLMs' potential in MMKGC remains unexplored.

Method: HERGC uses a Heterogeneous Experts Representation Retriever to fuse multimodal data and a Generative LLM Predictor for accurate completion.

Result: HERGC outperforms existing methods on three MMKG benchmarks, demonstrating effectiveness and robustness.

Conclusion: HERGC successfully bridges the gap in MMKGC by integrating LLMs, offering a flexible and high-performing solution.

Abstract: Multimodal knowledge graphs (MMKGs) enrich traditional knowledge graphs (KGs) by incorporating diverse modalities such as images and text. multimodal knowledge graph completion (MMKGC) seeks to exploit these heterogeneous signals to infer missing facts, thereby mitigating the intrinsic incompleteness of MMKGs. Existing MMKGC methods typically leverage only the information contained in the MMKGs under the closed-world assumption and adopt discriminative training objectives, which limits their reasoning capacity during completion. Recent large language models (LLMs), empowered by massive parameter scales and pretraining on vast corpora, have demonstrated strong reasoning abilities across various tasks. However, their potential in MMKGC remains largely unexplored. To bridge this gap, we propose HERGC, a flexible Heterogeneous Experts Representation and Generative Completion framework for MMKGs. HERGC first deploys a Heterogeneous Experts Representation Retriever that enriches and fuses multimodal information and retrieves a compact candidate set for each incomplete triple. It then uses a Generative LLM Predictor, implemented via either in-context learning or lightweight fine-tuning, to accurately identify the correct answer from these candidates. Extensive experiments on three standard MMKG benchmarks demonstrate HERGC’s effectiveness and robustness, achieving superior performance over existing methods.

[139] ACCESS DENIED INC: The First Benchmark Environment for Sensitivity Awareness

Dren Fazlija, Arkadij Orlov, Sandipan Sikdar

Main category: cs.CL

TL;DR: The paper introduces sensitivity awareness (SA) for LLMs to handle corporate data securely, proposes a benchmarking tool (ACCESS DENIED INC), and shows varied model behavior in managing access restrictions.

Details

Motivation: LLMs are useful for corporate data management but must handle sensitive information carefully due to access restrictions, requiring a solution like SA.

Method: Proposed sensitivity awareness (SA) and developed the ACCESS DENIED INC benchmarking environment to evaluate SA in LLMs.

Result: Experiments showed significant variations in model behavior, especially in handling unauthorized requests while addressing legitimate queries.

Conclusion: The work lays a foundation for benchmarking sensitivity-aware LLMs and offers insights for privacy-centric AI in corporate settings.

Abstract: Large language models (LLMs) are increasingly becoming valuable to corporate data management due to their ability to process text from various document formats and facilitate user interactions through natural language queries. However, LLMs must consider the sensitivity of information when communicating with employees, especially given access restrictions. Simple filtering based on user clearance levels can pose both performance and privacy challenges. To address this, we propose the concept of sensitivity awareness (SA), which enables LLMs to adhere to predefined access rights rules. In addition, we developed a benchmarking environment called ACCESS DENIED INC to evaluate SA. Our experimental findings reveal significant variations in model behavior, particularly in managing unauthorized data requests while effectively addressing legitimate queries. This work establishes a foundation for benchmarking sensitivity-aware language models and provides insights to enhance privacy-centric AI systems in corporate environments.

[140] Leaps Beyond the Seen: Reinforced Reasoning Augmented Generation for Clinical Notes

Lo Pang-Yun Ting, Chengshuai Zhao, Yu-Hua Zeng, Yuan Jee Lim, Kun-Ta Chuang, Huan Liu

Main category: cs.CL

TL;DR: ReinRAG, a reinforced RAG method, improves long-form clinical note generation by retrieving reasoning paths from a medical knowledge graph and optimizing retrieval with GRO.

Details

Motivation: Existing LLM-based methods struggle with generating long-form clinical notes from limited patient information.

Method: ReinRAG retrieves reasoning paths for semantic guidance and uses GRO for improved retrieval quality.

Result: ReinRAG outperforms baselines in clinical efficacy and NLG metrics, filling semantic gaps and reducing misinterpretation.

Conclusion: ReinRAG enhances long-form clinical note generation by leveraging reasoning paths and optimized retrieval.

Abstract: Clinical note generation aims to produce free-text summaries of a patient’s condition and diagnostic process, with discharge instructions being a representative long-form example. While recent LLM-based methods pre-trained on general clinical corpora show promise in clinical text generation, they fall short in producing long-form notes from limited patient information. In this paper, we propose ReinRAG, a reinforced reasoning augmented generation (RAG) for long-form discharge instructions based on pre-admission information. ReinRAG retrieves reasoning paths from a medical knowledge graph to provide explicit semantic guidance to the LLM. To bridge the information gap, we propose group-based retriever optimization (GRO) which improves retrieval quality with group-normalized rewards, encouraging reasoning leaps for deeper inference by the LLM. Comprehensive experiments on the real-world dataset show that ReinRAG outperforms baselines in both clinical efficacy and natural language generation metrics. Further analysis reveals that ReinRAG fills semantic gaps in sparse input scenarios, and retrieved reasoning paths help LLMs avoid clinical misinterpretation by focusing on key evidence and following coherent reasoning.

[141] Structure-Augmented Reasoning Generation

Jash Rajesh Parekh, Pengcheng Jiang, Jiawei Han

Main category: cs.CL

TL;DR: SARG improves RAG systems by adding explicit reasoning structures, enabling multi-hop reasoning and interpretability.

Details

Motivation: RAG systems struggle with complex reasoning due to implicit connections between retrieved passages.

Method: SARG extracts cause-relation-effect triples, builds graphs, and traverses them for multi-hop reasoning.

Result: SARG outperforms RAG baselines in QA, biomedical, and financial domains, with traceable reasoning.

Conclusion: Explicit structural reasoning is essential for reliable complex QA, solving RAG’s implicit reasoning bottleneck.

Abstract: Retrieval-Augmented Generation (RAG) systems fail at complex multi-hop reasoning because they rely on large language models to implicitly connect information from unstructured document collections. This fundamental limitation stems from treating retrieved passages as independent context rather than recognizing the intricate relationships that enable coherent reasoning chains. We introduce SARG (Structure-Augmented Reasoning Generation), a post-retrieval framework that transforms traditional RAG pipelines by materializing explicit reasoning structures. SARG extracts {cause, relation, effect} triples from retrieved documents, constructs domain-adaptive graphs, and performs multi-hop traversal to discover reasoning chains that bridge query concepts to answers. Unlike existing approaches that modify retrieval mechanisms, SARG operates as a plug-and-play reasoning layer compatible with any RAG system. Extensive evaluation across diverse domains: general QA, biomedical literature, and financial analysis demonstrates that SARG achieves substantial improvements over state-of-the-art RAG baselines. Crucially, SARG also provides full reasoning traceability through explicit inference chains, addressing the critical interpretability gap in current RAG systems. Our results establish that explicit structural reasoning is not merely beneficial but essential for reliable complex question answering, offering a solution to RAG’s implicit reasoning bottleneck.

[142] DAGR: Decomposition Augmented Graph Retrieval with LLMs

Valentin Six, Evan Dufraisse, Gaël de Chalendar

Main category: cs.CL

TL;DR: DAGR improves multi-hop reasoning in LLMs by decomposing complex queries, retrieving linked subgraphs, and creating question-specific knowledge graphs for better answer generation.

Details

Motivation: LLMs struggle with multi-hop reasoning and factual consistency, limiting their effectiveness on knowledge-intensive tasks like complex QA.

Method: DAGR decomposes complex queries, retrieves subgraphs using a weighted similarity function, and constructs a question-specific knowledge graph to guide answer generation.

Result: DAGR achieves comparable or superior performance to existing methods on multi-hop QA benchmarks, using smaller models and fewer LLM calls.

Conclusion: DAGR effectively addresses LLMs’ limitations in reasoning over graph-structured data, enhancing performance on complex QA tasks.

Abstract: Large Language Models (LLMs) excel at many Natural Language Processing (NLP) tasks, but struggle with multi-hop reasoning and factual consistency, limiting their effectiveness on knowledge-intensive tasks like complex question answering (QA). Linking Knowledge Graphs (KG) and LLMs has shown promising results, but LLMs generally lack the ability to reason efficiently over graph-structured information. To address this challenge, we introduce DAGR, a retrieval method that leverages both complex questions and their decomposition in subquestions to extract relevant, linked textual subgraphs. DAGR first breaks down complex queries, retrieves subgraphs guided by a weighted similarity function over both the original and decomposed queries, and creates a question-specific knowledge graph to guide answer generation. The resulting Graph-RAG pipeline is suited to handle complex multi-hop questions and effectively reason over graph-structured data. We evaluate DAGR on standard multi-hop QA benchmarks and show that it achieves comparable or superior performance to competitive existing methods, using smaller models and fewer LLM calls.

[143] PersonalAI: A Systematic Comparison of Knowledge Graph Storage and Retrieval Approaches for Personalized LLM agents

Mikhail Menschikov, Dmitry Evseev, Victoria Dochkina, Ruslan Kostoev, Ilia Perepechkin, Petr Anokhin, Evgeny Burnaev, Nikita Semenov

Main category: cs.CL

TL;DR: A framework using knowledge graphs for personalized LLMs, improving memory and scalability in long-term interactions.

Details

Motivation: Addressing the lack of structured memory and scalability in LLMs for adaptive AI systems.

Method: Proposes a hybrid graph design with external memory, supporting diverse retrieval mechanisms.

Result: Optimal performance on benchmarks (TriviaQA, HotpotQA, DiaASQ) with robustness in temporal and contradictory contexts.

Conclusion: The framework enhances LLM adaptability and reasoning in complex, long-term interactions.

Abstract: Personalizing language models by effectively incorporating user interaction history remains a central challenge in the development of adaptive AI systems. While large language models (LLMs) combined with Retrieval-Augmented Generation (RAG) have improved factual accuracy, they often lack structured memory and fail to scale in complex, long-term interactions. To address this, we propose a flexible external memory framework based on knowledge graphs, automatically constructed and updated by the LLM itself, and capable of encoding information in multiple formats-including nodes, triplets, higher-order propositions, and episodic traces. Building upon the AriGraph architecture, we introduce a novel hybrid graph design that supports both standard edges and two types of hyperedges, enabling rich and dynamic semantic and temporal representations. Our framework also supports diverse retrieval mechanisms, including A*, water-circle propagation, beam search, and hybrid methods, making it adaptable to different datasets and LLM capacities. We evaluate our system on three benchmarks-TriviaQA, HotpotQA, and DiaASQ-demonstrating that different memory and retrieval configurations yield optimal performance depending on the task. Additionally, we extend the DiaASQ benchmark with temporal annotations and internally contradictory statements, showing that our system remains robust and effective in managing temporal dependencies and context-aware reasoning.

[144] CycleDistill: Bootstrapping Machine Translation using LLMs with Cyclical Distillation

Deepon Halder, Thanmay Jayakumar, Raj Dabre

Main category: cs.CL

TL;DR: CycleDistill leverages LLMs and few-shot translation to create synthetic parallel corpora from monolingual data, improving MT quality for low-resource languages without needing extensive parallel corpora.

Details

Motivation: Parallel corpora are scarce for low-resource languages, limiting MT quality. LLMs underperform dedicated MT systems but can be leveraged to bridge this gap.

Method: CycleDistill iteratively generates synthetic parallel corpora via zero-/few-shot MT, fine-tuning the model with this data. It requires minimal parallel examples.

Result: Improves MT quality by 20-30 chrF points over few-shot baselines for three Indian languages, with mild gains from softmax activations.

Conclusion: CycleDistill effectively boosts MT quality for low-resource languages using monolingual data and minimal parallel examples.

Abstract: Large language models (LLMs), despite their ability to perform few-shot machine translation (MT), often lag behind dedicated MT systems trained on parallel corpora, which are crucial for high quality machine translation (MT). However, parallel corpora are often scarce or non-existent for low-resource languages. In this paper, we propose CycleDistill, a bootstrapping approach leveraging LLMs and few-shot translation to obtain high-quality MT systems. CycleDistill involves iteratively generating synthetic parallel corpora from monolingual corpora via zero- or few-shot MT, which is then used to fine-tune the model that was used for generating said data for MT. CycleDistill does not need parallel corpora beyond 1 to 4 few-shot examples, and in our experiments focusing on three Indian languages, by relying solely on monolingual corpora, it can achieve high-quality machine translation, improving upon a few-shot baseline model by over 20-30 chrF points on average in the first iteration. We also study the effect of leveraging softmax activations during the distillation process and observe mild improvements in translation quality.

[145] MDC-R: The Minecraft Dialogue Corpus with Reference

Chris Madge, Maris Camilleri, Paloma Carretero Garcia, Vanja Karan, Juexi Shao, Prashant Jayannavar, Julian Hough, Benjamin Roth, Massimo Poesio

Main category: cs.CL

TL;DR: The paper introduces MDC-R, an annotated version of the Minecraft Dialogue Corpus (MDC) with expert annotations for anaphoric and deictic reference, highlighting its value for linguistic research and referring expression comprehension.

Details

Motivation: The dynamic, task-oriented, multi-turn dialogues in MDC present interesting linguistic phenomena, motivating the need for expert annotations of reference to enhance its utility.

Method: The authors detail their annotation process for MDC-R, followed by quantitative and qualitative analyses of the corpus, and conduct an experiment to demonstrate its usefulness for referring expression comprehension.

Result: MDC-R is presented as a valuable resource, with the experiment showcasing its practical application in understanding referring expressions.

Conclusion: The annotated MDC-R corpus is a significant contribution to linguistic research, particularly for studying reference in dynamic, situated dialogues.

Abstract: We introduce the Minecraft Dialogue Corpus with Reference (MDC-R). MDC-R is a new language resource that supplements the original Minecraft Dialogue Corpus (MDC) with expert annotations of anaphoric and deictic reference. MDC’s task-orientated, multi-turn, situated dialogue in a dynamic environment has motivated multiple annotation efforts, owing to the interesting linguistic phenomena that this setting gives rise to. We believe it can serve as a valuable resource when annotated with reference, too. Here, we discuss our method of annotation and the resulting corpus, and provide both a quantitative and a qualitative analysis of the data. Furthermore, we carry out a short experiment demonstrating the usefulness of our corpus for referring expression comprehension.

[146] Auto-TA: Towards Scalable Automated Thematic Analysis (TA) via Multi-Agent Large Language Models with Reinforcement Learning

Seungjun Yi, Joakim Nguyen, Huimin Xu, Terence Lim, Andrew Well, Mia Markey, Ying Ding

Main category: cs.CL

TL;DR: An automated LLM pipeline for thematic analysis of clinical narratives in CHD, reducing manual effort and improving scalability.

Details

Motivation: Traditional thematic analysis of clinical narratives is labor-intensive and unscalable, especially for complex conditions like CHD.

Method: A multi-agent LLM framework with optional RLHF for automated, scalable thematic analysis.

Result: Enables efficient, patient-centered analysis of large qualitative datasets without manual coding.

Conclusion: The proposed system offers a scalable solution for thematic analysis in clinical contexts, enhancing patient-centered care.

Abstract: Congenital heart disease (CHD) presents complex, lifelong challenges often underrepresented in traditional clinical metrics. While unstructured narratives offer rich insights into patient and caregiver experiences, manual thematic analysis (TA) remains labor-intensive and unscalable. We propose a fully automated large language model (LLM) pipeline that performs end-to-end TA on clinical narratives, which eliminates the need for manual coding or full transcript review. Our system employs a novel multi-agent framework, where specialized LLM agents assume roles to enhance theme quality and alignment with human analysis. To further improve thematic relevance, we optionally integrate reinforcement learning from human feedback (RLHF). This supports scalable, patient-centered analysis of large qualitative datasets and allows LLMs to be fine-tuned for specific clinical contexts.

[147] Large Language Models Don’t Make Sense of Word Problems. A Scoping Review from a Mathematics Education Perspective

Anselm R. Strohmaier, Wim Van Dooren, Kathrin Seßler, Brian Greer, Lieven Verschaffel

Main category: cs.CL

TL;DR: The paper examines LLMs’ ability to solve math word problems, finding they excel at superficial tasks but struggle with real-world context, limiting their educational utility.

Details

Motivation: To assess how LLMs like ChatGPT can support math education, particularly word-problem solving, and their real-world applicability.

Method: A scoping review with three parts: technical overview, systematic literature review of word-problem corpora, and empirical evaluation of LLMs on word problems.

Result: LLMs perform well on s-problems (lacking real-world context) but falter with contextually complex or nonsensical problems.

Conclusion: LLMs solve word problems superficially without understanding context, reducing their effectiveness as instructional tools in math education.

Abstract: The progress of Large Language Models (LLMs) like ChatGPT raises the question of how they can be integrated into education. One hope is that they can support mathematics learning, including word-problem solving. Since LLMs can handle textual input with ease, they appear well-suited for solving mathematical word problems. Yet their real competence, whether they can make sense of the real-world context, and the implications for classrooms remain unclear. We conducted a scoping review from a mathematics-education perspective, including three parts: a technical overview, a systematic review of word problems used in research, and a state-of-the-art empirical evaluation of LLMs on mathematical word problems. First, in the technical overview, we contrast the conceptualization of word problems and their solution processes between LLMs and students. In computer-science research this is typically labeled mathematical reasoning, a term that does not align with usage in mathematics education. Second, our literature review of 213 studies shows that the most popular word-problem corpora are dominated by s-problems, which do not require a consideration of realities of their real-world context. Finally, our evaluation of GPT-3.5-turbo, GPT-4o-mini, GPT-4.1, o3, and GPT-5 on 287 word problems shows that most recent LLMs solve these s-problems with near-perfect accuracy, including a perfect score on 20 problems from PISA. LLMs still showed weaknesses in tackling problems where the real-world context is problematic or non-sensical. In sum, we argue based on all three aspects that LLMs have mastered a superficial solution process but do not make sense of word problems, which potentially limits their value as instructional tools in mathematics classrooms.

[148] EduCoder: An Open-Source Annotation System for Education Transcript Data

Guanzhong Pan, Mei Tan, Hyunji Nam, Lucía Langlois, James Malamut, Liliana Deonizio, Dorottya Demszky

Main category: cs.CL

TL;DR: EduCoder is a specialized tool for annotating educational dialogue, addressing challenges like complex codebooks, mixed annotation types, and contextualization.

Details

Motivation: Existing tools lack support for the complexities of coding educational dialogue, such as diverse interactions and pedagogical features.

Method: EduCoder provides a collaborative platform for defining codebooks, supports categorical and open-ended annotations, and includes contextual materials.

Result: The tool enables side-by-side comparison of annotations for reliability and is open-source with a demo available.

Conclusion: EduCoder fills a gap in educational dialogue annotation, offering a reliable and adaptable solution for researchers.

Abstract: We introduce EduCoder, a domain-specialized tool designed to support utterance-level annotation of educational dialogue. While general-purpose text annotation tools for NLP and qualitative research abound, few address the complexities of coding education dialogue transcripts – with diverse teacher-student and peer interactions. Common challenges include defining codebooks for complex pedagogical features, supporting both open-ended and categorical coding, and contextualizing utterances with external features, such as the lesson’s purpose and the pedagogical value of the instruction. EduCoder is designed to address these challenges by providing a platform for researchers and domain experts to collaboratively define complex codebooks based on observed data. It incorporates both categorical and open-ended annotation types along with contextual materials. Additionally, it offers a side-by-side comparison of multiple annotators’ responses, allowing comparison and calibration of annotations with others to improve data reliability. The system is open-source, with a demo video available.

[149] WSM: Decay-Free Learning Rate Schedule via Checkpoint Merging for LLM Pre-training

Changxin Tian, Jiapeng Wang, Qian Zhao, Kunlong Chen, Jia Liu, Ziqi Liu, Jiaxin Mao, Wayne Xin Zhao, Zhiqiang Zhang, Jun Zhou

Main category: cs.CL

TL;DR: WSM framework links learning rate decay to model merging, outperforming traditional decay methods with significant performance gains.

Details

Motivation: To bridge learning rate decay and model merging, offering a decay-free alternative with competitive performance.

Method: Warmup-Stable and Merge (WSM) emulates decay strategies via model averaging, focusing on merge duration as key.

Result: WSM outperforms Warmup-Stable-Decay by +3.5% (MATH), +2.9% (HumanEval), +5.5% (MMLU-Pro).

Conclusion: WSM is a robust, decay-free framework for long-term model refinement, superior to traditional decay approaches.

Abstract: Recent advances in learning rate (LR) scheduling have demonstrated the effectiveness of decay-free approaches that eliminate the traditional decay phase while maintaining competitive performance. Model merging techniques have emerged as particularly promising solutions in this domain. We present Warmup-Stable and Merge (WSM), a general framework that establishes a formal connection between learning rate decay and model merging. WSM provides a unified theoretical foundation for emulating various decay strategies-including cosine decay, linear decay and inverse square root decay-as principled model averaging schemes, while remaining fully compatible with diverse optimization methods. Through extensive experiments, we identify merge duration-the training window for checkpoint aggregation-as the most critical factor influencing model performance, surpassing the importance of both checkpoint interval and merge quantity. Our framework consistently outperforms the widely-adopted Warmup-Stable-Decay (WSD) approach across multiple benchmarks, achieving significant improvements of +3.5% on MATH, +2.9% on HumanEval, and +5.5% on MMLU-Pro. The performance advantages extend to supervised fine-tuning scenarios, highlighting WSM’s potential for long-term model refinement.

[150] Towards Greater Leverage: Scaling Laws for Efficient Mixture-of-Experts Language Models

Changxin Tian, Kunlong Chen, Jia Liu, Ziqi Liu, Zhiqiang Zhang, Jun Zhou

Main category: cs.CL

TL;DR: The paper introduces Efficiency Leverage (EL) to predict MoE model capacity, revealing power-law relationships and validating scaling laws with Ling-mini-beta.

Details

Motivation: Addressing the challenge of predicting MoE model capacity due to decoupled parameters and computational cost.

Method: Conducted a large-scale study with 300+ models, analyzing MoE configurations and introducing EL. Derived scaling laws and validated them with Ling-mini-beta.

Result: EL is driven by expert activation ratio and compute budget, following power laws. Ling-mini-beta matched a 6.1B dense model with 7x fewer resources.

Conclusion: Provides a principled foundation for scaling efficient MoE models, validated by empirical results.

Abstract: Mixture-of-Experts (MoE) has become a dominant architecture for scaling Large Language Models (LLMs) efficiently by decoupling total parameters from computational cost. However, this decoupling creates a critical challenge: predicting the model capacity of a given MoE configurations (e.g., expert activation ratio and granularity) remains an unresolved problem. To address this gap, we introduce Efficiency Leverage (EL), a metric quantifying the computational advantage of an MoE model over a dense equivalent. We conduct a large-scale empirical study, training over 300 models up to 28B parameters, to systematically investigate the relationship between MoE architectural configurations and EL. Our findings reveal that EL is primarily driven by the expert activation ratio and the total compute budget, both following predictable power laws, while expert granularity acts as a non-linear modulator with a clear optimal range. We integrate these discoveries into a unified scaling law that accurately predicts the EL of an MoE architecture based on its configuration. To validate our derived scaling laws, we designed and trained Ling-mini-beta, a pilot model for Ling-2.0 series with only 0.85B active parameters, alongside a 6.1B dense model for comparison. When trained on an identical 1T high-quality token dataset, Ling-mini-beta matched the performance of the 6.1B dense model while consuming over 7x fewer computational resources, thereby confirming the accuracy of our scaling laws. This work provides a principled and empirically-grounded foundation for the scaling of efficient MoE models.

[151] Language Arithmetics: Towards Systematic Language Neuron Identification and Manipulation

Daniil Gurgurov, Katharina Trinley, Yusser Al Ghussin, Tanja Baeumel, Josef van Genabith, Simon Ostermann

Main category: cs.CL

TL;DR: The paper investigates language-specific neurons in LLMs, revealing their clustering in deeper layers and specialization for non-Latin scripts. It introduces the LAPE method and language arithmetics to manipulate language behavior, improving multilingual task performance.

Details

Motivation: To understand the neural mechanisms behind language-specific processing in LLMs and develop methods to control language behavior.

Method: Analyzes neurons in Llama-3.1-8B, Mistral-Nemo-12B, and Aya-Expanse models using the LAPE method and language arithmetics (activation addition/multiplication).

Result: Identifies language-specific neurons, shows their clustering in deeper layers, and demonstrates successful manipulation for multilingual tasks. Typological similarity enhances effectiveness.

Conclusion: Language-specific neurons can be manipulated to control LLM behavior, with implications for multilingual applications and understanding linguistic representations.

Abstract: Large language models (LLMs) exhibit strong multilingual abilities, yet the neural mechanisms behind language-specific processing remain unclear. We analyze language-specific neurons in Llama-3.1-8B, Mistral-Nemo-12B, and Aya-Expanse-8B & 32B across 21 typologically diverse languages, identifying neurons that control language behavior. Using the Language Activation Probability Entropy (LAPE) method, we show that these neurons cluster in deeper layers, with non-Latin scripts showing greater specialization. Related languages share overlapping neurons, reflecting internal representations of linguistic proximity. Through language arithmetics, i.e. systematic activation addition and multiplication, we steer models to deactivate unwanted languages and activate desired ones, outperforming simpler replacement approaches. These interventions effectively guide behavior across five multilingual tasks: language forcing, translation, QA, comprehension, and NLI. Manipulation is more successful for high-resource languages, while typological similarity improves effectiveness. We also demonstrate that cross-lingual neuron steering enhances downstream performance and reveal internal “fallback” mechanisms for language selection when neurons are progressively deactivated. Our code is made publicly available at https://github.com/d-gurgurov/Language-Neurons-Manipulation.

[152] A novel language model for predicting serious adverse event results in clinical trials from their prospective registrations

Qixuan Hu, Xumou Zhang, Jinman Kim, Florence Bourgeois, Adam G. Dunn

Main category: cs.CL

TL;DR: The paper evaluates methods for predicting serious adverse event (SAE) outcomes in clinical trials using registration data, achieving 77.6% AUC for classification and 18.6% RMSE for regression.

Details

Motivation: To improve clinical trial design and monitoring by predicting SAE outcomes using pre-trial registration data.

Method: Analyzed 22,107 trials from ClinicalTrials.gov, using transfer learning with pretrained models (ClinicalT5, BioBERT) and a sliding window approach for embedding extraction.

Result: Best model achieved 77.6% AUC for classification and 18.6% RMSE for regression. Sliding window method outperformed direct comparisons.

Conclusion: ClinicalTrials.gov data is underutilized; predicted results can help identify discrepancies in safety outcomes.

Abstract: Objectives: With accurate estimates of expected safety results, clinical trials could be better designed and monitored. We evaluated methods for predicting serious adverse event (SAE) results in clinical trials using information only from their registrations prior to the trial. Material and Methods: We analyzed 22,107 two-arm parallel interventional clinical trials from ClinicalTrials.gov with structured summary results. Two prediction models were developed: a classifier predicting whether a greater proportion of participants in an experimental arm would have SAEs (area under the receiver operating characteristic curve; AUC) compared to the control arm, and a regression model to predict the proportion of participants with SAEs in the control arms (root mean squared error; RMSE). A transfer learning approach using pretrained language models (e.g., ClinicalT5, BioBERT) was used for feature extraction, combined with a downstream model for prediction. To maintain semantic representation in long trial texts exceeding localized language model input limits, a sliding window method was developed for embedding extraction. Results: The best model (ClinicalT5+Transformer+MLP) had 77.6% AUC when predicting which trial arm had a higher proportion of SAEs. When predicting SAE proportion in the control arm, the same model achieved RMSE of 18.6%. The sliding window approach consistently outperformed direct comparisons. Across 12 classifiers, the average absolute AUC increase was 2.00%, and absolute RMSE reduction was 1.58% across 12 regressors. Discussion: Summary results data from ClinicalTrials.gov remains underutilized. Predicted results of publicly reported trials provides an opportunity to identify discrepancies between expected and reported safety results.

[153] Is neural semantic parsing good at ellipsis resolution, or isn’t it?

Xiao Zhang, Johan bos

Main category: cs.CL

TL;DR: Neural semantic parsers excel in general but struggle with context-sensitive phenomena like verb phrase ellipsis, despite data augmentation improving results.

Details

Motivation: To evaluate neural semantic parsers' performance on context-sensitive linguistic phenomena, specifically verb phrase ellipsis.

Method: Constructed a corpus of 120 ellipsis cases with resolved meaning representations and tested various neural semantic parsers.

Result: Parsers performed poorly on ellipsis despite high scores on standard tests; data augmentation helped but didn’t fully resolve issues.

Conclusion: Ellipsis parsing is challenging due to linguistically complex contexts, not just semantic copying.

Abstract: Neural semantic parsers have shown good overall performance for a variety of linguistic phenomena, reaching semantic matching scores of more than 90%. But how do such parsers perform on strongly context-sensitive phenomena, where large pieces of semantic information need to be duplicated to form a meaningful semantic representation? A case in point is English verb phrase ellipsis, a construct where entire verb phrases can be abbreviated by a single auxiliary verb. Are the otherwise known as powerful semantic parsers able to deal with ellipsis or aren’t they? We constructed a corpus of 120 cases of ellipsis with their fully resolved meaning representation and used this as a challenge set for a large battery of neural semantic parsers. Although these parsers performed very well on the standard test set, they failed in the instances with ellipsis. Data augmentation helped improve the parsing results. The reason for the difficulty of parsing elided phrases is not that copying semantic material is hard, but that usually occur in linguistically complicated contexts causing most of the parsing errors.

[154] Beyond Content: How Grammatical Gender Shapes Visual Representation in Text-to-Image Models

Muhammed Saeed, Shaina Raza, Ashmal Vayani, Muhammad Abdul-Mageed, Ali Emami, Shady Shehata

Main category: cs.CL

TL;DR: The paper investigates how grammatical gender in languages influences visual representation in Text-to-Image (T2I) models, revealing significant biases in gender representation based on linguistic structure.

Details

Motivation: To address the overlooked question of how grammatical gender affects visual representation in T2I models, especially in gendered languages where grammatical gender contradicts stereotypical associations.

Method: A cross-linguistic benchmark was created, testing 800 prompts across five gendered and two gender-neutral languages, generating 28,800 images using three T2I models.

Result: Grammatical gender strongly influences image generation, with masculine markers increasing male representation to 73% and feminine markers increasing female representation to 38%, varying by language and model.

Conclusion: Language structure itself shapes AI-generated visuals, adding a new dimension to understanding bias and fairness in multilingual, multimodal AI systems.

Abstract: Research on bias in Text-to-Image (T2I) models has primarily focused on demographic representation and stereotypical attributes, overlooking a fundamental question: how does grammatical gender influence visual representation across languages? We introduce a cross-linguistic benchmark examining words where grammatical gender contradicts stereotypical gender associations (e.g., une sentinelle'' - grammatically feminine in French but referring to the stereotypically masculine concept guard’’). Our dataset spans five gendered languages (French, Spanish, German, Italian, Russian) and two gender-neutral control languages (English, Chinese), comprising 800 unique prompts that generated 28,800 images across three state-of-the-art T2I models. Our analysis reveals that grammatical gender dramatically influences image generation: masculine grammatical markers increase male representation to 73% on average (compared to 22% with gender-neutral English), while feminine grammatical markers increase female representation to 38% (compared to 28% in English). These effects vary systematically by language resource availability and model architecture, with high-resource languages showing stronger effects. Our findings establish that language structure itself, not just content, shapes AI-generated visual outputs, introducing a new dimension for understanding bias and fairness in multilingual, multimodal systems.

[155] Probing Syntax in Large Language Models: Successes and Remaining Challenges

Pablo J. Diego-Simón, Emmanuel Chemla, Jean-Rémi King, Yair Lakretz

Main category: cs.CL

TL;DR: Structural probes in LLMs reveal syntactic structures but are biased by word proximity, struggle with deep syntax, and are unaffected by word predictability.

Details

Motivation: To understand if structural and statistical factors systematically affect syntactic representations in LLMs.

Method: Analyzed structural probes on three controlled benchmarks.

Result: Probes are biased by word proximity, challenged by deep syntax, and unaffected by word predictability.

Conclusion: Highlights challenges for structural probes and proposes controlled benchmarks for better evaluation.

Abstract: The syntactic structures of sentences can be readily read-out from the activations of large language models (LLMs). However, the ``structural probes’’ that have been developed to reveal this phenomenon are typically evaluated on an indiscriminate set of sentences. Consequently, it remains unclear whether structural and/or statistical factors systematically affect these syntactic representations. To address this issue, we conduct an in-depth analysis of structural probes on three controlled benchmarks. Our results are three-fold. First, structural probes are biased by a superficial property: the closer two words are in a sentence, the more likely structural probes will consider them as syntactically linked. Second, structural probes are challenged by linguistic properties: they poorly represent deep syntactic structures, and get interfered by interacting nouns or ungrammatical verb forms. Third, structural probes do not appear to be affected by the predictability of individual words. Overall, this work sheds light on the current challenges faced by structural probes. Providing a benchmark made of controlled stimuli to better evaluate their performance.

[156] An Entity Linking Agent for Question Answering

Yajie Luo, Yihong Wu, Muzhi Li, Fengran Mo, Jia Ao Sun, Xinyu Wang, Liheng Ma, Yingxue Zhang, Jian-Yun Nie

Main category: cs.CL

TL;DR: A QA-focused entity linking agent using a Large Language Model to improve accuracy in short, ambiguous questions.

Details

Motivation: Existing EL methods perform poorly on short, ambiguous questions in QA tasks.

Method: Proposes an agent that identifies mentions, retrieves candidates, and makes decisions using a Large Language Model.

Result: Experiments confirm the agent’s robustness and effectiveness in tool-based EL and QA tasks.

Conclusion: The agent enhances EL performance in QA systems for short, ambiguous questions.

Abstract: Some Question Answering (QA) systems rely on knowledge bases (KBs) to provide accurate answers. Entity Linking (EL) plays a critical role in linking natural language mentions to KB entries. However, most existing EL methods are designed for long contexts and do not perform well on short, ambiguous user questions in QA tasks. We propose an entity linking agent for QA, based on a Large Language Model that simulates human cognitive workflows. The agent actively identifies entity mentions, retrieves candidate entities, and makes decision. To verify the effectiveness of our agent, we conduct two experiments: tool-based entity linking and QA task evaluation. The results confirm the robustness and effectiveness of our agent.

[157] Multimodal Fact Checking with Unified Visual, Textual, and Contextual Representations

Aditya Kishore, Gaurav Kumar, Jasabanta Patro

Main category: cs.CL

TL;DR: Proposes MultiCheck, a unified framework for multimodal fact verification, combining text and image encoders with cross-modal fusion, achieving strong performance on Factify 2.

Details

Motivation: Address challenges in fact-checking multimodal misinformation by integrating textual and visual evidence.

Method: Uses dedicated text and image encoders with a fusion module for cross-modal reasoning and contrastive learning for semantic alignment.

Result: Achieves a weighted F1 score of 0.84 on Factify 2, outperforming baselines.

Conclusion: Demonstrates effectiveness of explicit multimodal reasoning for scalable, interpretable fact-checking.

Abstract: The growing rate of multimodal misinformation, where claims are supported by both text and images, poses significant challenges to fact-checking systems that rely primarily on textual evidence. In this work, we have proposed a unified framework for fine-grained multimodal fact verification called “MultiCheck”, designed to reason over structured textual and visual signals. Our architecture combines dedicated encoders for text and images with a fusion module that captures cross-modal relationships using element-wise interactions. A classification head then predicts the veracity of a claim, supported by a contrastive learning objective that encourages semantic alignment between claim-evidence pairs in a shared latent space. We evaluate our approach on the Factify 2 dataset, achieving a weighted F1 score of 0.84, substantially outperforming the baseline. These results highlight the effectiveness of explicit multimodal reasoning and demonstrate the potential of our approach for scalable and interpretable fact-checking in complex, real-world scenarios.

[158] LAG: Logic-Augmented Generation from a Cartesian Perspective

Yilin Xiao, Chuang Zhou, Qinggang Zhang, Su Dong, Shengyuan Chen, Xiao Huang

Main category: cs.CL

TL;DR: The paper introduces Logic-Augmented Generation (LAG), a method to improve LLMs’ reasoning by decomposing questions into logical sub-questions and resolving them sequentially.

Details

Motivation: LLMs struggle with knowledge-intensive tasks and hallucinate in specialized domains. RAG helps but lacks structured reasoning.

Method: LAG decomposes questions into atomic sub-questions, resolves them in order, and uses logical termination to prevent errors.

Result: LAG improves reasoning robustness, reduces hallucinations, and aligns LLMs with human cognition.

Conclusion: LAG offers a principled alternative to RAG, enhancing LLM performance in complex reasoning tasks.

Abstract: Large language models (LLMs) have demonstrated remarkable capabilities across a wide range of tasks, yet exhibit critical limitations in knowledge-intensive tasks, often generating hallucinations when faced with questions requiring specialized expertise. While retrieval-augmented generation (RAG) mitigates this by integrating external knowledge, it struggles with complex reasoning scenarios due to its reliance on direct semantic retrieval and lack of structured logical organization. Inspired by Cartesian principles from \textit{Discours de la m'ethode}, this paper introduces Logic-Augmented Generation (LAG), a novel paradigm that reframes knowledge augmentation through systematic question decomposition and dependency-aware reasoning. Specifically, LAG first decomposes complex questions into atomic sub-questions ordered by logical dependencies. It then resolves these sequentially, using prior answers to guide context retrieval for subsequent sub-questions, ensuring stepwise grounding in logical chain. To prevent error propagation, LAG incorporates a logical termination mechanism that halts inference upon encountering unanswerable sub-questions and reduces wasted computation on excessive reasoning. Finally, it synthesizes all sub-resolutions to generate verified responses. Experiments on four benchmark datasets demonstrate that LAG significantly enhances reasoning robustness, reduces hallucination, and aligns LLM problem-solving with human cognition, offering a principled alternative to existing RAG systems.

[159] MathSmith: Towards Extremely Hard Mathematical Reasoning by Forging Synthetic Problems with a Reinforced Policy

Shaoxiong Zhan, Yanlin Lai, Ziyu Lu, Dahua Lin, Ziqing Yang, Fei Tan

Main category: cs.CL

TL;DR: MathSmith is a framework for synthesizing challenging math problems to enhance LLM reasoning, outperforming baselines by ensuring diversity, difficulty, and avoiding data contamination.

Details

Motivation: The scarcity of high-quality, high-difficulty training data limits LLM progress in mathematical reasoning. Existing methods lack diversity and scalability.

Method: MathSmith constructs problems from scratch using PlanetMath, employs predefined strategies for difficulty, and uses reinforcement learning to optimize validity, complexity, and consistency.

Result: MathSmith outperforms baselines across five benchmarks, showing scalability and generalization.

Conclusion: MathSmith demonstrates the potential of synthetic high-difficulty data to advance LLM reasoning capabilities.

Abstract: Large language models have achieved substantial progress in mathematical reasoning, yet their advancement is limited by the scarcity of high-quality, high-difficulty training data. Existing synthesis methods largely rely on transforming human-written templates, limiting both diversity and scalability. We propose MathSmith, a novel framework for synthesizing challenging mathematical problems to enhance LLM reasoning. Rather than modifying existing problems, MathSmith constructs new ones from scratch by randomly sampling concept-explanation pairs from PlanetMath, ensuring data independence and avoiding contamination. To increase difficulty, we design nine predefined strategies as soft constraints during rationales. We further adopts reinforcement learning to jointly optimize structural validity, reasoning complexity, and answer consistency. The length of the reasoning trace generated under autoregressive prompting is used to reflect cognitive complexity, encouraging the creation of more demanding problems aligned with long-chain-of-thought reasoning. Experiments across five benchmarks, categorized as easy & medium (GSM8K, MATH-500) and hard (AIME2024, AIME2025, OlympiadBench), show that MathSmith consistently outperforms existing baselines under both short and long CoT settings. Additionally, a weakness-focused variant generation module enables targeted improvement on specific concepts. Overall, MathSmith exhibits strong scalability, generalization, and transferability, highlighting the promise of high-difficulty synthetic data in advancing LLM reasoning capabilities.

cs.CV

[160] Med-GRIM: Enhanced Zero-Shot Medical VQA using prompt-embedded Multimodal Graph RAG

Rakesh Raj Madavan, Akshat Kaimal, Hashim Faisal, Chandrakala S

Main category: cs.CV

TL;DR: BIND and Med-GRIM improve medical VQA with dense encodings and modular workflows, achieving efficiency and accuracy without heavy fine-tuning.

Details

Motivation: Existing VQA models lack precision for domain-specific tasks like medical VQA, requiring more efficient and accurate solutions.

Method: BIND refines joint embeddings with dense encodings, while Med-GRIM uses graph-based retrieval and prompt engineering with SLMs.

Result: Med-GRIM achieves high performance at low computational cost, supported by the DermaGraph dataset for scalable research.

Conclusion: The proposed methods enhance medical VQA efficiency and accuracy, with potential for broader applications.

Abstract: An ensemble of trained multimodal encoders and vision-language models (VLMs) has become a standard approach for visual question answering (VQA) tasks. However, such models often fail to produce responses with the detailed precision necessary for complex, domain-specific applications such as medical VQA. Our representation model, BIND: BLIVA Integrated with Dense Encoding, extends prior multimodal work by refining the joint embedding space through dense, query-token-based encodings inspired by contrastive pretraining techniques. This refined encoder powers Med-GRIM, a model designed for medical VQA tasks that leverages graph-based retrieval and prompt engineering to integrate domain-specific knowledge. Rather than relying on compute-heavy fine-tuning of vision and language models on specific datasets, Med-GRIM applies a low-compute, modular workflow with small language models (SLMs) for efficiency. Med-GRIM employs prompt-based retrieval to dynamically inject relevant knowledge, ensuring both accuracy and robustness in its responses. By assigning distinct roles to each agent within the VQA system, Med-GRIM achieves large language model performance at a fraction of the computational cost. Additionally, to support scalable research in zero-shot multimodal medical applications, we introduce DermaGraph, a novel Graph-RAG dataset comprising diverse dermatological conditions. This dataset facilitates both multimodal and unimodal querying. The code and dataset are available at: https://github.com/Rakesh-123-cryp/Med-GRIM.git

[161] DiTalker: A Unified DiT-based Framework for High-Quality and Speaking Styles Controllable Portrait Animation

He Feng, Yongjia Ma, Donglin Di, Lei Fan, Tonghua Su, Xiangqian Wu

Main category: cs.CV

TL;DR: DiTalker is a unified DiT-based framework for portrait animation, focusing on dynamic styles like head movements and lip sync, outperforming existing methods.

Details

Motivation: Existing methods overlook dynamic styles (e.g., head movements) and rely on inefficient dual U-Net architectures, prompting the need for a more unified and efficient solution.

Method: DiTalker uses a Style-Emotion Encoding Module (two branches for style and emotion) and an Audio-Style Fusion Module (parallel cross-attention layers) to decouple audio and styles. It also includes optimization constraints for lip sync and detail preservation.

Result: DiTalker excels in lip synchronization and speaking style controllability, as shown in extensive experiments.

Conclusion: DiTalker offers a superior, unified approach for portrait animation, addressing dynamic styles and computational efficiency.

Abstract: Portrait animation aims to synthesize talking videos from a static reference face, conditioned on audio and style frame cues (e.g., emotion and head poses), while ensuring precise lip synchronization and faithful reproduction of speaking styles. Existing diffusion-based portrait animation methods primarily focus on lip synchronization or static emotion transformation, often overlooking dynamic styles such as head movements. Moreover, most of these methods rely on a dual U-Net architecture, which preserves identity consistency but incurs additional computational overhead. To this end, we propose DiTalker, a unified DiT-based framework for speaking style-controllable portrait animation. We design a Style-Emotion Encoding Module that employs two separate branches: a style branch extracting identity-specific style information (e.g., head poses and movements), and an emotion branch extracting identity-agnostic emotion features. We further introduce an Audio-Style Fusion Module that decouples audio and speaking styles via two parallel cross-attention layers, using these features to guide the animation process. To enhance the quality of results, we adopt and modify two optimization constraints: one to improve lip synchronization and the other to preserve fine-grained identity and background details. Extensive experiments demonstrate the superiority of DiTalker in terms of lip synchronization and speaking style controllability. Project Page: https://thenameishope.github.io/DiTalker/

[162] StyleTailor: Towards Personalized Fashion Styling via Hierarchical Negative Feedback

Hongbo Ma, Fei Shen, Hongbin Xu, Xiaoce Wang, Gang Xu, Jinkai Zheng, Liangqiong Qu, Ming Li

Main category: cs.CV

TL;DR: StyleTailor is a collaborative agent framework for personalized fashion styling, integrating design, recommendation, virtual try-on, and evaluation, enhanced by iterative feedback.

Details

Motivation: Personalized fashion styling is underexplored despite its potential to improve shopping experiences.

Method: StyleTailor uses two core agents (Designer and Consultant) with hierarchical vision-language feedback and negative prompts for refinement.

Result: Outperforms baselines in personalized designs and recommendations, setting a new benchmark.

Conclusion: StyleTailor advances intelligent fashion systems with adaptive user alignment and comprehensive evaluation.

Abstract: The advancement of intelligent agents has revolutionized problem-solving across diverse domains, yet solutions for personalized fashion styling remain underexplored, which holds immense promise for promoting shopping experiences. In this work, we present StyleTailor, the first collaborative agent framework that seamlessly unifies personalized apparel design, shopping recommendation, virtual try-on, and systematic evaluation into a cohesive workflow. To this end, StyleTailor pioneers an iterative visual refinement paradigm driven by multi-level negative feedback, enabling adaptive and precise user alignment. Specifically, our framework features two core agents, i.e., Designer for personalized garment selection and Consultant for virtual try-on, whose outputs are progressively refined via hierarchical vision-language model feedback spanning individual items, complete outfits, and try-on efficacy. Counterexamples are aggregated into negative prompts, forming a closed-loop mechanism that enhances recommendation quality.To assess the performance, we introduce a comprehensive evaluation suite encompassing style consistency, visual quality, face similarity, and artistic appraisal. Extensive experiments demonstrate StyleTailor’s superior performance in delivering personalized designs and recommendations, outperforming strong baselines without negative feedback and establishing a new benchmark for intelligent fashion systems.

[163] BigTokDetect: A Clinically-Informed Vision-Language Model Framework for Detecting Pro-Bigorexia Videos on TikTok

Minh Duc Chu, Kshitij Pawar, Zihao He, Roxanna Sharifi, Ross Sonnenblick, Magdalayna Curry, Laura D’Adamo, Lindsay Young, Stuart B Murray, Kristina Lerman

Main category: cs.CV

TL;DR: The paper introduces BigTokDetect, a framework for detecting pro-bigorexia content on TikTok, using a multimodal approach to overcome limitations of text-based detection.

Details

Motivation: Social media struggles to detect harmful pro-bigorexia content, which evades traditional detection methods by mimicking fitness content.

Method: Developed BigTokDetect using a clinically-annotated dataset (BigTok) of 2,200 TikTok videos, evaluated with vision-language models and multimodal fusion.

Result: Achieved 0.829% accuracy for primary category classification and 0.690% for subcategories, with multimodal fusion improving performance by 5-10%.

Conclusion: BigTokDetect sets new benchmarks for multimodal harmful content detection, offering scalable moderation tools for mental health domains.

Abstract: Social media platforms increasingly struggle to detect harmful content that promotes muscle dysmorphic behaviors, particularly pro-bigorexia content that disproportionately affects adolescent males. Unlike traditional eating disorder detection focused on the “thin ideal,” pro-bigorexia material masquerades as legitimate fitness content through complex multimodal combinations of visual displays, coded language, and motivational messaging that evade text-based detection systems. We address this challenge by developing BigTokDetect, a clinically-informed detection framework for identifying pro-bigorexia content on TikTok. We introduce BigTok, the first expert-annotated multimodal dataset of over 2,200 TikTok videos labeled by clinical psychologists and psychiatrists across five primary categories spanning body image, nutrition, exercise, supplements, and masculinity. Through a comprehensive evaluation of state-of-the-art vision language models, we achieve 0.829% accuracy on primary category classification and 0.690% on subcategory detection via domain-specific finetuning. Our ablation studies demonstrate that multimodal fusion improves performance by 5-10% over text-only approaches, with video features providing the most discriminative signals. These findings establish new benchmarks for multimodal harmful content detection and provide both the computational tools and methodological framework needed for scalable content moderation in specialized mental health domains.

[164] Frequency Prior Guided Matching: A Data Augmentation Approach for Generalizable Semi-Supervised Polyp Segmentation

Haoran Xi, Chen Liu, Xiaolin Li

Main category: cs.CV

TL;DR: FPGM introduces a frequency-based augmentation framework for polyp segmentation, improving cross-domain generalization by leveraging consistent edge frequency signatures.

Details

Motivation: Addressing the challenge of limited annotated data and domain shift in polyp segmentation, which hinders model robustness.

Method: FPGM uses a two-stage process: learning a domain-invariant frequency prior from labeled polyp edges, then aligning unlabeled images’ amplitude spectra with this prior while preserving phase.

Result: Achieves state-of-the-art performance on six datasets, with over 10% Dice score improvement in zero-shot scenarios.

Conclusion: FPGM enhances cross-domain robustness, offering a clinically viable solution for polyp segmentation with limited supervision.

Abstract: Automated polyp segmentation is essential for early diagnosis of colorectal cancer, yet developing robust models remains challenging due to limited annotated data and significant performance degradation under domain shift. Although semi-supervised learning (SSL) reduces annotation requirements, existing methods rely on generic augmentations that ignore polyp-specific structural properties, resulting in poor generalization to new imaging centers and devices. To address this, we introduce Frequency Prior Guided Matching (FPGM), a novel augmentation framework built on a key discovery: polyp edges exhibit a remarkably consistent frequency signature across diverse datasets. FPGM leverages this intrinsic regularity in a two-stage process. It first learns a domain-invariant frequency prior from the edge regions of labeled polyps. Then, it performs principled spectral perturbations on unlabeled images, aligning their amplitude spectra with this learned prior while preserving phase information to maintain structural integrity. This targeted alignment normalizes domain-specific textural variations, thereby compelling the model to learn the underlying, generalizable anatomical structure. Validated on six public datasets, FPGM establishes a new state-of-the-art against ten competing methods. It demonstrates exceptional zero-shot generalization capabilities, achieving over 10% absolute gain in Dice score in data-scarce scenarios. By significantly enhancing cross-domain robustness, FPGM presents a powerful solution for clinically deployable polyp segmentation under limited supervision.

[165] Large Language Models Facilitate Vision Reflection in Image Classification

Guoyuan An, JaeYoon Kim, SungEui Yoon

Main category: cs.CV

TL;DR: The paper explores how vision reflection in LMMs improves recognition accuracy, analyzes their internal behavior, and suggests training-free methods for better performance.

Details

Motivation: To understand and improve the explainability and performance of vision reflection in large multimodal models (LMMs).

Method: Prompting LMMs to verify predictions, analyzing vision-language connectors, and testing training-free connectors.

Result: Improved accuracy, reliance on textual representations, and enhanced fine-grained recognition without training.

Conclusion: Vision reflection is a robust and interpretable strategy for visual recognition in LMMs.

Abstract: This paper presents several novel findings on the explainability of vision reflection in large multimodal models (LMMs). First, we show that prompting an LMM to verify the prediction of a specialized vision model can improve recognition accuracy, even on benchmarks like ImageNet, despite prior evidence that LMMs typically underperform dedicated vision encoders. Second, we analyze the internal behavior of vision reflection and find that the vision-language connector maps visual features into explicit textual concepts, allowing the language model to reason about prediction plausibility using commonsense knowledge. We further observe that replacing a large number of vision tokens with only a few text tokens still enables LLaVA to generate similar answers, suggesting that LMMs may rely primarily on a compact set of distilled textual representations rather than raw vision features. Third, we show that a training-free connector can enhance LMM performance in fine-grained recognition tasks, without extensive feature-alignment training. Together, these findings offer new insights into the explainability of vision-language models and suggest that vision reflection is a promising strategy for achieving robust and interpretable visual recognition.

[166] A Framework Combining 3D CNN and Transformer for Video-Based Behavior Recognition

Xiuliang Zhang, Tadiwa Elisha Nyamasvisva, Chuntao Liu

Main category: cs.CV

TL;DR: A hybrid framework combining 3D CNN and Transformer improves video-based behavior recognition by leveraging local spatiotemporal features and global dependencies.

Details

Motivation: Traditional 3D CNNs struggle with long-range dependencies, while Transformers face high computational costs. A hybrid approach aims to address these limitations.

Method: The proposed model integrates 3D CNN for low-level spatiotemporal features and Transformer for long-range dependencies, with a fusion mechanism.

Result: The hybrid model outperforms standalone 3D CNNs and Transformers in accuracy while maintaining manageable complexity.

Conclusion: The hybrid framework effectively combines the strengths of 3D CNN and Transformer, offering a scalable solution for behavior recognition.

Abstract: Video-based behavior recognition is essential in fields such as public safety, intelligent surveillance, and human-computer interaction. Traditional 3D Convolutional Neural Network (3D CNN) effectively capture local spatiotemporal features but struggle with modeling long-range dependencies. Conversely, Transformers excel at learning global contextual information but face challenges with high computational costs. To address these limitations, we propose a hybrid framework combining 3D CNN and Transformer architectures. The 3D CNN module extracts low-level spatiotemporal features, while the Transformer module captures long-range temporal dependencies, with a fusion mechanism integrating both representations. Evaluated on benchmark datasets, the proposed model outperforms traditional 3D CNN and standalone Transformers, achieving higher recognition accuracy with manageable complexity. Ablation studies further validate the complementary strengths of the two modules. This hybrid framework offers an effective and scalable solution for video-based behavior recognition.

[167] Statistical Confidence Rescoring for Robust 3D Scene Graph Generation from Multi-View Images

Qi Xun Yeo, Yanyan Li, Gim Hee Lee

Main category: cs.CV

TL;DR: The paper proposes a method for 3D semantic scene graph estimation using only multi-view RGB images, overcoming noise in reconstructed geometry and background features by enriching node and edge features with semantic and spatial information.

Details

Motivation: To address the challenge of 3D semantic scene graph estimation without ground truth 3D annotations, leveraging multi-view RGB images to predict objects, predicates, and relationships.

Method: Uses semantic masks to filter background noise, incorporates neighboring node information for robustness, and refines predictions with statistical priors from training data.

Result: Outperforms current methods relying solely on multi-view images as input.

Conclusion: The approach demonstrates effectiveness in 3D scene graph estimation without 3D ground truth, highlighting the potential of multi-view RGB images for this task.

Abstract: Modern 3D semantic scene graph estimation methods utilize ground truth 3D annotations to accurately predict target objects, predicates, and relationships. In the absence of given 3D ground truth representations, we explore leveraging only multi-view RGB images to tackle this task. To attain robust features for accurate scene graph estimation, we must overcome the noisy reconstructed pseudo point-based geometry from predicted depth maps and reduce the amount of background noise present in multi-view image features. The key is to enrich node and edge features with accurate semantic and spatial information and through neighboring relations. We obtain semantic masks to guide feature aggregation to filter background features and design a novel method to incorporate neighboring node information to aid robustness of our scene graph estimates. Furthermore, we leverage on explicit statistical priors calculated from the training summary statistics to refine node and edge predictions based on their one-hop neighborhood. Our experiments show that our method outperforms current methods purely using multi-view images as the initial input. Our project page is available at https://qixun1.github.io/projects/SCRSSG.

[168] RMT-PPAD: Real-time Multi-task Learning for Panoptic Perception in Autonomous Driving

Jiayuan Wang, Q. M. Jonathan Wu, Katsuya Suto, Ning Zhang

Main category: cs.CV

TL;DR: RMT-PPAD is a real-time, transformer-based multi-task model for autonomous driving, excelling in object detection, drivable area segmentation, and lane line segmentation with state-of-the-art performance.

Details

Motivation: To address the need for precision and real-time performance in panoptic driving perception by reducing negative transfer between tasks and avoiding manual design of task-specific structures.

Method: Proposes a lightweight gate control with an adapter for adaptive feature fusion, an adaptive segmentation decoder for multi-scale feature learning, and resolves label inconsistency in lane line segmentation.

Result: Achieves mAP50 of 84.9%, Recall of 95.4% for object detection, mIoU of 92.6% for drivable area segmentation, and IoU of 56.8% with 84.7% accuracy for lane line segmentation at 32.6 FPS.

Conclusion: RMT-PPAD delivers stable, high-performance results in real-world scenarios, with open-source code and models available.

Abstract: Autonomous driving systems rely on panoptic driving perception that requires both precision and real-time performance. In this work, we propose RMT-PPAD, a real-time, transformer-based multi-task model that jointly performs object detection, drivable area segmentation, and lane line segmentation. We introduce a lightweight module, a gate control with an adapter to adaptively fuse shared and task-specific features, effectively alleviating negative transfer between tasks. Additionally, we design an adaptive segmentation decoder to learn the weights over multi-scale features automatically during the training stage. This avoids the manual design of task-specific structures for different segmentation tasks. We also identify and resolve the inconsistency between training and testing labels in lane line segmentation. This allows fairer evaluation. Experiments on the BDD100K dataset demonstrate that RMT-PPAD achieves state-of-the-art results with mAP50 of 84.9% and Recall of 95.4% for object detection, mIoU of 92.6% for drivable area segmentation, and IoU of 56.8% and accuracy of 84.7% for lane line segmentation. The inference speed reaches 32.6 FPS. Moreover, we introduce real-world scenarios to evaluate RMT-PPAD performance in practice. The results show that RMT-PPAD consistently delivers stable performance. The source codes and pre-trained models are released at https://github.com/JiayuanWang-JW/RMT-PPAD.

[169] What Makes “Good” Distractors for Object Hallucination Evaluation in Large Vision-Language Models?

Ming-Kun Xie, Jia-Hao Xiao, Gang Niu, Lei Feng, Zhiqiang Kou, Min-Ling Zhang, Masashi Sugiyama

Main category: cs.CV

TL;DR: The paper introduces HOPE, a new benchmark to rigorously assess object hallucination in LVLMs by generating misleading distractors, outperforming the existing POPE benchmark.

Details

Motivation: Existing benchmarks like POPE are ineffective for evaluating object hallucination in advanced LVLMs due to simplistic sampling strategies.

Method: HOPE uses content-aware hallucination searching (leveraging CLIP) and description-based hallucination searching to create misleading distractors.

Result: HOPE causes a precision drop of 9-23% in LVLMs, significantly outperforming POPE.

Conclusion: HOPE provides a more rigorous assessment of LVLM hallucination vulnerabilities.

Abstract: Large Vision-Language Models (LVLMs), empowered by the success of Large Language Models (LLMs), have achieved impressive performance across domains. Despite the great advances in LVLMs, they still suffer from the unavailable object hallucination issue, which tends to generate objects inconsistent with the image content. The most commonly used Polling-based Object Probing Evaluation (POPE) benchmark evaluates this issue by sampling negative categories according to category-level statistics, \textit{e.g.}, category frequencies and co-occurrence. However, with the continuous advancement of LVLMs, the POPE benchmark has shown diminishing effectiveness in assessing object hallucination, as it employs a simplistic sampling strategy that overlooks image-specific information and restricts distractors to negative object categories only. In this paper, we introduce the Hallucination searching-based Object Probing Evaluation (HOPE) benchmark, aiming to generate the most misleading distractors (\textit{i.e.}, non-existent objects or incorrect image descriptions) that can trigger hallucination in LVLMs, which serves as a means to more rigorously assess their immunity to hallucination. To explore the image-specific information, the content-aware hallucination searching leverages Contrastive Language-Image Pre-Training (CLIP) to approximate the predictive behavior of LVLMs by selecting negative objects with the highest predicted likelihood as distractors. To expand the scope of hallucination assessment, the description-based hallucination searching constructs highly misleading distractors by pairing true objects with false descriptions. Experimental results show that HOPE leads to a precision drop of at least 9% and up to 23% across various state-of-the-art LVLMs, significantly outperforming POPE in exposing hallucination vulnerabilities. The code is available at https://github.com/xiemk/HOPE.

[170] SLRTP2025 Sign Language Production Challenge: Methodology, Results, and Future Work

Harry Walsh, Ed Fish, Ozge Mercanoglu Sincan, Mohamed Ilyes Lakhal, Richard Bowden, Neil Fox, Bencie Woll, Kepeng Wu, Zecheng Li, Weichao Zhao, Haodong Wang, Wengang Zhou, Houqiang Li, Shengeng Tang, Jiayi He, Xu Wang, Ruobei Zhang, Yaxiong Wang, Lechao Cheng, Meryem Tasyurek, Tugce Kiziltepe, Hacer Yalim Keles

Main category: cs.CV

TL;DR: The paper introduces the first Sign Language Production Challenge to standardize evaluation metrics for SLP, using the RWTH-PHOENIX-Weather-2014T dataset. The winning method employed a retrieval-based framework and pre-trained language model.

Details

Motivation: The lack of standardized evaluation metrics for Sign Language Production (SLP) hinders meaningful comparisons across systems.

Method: The challenge evaluated Text-to-Pose (T2P) translation architectures using a German Sign Language dataset and a custom hidden test set.

Result: The top-performing team achieved BLEU-1 scores of 31.40 and DTW-MJE of 0.0574.

Conclusion: The challenge and released evaluation network aim to establish a consistent baseline for future SLP research.

Abstract: Sign Language Production (SLP) is the task of generating sign language video from spoken language inputs. The field has seen a range of innovations over the last few years, with the introduction of deep learning-based approaches providing significant improvements in the realism and naturalness of generated outputs. However, the lack of standardized evaluation metrics for SLP approaches hampers meaningful comparisons across different systems. To address this, we introduce the first Sign Language Production Challenge, held as part of the third SLRTP Workshop at CVPR 2025. The competition’s aims are to evaluate architectures that translate from spoken language sentences to a sequence of skeleton poses, known as Text-to-Pose (T2P) translation, over a range of metrics. For our evaluation data, we use the RWTH-PHOENIX-Weather-2014T dataset, a German Sign Language - Deutsche Gebardensprache (DGS) weather broadcast dataset. In addition, we curate a custom hidden test set from a similar domain of discourse. This paper presents the challenge design and the winning methodologies. The challenge attracted 33 participants who submitted 231 solutions, with the top-performing team achieving BLEU-1 scores of 31.40 and DTW-MJE of 0.0574. The winning approach utilized a retrieval-based framework and a pre-trained language model. As part of the workshop, we release a standardized evaluation network, including high-quality skeleton extraction-based keypoints establishing a consistent baseline for the SLP field, which will enable future researchers to compare their work against a broader range of methods.

[171] OpenHAIV: A Framework Towards Practical Open-World Learning

Xiang Xiang, Qinhao Zhou, Zhuo Xu, Jing Ma, Jiaxin Dai, Yifan Liang, Hanlin Li

Main category: cs.CV

TL;DR: OpenHAIV integrates OOD detection, new class discovery, and incremental fine-tuning for autonomous knowledge updates in open-world recognition.

Details

Motivation: Existing methods like OOD detection and incremental learning are limited in open-world scenarios, lacking knowledge updates and requiring supervised conditions.

Method: Proposes OpenHAIV, a unified framework combining OOD detection, new class discovery, and incremental continual fine-tuning.

Result: Enables models to autonomously acquire and update knowledge in open-world environments.

Conclusion: OpenHAIV addresses limitations of current approaches, offering a practical solution for open-world recognition.

Abstract: Substantial progress has been made in various techniques for open-world recognition. Out-of-distribution (OOD) detection methods can effectively distinguish between known and unknown classes in the data, while incremental learning enables continuous model knowledge updates. However, in open-world scenarios, these approaches still face limitations. Relying solely on OOD detection does not facilitate knowledge updates in the model, and incremental fine-tuning typically requires supervised conditions, which significantly deviate from open-world settings. To address these challenges, this paper proposes OpenHAIV, a novel framework that integrates OOD detection, new class discovery, and incremental continual fine-tuning into a unified pipeline. This framework allows models to autonomously acquire and update knowledge in open-world environments. The proposed framework is available at https://haiv-lab.github.io/openhaiv .

[172] Benchmarking Deep Learning-Based Object Detection Models on Feature Deficient Astrophotography Imagery Dataset

Shantanusinh Parmar

Main category: cs.CV

TL;DR: Benchmarking object detection models on MobilTelesco, a sparse night-sky dataset, reveals challenges in feature-deficient conditions compared to common datasets like ImageNet.

Details

Motivation: Existing datasets (e.g., ImageNet, COCO) focus on everyday objects and lack signal sparsity, limiting their applicability to non-commercial domains like astrophotography.

Method: The study uses MobilTelesco, a smartphone-based astrophotography dataset, to evaluate object detection models under sparse feature conditions.

Result: The benchmarking highlights the difficulties detection models face in feature-deficient scenarios, such as those in night-sky images.

Conclusion: MobilTelesco fills a gap in datasets for sparse signal domains, revealing limitations of current models in such environments.

Abstract: Object detection models are typically trained on datasets like ImageNet, COCO, and PASCAL VOC, which focus on everyday objects. However, these lack signal sparsity found in non-commercial domains. MobilTelesco, a smartphone-based astrophotography dataset, addresses this by providing sparse night-sky images. We benchmark several detection models on it, highlighting challenges under feature-deficient conditions.

[173] Novel View Synthesis with Gaussian Splatting: Impact on Photogrammetry Model Accuracy and Resolution

Pranav Chougule

Main category: cs.CV

TL;DR: A study comparing Photogrammetry and Gaussian Splatting for 3D model reconstruction and view synthesis, evaluating performance with metrics like SSIM, PSNR, and LPIPS. A modified Gaussian Splatting repository was developed to enhance novel view synthesis.

Details

Motivation: To compare and evaluate the effectiveness of Photogrammetry and Gaussian Splatting for 3D reconstruction and view synthesis, aiming to improve photogrammetry using novel views from Gaussian Splatting.

Method: Created a dataset of real-world images, built 3D models using both techniques, and evaluated them using SSIM, PSNR, LPIPS, and resolution metrics. Enhanced Gaussian Splatting for novel view rendering in Blender.

Result: Gaussian Splatting showed potential to improve photogrammetry by generating high-quality novel views. The augmented dataset with synthesized views improved photogrammetry model quality.

Conclusion: The study highlights the strengths of both methods, suggesting Gaussian Splatting can enhance photogrammetry, with applications in XR, autonomous vehicles, and simulations.

Abstract: In this paper, I present a comprehensive study comparing Photogrammetry and Gaussian Splatting techniques for 3D model reconstruction and view synthesis. I created a dataset of images from a real-world scene and constructed 3D models using both methods. To evaluate the performance, I compared the models using structural similarity index (SSIM), peak signal-to-noise ratio (PSNR), learned perceptual image patch similarity (LPIPS), and lp/mm resolution based on the USAF resolution chart. A significant contribution of this work is the development of a modified Gaussian Splatting repository, which I forked and enhanced to enable rendering images from novel camera poses generated in the Blender environment. This innovation allows for the synthesis of high-quality novel views, showcasing the flexibility and potential of Gaussian Splatting. My investigation extends to an augmented dataset that includes both original ground images and novel views synthesized via Gaussian Splatting. This augmented dataset was employed to generate a new photogrammetry model, which was then compared against the original photogrammetry model created using only the original images. The results demonstrate the efficacy of using Gaussian Splatting to generate novel high-quality views and its potential to improve photogrammetry-based 3D reconstructions. The comparative analysis highlights the strengths and limitations of both approaches, providing valuable information for applications in extended reality (XR), photogrammetry, and autonomous vehicle simulations. Code is available at https://github.com/pranavc2255/gaussian-splatting-novel-view-render.git.

[174] MILD: Multi-Layer Diffusion Strategy for Complex and Precise Multi-IP Aware Human Erasing

Jinghan Yu, Zhiyuan Ma, Yue Ma, Kaiqi Liu, Yuhan Wang, Jianjun Li

Main category: cs.CV

TL;DR: The paper introduces MILD, a method for multi-IP human erasing, addressing challenges like occlusions and background interference with a new dataset and diffusion-based approach.

Details

Motivation: Existing methods struggle with complex multi-IP scenarios due to dataset limitations and lack of spatial decoupling.

Method: Proposes Multi-Layer Diffusion (MILD) for semantic decoupling and Human Morphology Guidance for better human-centric understanding.

Result: MILD outperforms state-of-the-art methods on human erasing benchmarks.

Conclusion: MILD effectively addresses multi-IP challenges and improves human erasing performance.

Abstract: Recent years have witnessed the success of diffusion models in image-customized tasks. Prior works have achieved notable progress on human-oriented erasing using explicit mask guidance and semantic-aware inpainting. However, they struggle under complex multi-IP scenarios involving human-human occlusions, human-object entanglements, and background interferences. These challenges are mainly due to: 1) Dataset limitations, as existing datasets rarely cover dense occlusions, camouflaged backgrounds, and diverse interactions; 2) Lack of spatial decoupling, where foreground instances cannot be effectively disentangled, limiting clean background restoration. In this work, we introduce a high-quality multi-IP human erasing dataset with diverse pose variations and complex backgrounds. We then propose Multi-Layer Diffusion (MILD), a novel strategy that decomposes generation into semantically separated pathways for each instance and the background. To enhance human-centric understanding, we introduce Human Morphology Guidance, integrating pose, parsing, and spatial relations. We further present Spatially-Modulated Attention to better guide attention flow. Extensive experiments show that MILD outperforms state-of-the-art methods on challenging human erasing benchmarks.

[175] PP-Motion: Physical-Perceptual Fidelity Evaluation for Human Motion Generation

Sihan Zhao, Zixuan Wang, Tianyu Luan, Jia Jia, Wentao Zhu, Jiebo Luo, Junsong Yuan, Nan Xi

Main category: cs.CV

TL;DR: The paper introduces PP-Motion, a data-driven metric for evaluating human motion fidelity by combining physical and perceptual alignment, outperforming prior methods.

Details

Motivation: Existing motion fidelity evaluation methods rely on subjective human perception or physical constraints, lacking a robust, objective metric.

Method: A physical labeling method calculates minimal modifications for motion to align with physical laws, producing fine-grained annotations. PP-Motion is trained using Pearson’s correlation loss and perceptual fidelity loss.

Result: PP-Motion aligns better with both physical laws and human perception than previous methods.

Conclusion: PP-Motion provides a more objective and comprehensive evaluation of motion fidelity by integrating physical and perceptual alignment.

Abstract: Human motion generation has found widespread applications in AR/VR, film, sports, and medical rehabilitation, offering a cost-effective alternative to traditional motion capture systems. However, evaluating the fidelity of such generated motions is a crucial, multifaceted task. Although previous approaches have attempted at motion fidelity evaluation using human perception or physical constraints, there remains an inherent gap between human-perceived fidelity and physical feasibility. Moreover, the subjective and coarse binary labeling of human perception further undermines the development of a robust data-driven metric. We address these issues by introducing a physical labeling method. This method evaluates motion fidelity by calculating the minimum modifications needed for a motion to align with physical laws. With this approach, we are able to produce fine-grained, continuous physical alignment annotations that serve as objective ground truth. With these annotations, we propose PP-Motion, a novel data-driven metric to evaluate both physical and perceptual fidelity of human motion. To effectively capture underlying physical priors, we employ Pearson’s correlation loss for the training of our metric. Additionally, by incorporating a human-based perceptual fidelity loss, our metric can capture fidelity that simultaneously considers both human perception and physical alignment. Experimental results demonstrate that our metric, PP-Motion, not only aligns with physical laws but also aligns better with human perception of motion fidelity than previous work.

[176] Slice or the Whole Pie? Utility Control for AI Models

Ye Tao

Main category: cs.CV

TL;DR: NNObfuscator enables dynamic performance adaptation in AI models, eliminating the need for multiple versions by allowing real-time adjustments based on user tiers.

Details

Motivation: Addressing the inefficiency of training and maintaining multiple model versions for diverse performance requirements.

Method: Proposes NNObfuscator, a utility control mechanism for dynamic performance modification in AI models, tested on tasks like image classification and text-to-image generation.

Result: NNObfuscator successfully adapts a single model to varied tasks, improving resource allocation and reducing unnecessary computation.

Conclusion: NNObfuscator offers a sustainable and efficient solution for AI deployment, supporting tiered access and reducing infrastructure overhead.

Abstract: Training deep neural networks (DNNs) has become an increasingly resource-intensive task, requiring large volumes of labeled data, substantial computational power, and considerable fine-tuning efforts to achieve optimal performance across diverse use cases. Although pre-trained models offer a useful starting point, adapting them to meet specific user needs often demands extensive customization, and infrastructure overhead. This challenge grows when a single model must support diverse appli-cations with differing requirements for performance. Traditional solutions often involve training multiple model versions to meet varying requirements, which can be inefficient and difficult to maintain. In order to overcome this challenge, we propose NNObfuscator, a novel utility control mechanism that enables AI models to dynamically modify their performance according to predefined conditions. It is different from traditional methods that need separate models for each user. Instead, NNObfuscator allows a single model to be adapted in real time, giving you controlled access to multiple levels of performance. This mechanism enables model owners set up tiered access, ensuring that free-tier users receive a baseline level of performance while premium users benefit from enhanced capabilities. The approach improves resource allocation, reduces unnecessary computation, and supports sustainable business models in AI deployment. To validate our approach, we conducted experiments on multiple tasks, including image classification, semantic segmentation, and text to image generation, using well-established models such as ResNet, DeepLab, VGG16, FCN and Stable Diffusion. Experimental results show that NNObfuscator successfully makes model more adaptable, so that a single trained model can handle a broad range of tasks without requiring a lot of changes.

[177] Age-Diverse Deepfake Dataset: Bridging the Age Gap in Deepfake Detection

Unisha Joshi

Main category: cs.CV

TL;DR: The paper introduces an age-diverse deepfake dataset to mitigate age-specific bias, improving fairness and accuracy in deepfake detection models.

Details

Motivation: Addressing demographic bias, particularly age-specific bias, in deepfake datasets to enhance fairness in detection models.

Method: Constructing an age-diverse dataset using existing datasets (Celeb-DF, FaceForensics++, UTKFace) and synthetic data, evaluated with XceptionNet, EfficientNet, and LipForensics.

Result: Models trained on the age-diverse dataset showed fairer performance across age groups, improved accuracy, and better generalization.

Conclusion: The study provides a fairness-aware dataset and pipeline for future research in fairer deepfake detection, with open access to the dataset and code.

Abstract: The challenges associated with deepfake detection are increasing significantly with the latest advancements in technology and the growing popularity of deepfake videos and images. Despite the presence of numerous detection models, demographic bias in the deepfake dataset remains largely unaddressed. This paper focuses on the mitigation of age-specific bias in the deepfake dataset by introducing an age-diverse deepfake dataset that will improve fairness across age groups. The dataset is constructed through a modular pipeline incorporating the existing deepfake datasets Celeb-DF, FaceForensics++, and UTKFace datasets, and the creation of synthetic data to fill the age distribution gaps. The effectiveness and generalizability of this dataset are evaluated using three deepfake detection models: XceptionNet, EfficientNet, and LipForensics. Evaluation metrics, including AUC, pAUC, and EER, revealed that models trained on the age-diverse dataset demonstrated fairer performance across age groups, improved overall accuracy, and higher generalization across datasets. This study contributes a reproducible, fairness-aware deepfake dataset and model pipeline that can serve as a foundation for future research in fairer deepfake detection. The complete dataset and implementation code are available at https://github.com/unishajoshi/age-diverse-deepfake-detection.

[178] Static and Plugged: Make Embodied Evaluation Simple

Jiahao Xiao, Jianbo Zhang, BoWen Yan, Shengyu Guo, Tongrui Ye, Kaiwei Zhang, Zicheng Zhang, Xiaohong Liu, Zhengxue Cheng, Lei Fan, Chuyi Li, Guangtao Zhai

Main category: cs.CV

TL;DR: StaticEmbodiedBench is introduced as a scalable, unified benchmark for embodied intelligence evaluation using static scenes, covering 42 scenarios and 8 dimensions. It evaluates 19 VLMs and 11 VLAs, releasing 200 samples to aid development.

Details

Motivation: Current benchmarks for embodied intelligence are costly, fragmented, and hard to scale, necessitating a more efficient and unified evaluation method.

Method: The paper introduces StaticEmbodiedBench, a plug-and-play benchmark using static scene representations for scalable and comprehensive assessment.

Result: The benchmark evaluates 19 VLMs and 11 VLAs, establishing the first unified static leaderboard for embodied intelligence.

Conclusion: StaticEmbodiedBench offers a scalable, cost-effective solution for embodied intelligence evaluation, with released samples to foster further development.

Abstract: Embodied intelligence is advancing rapidly, driving the need for efficient evaluation. Current benchmarks typically rely on interactive simulated environments or real-world setups, which are costly, fragmented, and hard to scale. To address this, we introduce StaticEmbodiedBench, a plug-and-play benchmark that enables unified evaluation using static scene representations. Covering 42 diverse scenarios and 8 core dimensions, it supports scalable and comprehensive assessment through a simple interface. Furthermore, we evaluate 19 Vision-Language Models (VLMs) and 11 Vision-Language-Action models (VLAs), establishing the first unified static leaderboard for Embodied intelligence. Moreover, we release a subset of 200 samples from our benchmark to accelerate the development of embodied intelligence.

[179] From Label Error Detection to Correction: A Modular Framework and Benchmark for Object Detection Datasets

Sarina Penquitt, Jonathan Klees, Rinor Cakaj, Daniel Kondermann, Matthias Rottmann, Lars Schmarje

Main category: cs.CV

TL;DR: The paper introduces REC✓D, a semi-automated framework for correcting label errors in object detection datasets, validated on the KITTI dataset, revealing significant annotation inaccuracies and proposing a new benchmark for further research.

Details

Motivation: Label errors in object detection datasets compromise training and evaluation quality, but existing methods lack systemic and scalable solutions.

Method: REC✓D combines error detection proposals with crowd-sourced microtasks for verification and aggregation to improve label quality.

Result: Applied to KITTI’s pedestrian class, REC✓D identified 24% missing/inaccurate annotations, creating a benchmark. Current methods miss up to 66% of errors.

Conclusion: The framework improves label quality but highlights the need for better error detection methods, enabled by the released benchmark.

Abstract: Object detection has advanced rapidly in recent years, driven by increasingly large and diverse datasets. However, label errors, defined as missing labels, incorrect classification or inaccurate localization, often compromise the quality of these datasets. This can have a significant impact on the outcomes of training and benchmark evaluations. Although several methods now exist for detecting label errors in object detection datasets, they are typically validated only on synthetic benchmarks or limited manual inspection. How to correct such errors systemically and at scale therefore remains an open problem. We introduce a semi-automated framework for label-error correction called REC$\checkmark$D (Rechecked). Building on existing detectors, the framework pairs their error proposals with lightweight, crowd-sourced microtasks. These tasks enable multiple annotators to independently verify each candidate bounding box, and their responses are aggregated to estimate ambiguity and improve label quality. To demonstrate the effectiveness of REC$\checkmark$D, we apply it to the class pedestrian in the KITTI dataset. Our crowdsourced review yields high-quality corrected annotations, which indicate a rate of at least 24% of missing and inaccurate annotations in original annotations. This validated set will be released as a new real-world benchmark for label error detection and correction. We show that current label error detection methods, when combined with our correction framework, can recover hundreds of errors in the time it would take a human to annotate bounding boxes from scratch. However, even the best methods still miss up to 66% of the true errors and with low quality labels introduce more errors than they find. This highlights the urgent need for further research, now enabled by our released benchmark.

[180] MMFformer: Multimodal Fusion Transformer Network for Depression Detection

Md Rezwanul Haque, Md. Milon Islam, S M Taslim Uddin Raju, Hamdi Altaheri, Lobna Nassar, Fakhri Karray

Main category: cs.CV

TL;DR: MMFformer is a multimodal depression detection network using social media data, outperforming state-of-the-art methods with significant F1-Score improvements.

Details

Motivation: Early depression detection is challenging due to subjective clinical evaluations. Social media offers rich, diverse data for objective analysis.

Method: MMFformer uses transformers for spatial (video) and temporal (audio) feature extraction, fused via late and intermediate strategies.

Result: Achieves 13.92% and 7.74% F1-Score improvements on D-Vlog and LMVD datasets, respectively.

Conclusion: MMFformer effectively detects depression from multimodal social media data, offering a robust tool for early diagnosis.

Abstract: Depression is a serious mental health illness that significantly affects an individual’s well-being and quality of life, making early detection crucial for adequate care and treatment. Detecting depression is often difficult, as it is based primarily on subjective evaluations during clinical interviews. Hence, the early diagnosis of depression, thanks to the content of social networks, has become a prominent research area. The extensive and diverse nature of user-generated information poses a significant challenge, limiting the accurate extraction of relevant temporal information and the effective fusion of data across multiple modalities. This paper introduces MMFformer, a multimodal depression detection network designed to retrieve depressive spatio-temporal high-level patterns from multimodal social media information. The transformer network with residual connections captures spatial features from videos, and a transformer encoder is exploited to design important temporal dynamics in audio. Moreover, the fusion architecture fused the extracted features through late and intermediate fusion strategies to find out the most relevant intermodal correlations among them. Finally, the proposed network is assessed on two large-scale depression detection datasets, and the results clearly reveal that it surpasses existing state-of-the-art approaches, improving the F1-Score by 13.92% for D-Vlog dataset and 7.74% for LMVD dataset. The code is made available publicly at https://github.com/rezwanh001/Large-Scale-Multimodal-Depression-Detection.

[181] Interpreting the linear structure of vision-language model embedding spaces

Isabel Papadimitriou, Huangyuan Su, Thomas Fel, Sham Kakade, Stephanie Gil

Main category: cs.CV

TL;DR: The paper investigates how vision-language models organize language and images in a joint space using sparse autoencoders (SAEs), revealing stable cross-modal semantic concepts and introducing the Bridge Score to quantify their integration.

Details

Motivation: To understand how vision-language models encode meaning and modality in their joint embedding spaces and explore the organization of these spaces.

Method: Train and analyze sparse autoencoders (SAEs) on four vision-language models (CLIP, SigLIP, SigLIP2, AIMv2) to identify sparse linear concepts and their cross-modal behavior.

Result: SAEs outperform other linear methods in reconstruction and sparsity. Concepts are stable across runs, encode cross-modal semantics, and collaborate for integration, as measured by the Bridge Score.

Conclusion: The study reveals a sparse linear structure in VLM embedding spaces, shaped by modality but integrated through latent bridges, providing insights into multimodal meaning construction.

Abstract: Vision-language models encode images and text in a joint space, minimizing the distance between corresponding image and text pairs. How are language and images organized in this joint space, and how do the models encode meaning and modality? To investigate this, we train and release sparse autoencoders (SAEs) on the embedding spaces of four vision-language models (CLIP, SigLIP, SigLIP2, and AIMv2). SAEs approximate model embeddings as sparse linear combinations of learned directions, or “concepts”. We find that, compared to other methods of linear feature learning, SAEs are better at reconstructing the real embeddings, while also able to retain the most sparsity. Retraining SAEs with different seeds or different data diet leads to two findings: the rare, specific concepts captured by the SAEs are liable to change drastically, but we also show that commonly-activating concepts are remarkably stable across runs. Interestingly, while most concepts activate primarily for one modality, we find they are not merely encoding modality per se. Many are almost orthogonal to the subspace that defines modality, and the concept directions do not function as good modality classifiers, suggesting that they encode cross-modal semantics. To quantify this bridging behavior, we introduce the Bridge Score, a metric that identifies concept pairs which are both co-activated across aligned image-text inputs and geometrically aligned in the shared space. This reveals that even single-modality concepts can collaborate to support cross-modal integration. We release interactive demos of the SAEs for all models, allowing researchers to explore the organization of the concept spaces. Overall, our findings uncover a sparse linear structure within VLM embedding spaces that is shaped by modality, yet stitched together through latent bridges, offering new insight into how multimodal meaning is constructed.

[182] On the effectiveness of multimodal privileged knowledge distillation in two vision transformer based diagnostic applications

Simon Baur, Alexandra Benova, Emilio Dolgener Cantú, Jackie Ma

Main category: cs.CV

TL;DR: MMPKD uses extra training-only modalities to enhance a unimodal vision model, improving ROI localization but not generalizing across domains.

Details

Motivation: Improve deep learning model robustness in clinical practice by leveraging unavailable-at-inference modalities during training.

Method: Propose MMPKD, distilling knowledge from text (MIMIC-CXR) and tabular (CBIS-DDSM) teacher models into a vision transformer student.

Result: Enhanced zero-shot ROI localization in images, though domain generalization remains limited.

Conclusion: MMPKD improves unimodal model performance but lacks cross-domain generalization, contrasting prior research.

Abstract: Deploying deep learning models in clinical practice often requires leveraging multiple data modalities, such as images, text, and structured data, to achieve robust and trustworthy decisions. However, not all modalities are always available at inference time. In this work, we propose multimodal privileged knowledge distillation (MMPKD), a training strategy that utilizes additional modalities available solely during training to guide a unimodal vision model. Specifically, we used a text-based teacher model for chest radiographs (MIMIC-CXR) and a tabular metadata-based teacher model for mammography (CBIS-DDSM) to distill knowledge into a vision transformer student model. We show that MMPKD can improve the resulting attention maps’ zero-shot capabilities of localizing ROI in input images, while this effect does not generalize across domains, as contrarily suggested by prior research.

[183] DanceChat: Large Language Model-Guided Music-to-Dance Generation

Qing Wang, Xiaohang Yang, Yilan Dong, Naveen Raj Govindaraj, Gregory Slabaugh, Shanxin Yuan

Main category: cs.CV

TL;DR: DanceChat uses an LLM to guide music-to-dance generation, improving diversity and alignment with music by incorporating textual motion instructions.

Details

Motivation: The semantic gap between music and dance, along with the one-to-many mapping challenge, necessitates additional guidance beyond music cues.

Method: Three components: LLM-based pseudo instruction generation, multi-modal feature fusion, and diffusion-based motion synthesis with alignment loss.

Result: Outperforms state-of-the-art methods on AIST++ and in human evaluations.

Conclusion: DanceChat effectively bridges the music-dance gap by leveraging LLM guidance for diverse and musically aligned dance generation.

Abstract: Music-to-dance generation aims to synthesize human dance motion conditioned on musical input. Despite recent progress, significant challenges remain due to the semantic gap between music and dance motion, as music offers only abstract cues, such as melody, groove, and emotion, without explicitly specifying the physical movements. Moreover, a single piece of music can produce multiple plausible dance interpretations. This one-to-many mapping demands additional guidance, as music alone provides limited information for generating diverse dance movements. The challenge is further amplified by the scarcity of paired music and dance data, which restricts the model^a\u{A}'Zs ability to learn diverse dance patterns. In this paper, we introduce DanceChat, a Large Language Model (LLM)-guided music-to-dance generation approach. We use an LLM as a choreographer that provides textual motion instructions, offering explicit, high-level guidance for dance generation. This approach goes beyond implicit learning from music alone, enabling the model to generate dance that is both more diverse and better aligned with musical styles. Our approach consists of three components: (1) an LLM-based pseudo instruction generation module that produces textual dance guidance based on music style and structure, (2) a multi-modal feature extraction and fusion module that integrates music, rhythm, and textual guidance into a shared representation, and (3) a diffusion-based motion synthesis module together with a multi-modal alignment loss, which ensures that the generated dance is aligned with both musical and textual cues. Extensive experiments on AIST++ and human evaluations show that DanceChat outperforms state-of-the-art methods both qualitatively and quantitatively.

[184] Grounding Emotion Recognition with Visual Prototypes: VEGA – Revisiting CLIP in MERC

Guanyu Hu, Dimitrios Kollias, Xinyu Yang

Main category: cs.CV

TL;DR: The paper introduces VEGA, a novel mechanism using CLIP’s image encoder to create emotion-specific visual anchors for multimodal emotion recognition, achieving state-of-the-art results.

Details

Motivation: Existing models lack psychologically meaningful priors for multimodal alignment, despite advanced fusion strategies.

Method: Proposes VEGA, leveraging CLIP’s image encoder to build visual anchors from facial exemplars, guided by cognitive theories, and integrates it into a dual-branch architecture with self-distillation.

Result: Achieves state-of-the-art performance on IEMOCAP and MELD datasets.

Conclusion: VEGA enhances multimodal emotion recognition by grounding representations in psychological and perceptual alignment, validated by superior performance.

Abstract: Multimodal Emotion Recognition in Conversations remains a challenging task due to the complex interplay of textual, acoustic and visual signals. While recent models have improved performance via advanced fusion strategies, they often lack psychologically meaningful priors to guide multimodal alignment. In this paper, we revisit the use of CLIP and propose a novel Visual Emotion Guided Anchoring (VEGA) mechanism that introduces class-level visual semantics into the fusion and classification process. Distinct from prior work that primarily utilizes CLIP’s textual encoder, our approach leverages its image encoder to construct emotion-specific visual anchors based on facial exemplars. These anchors guide unimodal and multimodal features toward a perceptually grounded and psychologically aligned representation space, drawing inspiration from cognitive theories (prototypical emotion categories and multisensory integration). A stochastic anchor sampling strategy further enhances robustness by balancing semantic stability and intra-class diversity. Integrated into a dual-branch architecture with self-distillation, our VEGA-augmented model achieves sota performance on IEMOCAP and MELD. Code is available at: https://github.com/dkollias/VEGA.

[185] Bridging Brain Connectomes and Clinical Reports for Early Alzheimer’s Disease Diagnosis

Jing Zhang, Xiaowei Yu, Minheng Chen, Lu Zhang, Tong Chen, Yan Zhuang, Chao Cao, Yanjun Lyu, Li Su, Tianming Liu, Dajiang Zhu

Main category: cs.CV

TL;DR: A novel framework aligns brain connectomes with clinical reports in a shared latent space, improving diagnosis by linking imaging data and text reports.

Details

Motivation: To enhance brain disorder diagnosis by integrating objective imaging data with subjective clinical reports, addressing the challenge of linking these modalities.

Method: Aligns brain subnetworks (as tokens) with word tokens in clinical reports in a cross-modal latent space, applied to MCI using the ADNI dataset.

Result: Achieves state-of-the-art predictive performance and identifies meaningful connectome-text pairs, revealing insights into Alzheimer’s disease mechanisms.

Conclusion: The framework improves multimodal representation learning and supports the development of clinically useful biomarkers for brain disorders.

Abstract: Integrating brain imaging data with clinical reports offers a valuable opportunity to leverage complementary multimodal information for more effective and timely diagnosis in practical clinical settings. This approach has gained significant attention in brain disorder research, yet a key challenge remains: how to effectively link objective imaging data with subjective text-based reports, such as doctors’ notes. In this work, we propose a novel framework that aligns brain connectomes with clinical reports in a shared cross-modal latent space at both the subject and connectome levels, thereby enhancing representation learning. The key innovation of our approach is that we treat brain subnetworks as tokens of imaging data, rather than raw image patches, to align with word tokens in clinical reports. This enables a more efficient identification of system-level associations between neuroimaging findings and clinical observations, which is critical since brain disorders often manifest as network-level abnormalities rather than isolated regional alterations. We applied our method to mild cognitive impairment (MCI) using the ADNI dataset. Our approach not only achieves state-of-the-art predictive performance but also identifies clinically meaningful connectome-text pairs, offering new insights into the early mechanisms of Alzheimer’s disease and supporting the development of clinically useful multimodal biomarkers.

Xiao Zhang, Johan Bos

Main category: cs.CV

TL;DR: A multi-modal framework for digitizing tombstones using vision-language models (VLMs) and retrieval-augmented generation (RAG) improves parsing accuracy and semantic enrichment, addressing preservation challenges.

Details

Motivation: Tombstones are culturally significant but face preservation issues like erosion and vandalism. Digitization can aid in interpretation and retrieval of their content.

Method: Leverages VLMs to create structured Tombstone Meaning Representations (TMRs) and uses RAG for semantic enrichment with external data.

Result: Parsing accuracy improves from 36.1 to 89.5 F1 score, with robustness tested across diverse inscriptions and degraded conditions.

Conclusion: The framework formalizes tombstone understanding using large VLMs, offering potential for heritage preservation.

Abstract: Tombstones are historically and culturally rich artifacts, encapsulating individual lives, community memory, historical narratives and artistic expression. Yet, many tombstones today face significant preservation challenges, including physical erosion, vandalism, environmental degradation, and political shifts. In this paper, we introduce a novel multi-modal framework for tombstones digitization, aiming to improve the interpretation, organization and retrieval of tombstone content. Our approach leverages vision-language models (VLMs) to translate tombstone images into structured Tombstone Meaning Representations (TMRs), capturing both image and text information. To further enrich semantic parsing, we incorporate retrieval-augmented generation (RAG) for integrate externally dependent elements such as toponyms, occupation codes, and ontological concepts. Compared to traditional OCR-based pipelines, our method improves parsing accuracy from an F1 score of 36.1 to 89.5. We additionally evaluate the model’s robustness across diverse linguistic and cultural inscriptions, and simulate physical degradation through image fusion to assess performance under noisy or damaged conditions. Our work represents the first attempt to formalize tombstone understanding using large vision-language models, presenting implications for heritage preservation.

[187] Surformer v1: Transformer-Based Surface Classification Using Tactile and Vision Features

Manish Kansana, Elias Hossain, Shahram Rahimi, Noorbakhsh Amiri Golilarz

Main category: cs.CV

TL;DR: Surformer v1, a transformer-based model, excels in surface material recognition by combining tactile and visual inputs, achieving high accuracy and fast inference.

Details

Motivation: To improve robotic perception and physical interaction by leveraging multimodal (tactile and visual) sensory inputs for surface classification.

Method: Proposes Surformer v1, integrating modality-specific encoders with cross-modal attention layers, and compares it with tactile-only and multimodal CNN approaches.

Result: Surformer v1 achieved 99.4% accuracy with 0.77 ms inference time, outperforming other models in efficiency.

Conclusion: Surformer v1 balances accuracy, efficiency, and computational cost, making it suitable for real-time surface material recognition.

Abstract: Surface material recognition is a key component in robotic perception and physical interaction, particularly when leveraging both tactile and visual sensory inputs. In this work, we propose Surformer v1, a transformer-based architecture designed for surface classification using structured tactile features and PCA-reduced visual embeddings extracted via ResNet-50. The model integrates modality-specific encoders with cross-modal attention layers, enabling rich interactions between vision and touch. Currently, state-of-the-art deep learning models for vision tasks have achieved remarkable performance. With this in mind, our first set of experiments focused exclusively on tactile-only surface classification. Using feature engineering, we trained and evaluated multiple machine learning models, assessing their accuracy and inference time. We then implemented an encoder-only Transformer model tailored for tactile features. This model not only achieved the highest accuracy but also demonstrated significantly faster inference time compared to other evaluated models, highlighting its potential for real-time applications. To extend this investigation, we introduced a multimodal fusion setup by combining vision and tactile inputs. We trained both Surformer v1 (using structured features) and Multimodal CNN (using raw images) to examine the impact of feature-based versus image-based multimodal learning on classification accuracy and computational efficiency. The results showed that Surformer v1 achieved 99.4% accuracy with an inference time of 0.77 ms, while the Multimodal CNN achieved slightly higher accuracy but required significantly more inference time. These findings suggest Surformer v1 offers a compelling balance between accuracy, efficiency, and computational cost for surface material recognition.

[188] Voice Pathology Detection Using Phonation

Sri Raksha Siva, Nived Suthahar, Prakash Boominathan, Uma Ranjan

Main category: cs.CV

TL;DR: A noninvasive, machine learning-based framework for detecting voice pathologies using phonation data, offering early and accurate diagnosis.

Details

Motivation: Voice disorders impact communication and quality of life, but traditional diagnostic methods like laryngoscopy are invasive and subjective.

Method: Analyzes phonation data using acoustic features (MFCCs, chroma, Mel spectrograms) and classifies with RNNs (LSTM, attention). Includes data augmentation and preprocessing.

Result: The framework provides an automated, noninvasive tool for detecting voice pathologies.

Conclusion: Supports AI-driven healthcare and improves patient outcomes through early diagnosis.

Abstract: Voice disorders significantly affect communication and quality of life, requiring an early and accurate diagnosis. Traditional methods like laryngoscopy are invasive, subjective, and often inaccessible. This research proposes a noninvasive, machine learning-based framework for detecting voice pathologies using phonation data. Phonation data from the Saarbr"ucken Voice Database are analyzed using acoustic features such as Mel Frequency Cepstral Coefficients (MFCCs), chroma features, and Mel spectrograms. Recurrent Neural Networks (RNNs), including LSTM and attention mechanisms, classify samples into normal and pathological categories. Data augmentation techniques, including pitch shifting and Gaussian noise addition, enhance model generalizability, while preprocessing ensures signal quality. Scale-based features, such as H"older and Hurst exponents, further capture signal irregularities and long-term dependencies. The proposed framework offers a noninvasive, automated diagnostic tool for early detection of voice pathologies, supporting AI-driven healthcare, and improving patient outcomes.

[189] RAPNet: A Receptive-Field Adaptive Convolutional Neural Network for Pansharpening

Tao Tang, Chengxu Yang

Main category: cs.CV

TL;DR: RAPNet introduces content-adaptive convolution (RAPConv) and dynamic feature fusion (PAN-DFF) to improve pansharpening by addressing local content variations and balancing spatial detail with spectral fidelity.

Details

Motivation: Traditional CNNs apply uniform kernels, ignoring local content variations in pansharpening. RAPNet aims to overcome this limitation.

Method: Uses RAPConv for spatially adaptive kernels and PAN-DFF with attention for optimal detail and spectral balance.

Result: Outperforms existing methods in quantitative and qualitative evaluations, validated by ablation studies.

Conclusion: RAPNet’s adaptive components significantly enhance pansharpening performance.

Abstract: Pansharpening refers to the process of integrating a high resolution panchromatic (PAN) image with a lower resolution multispectral (MS) image to generate a fused product, which is pivotal in remote sensing. Despite the effectiveness of CNNs in addressing this challenge, they are inherently constrained by the uniform application of convolutional kernels across all spatial positions, overlooking local content variations. To overcome this issue, we introduce RAPNet, a new architecture that leverages content-adaptive convolution. At its core, RAPNet employs the Receptive-field Adaptive Pansharpening Convolution (RAPConv), designed to produce spatially adaptive kernels responsive to local feature context, thereby enhancing the precision of spatial detail extraction. Additionally, the network integrates the Pansharpening Dynamic Feature Fusion (PAN-DFF) module, which incorporates an attention mechanism to achieve an optimal balance between spatial detail enhancement and spectral fidelity. Comprehensive evaluations on publicly available datasets confirm that RAPNet delivers superior performance compared to existing approaches, as demonstrated by both quantitative metrics and qualitative assessments. Ablation analyses further substantiate the effectiveness of the proposed adaptive components.

[190] ImpliHateVid: A Benchmark Dataset and Two-stage Contrastive Learning Framework for Implicit Hate Speech Detection in Videos

Mohammad Zia Ur Rehman, Anukriti Bhatnagar, Omkar Kabde, Shubhi Bansal, Nagendra Kumar

Main category: cs.CV

TL;DR: The paper introduces ImpliHateVid, a novel dataset for implicit hate speech detection in videos, and proposes a two-stage contrastive learning framework for multimodal hate speech detection.

Details

Motivation: Existing research lacks focus on video-based hate speech detection, especially for implicit hate. The paper addresses this gap by creating a dedicated dataset and a robust detection method.

Method: A two-stage contrastive learning framework is proposed: (1) modality-specific encoders for audio, text, and image trained with contrastive loss, and (2) cross-encoders for refining multimodal representations. Additional features like sentiment and emotion are incorporated.

Result: The method is evaluated on ImpliHateVid and HateMM datasets, showing effectiveness in detecting hateful content, especially implicit hate, in videos.

Conclusion: The work highlights the importance of video-based hate speech detection and demonstrates the success of the proposed multimodal contrastive learning approach and the utility of the new dataset.

Abstract: The existing research has primarily focused on text and image-based hate speech detection, video-based approaches remain underexplored. In this work, we introduce a novel dataset, ImpliHateVid, specifically curated for implicit hate speech detection in videos. ImpliHateVid consists of 2,009 videos comprising 509 implicit hate videos, 500 explicit hate videos, and 1,000 non-hate videos, making it one of the first large-scale video datasets dedicated to implicit hate detection. We also propose a novel two-stage contrastive learning framework for hate speech detection in videos. In the first stage, we train modality-specific encoders for audio, text, and image using contrastive loss by concatenating features from the three encoders. In the second stage, we train cross-encoders using contrastive learning to refine multimodal representations. Additionally, we incorporate sentiment, emotion, and caption-based features to enhance implicit hate detection. We evaluate our method on two datasets, ImpliHateVid for implicit hate speech detection and another dataset for general hate speech detection in videos, HateMM dataset, demonstrating the effectiveness of the proposed multimodal contrastive learning for hateful content detection in videos and the significance of our dataset.

[191] Pindrop it! Audio and Visual Deepfake Countermeasures for Robust Detection and Fine Grained-Localization

Nicholas Klein, Hemlata Tak, James Fullwood, Krishna Regmi, Leonidas Spinoulas, Ganesh Sivaraman, Tianxiang Chen, Elie Khoury

Main category: cs.CV

TL;DR: The paper addresses challenges in detecting synthetic content in videos, focusing on deepfake classification and localization, achieving top performance in a competition.

Details

Motivation: The rapid advancement in visual and audio generation necessitates robust detection methods, especially for subtle, localized manipulations in videos.

Method: The authors developed solutions for deepfake video classification and localization, submitted to the ACM 1M Deepfakes Detection Challenge.

Result: Their methods achieved the best performance in temporal localization and a top-four ranking in classification for the TestA dataset split.

Conclusion: The paper demonstrates effective solutions for detecting synthetic content, highlighting their competitive performance in a benchmark challenge.

Abstract: The field of visual and audio generation is burgeoning with new state-of-the-art methods. This rapid proliferation of new techniques underscores the need for robust solutions for detecting synthetic content in videos. In particular, when fine-grained alterations via localized manipulations are performed in visual, audio, or both domains, these subtle modifications add challenges to the detection algorithms. This paper presents solutions for the problems of deepfake video classification and localization. The methods were submitted to the ACM 1M Deepfakes Detection Challenge, achieving the best performance in the temporal localization task and a top four ranking in the classification task for the TestA split of the evaluation dataset.

Sihan Ma, Qiming Wu, Ruotong Jiang, Frank Burns

Main category: cs.CV

TL;DR: ContextGuard-LVLM is a framework using Vision-Language Large Models to detect fine-grained cross-modal inconsistencies in digital news, outperforming zero-shot baselines.

Details

Motivation: Addressing the fine-grained cross-modal contextual consistency (FCCC) problem in digital news verification, which traditional methods fail to solve.

Method: Proposes ContextGuard-LVLM, leveraging LVLMs with multi-stage contextual reasoning and reinforced/adversarial learning.

Result: Outperforms zero-shot baselines (InstructBLIP, LLaVA 1.5) in fine-grained tasks, showing robustness and human-aligned judgments.

Conclusion: ContextGuard-LVLM effectively detects nuanced contextual misalignments, proving superior in complex reasoning and expert agreement.

Abstract: The proliferation of digital news media necessitates robust methods for verifying content veracity, particularly regarding the consistency between visual and textual information. Traditional approaches often fall short in addressing the fine-grained cross-modal contextual consistency (FCCC) problem, which encompasses deeper alignment of visual narrative, emotional tone, and background information with text, beyond mere entity matching. To address this, we propose ContextGuard-LVLM, a novel framework built upon advanced Vision-Language Large Models (LVLMs) and integrating a multi-stage contextual reasoning mechanism. Our model is uniquely enhanced through reinforced or adversarial learning paradigms, enabling it to detect subtle contextual misalignments that evade zero-shot baselines. We extend and augment three established datasets (TamperedNews-Ent, News400-Ent, MMG-Ent) with new fine-grained contextual annotations, including “contextual sentiment,” “visual narrative theme,” and “scene-event logical coherence,” and introduce a comprehensive CTXT (Contextual Coherence) entity type. Extensive experiments demonstrate that ContextGuard-LVLM consistently outperforms state-of-the-art zero-shot LVLM baselines (InstructBLIP and LLaVA 1.5) across nearly all fine-grained consistency tasks, showing significant improvements in complex logical reasoning and nuanced contextual understanding. Furthermore, our model exhibits superior robustness to subtle perturbations and a higher agreement rate with human expert judgments on challenging samples, affirming its efficacy in discerning sophisticated forms of context detachment.

[193] VL-MedGuide: A Visual-Linguistic Large Model for Intelligent and Explainable Skin Disease Auxiliary Diagnosis

Kexin Yu, Zihan Xu, Jialei Xie, Carter Adams

Main category: cs.CV

TL;DR: VL-MedGuide, a visual-linguistic framework, improves skin disease diagnosis by combining multi-modal understanding with interpretable reasoning, outperforming existing methods.

Details

Motivation: Addressing the challenge of diagnosing skin diseases due to complex visual features and lack of interpretability in current models.

Method: Uses a two-stage approach: Multi-modal Concept Perception Module for feature description and Explainable Disease Reasoning Module for diagnosis with transparent rationales.

Result: Achieves state-of-the-art performance (83.55% BACC, 80.12% F1 in diagnosis; 76.10% BACC, 67.45% F1 in concept detection) on Derm7pt dataset.

Conclusion: VL-MedGuide bridges AI performance and clinical utility with clear, trustworthy explanations, enhancing dermatological practice.

Abstract: Accurate diagnosis of skin diseases remains a significant challenge due to the complex and diverse visual features present in dermatoscopic images, often compounded by a lack of interpretability in existing purely visual diagnostic models. To address these limitations, this study introduces VL-MedGuide (Visual-Linguistic Medical Guide), a novel framework leveraging the powerful multi-modal understanding and reasoning capabilities of Visual-Language Large Models (LVLMs) for intelligent and inherently interpretable auxiliary diagnosis of skin conditions. VL-MedGuide operates in two interconnected stages: a Multi-modal Concept Perception Module, which identifies and linguistically describes dermatologically relevant visual features through sophisticated prompt engineering, and an Explainable Disease Reasoning Module, which integrates these concepts with raw visual information via Chain-of-Thought prompting to provide precise disease diagnoses alongside transparent rationales. Comprehensive experiments on the Derm7pt dataset demonstrate that VL-MedGuide achieves state-of-the-art performance in both disease diagnosis (83.55% BACC, 80.12% F1) and concept detection (76.10% BACC, 67.45% F1), surpassing existing baselines. Furthermore, human evaluations confirm the high clarity, completeness, and trustworthiness of its generated explanations, bridging the gap between AI performance and clinical utility by offering actionable, explainable insights for dermatological practice.

[194] CycleDiff: Cycle Diffusion Models for Unpaired Image-to-image Translation

Shilong Zou, Yuhang Huang, Renjiao Yi, Chenyang Zhu, Kai Xu

Main category: cs.CV

TL;DR: A diffusion-based cross-domain image translator is introduced, using a joint learning framework to align diffusion and translation processes for improved performance.

Details

Motivation: Existing GAN-based methods and shallow integration of diffusion models limit the effectiveness of cross-domain image translation.

Method: Proposes a joint learning framework aligning diffusion and translation processes, using diffusion models to represent clean signals and a time-dependent translation network.

Result: Achieves better generative performance in RGB↔RGB and cross-modality tasks (e.g., RGB↔Edge, RGB↔Semantics, RGB↔Depth) compared to state-of-the-art methods.

Conclusion: The joint learning framework enhances global optimization, improving fidelity and structural consistency in cross-domain image translation.

Abstract: We introduce a diffusion-based cross-domain image translator in the absence of paired training data. Unlike GAN-based methods, our approach integrates diffusion models to learn the image translation process, allowing for more coverable modeling of the data distribution and performance improvement of the cross-domain translation. However, incorporating the translation process within the diffusion process is still challenging since the two processes are not aligned exactly, i.e., the diffusion process is applied to the noisy signal while the translation process is conducted on the clean signal. As a result, recent diffusion-based studies employ separate training or shallow integration to learn the two processes, yet this may cause the local minimal of the translation optimization, constraining the effectiveness of diffusion models. To address the problem, we propose a novel joint learning framework that aligns the diffusion and the translation process, thereby improving the global optimality. Specifically, we propose to extract the image components with diffusion models to represent the clean signal and employ the translation process with the image components, enabling an end-to-end joint learning manner. On the other hand, we introduce a time-dependent translation network to learn the complex translation mapping, resulting in effective translation learning and significant performance improvement. Benefiting from the design of joint learning, our method enables global optimization of both processes, enhancing the optimality and achieving improved fidelity and structural consistency. We have conducted extensive experiments on RGB$\leftrightarrow$RGB and diverse cross-modality translation tasks including RGB$\leftrightarrow$Edge, RGB$\leftrightarrow$Semantics and RGB$\leftrightarrow$Depth, showcasing better generative performances than the state of the arts.

[195] CoDe-NeRF: Neural Rendering via Dynamic Coefficient Decomposition

Wenpeng Xing, Jie Chen, Zaifeng Yang, Tiancheng Zhao, Gaolei Li, Changting Lin, Yike Guo, Meng Han

Main category: cs.CV

TL;DR: A neural rendering framework improves specular reflection modeling by decomposing appearance into static neural basis and dynamic coefficients, yielding sharper highlights.

Details

Motivation: Addressing blurry reflections and optimization instability in NeRF for scenes with complex specular effects.

Method: Dynamic coefficient decomposition: static neural basis for material properties and dynamic coefficients from a Coefficient Network, combined by a Dynamic Radiance Integrator.

Result: Produces sharper, more realistic specular highlights compared to existing techniques.

Conclusion: The decomposition paradigm offers a flexible, effective approach for modeling complex appearance in neural scene representations.

Abstract: Neural Radiance Fields (NeRF) have shown impressive performance in novel view synthesis, but challenges remain in rendering scenes with complex specular reflections and highlights. Existing approaches may produce blurry reflections due to entanglement between lighting and material properties, or encounter optimization instability when relying on physically-based inverse rendering. In this work, we present a neural rendering framework based on dynamic coefficient decomposition, aiming to improve the modeling of view-dependent appearance. Our approach decomposes complex appearance into a shared, static neural basis that encodes intrinsic material properties, and a set of dynamic coefficients generated by a Coefficient Network conditioned on view and illumination. A Dynamic Radiance Integrator then combines these components to synthesize the final radiance. Experimental results on several challenging benchmarks suggest that our method can produce sharper and more realistic specular highlights compared to existing techniques. We hope that this decomposition paradigm can provide a flexible and effective direction for modeling complex appearance in neural scene representations.

[196] Rethinking Key-frame-based Micro-expression Recognition: A Robust and Accurate Framework Against Key-frame Errors

Zheyuan Zhang, Weihao Tang, Hong Chen

Main category: cs.CV

TL;DR: CausalNet improves micro-expression recognition (MER) by addressing key-frame index errors and redundancy, using causal learning for robust and accurate results.

Details

Motivation: Current key-frame-based MER methods rely on accurate key-frame indexes, which are hard to obtain, limiting practical applications.

Method: CausalNet uses the entire ME sequence, with CMPLM to locate muscle movement areas and CAB to learn causal relationships between movements.

Result: CausalNet achieves robust MER under key-frame noise and outperforms SOTA methods on standard benchmarks.

Conclusion: CausalNet offers a practical solution for MER by handling key-frame errors and maintaining accuracy.

Abstract: Micro-expression recognition (MER) is a highly challenging task in affective computing. With the reduced-sized micro-expression (ME) input that contains key information based on key-frame indexes, key-frame-based methods have significantly improved the performance of MER. However, most of these methods focus on improving the performance with relatively accurate key-frame indexes, while ignoring the difficulty of obtaining accurate key-frame indexes and the objective existence of key-frame index errors, which impedes them from moving towards practical applications. In this paper, we propose CausalNet, a novel framework to achieve robust MER facing key-frame index errors while maintaining accurate recognition. To enhance robustness, CausalNet takes the representation of the entire ME sequence as the input. To address the information redundancy brought by the complete ME range input and maintain accurate recognition, first, the Causal Motion Position Learning Module (CMPLM) is proposed to help the model locate the muscle movement areas related to Action Units (AUs), thereby reducing the attention to other redundant areas. Second, the Causal Attention Block (CAB) is proposed to deeply learn the causal relationships between the muscle contraction and relaxation movements in MEs. Empirical experiments have demonstrated that on popular ME benchmarks, the CausalNet has achieved robust MER under different levels of key-frame index noise. Meanwhile, it has surpassed state-of-the-art (SOTA) methods on several standard MER benchmarks when using the provided annotated key-frames. Code is available at https://github.com/tony19980810/CausalNet.

[197] Towards Robust Red-Green Watermarking for Autoregressive Image Generators

Denis Lukovnikov, Andreas Müller, Erwin Quiring, Asja Fischer

Main category: cs.CV

TL;DR: The paper explores in-generation watermarking for autoregressive (AR) image models, proposing two novel methods using visual token clustering to improve robustness against perturbations while maintaining image quality.

Details

Motivation: To address the unexplored use of in-generation watermarks in AR image models and improve their detectability under common image perturbations.

Method: Two token-level watermarking schemes: a training-free cluster lookup table and finetuned VAE encoders for token cluster prediction from perturbed images.

Result: Cluster-level watermarks enhance robustness against perturbations and regeneration attacks, with improved detectability and fast verification runtime.

Conclusion: The proposed methods effectively improve watermark robustness and detectability in AR image models, outperforming baselines while preserving image quality.

Abstract: In-generation watermarking for detecting and attributing generated content has recently been explored for latent diffusion models (LDMs), demonstrating high robustness. However, the use of in-generation watermarks in autoregressive (AR) image models has not been explored yet. AR models generate images by autoregressively predicting a sequence of visual tokens that are then decoded into pixels using a vector-quantized decoder. Inspired by red-green watermarks for large language models, we examine token-level watermarking schemes that bias the next-token prediction based on prior tokens. We find that a direct transfer of these schemes works in principle, but the detectability of the watermarks decreases considerably under common image perturbations. As a remedy, we propose two novel watermarking methods that rely on visual token clustering to assign similar tokens to the same set. Firstly, we investigate a training-free approach that relies on a cluster lookup table, and secondly, we finetune VAE encoders to predict token clusters directly from perturbed images. Overall, our experiments show that cluster-level watermarks improve robustness against perturbations and regeneration attacks while preserving image quality. Cluster classification further boosts watermark detectability, outperforming a set of baselines. Moreover, our methods offer fast verification runtime, comparable to lightweight post-hoc watermarking methods.

[198] Learning More by Seeing Less: Line Drawing Pretraining for Efficient, Transferable, and Human-Aligned Vision

Tianqin Li, George Liu, Tai Sing Lee

Main category: cs.CV

TL;DR: The paper proposes using line drawings for pretraining vision models, showing improved shape bias, data efficiency, and lower intrinsic dimensionality, leading to better performance and compressible representations.

Details

Motivation: Modern vision systems rely on rich visual inputs, unlike humans who understand sparse representations like line drawings. The work aims to leverage structure-first learning for more efficient and generalizable visual understanding.

Method: The authors pretrain models on line drawings, evaluating their performance on classification, detection, and segmentation tasks. They also introduce an unsupervised method called “learning to draw.”

Result: Models pretrained on line drawings exhibit stronger shape bias, focused attention, and greater data efficiency. They also show lower intrinsic dimensionality and better compressibility, enabling improved distillation into lightweight models.

Conclusion: Structure-first learning with line drawings fosters efficiency, generalization, and human-aligned inductive biases, offering a robust strategy for adaptable vision systems.

Abstract: Despite remarkable progress in computer vision, modern recognition systems remain limited by their dependence on rich, redundant visual inputs. In contrast, humans can effortlessly understand sparse, minimal representations like line drawings - suggesting that structure, rather than appearance, underlies efficient visual understanding. In this work, we propose using line drawings as a structure-first pretraining modality to induce more compact and generalizable visual representations. We show that models pretrained on line drawings develop stronger shape bias, more focused attention, and greater data efficiency across classification, detection, and segmentation tasks. Notably, these models also exhibit lower intrinsic dimensionality, requiring significantly fewer principal components to capture representational variance - echoing the similar observation in low dimensional efficient representation in the brain. Beyond performance improvements, line drawing pretraining produces more compressible representations, enabling better distillation into lightweight student models. Students distilled from line-pretrained teachers consistently outperform those trained from color-supervised teachers, highlighting the benefits of structurally compact knowledge. Finally, we demonstrate that the pretraining with line-drawing can also be extended to unsupervised setting via our proposed method “learning to draw”. Together, our results support the view that structure-first visual learning fosters efficiency, generalization, and human-aligned inductive biases - offering a simple yet powerful strategy for building more robust and adaptable vision systems.

[199] Fourier Optics and Deep Learning Methods for Fast 3D Reconstruction in Digital Holography

Justin London

Main category: cs.CV

TL;DR: A fast pipeline framework for CGH synthesis using point cloud and MRI data, optimized with non-convex methods, outperforming deep learning in metrics like MSE and PSNR.

Details

Motivation: To improve the efficiency and quality of computer-generated holography (CGH) by leveraging volumetric data and advanced optimization techniques.

Method: Reconstructs volumetric objects from point cloud and MRI data, then applies non-convex Fourier optics optimization (alternating projection, SGD, quasi-Newton) for POH and CH generation. Performance is enhanced with 2D median filtering.

Result: The proposed framework outperforms HoloNet in metrics like MSE, RMSE, and PSNR, with artifact reduction via 2D median filtering.

Conclusion: The pipeline offers efficient and high-quality CGH synthesis, with optimization methods and filtering significantly improving performance.

Abstract: Computer-generated holography (CGH) is a promising method that modulates user-defined waveforms with digital holograms. An efficient and fast pipeline framework is proposed to synthesize CGH using initial point cloud and MRI data. This input data is reconstructed into volumetric objects that are then input into non-convex Fourier optics optimization algorithms for phase-only hologram (POH) and complex-hologram (CH) generation using alternating projection, SGD, and quasi-Netwton methods. Comparison of reconstruction performance of these algorithms as measured by MSE, RMSE, and PSNR is analyzed as well as to HoloNet deep learning CGH. Performance metrics are shown to be improved by using 2D median filtering to remove artifacts and speckled noise during optimization.

[200] Restage4D: Reanimating Deformable 3D Reconstruction from a Single Video

Jixuan He, Chieh Hubert Lin, Lu Qi, Ming-Hsuan Yang

Main category: cs.CV

TL;DR: Restage4D leverages real-world video motion priors to generate physically consistent 4D content, improving geometry and motion quality over synthetic methods.

Details

Motivation: Existing generative models lack physical realism for 4D scene synthesis, while real-world videos offer grounded motion cues.

Method: Uses video-rewinding training, occlusion-aware rigidity loss, and disocclusion backtracing to bridge real and synthetic motion.

Result: Validated on DAVIS and PointOdyssey, showing better geometry consistency, motion quality, and 3D tracking.

Conclusion: Restage4D preserves deformable structure and corrects generative model errors, highlighting video priors’ potential for 4D restaging.

Abstract: Creating deformable 3D content has gained increasing attention with the rise of text-to-image and image-to-video generative models. While these models provide rich semantic priors for appearance, they struggle to capture the physical realism and motion dynamics needed for authentic 4D scene synthesis. In contrast, real-world videos can provide physically grounded geometry and articulation cues that are difficult to hallucinate. One question is raised: \textit{Can we generate physically consistent 4D content by leveraging the motion priors of the real-world video}? In this work, we explore the task of reanimating deformable 3D scenes from a single video, using the original sequence as a supervisory signal to correct artifacts from synthetic motion. We introduce \textbf{Restage4D}, a geometry-preserving pipeline for video-conditioned 4D restaging. Our approach uses a video-rewinding training strategy to temporally bridge a real base video and a synthetic driving video via a shared motion representation. We further incorporate an occlusion-aware rigidity loss and a disocclusion backtracing mechanism to improve structural and geometry consistency under challenging motion. We validate Restage4D on DAVIS and PointOdyssey, demonstrating improved geometry consistency, motion quality, and 3D tracking performance. Our method not only preserves deformable structure under novel motion, but also automatically corrects errors introduced by generative models, revealing the potential of video prior in 4D restaging task. Source code and trained models will be released.

[201] FoundBioNet: A Foundation-Based Model for IDH Genotyping of Glioma from Multi-Parametric MRI

Somayeh Farahani, Marjaneh Hejazi, Antonio Di Ieva, Sidong Liu

Main category: cs.CV

TL;DR: FoundBioNet, a SWIN-UNETR-based model, noninvasively predicts IDH mutation in gliomas using multi-parametric MRI, outperforming baselines with AUCs up to 90.58%.

Details

Motivation: Traditional invasive methods for IDH mutation detection in gliomas lack spatial heterogeneity capture, and existing deep learning models are limited by scarce annotated data.

Method: FoundBioNet uses a SWIN-UNETR architecture with Tumor-Aware Feature Encoding (TAFE) and Cross-Modality Differential (CMD) modules to analyze MRI data.

Result: Achieved AUCs of 90.58%, 88.08%, 65.41%, and 80.31% on diverse test sets, outperforming baselines (p <= 0.05).

Conclusion: FoundBioNet enhances diagnostic accuracy and interpretability, enabling personalized glioma care through large-scale pretraining and fine-tuning.

Abstract: Accurate, noninvasive detection of isocitrate dehydrogenase (IDH) mutation is essential for effective glioma management. Traditional methods rely on invasive tissue sampling, which may fail to capture a tumor’s spatial heterogeneity. While deep learning models have shown promise in molecular profiling, their performance is often limited by scarce annotated data. In contrast, foundation deep learning models offer a more generalizable approach for glioma imaging biomarkers. We propose a Foundation-based Biomarker Network (FoundBioNet) that utilizes a SWIN-UNETR-based architecture to noninvasively predict IDH mutation status from multi-parametric MRI. Two key modules are incorporated: Tumor-Aware Feature Encoding (TAFE) for extracting multi-scale, tumor-focused features, and Cross-Modality Differential (CMD) for highlighting subtle T2-FLAIR mismatch signals associated with IDH mutation. The model was trained and validated on a diverse, multi-center cohort of 1705 glioma patients from six public datasets. Our model achieved AUCs of 90.58%, 88.08%, 65.41%, and 80.31% on independent test sets from EGD, TCGA, Ivy GAP, RHUH, and UPenn, consistently outperforming baseline approaches (p <= 0.05). Ablation studies confirmed that both the TAFE and CMD modules are essential for improving predictive accuracy. By integrating large-scale pretraining and task-specific fine-tuning, FoundBioNet enables generalizable glioma characterization. This approach enhances diagnostic accuracy and interpretability, with the potential to enable more personalized patient care.

[202] Fractured Glass, Failing Cameras: Simulating Physics-Based Adversarial Samples for Autonomous Driving Systems

Manav Prabhakar, Jwalandhar Girnar, Arpan Kusari

Main category: cs.CV

TL;DR: The paper investigates how physical camera failures (e.g., glass breakage) can create adversarial samples for autonomous vehicle perception systems, using simulations and real-world experiments to validate the impact on neural networks.

Details

Motivation: To address the overlooked issue of physical camera failures in autonomous vehicles, demonstrating their potential to disrupt detection models.

Method: Combines real-world experiments, FEM-based simulations of glass breakage, and PBR techniques to create adversarial samples, then tests them on datasets like KITTI and BDD100K using YOLOv8, Faster R-CNN, and Pyramid Vision Transformers.

Result: Broken glass filters cause detection failures but do not introduce significant distributional shifts, as shown by K-L divergence analysis.

Conclusion: Physical camera failures pose a realistic threat to perception systems, warranting further research into robust solutions.

Abstract: While much research has recently focused on generating physics-based adversarial samples, a critical yet often overlooked category originates from physical failures within on-board cameras – components essential to the perception systems of autonomous vehicles. Firstly, we motivate the study using two separate real-world experiments to showcase that indeed glass failures would cause the detection based neural network models to fail. Secondly, we develop a simulation-based study using the physical process of the glass breakage to create perturbed scenarios, representing a realistic class of physics-based adversarial samples. Using a finite element model (FEM)-based approach, we generate surface cracks on the camera image by applying a stress field defined by particles within a triangular mesh. Lastly, we use physically-based rendering (PBR) techniques to provide realistic visualizations of these physically plausible fractures. To analyze the safety implications, we superimpose these simulated broken glass effects as image filters on widely used open-source datasets: KITTI and BDD100K using two most prominent object detection neural networks (CNN-based – YOLOv8 and Faster R-CNN) and Pyramid Vision Transformers. To further investigate the distributional impact of these visual distortions, we compute the Kullback-Leibler (K-L) divergence between three distinct data distributions, applying various broken glass filters to a custom dataset (captured through a cracked windshield), as well as the KITTI and Kaggle cats and dogs datasets. The K-L divergence analysis suggests that these broken glass filters do not introduce significant distributional shifts.

[203] VOccl3D: A Video Benchmark Dataset for 3D Human Pose and Shape Estimation under real Occlusions

Yash Garg, Saketh Bachu, Arindam Dutta, Rohit Lal, Sarosij Bose, Calvin-Khang Ta, M. Salman Asif, Amit Roy-Chowdhury

Main category: cs.CV

TL;DR: The paper introduces VOccl3D, a realistic video-based occlusion dataset for 3D human pose and shape estimation, addressing gaps in existing datasets. It also improves HPS methods and human detection under occlusion.

Details

Motivation: Existing datasets for occlusion in HPS estimation lack realism, using artificial occlusions like random patches or clipart. This limits their applicability to real-world scenarios.

Method: The authors created VOccl3D using advanced rendering techniques, incorporating diverse occlusions, clothing, and motions. They fine-tuned HPS methods (CLIFF, BEDLAM-CLIFF) and YOLO11 for detection.

Result: Fine-tuning on VOccl3D led to significant improvements in HPS estimation and human detection under occlusion, outperforming state-of-the-art methods on public datasets and their test split.

Conclusion: VOccl3D provides a realistic benchmark for occlusion research, enhancing HPS estimation and detection. It encourages future work in handling real-world occlusions.

Abstract: Human pose and shape (HPS) estimation methods have been extensively studied, with many demonstrating high zero-shot performance on in-the-wild images and videos. However, these methods often struggle in challenging scenarios involving complex human poses or significant occlusions. Although some studies address 3D human pose estimation under occlusion, they typically evaluate performance on datasets that lack realistic or substantial occlusions, e.g., most existing datasets introduce occlusions with random patches over the human or clipart-style overlays, which may not reflect real-world challenges. To bridge this gap in realistic occlusion datasets, we introduce a novel benchmark dataset, VOccl3D, a Video-based human Occlusion dataset with 3D body pose and shape annotations. Inspired by works such as AGORA and BEDLAM, we constructed this dataset using advanced computer graphics rendering techniques, incorporating diverse real-world occlusion scenarios, clothing textures, and human motions. Additionally, we fine-tuned recent HPS methods, CLIFF and BEDLAM-CLIFF, on our dataset, demonstrating significant qualitative and quantitative improvements across multiple public datasets, as well as on the test split of our dataset, while comparing its performance with other state-of-the-art methods. Furthermore, we leveraged our dataset to enhance human detection performance under occlusion by fine-tuning an existing object detector, YOLO11, thus leading to a robust end-to-end HPS estimation system under occlusions. Overall, this dataset serves as a valuable resource for future research aimed at benchmarking methods designed to handle occlusions, offering a more realistic alternative to existing occlusion datasets. See the Project page for code and dataset:https://yashgarg98.github.io/VOccl3D-dataset/

[204] VFM-UDA++: Improving Network Architectures and Data Strategies for Unsupervised Domain Adaptive Semantic Segmentation

Brunó B. Englert, Gijs Dubbelman

Main category: cs.CV

TL;DR: VFM-UDA++ improves UDA by leveraging multi-scale features, adapting feature distance losses for ViT-based VFMs, and scaling data, achieving +1.4 mIoU on GTA5→Cityscapes and +2.4 mIoU with more data.

Details

Motivation: Explore how UDA can best leverage Vision Foundation Models (VFMs) for better generalization, addressing limitations of prior work (VFM-UDA) like incompatible feature distance losses and lack of multi-scale biases.

Method: Proposes VFM-UDA++: (1) investigates multi-scale features, (2) adapts feature distance loss for ViT-based VFMs, (3) evaluates UDA with increased synthetic and real data.

Result: Improves performance by +1.4 mIoU on GTA5→Cityscapes. With more data, gains +2.4 mIoU, showing scalability.

Conclusion: VFM-UDA++ effectively leverages VFMs for UDA, outperforming prior methods and demonstrating scalability with data.

Abstract: Unsupervised Domain Adaptation (UDA) enables strong generalization from a labeled source domain to an unlabeled target domain, often with limited data. In parallel, Vision Foundation Models (VFMs) pretrained at scale without labels have also shown impressive downstream performance and generalization. This motivates us to explore how UDA can best leverage VFMs. Prior work (VFM-UDA) demonstrated that replacing a standard ImageNet-pretrained encoder with a VFM improves generalization. However, it also showed that commonly used feature distance losses harm performance when applied to VFMs. Additionally, VFM-UDA does not incorporate multi-scale inductive biases, which are known to improve semantic segmentation. Building on these insights, we propose VFM-UDA++, which (1) investigates the role of multi-scale features, (2) adapts feature distance loss to be compatible with ViT-based VFMs and (3) evaluates how UDA benefits from increased synthetic source and real target data. By addressing these questions, we can improve performance on the standard GTA5 $\rightarrow$ Cityscapes benchmark by +1.4 mIoU. While prior non-VFM UDA methods did not scale with more data, VFM-UDA++ shows consistent improvement and achieves a further +2.4 mIoU gain when scaling the data, demonstrating that VFM-based UDA continues to benefit from increased data availability.

[205] SafePLUG: Empowering Multimodal LLMs with Pixel-Level Insight and Temporal Grounding for Traffic Accident Understanding

Zihao Sheng, Zilin Huang, Yen-Jung Chen, Yansong Qu, Yuhao Luo, Yue Leng, Sikai Chen

Main category: cs.CV

TL;DR: SafePLUG enhances multimodal large language models (MLLMs) with pixel-level understanding and temporal grounding for fine-grained traffic accident analysis, outperforming existing methods.

Details

Motivation: Existing MLLMs lack fine-grained visual and temporal analysis for traffic accidents, limiting their practical use.

Method: Proposes SafePLUG, supporting pixel-level segmentation, region-aware QA, and temporal event recognition.

Result: Achieves strong performance in region-based QA, segmentation, and temporal event localization.

Conclusion: SafePLUG advances fine-grained traffic scene understanding, improving safety and situational awareness in smart transportation.

Abstract: Multimodal large language models (MLLMs) have achieved remarkable progress across a range of vision-language tasks and demonstrate strong potential for traffic accident understanding. However, existing MLLMs in this domain primarily focus on coarse-grained image-level or video-level comprehension and often struggle to handle fine-grained visual details or localized scene components, limiting their applicability in complex accident scenarios. To address these limitations, we propose SafePLUG, a novel framework that empowers MLLMs with both Pixel-Level Understanding and temporal Grounding for comprehensive traffic accident analysis. SafePLUG supports both arbitrary-shaped visual prompts for region-aware question answering and pixel-level segmentation based on language instructions, while also enabling the recognition of temporally anchored events in traffic accident scenarios. To advance the development of MLLMs for traffic accident understanding, we curate a new dataset containing multimodal question-answer pairs centered on diverse accident scenarios, with detailed pixel-level annotations and temporal event boundaries. Experimental results show that SafePLUG achieves strong performance on multiple tasks, including region-based question answering, pixel-level segmentation, temporal event localization, and accident event understanding. These capabilities lay a foundation for fine-grained understanding of complex traffic scenes, with the potential to improve driving safety and enhance situational awareness in smart transportation systems. The code, dataset, and model checkpoints will be made publicly available at: https://zihaosheng.github.io/SafePLUG

[206] Exploring Video-Based Driver Activity Recognition under Noisy Labels

Linjuan Fan, Di Wen, Kunyu Peng, Kailun Yang, Jiaming Zhang, Ruiping Liu, Yufan Chen, Junwei Zheng, Jiamin Wu, Xudong Han, Rainer Stiefelhagen

Main category: cs.CV

TL;DR: A novel label noise learning approach for driver activity recognition, combining clustering, co-refinement, and flexible sample selection to improve model performance on noisy data.

Details

Motivation: Real-world video data for driver activity recognition often contains mislabeled samples, reducing model reliability. Existing label noise learning methods are underexplored in this field.

Method: Proposes clustering-friendly low-dimensional representations, co-refinement within clusters, and a hyperparameter-free sample selection strategy with class balancing.

Result: Outperforms other label-denoising methods on the Drive&Act dataset across all granularity levels.

Conclusion: The approach effectively addresses label noise in driver activity recognition, offering a robust solution with superior performance.

Abstract: As an open research topic in the field of deep learning, learning with noisy labels has attracted much attention and grown rapidly over the past ten years. Learning with label noise is crucial for driver distraction behavior recognition, as real-world video data often contains mislabeled samples, impacting model reliability and performance. However, label noise learning is barely explored in the driver activity recognition field. In this paper, we propose the first label noise learning approach for the driver activity recognition task. Based on the cluster assumption, we initially enable the model to learn clustering-friendly low-dimensional representations from given videos and assign the resultant embeddings into clusters. We subsequently perform co-refinement within each cluster to smooth the classifier outputs. Furthermore, we propose a flexible sample selection strategy that combines two selection criteria without relying on any hyperparameters to filter clean samples from the training dataset. We also incorporate a self-adaptive parameter into the sample selection process to enforce balancing across classes. A comprehensive variety of experiments on the public Drive&Act dataset for all granularity levels demonstrates the superior performance of our method in comparison with other label-denoising methods derived from the image classification field. The source code is available at https://github.com/ilonafan/DAR-noisy-labels.

[207] DiffUS: Differentiable Ultrasound Rendering from Volumetric Imaging

Noe Bertramo, Gabriel Duguey, Vivek Gopalakrishnan

Main category: cs.CV

TL;DR: DiffUS is a differentiable ultrasound renderer that synthesizes realistic B-mode images from MRI scans, aiding intraoperative guidance by bridging preoperative planning and real-time imaging.

Details

Motivation: Intraoperative ultrasound interpretation is challenging due to noise, artifacts, and misalignment with preoperative scans. DiffUS aims to improve this by generating realistic ultrasound images from MRI data.

Method: DiffUS converts MRI scans to acoustic impedance volumes, simulates ultrasound beam propagation via ray tracing, and reconstructs B-mode images with realistic artifacts. It uses differentiable tensor operations in PyTorch.

Result: DiffUS successfully generates anatomically accurate ultrasound images from brain MRI data, as validated on the ReMIND dataset.

Conclusion: DiffUS provides a physics-based, differentiable solution for realistic ultrasound synthesis, enabling applications like slice-to-volume registration and volumetric reconstruction.

Abstract: Intraoperative ultrasound imaging provides real-time guidance during numerous surgical procedures, but its interpretation is complicated by noise, artifacts, and poor alignment with high-resolution preoperative MRI/CT scans. To bridge the gap between reoperative planning and intraoperative guidance, we present DiffUS, a physics-based, differentiable ultrasound renderer that synthesizes realistic B-mode images from volumetric imaging. DiffUS first converts MRI 3D scans into acoustic impedance volumes using a machine learning approach. Next, we simulate ultrasound beam propagation using ray tracing with coupled reflection-transmission equations. DiffUS formulates wave propagation as a sparse linear system that captures multiple internal reflections. Finally, we reconstruct B-mode images via depth-resolved echo extraction across fan-shaped acquisition geometry, incorporating realistic artifacts including speckle noise and depth-dependent degradation. DiffUS is entirely implemented as differentiable tensor operations in PyTorch, enabling gradient-based optimization for downstream applications such as slice-to-volume registration and volumetric reconstruction. Evaluation on the ReMIND dataset demonstrates DiffUS’s ability to generate anatomically accurate ultrasound images from brain MRI data.

Aarav Mehta, Priya Deshmukh, Vikram Singh, Siddharth Malhotra, Krishnan Menon Iyer, Tanvi Iyer

Main category: cs.CV

TL;DR: A novel medically focused crisp edge detector improves organ boundary localization in medical imaging, enhancing tasks like segmentation and registration.

Details

Motivation: Precise organ boundary localization is crucial for medical imaging tasks, but current ConvNet edge detectors lack the required accuracy.

Method: The method uses a top-down backward refinement architecture, fusing high-level semantic features with low-level cues, and extends to anisotropic volumes with 2D slice-wise refinement and 3D context.

Result: The approach outperforms baselines in boundary localization metrics and improves downstream tasks like segmentation and registration.

Conclusion: The proposed crisp edge detector provides clinically valuable, high-resolution organ boundaries, significantly enhancing medical-imaging applications.

Abstract: Accurate localization of organ boundaries is critical in medical imaging for segmentation, registration, surgical planning, and radiotherapy. While deep convolutional networks (ConvNets) have advanced general-purpose edge detection to near-human performance on natural images, their outputs often lack precise localization, a limitation that is particularly harmful in medical applications where millimeter-level accuracy is required. Building on a systematic analysis of ConvNet edge outputs, we propose a medically focused crisp edge detector that adapts a novel top-down backward refinement architecture to medical images (2D and volumetric). Our method progressively upsamples and fuses high-level semantic features with fine-grained low-level cues through a backward refinement pathway, producing high-resolution, well-localized organ boundaries. We further extend the design to handle anisotropic volumes by combining 2D slice-wise refinement with light 3D context aggregation to retain computational efficiency. Evaluations on several CT and MRI organ datasets demonstrate substantially improved boundary localization under strict criteria (boundary F-measure, Hausdorff distance) compared to baseline ConvNet detectors and contemporary medical edge/contour methods. Importantly, integrating our crisp edge maps into downstream pipelines yields consistent gains in organ segmentation (higher Dice scores, lower boundary errors), more accurate image registration, and improved delineation of lesions near organ interfaces. The proposed approach produces clinically valuable, crisp organ edges that materially enhance common medical-imaging tasks.

[209] Inference-Time Gaze Refinement for Micro-Expression Recognition: Enhancing Event-Based Eye Tracking with Motion-Aware Post-Processing

Nuwan Bandara, Thivya Kandappu, Archan Misra

Main category: cs.CV

TL;DR: A framework enhances event-based gaze estimation models with post-processing modules for smoother, more accurate gaze signals, improving cognitive state inference.

Details

Motivation: To improve the temporal smoothness and spatial accuracy of event-based gaze signals for better cognitive state inference like attention or fatigue.

Method: Introduces two post-processing modules: Motion-Aware Median Filtering and Optical Flow-Based Local Refinement, plus a novel Jitter Metric for evaluation.

Result: Significant improvements in gaze signal consistency across baseline models, validated on controlled datasets.

Conclusion: The framework advances event-based gaze tracking for real-world applications like affect recognition, with code available for further use.

Abstract: Event-based eye tracking holds significant promise for fine-grained cognitive state inference, offering high temporal resolution and robustness to motion artifacts, critical features for decoding subtle mental states such as attention, confusion, or fatigue. In this work, we introduce a model-agnostic, inference-time refinement framework designed to enhance the output of existing event-based gaze estimation models without modifying their architecture or requiring retraining. Our method comprises two key post-processing modules: (i) Motion-Aware Median Filtering, which suppresses blink-induced spikes while preserving natural gaze dynamics, and (ii) Optical Flow-Based Local Refinement, which aligns gaze predictions with cumulative event motion to reduce spatial jitter and temporal discontinuities. To complement traditional spatial accuracy metrics, we propose a novel Jitter Metric that captures the temporal smoothness of predicted gaze trajectories based on velocity regularity and local signal complexity. Together, these contributions significantly improve the consistency of event-based gaze signals, making them better suited for downstream tasks such as micro-expression analysis and mind-state decoding. Our results demonstrate consistent improvements across multiple baseline models on controlled datasets, laying the groundwork for future integration with multimodal affect recognition systems in real-world environments. Our code implementations can be found at https://github.com/eye-tracking-for-physiological-sensing/EyeLoRiN.

[210] DualResolution Residual Architecture with Artifact Suppression for Melanocytic Lesion Segmentation

Vikram Singh, Kabir Malhotra, Rohan Desai, Ananya Shankaracharya, Priyadarshini Chatterjee, Krishnan Menon Iyer

Main category: cs.CV

TL;DR: A novel dual-resolution ResNet architecture for precise melanocytic tumor segmentation in dermoscopic images, addressing artifacts and boundary localization.

Details

Motivation: Accurate lesion segmentation is crucial for skin cancer screening but faces challenges like subtle variations, artifacts, and precise boundary needs.

Method: Dual-resolution architecture with full-resolution and pooled streams, boundary-aware connections, channel attention, artifact suppression, and multi-task training.

Result: Significantly improves boundary adherence and segmentation metrics on public benchmarks.

Conclusion: The method is a practical solution for automated melanoma assessment, requiring minimal post-processing.

Abstract: Accurate segmentation of melanocytic tumors in dermoscopic images is a critical step for automated skin cancer screening and clinical decision support. Unlike natural scene segmentation, lesion delineation must reconcile subtle texture and color variations, frequent artifacts (hairs, rulers, bubbles), and a strong need for precise boundary localization to support downstream diagnosis. In this paper we introduce Our method, a novel ResNet inspired dual resolution architecture specifically designed for melanocytic tumor segmentation. Our method maintains a full resolution stream that preserves fine grained boundary information while a complementary pooled stream aggregates multi scale contextual cues for robust lesion recognition. The streams are tightly coupled by boundary aware residual connections that inject high frequency edge information into deep feature maps, and by a channel attention module that adapts color and texture sensitivity to dermoscopic appearance. To further address common imaging artifacts and the limited size of clinical datasets, we propose a lightweight artifact suppression block and a multi task training objective that combines a Dice Tversky segmentation loss with an explicit boundary loss and a contrastive regularizer for feature stability. The combined design yields pixel accurate masks without requiring heavy post processing or complex pre training protocols. Extensive experiments on public dermoscopic benchmarks demonstrate that Our method significantly improves boundary adherence and clinically relevant segmentation metrics compared to standard encoder decoder baselines, making it a practical building block for automated melanoma assessment systems.

[211] VesselRW: Weakly Supervised Subcutaneous Vessel Segmentation via Learned Random Walk Propagation

Ayaan Nooruddin Siddiqui, Mahnoor Zaidi, Ayesha Nazneen Shahbaz, Priyadarshini Chatterjee, Krishnan Menon Iyer

Main category: cs.CV

TL;DR: A weakly supervised framework for subcutaneous vessel segmentation uses sparse annotations and a differentiable random walk model to generate dense, probabilistic supervision, improving accuracy and reducing annotation burden.

Details

Motivation: The challenge of scarce and expensive ground truth for vessel segmentation, along with low contrast and noisy images across patients and modalities, drives the need for a cost-effective and accurate solution.

Method: The framework leverages sparse annotations (e.g., centerline traces, dot markers) and expands them into dense supervision using a differentiable random walk model. It incorporates image-driven vesselness cues, tubular continuity priors, and uncertainty-weighted loss to avoid overfitting. A CNN-based predictor and topology-aware regularizer are jointly learned.

Result: The method outperforms naive training on sparse labels and conventional pseudo-labeling, producing more complete vascular maps and better-calibrated uncertainty.

Conclusion: The approach reduces annotation burden while preserving clinically relevant vessel topology, making it practical for clinical use.

Abstract: Accurate segmentation of subcutaneous vessels from clinical images is hampered by scarce, expensive ground truth and by low contrast, noisy appearance of vessels across patients and modalities. We present a novel weakly supervised training framework tailored for subcutaneous vessel segmentation that leverages inexpensive sparse annotations (e.g., centerline traces, dot markers, or short scribbles). Sparse labels are expanded into dense, probabilistic supervision via a differentiable random walk label propagation model whose transition weights incorporate image driven vesselness cues and tubular continuity priors. The propagation yields per-pixel hitting probabilities together with calibrated uncertainty estimates; these are incorporated into an uncertainty weighted loss to avoid over fitting to ambiguous regions. Crucially, the label-propagator is learned jointly with a CNN based segmentation predictor, enabling the system to discover vessel edges and continuity constraints without explicit edge supervision. We further introduce a topology aware regularizer that encourages centerline connectivity and penalizes spurious branches, improving clinical usability. In experiments on clinical subcutaneous imaging datasets, our method consistently outperforms naive training on sparse labels and conventional dense pseudo-labeling, producing more complete vascular maps and better calibrated uncertainty for downstream decision making. The approach substantially reduces annotation burden while preserving clinically relevant vessel topology.

[212] AU-IQA: A Benchmark Dataset for Perceptual Quality Assessment of AI-Enhanced User-Generated Content

Shushi Wang, Chunyi Li, Zicheng Zhang, Han Zhou, Wei Dong, Jun Chen, Guangtao Zhai, Xiaohong Liu

Main category: cs.CV

TL;DR: The paper introduces AU-IQA, a benchmark dataset for assessing perceptual quality of AI-enhanced UGC, evaluates existing models, and analyzes their performance.

Details

Motivation: The lack of specialized quality assessment models for AI-enhanced UGC limits user experience and advancement of enhancement methods.

Method: Constructed AU-IQA dataset with 4,800 AI-UGC images from three enhancement types (super-resolution, low-light enhancement, denoising) and evaluated traditional IQA and multimodal models.

Result: Comprehensive analysis of current models’ performance on AI-UGC quality assessment.

Conclusion: AU-IQA addresses the gap in assessing AI-enhanced UGC quality, providing insights for future improvements.

Abstract: AI-based image enhancement techniques have been widely adopted in various visual applications, significantly improving the perceptual quality of user-generated content (UGC). However, the lack of specialized quality assessment models has become a significant limiting factor in this field, limiting user experience and hindering the advancement of enhancement methods. While perceptual quality assessment methods have shown strong performance on UGC and AIGC individually, their effectiveness on AI-enhanced UGC (AI-UGC) which blends features from both, remains largely unexplored. To address this gap, we construct AU-IQA, a benchmark dataset comprising 4,800 AI-UGC images produced by three representative enhancement types which include super-resolution, low-light enhancement, and denoising. On this dataset, we further evaluate a range of existing quality assessment models, including traditional IQA methods and large multimodal models. Finally, we provide a comprehensive analysis of how well current approaches perform in assessing the perceptual quality of AI-UGC. The access link to the AU-IQA is https://github.com/WNNGGU/AU-IQA-Dataset.

[213] Low-Rank Expert Merging for Multi-Source Domain Adaptation in Person Re-Identification

Taha Mustapha Nehdi, Nairouz Mrabah, Atif Belal, Marco Pedersoli, Eric Granger

Main category: cs.CV

TL;DR: SAGE-reID is a source-free multi-source domain adaptation method for person re-identification, using low-rank adapters and a gating network for efficient cross-domain knowledge transfer.

Details

Motivation: Current MSDA methods for person reID require domain-specific backbones or source data, increasing computational costs. SAGE-reID aims to address this by being source-free and cost-effective.

Method: SAGE-reID trains source-specific low-rank adapters (LoRA) via source-free UDA and uses a lightweight gating network to dynamically merge LoRA experts for knowledge transfer.

Result: SAGE-reID outperforms state-of-the-art methods on benchmarks (Market-1501, DukeMTMC-reID, MSMT17) while maintaining computational efficiency.

Conclusion: SAGE-reID provides an efficient, scalable, and source-free solution for multi-source domain adaptation in person reID.

Abstract: Adapting person re-identification (reID) models to new target environments remains a challenging problem that is typically addressed using unsupervised domain adaptation (UDA) methods. Recent works show that when labeled data originates from several distinct sources (e.g., datasets and cameras), considering each source separately and applying multi-source domain adaptation (MSDA) typically yields higher accuracy and robustness compared to blending the sources and performing conventional UDA. However, state-of-the-art MSDA methods learn domain-specific backbone models or require access to source domain data during adaptation, resulting in significant growth in training parameters and computational cost. In this paper, a Source-free Adaptive Gated Experts (SAGE-reID) method is introduced for person reID. Our SAGE-reID is a cost-effective, source-free MSDA method that first trains individual source-specific low-rank adapters (LoRA) through source-free UDA. Next, a lightweight gating network is introduced and trained to dynamically assign optimal merging weights for fusion of LoRA experts, enabling effective cross-domain knowledge transfer. While the number of backbone parameters remains constant across source domains, LoRA experts scale linearly but remain negligible in size (<= 2% of the backbone), reducing both the memory consumption and risk of overfitting. Extensive experiments conducted on three challenging benchmarks: Market-1501, DukeMTMC-reID, and MSMT17 indicate that SAGE-reID outperforms state-of-the-art methods while being computationally efficient.

[214] Hybrid Machine Learning Framework for Predicting Geometric Deviations from 3D Surface Metrology

Hamidreza Samadi, Md Manjurul Ahsan, Shivakumar Raman

Main category: cs.CV

TL;DR: A hybrid machine learning framework improves geometric deviation forecasting in manufacturing, achieving 73% better accuracy than traditional methods.

Details

Motivation: Accurate forecasting of geometric deviations in manufactured components is challenging, especially for complex geometries, despite advancements in manufacturing.

Method: Uses a high-resolution 3D scanner for multi-angle surface data, processed with alignment, noise reduction, and merging. A hybrid ML framework (CNNs for feature extraction and gradient-boosted decision trees for prediction) is developed.

Result: Achieved 0.012 mm prediction accuracy at 95% confidence, 73% improvement over conventional methods. Revealed hidden correlations between manufacturing parameters and deviations.

Conclusion: The approach enhances automated quality control, predictive maintenance, and design optimization, with the dataset aiding future research.

Abstract: This study addresses the challenge of accurately forecasting geometric deviations in manufactured components using advanced 3D surface analysis. Despite progress in modern manufacturing, maintaining dimensional precision remains difficult, particularly for complex geometries. We present a methodology that employs a high-resolution 3D scanner to acquire multi-angle surface data from 237 components produced across different batches. The data were processed through precise alignment, noise reduction, and merging techniques to generate accurate 3D representations. A hybrid machine learning framework was developed, combining convolutional neural networks for feature extraction with gradient-boosted decision trees for predictive modeling. The proposed system achieved a prediction accuracy of 0.012 mm at a 95% confidence level, representing a 73% improvement over conventional statistical process control methods. In addition to improved accuracy, the model revealed hidden correlations between manufacturing parameters and geometric deviations. This approach offers significant potential for automated quality control, predictive maintenance, and design optimization in precision manufacturing, and the resulting dataset provides a strong foundation for future predictive modeling research.

[215] AGIC: Attention-Guided Image Captioning to Improve Caption Relevance

L. D. M. S. Sai Teja, Ashok Urlana, Pruthwik Mishra

Main category: cs.CV

TL;DR: AGIC improves image captioning by focusing on salient visual regions and using a hybrid decoding strategy, outperforming state-of-the-art models in speed and performance.

Details

Motivation: Generating accurate and descriptive captions for images is still challenging despite progress in the field.

Method: AGIC amplifies salient visual regions in feature space and employs a hybrid decoding strategy (deterministic and probabilistic sampling) for balanced fluency and diversity.

Result: AGIC matches or surpasses state-of-the-art models on Flickr8k and Flickr30k datasets, with faster inference and strong performance across metrics.

Conclusion: AGIC provides a scalable and interpretable solution for image captioning, combining accuracy, speed, and diversity.

Abstract: Despite significant progress in image captioning, generating accurate and descriptive captions remains a long-standing challenge. In this study, we propose Attention-Guided Image Captioning (AGIC), which amplifies salient visual regions directly in the feature space to guide caption generation. We further introduce a hybrid decoding strategy that combines deterministic and probabilistic sampling to balance fluency and diversity. To evaluate AGIC, we conduct extensive experiments on the Flickr8k and Flickr30k datasets. The results show that AGIC matches or surpasses several state-of-the-art models while achieving faster inference. Moreover, AGIC demonstrates strong performance across multiple evaluation metrics, offering a scalable and interpretable solution for image captioning.

[216] A Joint Sparse Self-Representation Learning Method for Multiview Clustering

Mengxue Jia, Zhihua Allen-Zhao, You Zhao, Sanyang Liu

Main category: cs.CV

TL;DR: A novel joint sparse self-representation learning model for multiview clustering (MC) is proposed, using cardinality constraints for local information extraction and a low-rank constraint for global structure. An alternating quadratic penalty (AQP) method ensures convergence, outperforming eight state-of-the-art algorithms.

Details

Motivation: To improve multiview clustering by extracting reliable local and global structure information without relying on Graph-Laplacian regularization, addressing the challenge of nonconvex, nonsmooth models.

Method: Introduces cardinality constraints for view-specific local information and a low-rank constraint for global structure. Uses an AQP method with closed-form solutions for convergence.

Result: Empirical results on six datasets show the model and AQP method outperform eight state-of-the-art algorithms.

Conclusion: The proposed model and AQP method effectively address the limitations of existing techniques, offering superior performance in multiview clustering.

Abstract: Multiview clustering (MC) aims to group samples using consistent and complementary information across various views. The subspace clustering, as a fundamental technique of MC, has attracted significant attention. In this paper, we propose a novel joint sparse self-representation learning model for MC, where a featured difference is the extraction of view-specific local information by introducing cardinality (i.e., $\ell_0$-norm) constraints instead of Graph-Laplacian regularization. Specifically, under each view, cardinality constraints directly restrict the samples used in the self-representation stage to extract reliable local and global structure information, while the low-rank constraint aids in revealing a global coherent structure in the consensus affinity matrix during merging. The attendant challenge is that Augmented Lagrange Method (ALM)-based alternating minimization algorithms cannot guarantee convergence when applied directly to our nonconvex, nonsmooth model, thus resulting in poor generalization ability. To address it, we develop an alternating quadratic penalty (AQP) method with global convergence, where two subproblems are iteratively solved by closed-form solutions. Empirical results on six standard datasets demonstrate the superiority of our model and AQP method, compared to eight state-of-the-art algorithms.

[217] VSI: Visual Subtitle Integration for Keyframe Selection to enhance Long Video Understanding

Jianxiang He, Shaoguang Wang, Weiyu Guo, Meisheng Hong, Jungang Li, Yijie Xu, Ziyang Chen, Hui Xiong

Main category: cs.CV

TL;DR: VSI improves keyframe search in long videos by integrating subtitles and visual data, outperforming baselines in accuracy and QA tasks.

Details

Motivation: Addressing weak multimodal alignment and temporal semantic capture in keyframe retrieval for long videos.

Method: Dual-stream search (Video Search Stream and Subtitle Match Stream) integrating subtitles, timestamps, and scene boundaries.

Result: 40.00% keyframe localization accuracy and 68.48% Video-QA accuracy, surpassing baselines by 20.35% and 15.79%.

Conclusion: VSI is robust and generalizable, achieving SOTA in medium-to-long video-QA tasks.

Abstract: Long video understanding presents a significant challenge to multimodal large language models (MLLMs) primarily due to the immense data scale. A critical and widely adopted strategy for making this task computationally tractable is keyframe retrieval, which seeks to identify a sparse set of video frames that are most salient to a given textual query. However, the efficacy of this approach is hindered by weak multimodal alignment between textual queries and visual content and fails to capture the complex temporal semantic information required for precise reasoning. To address this, we propose Visual-Subtitle Integeration(VSI), a multimodal keyframe search method that integrates subtitles, timestamps, and scene boundaries into a unified multimodal search process. The proposed method captures the visual information of video frames as well as the complementary textual information through a dual-stream search mechanism by Video Search Stream as well as Subtitle Match Stream, respectively, and improves the keyframe search accuracy through the interaction of the two search streams. Experimental results show that VSI achieve 40.00% key frame localization accuracy on the text-relevant subset of LongVideoBench and 68.48% accuracy on downstream long Video-QA tasks, surpassing competitive baselines by 20.35% and 15.79%, respectively. Furthermore, on the LongVideoBench, VSI achieved state-of-the-art(SOTA) in medium-to-long video-QA tasks, demonstrating the robustness and generalizability of the proposed multimodal search strategy.

[218] NS-FPN: Improving Infrared Small Target Detection and Segmentation from Noise Suppression Perspective

Maoxun Yuan, Duanni Meng, Ziteng Xi, Tianyi Zhao, Shiji Zhao, Yimian Dai, Xingxing Wei

Main category: cs.CV

TL;DR: A novel noise-suppression feature pyramid network (NS-FPN) is proposed for infrared small target detection and segmentation, addressing false alarms by focusing on noise suppression in the frequency domain.

Details

Motivation: The challenge of dim, shapeless targets and background clutter in IRSTDS tasks, along with the limitations of existing CNN-based methods that increase false alarms, motivates the need for a noise-suppression approach.

Method: The NS-FPN integrates a low-frequency guided feature purification (LFP) module for noise suppression and a spiral-aware feature sampling (SFS) module for target-relevant feature fusion.

Result: Extensive experiments show NS-FPN significantly reduces false alarms and outperforms existing methods on IRSTDS tasks.

Conclusion: The proposed NS-FPN is a lightweight, effective solution that enhances IRSTDS performance by focusing on noise suppression and feature fusion.

Abstract: Infrared small target detection and segmentation (IRSTDS) is a critical yet challenging task in defense and civilian applications, owing to the dim, shapeless appearance of targets and severe background clutter. Recent CNN-based methods have achieved promising target perception results, but they only focus on enhancing feature representation to offset the impact of noise, which results in the increased false alarms problem. In this paper, through analyzing the problem from the frequency domain, we pioneer in improving performance from noise suppression perspective and propose a novel noise-suppression feature pyramid network (NS-FPN), which integrates a low-frequency guided feature purification (LFP) module and a spiral-aware feature sampling (SFS) module into the original FPN structure. The LFP module suppresses the noise features by purifying high-frequency components to achieve feature enhancement devoid of noise interference, while the SFS module further adopts spiral sampling to fuse target-relevant features in feature fusion process. Our NS-FPN is designed to be lightweight yet effective and can be easily plugged into existing IRSTDS frameworks. Extensive experiments on the public IRSTDS datasets demonstrate that our method significantly reduces false alarms and achieves superior performance on IRSTDS tasks.

[219] BASIC: Boosting Visual Alignment with Intrinsic Refined Embeddings in Multimodal Large Language Models

Jianting Tang, Yubo Wang, Haoyu Cao, Linli Xu

Main category: cs.CV

TL;DR: The paper proposes BASIC, a method to improve visual-textual alignment in MLLMs by directly supervising visual embeddings using refined embeddings from LLM layers, enhancing performance without extra models or annotations.

Details

Motivation: Current MLLMs treat visual embeddings as contextual cues without direct supervision, limiting finer alignment. The paper addresses this gap by leveraging refined visual embeddings for direct guidance.

Method: BASIC uses refined visual embeddings from LLM shallow layers to supervise the vision projector. It optimizes embedding directions and semantic matching by reducing angles and logit distribution disparities.

Result: BASIC significantly improves MLLM performance across benchmarks, validating the effectiveness of direct visual supervision.

Conclusion: Direct visual supervision via BASIC enhances visual-textual alignment in MLLMs, demonstrating its potential for improving multimodal understanding.

Abstract: Mainstream Multimodal Large Language Models (MLLMs) achieve visual understanding by using a vision projector to bridge well-pretrained vision encoders and large language models (LLMs). The inherent gap between visual and textual modalities makes the embeddings from the vision projector critical for visual comprehension. However, current alignment approaches treat visual embeddings as contextual cues and merely apply auto-regressive supervision to textual outputs, neglecting the necessity of introducing equivalent direct visual supervision, which hinders the potential finer alignment of visual embeddings. In this paper, based on our analysis of the refinement process of visual embeddings in the LLM’s shallow layers, we propose BASIC, a method that utilizes refined visual embeddings within the LLM as supervision to directly guide the projector in generating initial visual embeddings. Specifically, the guidance is conducted from two perspectives: (i) optimizing embedding directions by reducing angles between initial and supervisory embeddings in semantic space; (ii) improving semantic matching by minimizing disparities between the logit distributions of both visual embeddings. Without additional supervisory models or artificial annotations, BASIC significantly improves the performance of MLLMs across a wide range of benchmarks, demonstrating the effectiveness of our introduced direct visual supervision.

[220] Advancements in Chinese font generation since deep learning era: A survey

Weiran Chen, Guiqian Zhu, Ying Li, Yi Ji, Chunping Liu

Main category: cs.CV

TL;DR: A survey of deep learning-based Chinese font generation methods, categorizing them into many-shot and few-shot approaches, and discussing challenges and future directions.

Details

Motivation: To address the challenge of improving the quality of generated Chinese character images and provide a comprehensive review of recent deep learning techniques in this field.

Method: The paper categorizes existing methods into many-shot and few-shot font generation, reviews fundamentals like architectures and datasets, and analyzes strengths and limitations of representative approaches.

Result: A detailed review of current methods, highlighting their pros and cons, and identifying key challenges in Chinese font generation.

Conclusion: The paper concludes with future research directions to advance the field, offering insights for researchers.

Abstract: Chinese font generation aims to create a new Chinese font library based on some reference samples. It is a topic of great concern to many font designers and typographers. Over the past years, with the rapid development of deep learning algorithms, various new techniques have achieved flourishing and thriving progress. Nevertheless, how to improve the overall quality of generated Chinese character images remains a tough issue. In this paper, we conduct a holistic survey of the recent Chinese font generation approaches based on deep learning. To be specific, we first illustrate the research background of the task. Then, we outline our literature selection and analysis methodology, and review a series of related fundamentals, including classical deep learning architectures, font representation formats, public datasets, and frequently-used evaluation metrics. After that, relying on the number of reference samples required to generate a new font, we categorize the existing methods into two major groups: many-shot font generation and few-shot font generation methods. Within each category, representative approaches are summarized, and their strengths and limitations are also discussed in detail. Finally, we conclude our paper with the challenges and future directions, with the expectation to provide some valuable illuminations for the researchers in this field.

[221] eMotions: A Large-Scale Dataset and Audio-Visual Fusion Network for Emotion Analysis in Short-form Videos

Xuecheng Wu, Dingkang Yang, Danlei Huang, Xinyi Yin, Yifan Wang, Jia Zhang, Jiayu Nie, Liangyu Fu, Yang Liu, Junxiao Xue, Hadi Amirpour, Wei Zhou

Main category: cs.CV

TL;DR: The paper introduces eMotions, a large-scale dataset for video emotion analysis (VEA) in short-form videos (SVs), and proposes AV-CANet, an audio-visual fusion network to address challenges like semantic gaps and emotional inconsistencies.

Details

Motivation: The rise of SVs and their multimodal complexity necessitates advanced VEA, but existing datasets and methods are limited for SVs.

Method: Proposes AV-CANet, an end-to-end audio-visual fusion network with a Local-Global Fusion Module and EP-CE Loss for optimization.

Result: AV-CANet shows effectiveness across multiple datasets, including eMotions, and ablation studies validate its components.

Conclusion: The work advances VEA for SVs, offering a robust dataset and method, with potential for future research.

Abstract: Short-form videos (SVs) have become a vital part of our online routine for acquiring and sharing information. Their multimodal complexity poses new challenges for video analysis, highlighting the need for video emotion analysis (VEA) within the community. Given the limited availability of SVs emotion data, we introduce eMotions, a large-scale dataset consisting of 27,996 videos with full-scale annotations. To ensure quality and reduce subjective bias, we emphasize better personnel allocation and propose a multi-stage annotation procedure. Additionally, we provide the category-balanced and test-oriented variants through targeted sampling to meet diverse needs. While there have been significant studies on videos with clear emotional cues (e.g., facial expressions), analyzing emotions in SVs remains a challenging task. The challenge arises from the broader content diversity, which introduces more distinct semantic gaps and complicates the representations learning of emotion-related features. Furthermore, the prevalence of audio-visual co-expressions in SVs leads to the local biases and collective information gaps caused by the inconsistencies in emotional expressions. To tackle this, we propose AV-CANet, an end-to-end audio-visual fusion network that leverages video transformer to capture semantically relevant representations. We further introduce the Local-Global Fusion Module designed to progressively capture the correlations of audio-visual features. Besides, EP-CE Loss is constructed to globally steer optimizations with tripolar penalties. Extensive experiments across three eMotions-related datasets and four public VEA datasets demonstrate the effectiveness of our proposed AV-CANet, while providing broad insights for future research. Moreover, we conduct ablation studies to examine the critical components of our method. Dataset and code will be made available at Github.

[222] A Simple yet Powerful Instance-Aware Prompting Framework for Training-free Camouflaged Object Segmentation

Chao Yin, Jide Li, Xiaoqiang Li

Main category: cs.CV

TL;DR: The paper introduces IAPF, a training-free COS method that converts task-generic prompts into fine-grained instance masks, outperforming existing methods.

Details

Motivation: Current training-free COS methods using SAM produce coarse semantic masks, failing in scenarios with multiple camouflaged instances.

Method: IAPF uses a three-step process: generating image-specific tags, creating instance-level prompts, and voting for the most consistent mask.

Result: IAPF significantly outperforms state-of-the-art training-free COS methods on standard benchmarks.

Conclusion: IAPF effectively addresses the limitation of coarse masks in training-free COS, providing fine-grained instance segmentation.

Abstract: Camouflaged Object Segmentation (COS) remains highly challenging due to the intrinsic visual similarity between target objects and their surroundings. While training-based COS methods achieve good performance, their performance degrades rapidly with increased annotation sparsity. To circumvent this limitation, recent studies have explored training-free COS methods, leveraging the Segment Anything Model (SAM) by automatically generating visual prompts from a single task-generic prompt (\textit{e.g.}, “\textit{camouflaged animal}”) uniformly applied across all test images. However, these methods typically produce only semantic-level visual prompts, causing SAM to output coarse semantic masks and thus failing to handle scenarios with multiple discrete camouflaged instances effectively. To address this critical limitation, we propose a simple yet powerful \textbf{I}nstance-\textbf{A}ware \textbf{P}rompting \textbf{F}ramework (IAPF), the first training-free COS pipeline that explicitly converts a task-generic prompt into fine-grained instance masks. Specifically, the IAPF comprises three steps: (1) Text Prompt Generator, utilizing task-generic queries to prompt a Multimodal Large Language Model (MLLM) for generating image-specific foreground and background tags; (2) \textbf{Instance Mask Generator}, leveraging Grounding DINO to produce precise instance-level bounding box prompts, alongside the proposed Single-Foreground Multi-Background Prompting strategy to sample region-constrained point prompts within each box, enabling SAM to yield a candidate instance mask; (3) Self-consistency Instance Mask Voting, which selects the final COS prediction by identifying the candidate mask most consistent across multiple candidate instance masks. Extensive evaluations on standard COS benchmarks demonstrate that the proposed IAPF significantly surpasses existing state-of-the-art training-free COS methods.

[223] MultiRef: Controllable Image Generation with Multiple Visual References

Ruoxi Chen, Dongping Chen, Siyuan Wu, Sinan Wang, Shiyun Lang, Petr Sushko, Gaoyang Jiang, Yao Wan, Ranjay Krishna

Main category: cs.CV

TL;DR: The paper introduces MultiRef-bench, a framework for evaluating multi-reference image generation, and highlights challenges in current models.

Details

Motivation: Current image generation frameworks rely on single-source inputs, limiting creative flexibility. The paper aims to address this by focusing on multi-reference conditioning.

Method: The authors develop MultiRef-bench, a dataset with synthetic and real-world samples, and evaluate it using three interleaved image-text models and six agentic frameworks.

Result: State-of-the-art models struggle with multi-reference tasks, with the best model achieving 66.6% (synthetic) and 79.0% (real-world) accuracy compared to golden answers.

Conclusion: The findings highlight the need for more flexible, human-like creative tools and provide a dataset (MultiRef) to support future research.

Abstract: Visual designers naturally draw inspiration from multiple visual references, combining diverse elements and aesthetic principles to create artwork. However, current image generative frameworks predominantly rely on single-source inputs – either text prompts or individual reference images. In this paper, we focus on the task of controllable image generation using multiple visual references. We introduce MultiRef-bench, a rigorous evaluation framework comprising 990 synthetic and 1,000 real-world samples that require incorporating visual content from multiple reference images. The synthetic samples are synthetically generated through our data engine RefBlend, with 10 reference types and 33 reference combinations. Based on RefBlend, we further construct a dataset MultiRef containing 38k high-quality images to facilitate further research. Our experiments across three interleaved image-text models (i.e., OmniGen, ACE, and Show-o) and six agentic frameworks (e.g., ChatDiT and LLM + SD) reveal that even state-of-the-art systems struggle with multi-reference conditioning, with the best model OmniGen achieving only 66.6% in synthetic samples and 79.0% in real-world cases on average compared to the golden answer. These findings provide valuable directions for developing more flexible and human-like creative tools that can effectively integrate multiple sources of visual inspiration. The dataset is publicly available at: https://multiref.github.io/.

[224] MMReID-Bench: Unleashing the Power of MLLMs for Effective and Versatile Person Re-identification

Jinhao Li, Zijian Chen, Lirong Deng, Changbo Wang, Guangtao Zhai

Main category: cs.CV

TL;DR: The paper introduces MMReID-Bench, a multi-task multi-modal benchmark for person re-identification (ReID), leveraging multi-modal large language models (MLLMs) to address limitations of traditional uni-modal ReID models.

Details

Motivation: Traditional person ReID models lack generalization in multi-modal data (e.g., RGB, thermal, infrared). MLLMs show promise but are underutilized.

Method: Developed MMReID-Bench with 20,710 multi-modal queries and gallery images across 10 ReID tasks to evaluate MLLMs.

Result: MLLMs demonstrate effective and versatile ReID but struggle with thermal and infrared data.

Conclusion: MMReID-Bench aims to advance robust multi-modal foundation models for person ReID.

Abstract: Person re-identification (ReID) aims to retrieve the images of an interested person in the gallery images, with wide applications in medical rehabilitation, abnormal behavior detection, and public security. However, traditional person ReID models suffer from uni-modal capability, leading to poor generalization ability in multi-modal data, such as RGB, thermal, infrared, sketch images, textual descriptions, etc. Recently, the emergence of multi-modal large language models (MLLMs) shows a promising avenue for addressing this problem. Despite this potential, existing methods merely regard MLLMs as feature extractors or caption generators, which do not fully unleash their reasoning, instruction-following, and cross-modal understanding capabilities. To bridge this gap, we introduce MMReID-Bench, the first multi-task multi-modal benchmark specifically designed for person ReID. The MMReID-Bench includes 20,710 multi-modal queries and gallery images covering 10 different person ReID tasks. Comprehensive experiments demonstrate the remarkable capabilities of MLLMs in delivering effective and versatile person ReID. Nevertheless, they also have limitations in handling a few modalities, particularly thermal and infrared data. We hope MMReID-Bench can facilitate the community to develop more robust and generalizable multimodal foundation models for person ReID.

[225] Talk2Image: A Multi-Agent System for Multi-Turn Image Generation and Editing

Shichao Ma, Yunhe Guo, Jiahao Su, Qihe Huang, Zhengyang Zhou, Yang Wang

Main category: cs.CV

TL;DR: Talk2Image is a multi-agent system for interactive image generation and editing in multi-turn dialogues, addressing intention drift and incoherence in existing methods.

Details

Motivation: Existing text-to-image systems struggle with iterative, multi-turn creative tasks, often causing intention drift and incoherent edits.

Method: Talk2Image integrates intention parsing, task decomposition across specialized agents, and feedback-driven refinement via multi-view evaluation.

Result: The system outperforms baselines in controllability, coherence, and user satisfaction for iterative tasks.

Conclusion: Talk2Image effectively aligns with user intentions and ensures consistent image editing in multi-turn scenarios.

Abstract: Text-to-image generation tasks have driven remarkable advances in diverse media applications, yet most focus on single-turn scenarios and struggle with iterative, multi-turn creative tasks. Recent dialogue-based systems attempt to bridge this gap, but their single-agent, sequential paradigm often causes intention drift and incoherent edits. To address these limitations, we present Talk2Image, a novel multi-agent system for interactive image generation and editing in multi-turn dialogue scenarios. Our approach integrates three key components: intention parsing from dialogue history, task decomposition and collaborative execution across specialized agents, and feedback-driven refinement based on a multi-view evaluation mechanism. Talk2Image enables step-by-step alignment with user intention and consistent image editing. Experiments demonstrate that Talk2Image outperforms existing baselines in controllability, coherence, and user satisfaction across iterative image generation and editing tasks.

[226] AR-GRPO: Training Autoregressive Image Generation Models via Reinforcement Learning

Shihao Yuan, Yahui Liu, Yang Yue, Jingyuan Zhang, Wangmeng Zuo, Qi Wang, Fuzheng Zhang, Guorui Zhou

Main category: cs.CV

TL;DR: AR-GRPO integrates online RL training into autoregressive image generation models, improving image quality and human preference over standard AR baselines.

Details

Motivation: To enhance autoregressive image generation models by leveraging reinforcement learning for better perceptual quality, realism, and semantic fidelity.

Method: Adapts the Group Relative Policy Optimization (GRPO) algorithm with reward functions evaluating multiple quality dimensions.

Result: Significant improvements in image quality and human preference across class- and text-conditional tasks.

Conclusion: RL-based optimization is viable for AR image generation, enabling controllable and high-quality synthesis.

Abstract: Inspired by the success of reinforcement learning (RL) in refining large language models (LLMs), we propose AR-GRPO, an approach to integrate online RL training into autoregressive (AR) image generation models. We adapt the Group Relative Policy Optimization (GRPO) algorithm to refine the vanilla autoregressive models’ outputs by carefully designed reward functions that evaluate generated images across multiple quality dimensions, including perceptual quality, realism, and semantic fidelity. We conduct comprehensive experiments on both class-conditional (i.e., class-to-image) and text-conditional (i.e., text-to-image) image generation tasks, demonstrating that our RL-enhanced framework significantly improves both the image quality and human preference of generated images compared to the standard AR baselines. Our results show consistent improvements across various evaluation metrics, establishing the viability of RL-based optimization for AR image generation and opening new avenues for controllable and high-quality image synthesis. The source codes and models are available at: https://github.com/Kwai-Klear/AR-GRPO.

[227] CannyEdit: Selective Canny Control and Dual-Prompt Guidance for Training-Free Image Editing

Weiyan Xie, Han Gao, Didan Deng, Kaican Li, April Hua Liu, Yongxiang Huang, Nevin L. Zhang

Main category: cs.CV

TL;DR: CannyEdit is a training-free framework for precise regional image editing using selective Canny Control and dual-prompt guidance, outperforming existing methods in text adherence, context fidelity, and seamlessness.

Details

Motivation: Existing text-to-image models struggle with balancing text adherence, context fidelity, and seamless integration in regional edits.

Method: Uses Selective Canny Control for precise edits and Dual-Prompt Guidance for coherent scene interactions.

Result: Achieves 2.93-10.49% improvement in balance metrics and higher seamlessness (49.2% general users, 42.0% experts identified as AI-edited).

Conclusion: CannyEdit effectively addresses the challenges of regional image editing, offering superior performance and seamless results.

Abstract: Recent advances in text-to-image (T2I) models have enabled training-free regional image editing by leveraging the generative priors of foundation models. However, existing methods struggle to balance text adherence in edited regions, context fidelity in unedited areas, and seamless integration of edits. We introduce CannyEdit, a novel training-free framework that addresses these challenges through two key innovations: (1) Selective Canny Control, which masks the structural guidance of Canny ControlNet in user-specified editable regions while strictly preserving details of the source images in unedited areas via inversion-phase ControlNet information retention. This enables precise, text-driven edits without compromising contextual integrity. (2) Dual-Prompt Guidance, which combines local prompts for object-specific edits with a global target prompt to maintain coherent scene interactions. On real-world image editing tasks (addition, replacement, removal), CannyEdit outperforms prior methods like KV-Edit, achieving a 2.93 to 10.49 percent improvement in the balance of text adherence and context fidelity. In terms of editing seamlessness, user studies reveal only 49.2 percent of general users and 42.0 percent of AIGC experts identified CannyEdit’s results as AI-edited when paired with real images without edits, versus 76.08 to 89.09 percent for competitor methods.

[228] Beyond Frequency: Seeing Subtle Cues Through the Lens of Spatial Decomposition for Fine-Grained Visual Classification

Qin Xu, Lili Zhu, Xiaoxia Cheng, Bo Jiang

Main category: cs.CV

TL;DR: SCOPE is a novel method for fine-grained visual classification (FGVC) that adaptively enhances spatial-domain features, overcoming limitations of fixed-scale frequency-domain methods.

Details

Motivation: Current frequency-domain methods lack adaptability to image content and cannot dynamically adjust feature extraction for discriminative requirements.

Method: SCOPE uses two modules: Subtle Detail Extractor (SDE) for enhancing shallow features and Salient Semantic Refiner (SSR) for refining high-level features. These are cascaded to combine local details with global semantics.

Result: Achieves state-of-the-art performance on four FGVC benchmarks.

Conclusion: SCOPE effectively addresses the limitations of fixed-scale frequency-domain methods by adaptively enhancing spatial features, improving FGVC performance.

Abstract: The crux of resolving fine-grained visual classification (FGVC) lies in capturing discriminative and class-specific cues that correspond to subtle visual characteristics. Recently, frequency decomposition/transform based approaches have attracted considerable interests since its appearing discriminative cue mining ability. However, the frequency-domain methods are based on fixed basis functions, lacking adaptability to image content and unable to dynamically adjust feature extraction according to the discriminative requirements of different images. To address this, we propose a novel method for FGVC, named Subtle-Cue Oriented Perception Engine (SCOPE), which adaptively enhances the representational capability of low-level details and high-level semantics in the spatial domain, breaking through the limitations of fixed scales in the frequency domain and improving the flexibility of multi-scale fusion. The core of SCOPE lies in two modules: the Subtle Detail Extractor (SDE), which dynamically enhances subtle details such as edges and textures from shallow features, and the Salient Semantic Refiner (SSR), which learns semantically coherent and structure-aware refinement features from the high-level features guided by the enhanced shallow features. The SDE and SSR are cascaded stage-by-stage to progressively combine local details with global semantics. Extensive experiments demonstrate that our method achieves new state-of-the-art on four popular fine-grained image classification benchmarks.

[229] Adversarial Video Promotion Against Text-to-Video Retrieval

Qiwei Tian, Chenhao Lin, Zhengyu Zhao, Qian Li, Shuai Liu, Chao Shen

Main category: cs.CV

TL;DR: The paper introduces ViPro, the first adversarial attack for promoting videos in text-to-video retrieval (T2VR), and proposes Modal Refinement (MoRe) to improve transferability. It demonstrates superior performance over baselines and highlights vulnerabilities in T2VR systems.

Details

Motivation: Existing T2VR attacks focus on suppressing video ranks, but promoting videos is more impactful for financial or misinformation gains. This gap motivates the development of ViPro.

Method: ViPro adversarially promotes videos in T2VR. MoRe enhances black-box transferability by refining cross-modal interactions. Experiments cover multiple models, datasets, and scenarios.

Result: ViPro outperforms baselines by over 30%/10%/4% in white/grey/black-box settings. It also evaluates defenses and imperceptibility.

Conclusion: The work exposes a vulnerability in T2VR, provides bounds for attacks, and suggests counterplays. Code will be publicly released.

Abstract: Thanks to the development of cross-modal models, text-to-video retrieval (T2VR) is advancing rapidly, but its robustness remains largely unexamined. Existing attacks against T2VR are designed to push videos away from queries, i.e., suppressing the ranks of videos, while the attacks that pull videos towards selected queries, i.e., promoting the ranks of videos, remain largely unexplored. These attacks can be more impactful as attackers may gain more views/clicks for financial benefits and widespread (mis)information. To this end, we pioneer the first attack against T2VR to promote videos adversarially, dubbed the Video Promotion attack (ViPro). We further propose Modal Refinement (MoRe) to capture the finer-grained, intricate interaction between visual and textual modalities to enhance black-box transferability. Comprehensive experiments cover 2 existing baselines, 3 leading T2VR models, 3 prevailing datasets with over 10k videos, evaluated under 3 scenarios. All experiments are conducted in a multi-target setting to reflect realistic scenarios where attackers seek to promote the video regarding multiple queries simultaneously. We also evaluated our attacks for defences and imperceptibility. Overall, ViPro surpasses other baselines by over $30/10/4%$ for white/grey/black-box settings on average. Our work highlights an overlooked vulnerability, provides a qualitative analysis on the upper/lower bound of our attacks, and offers insights into potential counterplays. Code will be publicly available at https://github.com/michaeltian108/ViPro.

[230] Evaluating Fisheye-Compatible 3D Gaussian Splatting Methods on Real Images Beyond 180 Degree Field of View

Ulas Gunes, Matias Turkulainen, Juho Kannala, Esa Rahtu

Main category: cs.CV

TL;DR: Evaluation of fisheye-based 3D Gaussian Splatting methods (Fisheye-GS and 3DGUT) on real images with extreme distortion, proposing a depth-based initialization strategy for improved performance.

Details

Motivation: To assess the performance of fisheye-based 3D Gaussian Splatting methods in real-world settings with extreme distortion and wide fields of view.

Method: Comparison of Fisheye-GS and 3DGUT under varying fields of view (200°, 160°, 120°), and introduction of a depth-based initialization strategy using UniK3D predictions.

Result: Fisheye-GS performs better with reduced FoV (160°), while 3DGUT maintains stability and quality at full 200° FoV. UniK3D-based initialization matches SfM quality in challenging scenes.

Conclusion: Fisheye-based 3DGS methods are viable for wide-angle 3D reconstruction, even with sparse and distortion-heavy inputs.

Abstract: We present the first evaluation of fisheye-based 3D Gaussian Splatting methods, Fisheye-GS and 3DGUT, on real images with fields of view exceeding 180 degree. Our study covers both indoor and outdoor scenes captured with 200 degree fisheye cameras and analyzes how each method handles extreme distortion in real world settings. We evaluate performance under varying fields of view (200 degree, 160 degree, and 120 degree) to study the tradeoff between peripheral distortion and spatial coverage. Fisheye-GS benefits from field of view (FoV) reduction, particularly at 160 degree, while 3DGUT remains stable across all settings and maintains high perceptual quality at the full 200 degree view. To address the limitations of SfM-based initialization, which often fails under strong distortion, we also propose a depth-based strategy using UniK3D predictions from only 2-3 fisheye images per scene. Although UniK3D is not trained on real fisheye data, it produces dense point clouds that enable reconstruction quality on par with SfM, even in difficult scenes with fog, glare, or sky. Our results highlight the practical viability of fisheye-based 3DGS methods for wide-angle 3D reconstruction from sparse and distortion-heavy image inputs.

[231] WeatherDiffusion: Weather-Guided Diffusion Model for Forward and Inverse Rendering

Yixin Zhu, Zuoliang Zhu, Miloš Hašan, Jian Yang, Jin Xie, Beibei Wang

Main category: cs.CV

TL;DR: WeatherDiffusion is a diffusion-based framework for forward and inverse rendering in autonomous driving scenes, addressing challenges of weather and illumination. It uses text-guided intrinsic maps and a novel attention mechanism for high-quality results.

Details

Motivation: Complex weather and illumination in autonomous driving scenes challenge rendering tasks. Existing diffusion models lack control and robustness.

Method: Proposes WeatherDiffusion with text-guided intrinsic maps and Intrinsic map-aware attention (MAA) for inverse rendering. Introduces synthetic (WeatherSynthetic) and real-world (WeatherReal) datasets.

Result: Outperforms state-of-the-art methods on benchmarks and improves robustness in downstream tasks like object detection and segmentation.

Conclusion: WeatherDiffusion effectively handles diverse weather and lighting conditions, enhancing autonomous driving applications.

Abstract: Forward and inverse rendering have emerged as key techniques for enabling understanding and reconstruction in the context of autonomous driving (AD). However, complex weather and illumination pose great challenges to this task. The emergence of large diffusion models has shown promise in achieving reasonable results through learning from 2D priors, but these models are difficult to control and lack robustness. In this paper, we introduce WeatherDiffusion, a diffusion-based framework for forward and inverse rendering on AD scenes with various weather and lighting conditions. Our method enables authentic estimation of material properties, scene geometry, and lighting, and further supports controllable weather and illumination editing through the use of predicted intrinsic maps guided by text descriptions. We observe that different intrinsic maps should correspond to different regions of the original image. Based on this observation, we propose Intrinsic map-aware attention (MAA) to enable high-quality inverse rendering. Additionally, we introduce a synthetic dataset (\ie WeatherSynthetic) and a real-world dataset (\ie WeatherReal) for forward and inverse rendering on AD scenes with diverse weather and lighting. Extensive experiments show that our WeatherDiffusion outperforms state-of-the-art methods on several benchmarks. Moreover, our method demonstrates significant value in downstream tasks for AD, enhancing the robustness of object detection and image segmentation in challenging weather scenarios.

[232] TADoc: Robust Time-Aware Document Image Dewarping

Fangmin Zhao, Weichao Zeng, Zhenhang Li, Dongbao Yang, Yu Zhou

Main category: cs.CV

TL;DR: The paper introduces TADoc, a lightweight framework for document image dewarping, modeling it as a dynamic process and proposing a new metric (DLS) for evaluation.

Details

Motivation: Existing methods struggle with complex document structures and high deformation in real-world scenarios, prompting a need for a progressive approach.

Method: The task is reformulated as a dynamic process with intermediate states, addressed by the TADoc framework.

Result: TADoc shows strong robustness and outperforms benchmarks across various document types and distortion levels.

Conclusion: The proposed dynamic modeling and DLS metric enhance dewarping effectiveness, validated by extensive experiments.

Abstract: Flattening curved, wrinkled, and rotated document images captured by portable photographing devices, termed document image dewarping, has become an increasingly important task with the rise of digital economy and online working. Although many methods have been proposed recently, they often struggle to achieve satisfactory results when confronted with intricate document structures and higher degrees of deformation in real-world scenarios. Our main insight is that, unlike other document restoration tasks (e.g., deblurring), dewarping in real physical scenes is a progressive motion rather than a one-step transformation. Based on this, we have undertaken two key initiatives. Firstly, we reformulate this task, modeling it for the first time as a dynamic process that encompasses a series of intermediate states. Secondly, we design a lightweight framework called TADoc (Time-Aware Document Dewarping Network) to address the geometric distortion of document images. In addition, due to the inadequacy of OCR metrics for document images containing sparse text, the comprehensiveness of evaluation is insufficient. To address this shortcoming, we propose a new metric – DLS (Document Layout Similarity) – to evaluate the effectiveness of document dewarping in downstream tasks. Extensive experiments and in-depth evaluations have been conducted and the results indicate that our model possesses strong robustness, achieving superiority on several benchmarks with different document types and degrees of distortion.

[233] OctreeNCA: Single-Pass 184 MP Segmentation on Consumer Hardware

Nick Lemke, John Kalkhof, Niklas Babendererde, Anirban Mukhopadhyay

Main category: cs.CV

TL;DR: OctreeNCA, a bio-inspired model, segments large medical inputs efficiently with minimal VRAM usage by leveraging an octree data structure and CUDA implementation.

Details

Motivation: Medical segmentation tasks require handling large inputs (e.g., MRIs, pathology slices) with global context, but current models like UNets or Vision Transformers suffer from high VRAM consumption and poor scalability.

Method: Proposes OctreeNCA, extending Neural Cellular Automata (NCA) with an octree-based neighborhood for global knowledge traversal, and implements a CUDA-based inference function for efficiency.

Result: OctreeNCA reduces VRAM usage by 90% compared to UNets, enabling segmentation of 184 Megapixel images or 1-minute videos at once.

Conclusion: OctreeNCA offers a scalable, efficient solution for high-resolution medical segmentation tasks, outperforming traditional models in VRAM efficiency and speed.

Abstract: Medical applications demand segmentation of large inputs, like prostate MRIs, pathology slices, or videos of surgery. These inputs should ideally be inferred at once to provide the model with proper spatial or temporal context. When segmenting large inputs, the VRAM consumption of the GPU becomes the bottleneck. Architectures like UNets or Vision Transformers scale very poorly in VRAM consumption, resulting in patch- or frame-wise approaches that compromise global consistency and inference speed. The lightweight Neural Cellular Automaton (NCA) is a bio-inspired model that is by construction size-invariant. However, due to its local-only communication rules, it lacks global knowledge. We propose OctreeNCA by generalizing the neighborhood definition using an octree data structure. Our generalized neighborhood definition enables the efficient traversal of global knowledge. Since deep learning frameworks are mainly developed for large multi-layer networks, their implementation does not fully leverage the advantages of NCAs. We implement an NCA inference function in CUDA that further reduces VRAM demands and increases inference speed. Our OctreeNCA segments high-resolution images and videos quickly while occupying 90% less VRAM than a UNet during evaluation. This allows us to segment 184 Megapixel pathology slices or 1-minute surgical videos at once.

[234] S2-UniSeg: Fast Universal Agglomerative Pooling for Scalable Segment Anything without Supervision

Huihui Xu, Jin Ye, Hongqiu Wang, Changkai Ji, Jiashi Lin, Ming Hu, Ziyan Huang, Ying Chen, Chenglong Ma, Tianbin Li, Lihao Liu, Junjun He, Lei Zhu

Main category: cs.CV

TL;DR: The paper introduces S2-UniSeg, a scalable self-supervised universal segmentation model, using a novel pseudo-mask algorithm (UniAP) and a pretext task (QuerySD) for continuous pretraining, outperforming SOTA models.

Details

Motivation: Address the inefficiency of multi-stage pretraining and sub-optimal solutions in self-supervised image segmentation by proposing a faster pseudo-mask generation method and continuous pretraining.

Method: Develops Fast Universal Agglomerative Pooling (UniAP) for rapid pseudo-mask generation and S2-UniSeg with a student-teacher framework and QuerySD pretext task for local-to-global correspondence learning.

Result: S2-UniSeg outperforms UnSAM with AP+6.9 on COCO, AR+11.1 on UVO, PixelAcc+4.5 on COCOStuff-27, RQ+8.0 on Cityscapes, and further improves with larger datasets.

Conclusion: S2-UniSeg offers a scalable, efficient solution for self-supervised segmentation, achieving superior performance and adaptability to larger datasets.

Abstract: Recent self-supervised image segmentation models have achieved promising performance on semantic segmentation and class-agnostic instance segmentation. However, their pretraining schedule is multi-stage, requiring a time-consuming pseudo-masks generation process between each training epoch. This time-consuming offline process not only makes it difficult to scale with training dataset size, but also leads to sub-optimal solutions due to its discontinuous optimization routine. To solve these, we first present a novel pseudo-mask algorithm, Fast Universal Agglomerative Pooling (UniAP). Each layer of UniAP can identify groups of similar nodes in parallel, allowing to generate both semantic-level and instance-level and multi-granular pseudo-masks within ens of milliseconds for one image. Based on the fast UniAP, we propose the Scalable Self-Supervised Universal Segmentation (S2-UniSeg), which employs a student and a momentum teacher for continuous pretraining. A novel segmentation-oriented pretext task, Query-wise Self-Distillation (QuerySD), is proposed to pretrain S2-UniSeg to learn the local-to-global correspondences. Under the same setting, S2-UniSeg outperforms the SOTA UnSAM model, achieving notable improvements of AP+6.9 on COCO, AR+11.1 on UVO, PixelAcc+4.5 on COCOStuff-27, RQ+8.0 on Cityscapes. After scaling up to a larger 2M-image subset of SA-1B, S2-UniSeg further achieves performance gains on all four benchmarks. Our code and pretrained models are available at https://github.com/bio-mlhui/S2-UniSeg

[235] HiMat: DiT-based Ultra-High Resolution SVBRDF Generation

Zixiong Wang, Jian Yang, Yiwei Hu, Milos Hasan, Beibei Wang

Main category: cs.CV

TL;DR: HiMat is a lightweight, diffusion-based framework for generating 4K-resolution SVBRDFs, addressing consistency across maps without altering the DiT backbone, using a CrossStitch module.

Details

Motivation: The need for detailed SVBRDFs in 3D content creation and the challenge of retargeting text-to-image models for multi-map generation while maintaining efficiency and consistency.

Method: Introduces HiMat, leveraging a diffusion-based framework with a CrossStitch module to capture inter-map dependencies without modifying the DiT backbone.

Result: HiMat successfully generates 4K SVBRDFs with structural coherence and high-frequency details, validated by text prompts and generalized to tasks like intrinsic decomposition.

Conclusion: HiMat provides an efficient solution for high-resolution SVBRDF generation, preserving prior model capabilities while ensuring map consistency.

Abstract: Creating highly detailed SVBRDFs is essential for 3D content creation. The rise of high-resolution text-to-image generative models, based on diffusion transformers (DiT), suggests an opportunity to finetune them for this task. However, retargeting the models to produce multiple aligned SVBRDF maps instead of just RGB images, while achieving high efficiency and ensuring consistency across different maps, remains a challenge. In this paper, we introduce HiMat: a memory- and computation-efficient diffusion-based framework capable of generating native 4K-resolution SVBRDFs. A key challenge we address is maintaining consistency across different maps in a lightweight manner, without relying on training new VAEs or significantly altering the DiT backbone (which would damage its prior capabilities). To tackle this, we introduce the CrossStitch module, a lightweight convolutional module that captures inter-map dependencies through localized operations. Its weights are initialized such that the DiT backbone operation is unchanged before finetuning starts. HiMat enables generation with strong structural coherence and high-frequency details. Results with a large set of text prompts demonstrate the effectiveness of our approach for 4K SVBRDF generation. Further experiments suggest generalization to tasks such as intrinsic decomposition.

[236] TerraMAE: Learning Spatial-Spectral Representations from Hyperspectral Earth Observation Data via Adaptive Masked Autoencoders

Tanjim Bin Faruk, Abdul Matin, Shrideep Pallickara, Sangmi Lee Pallickara

Main category: cs.CV

TL;DR: TerraMAE is a novel hyperspectral image encoding framework that improves spatial-spectral embedding learning for geospatial tasks.

Details

Motivation: Existing self-supervised methods like Masked Autoencoders struggle with hyperspectral imagery due to its complexity.

Method: TerraMAE uses adaptive channel grouping and an enhanced reconstruction loss to capture spatial-spectral correlations.

Result: It achieves high-fidelity reconstruction and strong performance on crop identification, land cover classification, and soil texture prediction.

Conclusion: TerraMAE effectively addresses the challenges of hyperspectral imagery and enhances geospatial analysis.

Abstract: Hyperspectral satellite imagery offers sub-30 m views of Earth in hundreds of contiguous spectral bands, enabling fine-grained mapping of soils, crops, and land cover. While self-supervised Masked Autoencoders excel on RGB and low-band multispectral data, they struggle to exploit the intricate spatial-spectral correlations in 200+ band hyperspectral images. We introduce TerraMAE, a novel HSI encoding framework specifically designed to learn highly representative spatial-spectral embeddings for diverse geospatial analyses. TerraMAE features an adaptive channel grouping strategy, based on statistical reflectance properties to capture spectral similarities, and an enhanced reconstruction loss function that incorporates spatial and spectral quality metrics. We demonstrate TerraMAE’s effectiveness through superior spatial-spectral information preservation in high-fidelity image reconstruction. Furthermore, we validate its practical utility and the quality of its learned representations through strong performance on three key downstream geospatial tasks: crop identification, land cover classification, and soil texture prediction.

[237] DocRefine: An Intelligent Framework for Scientific Document Understanding and Content Optimization based on Multimodal Large Model Agents

Kun Qian, Wenjie Li, Tianyu Sun, Wenhong Wang, Wenhan Luo

Main category: cs.CV

TL;DR: DocRefine is a framework for refining and summarizing scientific PDFs using a multi-agent system, outperforming baselines in accuracy and fidelity.

Details

Motivation: Addressing the limitations of traditional methods and LLMs/LVLMs in handling complex PDF layouts and multimodal content for scientific documents.

Method: Uses a multi-agent system with six specialized agents for layout analysis, content understanding, refinement, summarization, and verification.

Result: Achieves 86.7% SCS, 93.9% LFI, and 85.0% IAR on DocEditBench, outperforming state-of-the-art methods.

Conclusion: DocRefine advances automated scientific document processing by ensuring semantic and visual accuracy.

Abstract: The exponential growth of scientific literature in PDF format necessitates advanced tools for efficient and accurate document understanding, summarization, and content optimization. Traditional methods fall short in handling complex layouts and multimodal content, while direct application of Large Language Models (LLMs) and Vision-Language Large Models (LVLMs) lacks precision and control for intricate editing tasks. This paper introduces DocRefine, an innovative framework designed for intelligent understanding, content refinement, and automated summarization of scientific PDF documents, driven by natural language instructions. DocRefine leverages the power of advanced LVLMs (e.g., GPT-4o) by orchestrating a sophisticated multi-agent system comprising six specialized and collaborative agents: Layout & Structure Analysis, Multimodal Content Understanding, Instruction Decomposition, Content Refinement, Summarization & Generation, and Fidelity & Consistency Verification. This closed-loop feedback architecture ensures high semantic accuracy and visual fidelity. Evaluated on the comprehensive DocEditBench dataset, DocRefine consistently outperforms state-of-the-art baselines across various tasks, achieving overall scores of 86.7% for Semantic Consistency Score (SCS), 93.9% for Layout Fidelity Index (LFI), and 85.0% for Instruction Adherence Rate (IAR). These results demonstrate DocRefine’s superior capability in handling complex multimodal document editing, preserving semantic integrity, and maintaining visual consistency, marking a significant advancement in automated scientific document processing.

[238] MV-CoRe: Multimodal Visual-Conceptual Reasoning for Complex Visual Question Answering

Jingwei Peng, Jiehao Chen, Mateo Alejandro Rojas, Meilin Zhang

Main category: cs.CV

TL;DR: MV-CoRe enhances Complex VQA by fusing global and fine-grained visual-linguistic features, outperforming LVLMs with 77.5% accuracy on GQA.

Details

Motivation: Existing LVLMs struggle with Complex VQA due to reliance on high-level global features, lacking deep multimodal reasoning.

Method: MV-CoRe integrates global embeddings from VLMs/LLMs with fine-grained features (object detection, scene graphs) using a Multimodal Fusion Transformer.

Result: Achieves 77.5% accuracy on GQA, outperforming baselines, with ablation studies confirming key feature contributions.

Conclusion: MV-CoRe excels in Complex VQA, validated by human evaluations, demonstrating robust multimodal reasoning and understanding.

Abstract: Complex Visual Question Answering (Complex VQA) tasks, which demand sophisticated multi-modal reasoning and external knowledge integration, present significant challenges for existing large vision-language models (LVLMs) often limited by their reliance on high-level global features. To address this, we propose MV-CoRe (Multimodal Visual-Conceptual Reasoning), a novel model designed to enhance Complex VQA performance through the deep fusion of diverse visual and linguistic information. MV-CoRe meticulously integrates global embeddings from pre-trained Vision Large Models (VLMs) and Language Large Models (LLMs) with fine-grained semantic-aware visual features, including object detection characteristics and scene graph representations. An innovative Multimodal Fusion Transformer then processes and deeply integrates these diverse feature sets, enabling rich cross-modal attention and facilitating complex reasoning. We evaluate MV-CoRe on challenging Complex VQA benchmarks, including GQA, A-OKVQA, and OKVQA, after training on VQAv2. Our experimental results demonstrate that MV-CoRe consistently outperforms established LVLM baselines, achieving an overall accuracy of 77.5% on GQA. Ablation studies confirm the critical contribution of both object and scene graph features, and human evaluations further validate MV-CoRe’s superior factual correctness and reasoning depth, underscoring its robust capabilities for deep visual and conceptual understanding.

[239] Large Language Model Evaluated Stand-alone Attention-Assisted Graph Neural Network with Spatial and Structural Information Interaction for Precise Endoscopic Image Segmentation

Juntong Fan, Shuyi Fan, Debesh Jha, Changsheng Fang, Tieyong Zeng, Hengyong Yu, Dayang Wang

Main category: cs.CV

TL;DR: FOCUS-Med is a novel method for polyp segmentation in endoscopic images, combining graph-based and attention mechanisms to improve accuracy and boundary delineation, achieving state-of-the-art results.

Details

Motivation: The challenges of low contrast, specular highlights, and indistinct boundaries in endoscopic images hinder accurate polyp segmentation, crucial for early colorectal cancer detection.

Method: FOCUS-Med integrates a Dual-GCN module for spatial and structural dependencies, location-fused self-attention for global context, and a weighted fusion strategy for multi-scale aggregation. It also uses an LLM for qualitative evaluation.

Result: FOCUS-Med achieves state-of-the-art performance on public benchmarks across five key metrics.

Conclusion: FOCUS-Med demonstrates strong clinical potential for AI-assisted colonoscopy by effectively addressing segmentation challenges.

Abstract: Accurate endoscopic image segmentation on the polyps is critical for early colorectal cancer detection. However, this task remains challenging due to low contrast with surrounding mucosa, specular highlights, and indistinct boundaries. To address these challenges, we propose FOCUS-Med, which stands for Fusion of spatial and structural graph with attentional context-aware polyp segmentation in endoscopic medical imaging. FOCUS-Med integrates a Dual Graph Convolutional Network (Dual-GCN) module to capture contextual spatial and topological structural dependencies. This graph-based representation enables the model to better distinguish polyps from background tissues by leveraging topological cues and spatial connectivity, which are often obscured in raw image intensities. It enhances the model’s ability to preserve boundaries and delineate complex shapes typical of polyps. In addition, a location-fused stand-alone self-attention is employed to strengthen global context integration. To bridge the semantic gap between encoder-decoder layers, we incorporate a trainable weighted fast normalized fusion strategy for efficient multi-scale aggregation. Notably, we are the first to introduce the use of a Large Language Model (LLM) to provide detailed qualitative evaluations of segmentation quality. Extensive experiments on public benchmarks demonstrate that FOCUS-Med achieves state-of-the-art performance across five key metrics, underscoring its effectiveness and clinical potential for AI-assisted colonoscopy.

[240] TeSO: Representing and Compressing 3D Point Cloud Scenes with Textured Surfel Octree

Yueyu Hu, Ran Gong, Tingyu Fan, Yao Wang

Main category: cs.CV

TL;DR: The paper introduces Textured Surfel Octree (TeSO), a novel 3D representation for high-quality rendering and efficient compression, outperforming existing methods like point clouds and 3D Gaussians.

Details

Motivation: Existing 3D representations (point clouds, meshes, 3D Gaussians) have limitations in rendering quality, surface definition, and compressibility, necessitating a more versatile solution.

Method: TeSO organizes cube-bounded surfels on an octree, each with a texture patch, reducing primitives while retaining texture details. A compression scheme leverages the octree structure for efficient encoding.

Result: TeSO achieves higher rendering quality at lower bit-rates compared to point cloud and 3D Gaussian baselines.

Conclusion: TeSO is a promising 3D representation for streaming applications, balancing quality and compression efficiency.

Abstract: 3D visual content streaming is a key technology for emerging 3D telepresence and AR/VR applications. One fundamental element underlying the technology is a versatile 3D representation that is capable of producing high-quality renders and can be efficiently compressed at the same time. Existing 3D representations like point clouds, meshes and 3D Gaussians each have limitations in terms of rendering quality, surface definition, and compressibility. In this paper, we present the Textured Surfel Octree (TeSO), a novel 3D representation that is built from point clouds but addresses the aforementioned limitations. It represents a 3D scene as cube-bounded surfels organized on an octree, where each surfel is further associated with a texture patch. By approximating a smooth surface with a large surfel at a coarser level of the octree, it reduces the number of primitives required to represent the 3D scene, and yet retains the high-frequency texture details through the texture map attached to each surfel. We further propose a compression scheme to encode the geometry and texture efficiently, leveraging the octree structure. The proposed textured surfel octree combined with the compression scheme achieves higher rendering quality at lower bit-rates compared to multiple point cloud and 3D Gaussian-based baselines.

[241] ForeSight: Multi-View Streaming Joint Object Detection and Trajectory Forecasting

Sandro Papais, Letian Wang, Brian Cheong, Steven L. Waslander

Main category: cs.CV

TL;DR: ForeSight is a joint detection and forecasting framework for autonomous vehicles, combining tasks to improve performance.

Details

Motivation: Traditional methods separate detection and forecasting, missing temporal cues. ForeSight integrates them for better accuracy.

Method: Uses multi-task streaming and bidirectional learning with query memory sharing. Features forecast-aware detection and streaming forecast transformers.

Result: Achieves state-of-the-art performance: 54.9% EPA, 9.3% better than prior methods, and top mAP and minADE scores.

Conclusion: ForeSight outperforms tracking-based methods, eliminating object association needs and scaling efficiently.

Abstract: We introduce ForeSight, a novel joint detection and forecasting framework for vision-based 3D perception in autonomous vehicles. Traditional approaches treat detection and forecasting as separate sequential tasks, limiting their ability to leverage temporal cues. ForeSight addresses this limitation with a multi-task streaming and bidirectional learning approach, allowing detection and forecasting to share query memory and propagate information seamlessly. The forecast-aware detection transformer enhances spatial reasoning by integrating trajectory predictions from a multiple hypothesis forecast memory queue, while the streaming forecast transformer improves temporal consistency using past forecasts and refined detections. Unlike tracking-based methods, ForeSight eliminates the need for explicit object association, reducing error propagation with a tracking-free model that efficiently scales across multi-frame sequences. Experiments on the nuScenes dataset show that ForeSight achieves state-of-the-art performance, achieving an EPA of 54.9%, surpassing previous methods by 9.3%, while also attaining the best mAP and minADE among multi-view detection and forecasting models.

[242] Communication-Efficient Multi-Agent 3D Detection via Hybrid Collaboration

Yue Hu, Juntong Peng, Yunqiao Yang, Siheng Chen

Main category: cs.CV

TL;DR: HyComm is a communication-efficient LiDAR-based collaborative 3D detection system that adaptively integrates perceptual outputs and raw observations to optimize performance and bandwidth.

Details

Motivation: To address the trade-off between detection performance and communication bandwidth in collaborative 3D detection.

Method: Proposes hybrid collaboration, integrating compact perceptual outputs and richer raw observations while prioritizing critical data.

Result: HyComm outperforms previous methods, achieving a 2,006× lower communication volume and better AP50 on DAIR-V2X.

Conclusion: HyComm offers adaptable compression and standardized formats, ensuring superior performance-bandwidth trade-offs across diverse scenarios.

Abstract: Collaborative 3D detection can substantially boost detection performance by allowing agents to exchange complementary information. It inherently results in a fundamental trade-off between detection performance and communication bandwidth. To tackle this bottleneck issue, we propose a novel hybrid collaboration that adaptively integrates two types of communication messages: perceptual outputs, which are compact, and raw observations, which offer richer information. This approach focuses on two key aspects: i) integrating complementary information from two message types and ii) prioritizing the most critical data within each type. By adaptively selecting the most critical set of messages, it ensures optimal perceptual information and adaptability, effectively meeting the demands of diverse communication scenarios.Building on this hybrid collaboration, we present \texttt{HyComm}, a communication-efficient LiDAR-based collaborative 3D detection system. \texttt{HyComm} boasts two main benefits: i) it facilitates adaptable compression rates for messages, addressing various communication requirements, and ii) it uses standardized data formats for messages. This ensures they are independent of specific detection models, fostering adaptability across different agent configurations. To evaluate HyComm, we conduct experiments on both real-world and simulation datasets: DAIR-V2X and OPV2V. HyComm consistently outperforms previous methods and achieves a superior performance-bandwidth trade-off regardless of whether agents use the same or varied detection models. It achieves a lower communication volume of more than 2,006$\times$ and still outperforms Where2comm on DAIR-V2X in terms of AP50. The related code will be released.

[243] AugLift: Boosting Generalization in Lifting-based 3D Human Pose Estimation

Nikolai Warner, Wenjin Zhang, Irfan Essa, Apaar Sadhwani

Main category: cs.CV

TL;DR: AugLift improves 3D Human Pose Estimation by enriching 2D keypoints with confidence scores and depth estimates, enhancing generalization without extra data.

Details

Motivation: Standard lifting-based methods for 3D HPE generalize poorly. AugLift addresses this by leveraging existing pre-trained models to enhance input signals.

Method: AugLift augments 2D keypoints with confidence scores and depth estimates from pre-trained models, integrating modularly into existing lifting pipelines.

Result: AugLift boosts cross-dataset performance by 10.1% and in-distribution performance by 4.0%, consistently across architectures.

Conclusion: AugLift offers a practical, modular solution to improve generalization in lifting-based 3D HPE, validated by extensive experiments.

Abstract: Lifting-based methods for 3D Human Pose Estimation (HPE), which predict 3D poses from detected 2D keypoints, often generalize poorly to new datasets and real-world settings. To address this, we propose \emph{AugLift}, a simple yet effective reformulation of the standard lifting pipeline that significantly improves generalization performance without requiring additional data collection or sensors. AugLift sparsely enriches the standard input – the 2D keypoint coordinates $(x, y)$ – by augmenting it with a keypoint detection confidence score $c$ and a corresponding depth estimate $d$. These additional signals are computed from the image using off-the-shelf, pre-trained models (e.g., for monocular depth estimation), thereby inheriting their strong generalization capabilities. Importantly, AugLift serves as a modular add-on and can be readily integrated into existing lifting architectures. Our extensive experiments across four datasets demonstrate that AugLift boosts cross-dataset performance on unseen datasets by an average of $10.1%$, while also improving in-distribution performance by $4.0%$. These gains are consistent across various lifting architectures, highlighting the robustness of our method. Our analysis suggests that these sparse, keypoint-aligned cues provide robust frame-level context, offering a practical way to significantly improve the generalization of any lifting-based pose estimation model. Code will be made publicly available.

[244] Perceptual Evaluation of GANs and Diffusion Models for Generating X-rays

Gregory Schuit, Denis Parra, Cecilia Besa

Main category: cs.CV

TL;DR: The paper evaluates GANs and Diffusion Models for synthesizing chest X-rays with abnormalities, finding DMs more realistic but GANs better for specific conditions, highlighting perceptual gaps and the need for refinement.

Details

Motivation: Address data scarcity in medical imaging, especially for rare anomalies, and assess the fidelity and clinical utility of synthetic images to improve AI diagnostic tools.

Method: Evaluated GANs and DMs using a benchmark of real (MIMIC-CXR) and synthetic images, conducting a reader study with radiologists to assess realism and abnormality consistency.

Result: DMs produce more realistic images overall, but GANs perform better for specific conditions like absence of ECS. Radiologists identified visual cues distinguishing synthetic images.

Conclusion: GANs and DMs have complementary strengths; further refinement is needed to ensure reliable augmentation of training datasets for AI diagnostics.

Abstract: Generative image models have achieved remarkable progress in both natural and medical imaging. In the medical context, these techniques offer a potential solution to data scarcity-especially for low-prevalence anomalies that impair the performance of AI-driven diagnostic and segmentation tools. However, questions remain regarding the fidelity and clinical utility of synthetic images, since poor generation quality can undermine model generalizability and trust. In this study, we evaluate the effectiveness of state-of-the-art generative models-Generative Adversarial Networks (GANs) and Diffusion Models (DMs)-for synthesizing chest X-rays conditioned on four abnormalities: Atelectasis (AT), Lung Opacity (LO), Pleural Effusion (PE), and Enlarged Cardiac Silhouette (ECS). Using a benchmark composed of real images from the MIMIC-CXR dataset and synthetic images from both GANs and DMs, we conducted a reader study with three radiologists of varied experience. Participants were asked to distinguish real from synthetic images and assess the consistency between visual features and the target abnormality. Our results show that while DMs generate more visually realistic images overall, GANs can report better accuracy for specific conditions, such as absence of ECS. We further identify visual cues radiologists use to detect synthetic images, offering insights into the perceptual gaps in current models. These findings underscore the complementary strengths of GANs and DMs and point to the need for further refinement to ensure generative models can reliably augment training datasets for AI diagnostic systems.

[245] CMAMRNet: A Contextual Mask-Aware Network Enhancing Mural Restoration Through Comprehensive Mask Guidance

Yingtie Lei, Fanghai Yi, Yihang Dong, Weihuang Liu, Xiaofeng Zhang, Zimeng Li, Chi-Man Pun, Xuhang Chen

Main category: cs.CV

TL;DR: CMAMRNet is a novel network for mural restoration, addressing challenges like inconsistent mask guidance and degradation patterns by using MAUDS and CFA components, outperforming existing methods.

Details

Motivation: Murals deteriorate due to environmental and human factors, requiring digital restoration that preserves artistic authenticity, which current methods struggle with.

Method: CMAMRNet uses Mask-Aware Up/Down-Sampler (MAUDS) for consistent mask sensitivity and Co-Feature Aggregator (CFA) for multi-scale feature extraction.

Result: CMAMRNet outperforms state-of-the-art methods, preserving structural integrity and artistic details in murals.

Conclusion: The proposed framework effectively restores murals while maintaining authenticity, with code available for public use.

Abstract: Murals, as invaluable cultural artifacts, face continuous deterioration from environmental factors and human activities. Digital restoration of murals faces unique challenges due to their complex degradation patterns and the critical need to preserve artistic authenticity. Existing learning-based methods struggle with maintaining consistent mask guidance throughout their networks, leading to insufficient focus on damaged regions and compromised restoration quality. We propose CMAMRNet, a Contextual Mask-Aware Mural Restoration Network that addresses these limitations through comprehensive mask guidance and multi-scale feature extraction. Our framework introduces two key components: (1) the Mask-Aware Up/Down-Sampler (MAUDS), which ensures consistent mask sensitivity across resolution scales through dedicated channel-wise feature selection and mask-guided feature fusion; and (2) the Co-Feature Aggregator (CFA), operating at both the highest and lowest resolutions to extract complementary features for capturing fine textures and global structures in degraded regions. Experimental results on benchmark datasets demonstrate that CMAMRNet outperforms state-of-the-art methods, effectively preserving both structural integrity and artistic details in restored murals. The code is available at~\href{https://github.com/CXH-Research/CMAMRNet}{https://github.com/CXH-Research/CMAMRNet}.

[246] Dynamic Pattern Alignment Learning for Pretraining Lightweight Human-Centric Vision Models

Xuanhan Wang, Huimin Deng, Ke Liu, Jun Wang, Lianli Gao, Jingkuan Song

Main category: cs.CV

TL;DR: DPAL is a distillation-based pretraining framework for lightweight human-centric vision models (HVMs) that achieves strong generalization by aligning visual patterns at multiple levels.

Details

Motivation: To address the impracticality of large HVMs due to their dependence on massive pretraining data and large architectures, DPAL aims to distill knowledge into lightweight models efficiently.

Method: DPAL uses a dynamic pattern decoder (D-PaDe) with specialized experts for visual patterns and three alignment objectives (global, local, instance) to minimize generalization gaps.

Result: DPAL-trained lightweight models (e.g., DPAL-ViT/Ti) achieve generalization comparable to large HVMs and outperform other distillation methods on 15 datasets.

Conclusion: DPAL enables lightweight HVMs to generalize effectively across tasks, offering a practical alternative to large models.

Abstract: Human-centric vision models (HVMs) have achieved remarkable generalization due to large-scale pretraining on massive person images. However, their dependence on large neural architectures and the restricted accessibility of pretraining data significantly limits their practicality in real-world applications. To address this limitation, we propose Dynamic Pattern Alignment Learning (DPAL), a novel distillation-based pretraining framework that efficiently trains lightweight HVMs to acquire strong generalization from large HVMs. In particular, human-centric visual perception are highly dependent on three typical visual patterns, including global identity pattern, local shape pattern and multi-person interaction pattern. To achieve generalizable lightweight HVMs, we firstly design a dynamic pattern decoder (D-PaDe), acting as a dynamic Mixture of Expert (MoE) model. It incorporates three specialized experts dedicated to adaptively extract typical visual patterns, conditioned on both input image and pattern queries. And then, we present three levels of alignment objectives, which aims to minimize generalization gap between lightweight HVMs and large HVMs at global image level, local pixel level, and instance relation level. With these two deliberate designs, the DPAL effectively guides lightweight model to learn all typical human visual patterns from large HVMs, which can generalize to various human-centric vision tasks. Extensive experiments conducted on 15 challenging datasets demonstrate the effectiveness of the DPAL. Remarkably, when employing PATH-B as the teacher, DPAL-ViT/Ti (5M parameters) achieves surprising generalizability similar to existing large HVMs such as PATH-B (84M) and Sapiens-L (307M), and outperforms previous distillation-based pretraining methods including Proteus-ViT/Ti (5M) and TinyMiM-ViT/Ti (5M) by a large margin.

[247] Intention-Aware Diffusion Model for Pedestrian Trajectory Prediction

Yu Liu, Zhijie Liu, Xiao Ren, You-Fu Li, He Kong

Main category: cs.CV

TL;DR: A diffusion-based framework improves pedestrian trajectory prediction by modeling short-term and long-term intentions, enhancing accuracy and context-awareness.

Details

Motivation: Addressing the lack of explicit semantic modeling of pedestrian intent in diffusion-based methods, which can lead to misinterpreted behaviors and reduced prediction accuracy.

Method: Combines short-term intent (residual polar representation) and long-term intent (token-based endpoint predictor) with adaptive guidance and residual noise predictor in the diffusion process.

Result: Achieves competitive performance on ETH, UCY, and SDD benchmarks.

Conclusion: The framework effectively integrates intent modeling into diffusion-based prediction, improving accuracy and multimodal behavior capture.

Abstract: Predicting pedestrian motion trajectories is critical for the path planning and motion control of autonomous vehicles. Recent diffusion-based models have shown promising results in capturing the inherent stochasticity of pedestrian behavior for trajectory prediction. However, the absence of explicit semantic modelling of pedestrian intent in many diffusion-based methods may result in misinterpreted behaviors and reduced prediction accuracy. To address the above challenges, we propose a diffusion-based pedestrian trajectory prediction framework that incorporates both short-term and long-term motion intentions. Short-term intent is modelled using a residual polar representation, which decouples direction and magnitude to capture fine-grained local motion patterns. Long-term intent is estimated through a learnable, token-based endpoint predictor that generates multiple candidate goals with associated probabilities, enabling multimodal and context-aware intention modelling. Furthermore, we enhance the diffusion process by incorporating adaptive guidance and a residual noise predictor that dynamically refines denoising accuracy. The proposed framework is evaluated on the widely used ETH, UCY, and SDD benchmarks, demonstrating competitive results against state-of-the-art methods.

[248] SketchAnimator: Animate Sketch via Motion Customization of Text-to-Video Diffusion Models

Ruolin Yang, Da Li, Honggang Zhang, Yi-Zhe Song

Main category: cs.CV

TL;DR: SketchAnimator is a novel model for animating sketches by integrating appearance and motion from a reference video, making the process accessible to amateurs.

Details

Motivation: Animating sketches is time-consuming and requires professional skills, limiting accessibility for amateurs.

Method: Divides sketch animation into three stages: Appearance Learning, Motion Learning, and Video Prior Distillation, using LoRA and SDS techniques.

Result: Produces sketch videos retaining original appearance while mirroring reference video dynamics, outperforming alternatives in one-shot motion customization.

Conclusion: SketchAnimator effectively bridges the gap between professional and amateur sketch animation, offering a creative and accessible solution.

Abstract: Sketching is a uniquely human tool for expressing ideas and creativity. The animation of sketches infuses life into these static drawings, opening a new dimension for designers. Animating sketches is a time-consuming process that demands professional skills and extensive experience, often proving daunting for amateurs. In this paper, we propose a novel sketch animation model SketchAnimator, which enables adding creative motion to a given sketch, like “a jumping car’’. Namely, given an input sketch and a reference video, we divide the sketch animation into three stages: Appearance Learning, Motion Learning and Video Prior Distillation. In stages 1 and 2, we utilize LoRA to integrate sketch appearance information and motion dynamics from the reference video into the pre-trained T2V model. In the third stage, we utilize Score Distillation Sampling (SDS) to update the parameters of the Bezier curves in each sketch frame according to the acquired motion information. Consequently, our model produces a sketch video that not only retains the original appearance of the sketch but also mirrors the dynamic movements of the reference video. We compare our method with alternative approaches and demonstrate that it generates the desired sketch video under the challenge of one-shot motion customization.

[249] CoopDiff: Anticipating 3D Human-object Interactions via Contact-consistent Decoupled Diffusion

Xiaotong Lin, Tianming Liang, Jian-Fang Hu, Kun-Yu Lin, Yulei Kang, Chunwei Tian, Jianhuang Lai, Wei-Shi Zheng

Main category: cs.CV

TL;DR: CoopDiff decouples human and object motion modeling using contact points for coherent 3D HOI anticipation, outperforming SOTA methods.

Details

Motivation: Existing works ignore distinct motion patterns of humans and objects, treating them uniformly. CoopDiff addresses this by decoupling their dynamics.

Method: Uses two branches (human and object motion) linked by shared contact points, with a human-driven interaction module for consistency.

Result: Outperforms state-of-the-art methods on BEHAVE and Human-object Interaction datasets.

Conclusion: Decoupling motion modeling with contact consistency improves 3D HOI anticipation accuracy.

Abstract: 3D human-object interaction (HOI) anticipation aims to predict the future motion of humans and their manipulated objects, conditioned on the historical context. Generally, the articulated humans and rigid objects exhibit different motion patterns, due to their distinct intrinsic physical properties. However, this distinction is ignored by most of the existing works, which intend to capture the dynamics of both humans and objects within a single prediction model. In this work, we propose a novel contact-consistent decoupled diffusion framework CoopDiff, which employs two distinct branches to decouple human and object motion modeling, with the human-object contact points as shared anchors to bridge the motion generation across branches. The human dynamics branch is aimed to predict highly structured human motion, while the object dynamics branch focuses on the object motion with rigid translations and rotations. These two branches are bridged by a series of shared contact points with consistency constraint for coherent human-object motion prediction. To further enhance human-object consistency and prediction reliability, we propose a human-driven interaction module to guide object motion modeling. Extensive experiments on the BEHAVE and Human-object Interaction datasets demonstrate that our CoopDiff outperforms state-of-the-art methods.

[250] Lightweight Multi-Scale Feature Extraction with Fully Connected LMF Layer for Salient Object Detection

Yunpeng Shi, Lei Chen, Xiaolu Shen, Yanju Guo

Main category: cs.CV

TL;DR: A lightweight multi-scale feature extraction layer (LMF layer) is proposed for salient object detection, achieving state-of-the-art performance with minimal parameters.

Details

Motivation: Addressing the trade-off between efficiency and performance in lightweight networks for multi-scale feature extraction in computer vision tasks.

Method: Proposes the LMF layer using depthwise separable dilated convolutions in a fully connected structure, integrated into LMFNet.

Result: LMFNet achieves competitive results on five benchmarks with only 0.81M parameters, outperforming traditional and lightweight models.

Conclusion: The LMFNet effectively balances efficiency and performance, showcasing potential for broader image processing applications.

Abstract: In the domain of computer vision, multi-scale feature extraction is vital for tasks such as salient object detection. However, achieving this capability in lightweight networks remains challenging due to the trade-off between efficiency and performance. This paper proposes a novel lightweight multi-scale feature extraction layer, termed the LMF layer, which employs depthwise separable dilated convolutions in a fully connected structure. By integrating multiple LMF layers, we develop LMFNet, a lightweight network tailored for salient object detection. Our approach significantly reduces the number of parameters while maintaining competitive performance. Here, we show that LMFNet achieves state-of-the-art or comparable results on five benchmark datasets with only 0.81M parameters, outperforming several traditional and lightweight models in terms of both efficiency and accuracy. Our work not only addresses the challenge of multi-scale learning in lightweight networks but also demonstrates the potential for broader applications in image processing tasks. The related code files are available at https://github.com/Shi-Yun-peng/LMFNet

[251] EventRR: Event Referential Reasoning for Referring Video Object Segmentation

Huihui Xu, Jiashi Lin, Haoyu Chen, Junjun He, Lei Zhu

Main category: cs.CV

TL;DR: EventRR framework improves RVOS by decoupling it into object summarization and referent reasoning, leveraging semantic event structures for better performance.

Details

Motivation: Current RVOS methods overlook the semantic structure of referring expressions, especially for videos, which include event attributes and temporal relations.

Method: EventRR uses bottleneck tokens for object summarization and constructs a Referential Event Graph (REG) for reasoning, guided by TCRR.

Result: EventRR outperforms state-of-the-art RVOS methods on four benchmark datasets.

Conclusion: EventRR effectively addresses the complexity of video-referring expressions, enhancing RVOS performance through structured reasoning.

Abstract: Referring Video Object Segmentation (RVOS) aims to segment out the object in a video referred by an expression. Current RVOS methods view referring expressions as unstructured sequences, neglecting their crucial semantic structure essential for referent reasoning. Besides, in contrast to image-referring expressions whose semantics focus only on object attributes and object-object relations, video-referring expressions also encompass event attributes and event-event temporal relations. This complexity challenges traditional structured reasoning image approaches. In this paper, we propose the Event Referential Reasoning (EventRR) framework. EventRR decouples RVOS into object summarization part and referent reasoning part. The summarization phase begins by summarizing each frame into a set of bottleneck tokens, which are then efficiently aggregated in the video-level summarization step to exchange the global cross-modal temporal context. For reasoning part, EventRR extracts semantic eventful structure of a video-referring expression into highly expressive Referential Event Graph (REG), which is a single-rooted directed acyclic graph. Guided by topological traversal of REG, we propose Temporal Concept-Role Reasoning (TCRR) to accumulate the referring score of each temporal query from REG leaf nodes to root node. Each reasoning step can be interpreted as a question-answer pair derived from the concept-role relations in REG. Extensive experiments across four widely recognized benchmark datasets, show that EventRR quantitatively and qualitatively outperforms state-of-the-art RVOS methods. Code is available at https://github.com/bio-mlhui/EventRR

[252] Similarity Matters: A Novel Depth-guided Network for Image Restoration and A New Dataset

Junyi He, Liuling Chen, Hongyang Zhou, Zhang xiaoxing, Xiaobin Zhu, Shengxiang Yu, Jingyan Qin, Xu-Cheng Yin

Main category: cs.CV

TL;DR: A Depth-Guided Network (DGN) for image restoration leverages depth information to improve quality, using a dual-branch approach and a new high-resolution dataset.

Details

Motivation: Existing methods overlook depth, causing issues in similarity matching and attention. Depth guidance can enhance restoration.

Method: DGN combines a depth estimation branch and an image restoration branch, using progressive window-based self-attention and sparse non-local attention.

Result: State-of-the-art performance on benchmarks and strong generalization to unseen plant images.

Conclusion: DGN effectively integrates depth for superior restoration, validated by a novel dataset.

Abstract: Image restoration has seen substantial progress in recent years. However, existing methods often neglect depth information, which hurts similarity matching, results in attention distractions in shallow depth-of-field (DoF) scenarios, and excessive enhancement of background content in deep DoF settings. To overcome these limitations, we propose a novel Depth-Guided Network (DGN) for image restoration, together with a novel large-scale high-resolution dataset. Specifically, the network consists of two interactive branches: a depth estimation branch that provides structural guidance, and an image restoration branch that performs the core restoration task. In addition, the image restoration branch exploits intra-object similarity through progressive window-based self-attention and captures inter-object similarity via sparse non-local attention. Through joint training, depth features contribute to improved restoration quality, while the enhanced visual features from the restoration branch in turn help refine depth estimation. Notably, we also introduce a new dataset for training and evaluation, consisting of 9,205 high-resolution images from 403 plant species, with diverse depth and texture variations. Extensive experiments show that our method achieves state-of-the-art performance on several standard benchmarks and generalizes well to unseen plant images, demonstrating its effectiveness and robustness.

[253] Unsupervised Real-World Super-Resolution via Rectified Flow Degradation Modelling

Hongyang Zhou, Xiaobin Zhu, Liuling Chen, Junyi He, Jingyan Qin, Xu-Cheng Yin, Zhang xiaoxing

Main category: cs.CV

TL;DR: Proposes unsupervised real-world super-resolution (SR) using rectified flow and Fourier prior to model degradation, improving SR performance on real-world data.

Details

Motivation: Addresses the challenge of unknown and complex degradation distributions in real-world SR, where synthetic LR-HR pairs fail to generalize.

Method: Introduces Rectified Flow Degradation Module (RFDM) and Fourier Prior Guided Degradation Module (FGDM) to model degradation and generate realistic LR-HR pairs.

Result: Significantly enhances SR performance on real-world datasets.

Conclusion: The method effectively bridges the domain gap in real-world SR by modeling degradation more accurately.

Abstract: Unsupervised real-world super-resolution (SR) faces critical challenges due to the complex, unknown degradation distributions in practical scenarios. Existing methods struggle to generalize from synthetic low-resolution (LR) and high-resolution (HR) image pairs to real-world data due to a significant domain gap. In this paper, we propose an unsupervised real-world SR method based on rectified flow to effectively capture and model real-world degradation, synthesizing LR-HR training pairs with realistic degradation. Specifically, given unpaired LR and HR images, we propose a novel Rectified Flow Degradation Module (RFDM) that introduces degradation-transformed LR (DT-LR) images as intermediaries. By modeling the degradation trajectory in a continuous and invertible manner, RFDM better captures real-world degradation and enhances the realism of generated LR images. Additionally, we propose a Fourier Prior Guided Degradation Module (FGDM) that leverages structural information embedded in Fourier phase components to ensure more precise modeling of real-world degradation. Finally, the LR images are processed by both FGDM and RFDM, producing final synthetic LR images with real-world degradation. The synthetic LR images are paired with the given HR images to train the off-the-shelf SR networks. Extensive experiments on real-world datasets demonstrate that our method significantly enhances the performance of existing SR approaches in real-world scenarios.

[254] Bridging Semantic Logic Gaps: A Cognition-Inspired Multimodal Boundary-Preserving Network for Image Manipulation Localization

Songlin Li, Zhiqing Guo, Yuanman Li, Zeyu Li, Yunfeng Diao, Gaobo Yang, Liejun Wang

Main category: cs.CV

TL;DR: A new model, CMB-Net, improves image manipulation localization by integrating semantic analysis from LLMs and addressing hallucination errors with ITCAM, ITIM, and RED modules.

Details

Motivation: Existing IML models lack semantic analysis, which is crucial for detecting manipulated regions due to disrupted content relationships.

Method: CMB-Net uses LLMs for semantic analysis, ITCAM for weighting text features, ITIM for aligning visual-text features, and RED for boundary preservation.

Result: CMB-Net outperforms existing IML models in experiments.

Conclusion: Integrating semantic analysis and addressing LLM hallucinations enhances IML accuracy.

Abstract: The existing image manipulation localization (IML) models mainly relies on visual cues, but ignores the semantic logical relationships between content features. In fact, the content semantics conveyed by real images often conform to human cognitive laws. However, image manipulation technology usually destroys the internal relationship between content features, thus leaving semantic clues for IML. In this paper, we propose a cognition-inspired multimodal boundary-preserving network (CMB-Net). Specifically, CMB-Net utilizes large language models (LLMs) to analyze manipulated regions within images and generate prompt-based textual information to compensate for the lack of semantic relationships in the visual information. Considering that the erroneous texts induced by hallucination from LLMs will damage the accuracy of IML, we propose an image-text central ambiguity module (ITCAM). It assigns weights to the text features by quantifying the ambiguity between text and image features, thereby ensuring the beneficial impact of textual information. We also propose an image-text interaction module (ITIM) that aligns visual and text features using a correlation matrix for fine-grained interaction. Finally, inspired by invertible neural networks, we propose a restoration edge decoder (RED) that mutually generates input and output features to preserve boundary information in manipulated regions without loss. Extensive experiments show that CMB-Net outperforms most existing IML models.

[255] Generic Calibration: Pose Ambiguity/Linear Solution and Parametric-hybrid Pipeline

Yuqi Han, Qi Cai, Yuanxin Wu

Main category: cs.CV

TL;DR: A hybrid camera calibration method combining generic and parametric models is proposed to address pose ambiguity and improve accuracy in complex scenarios.

Details

Motivation: Existing offline camera calibration methods face issues like reliance on user experience for parametric models and complexity or lack of intrinsic parameters in generic methods.

Method: A linear solver and nonlinear optimization address pose ambiguity, followed by a global optimization hybrid method integrating generic and parametric models.

Result: The hybrid method improves extrinsic parameter accuracy, mitigates overfitting, and performs well across lens types and noise.

Conclusion: The hybrid calibration method is reliable and accurate for complex scenarios.

Abstract: Offline camera calibration techniques typically employ parametric or generic camera models. Selecting parametric models relies heavily on user experience, and an inappropriate camera model can significantly affect calibration accuracy. Meanwhile, generic calibration methods involve complex procedures and cannot provide traditional intrinsic parameters. This paper reveals a pose ambiguity in the pose solutions of generic calibration methods that irreversibly impacts subsequent pose estimation. A linear solver and a nonlinear optimization are proposed to address this ambiguity issue. Then a global optimization hybrid calibration method is introduced to integrate generic and parametric models together, which improves extrinsic parameter accuracy of generic calibration and mitigates overfitting and numerical instability in parametric calibration. Simulation and real-world experimental results demonstrate that the generic-parametric hybrid calibration method consistently excels across various lens types and noise contamination, hopefully serving as a reliable and accurate solution for camera calibration in complex scenarios.

[256] Landmark Guided Visual Feature Extractor for Visual Speech Recognition with Limited Resource

Lei Yang, Junshan Jin, Mingyuan Zhang, Yi He, Bofan Chen, Shilin Wang

Main category: cs.CV

TL;DR: The paper proposes a landmark-guided visual feature extractor for visual speech recognition, improving accuracy with limited data and reducing user-specific feature influence.

Details

Motivation: Deep learning methods for visual speech recognition are sensitive to visual disturbances and require large datasets. The paper aims to enhance performance with limited data and reduce reliance on user-specific features.

Method: A landmark-guided visual feature extractor is introduced, using facial landmarks as auxiliary information. A spatio-temporal multi-graph convolutional network and multi-level lip dynamic fusion framework are designed to combine landmark and visual features.

Result: The approach performs well with limited data and improves accuracy on unseen speakers.

Conclusion: The proposed method effectively addresses challenges in visual speech recognition by leveraging facial landmarks and spatio-temporal features, offering a cost-effective solution with improved performance.

Abstract: Visual speech recognition is a technique to identify spoken content in silent speech videos, which has raised significant attention in recent years. Advancements in data-driven deep learning methods have significantly improved both the speed and accuracy of recognition. However, these deep learning methods can be effected by visual disturbances, such as lightning conditions, skin texture and other user-specific features. Data-driven approaches could reduce the performance degradation caused by these visual disturbances using models pretrained on large-scale datasets. But these methods often require large amounts of training data and computational resources, making them costly. To reduce the influence of user-specific features and enhance performance with limited data, this paper proposed a landmark guided visual feature extractor. Facial landmarks are used as auxiliary information to aid in training the visual feature extractor. A spatio-temporal multi-graph convolutional network is designed to fully exploit the spatial locations and spatio-temporal features of facial landmarks. Additionally, a multi-level lip dynamic fusion framework is introduced to combine the spatio-temporal features of the landmarks with the visual features extracted from the raw video frames. Experimental results show that this approach performs well with limited data and also improves the model’s accuracy on unseen speakers.

[257] ASM-UNet: Adaptive Scan Mamba Integrating Group Commonalities and Individual Variations for Fine-Grained Segmentation

Bo Wang, Mengyuan Xu, Yue Yan, Yuqun Yang, Kechen Shu, Wei Ping, Xu Tang, Wei Jiang, Zheng You

Main category: cs.CV

TL;DR: ASM-UNet, a Mamba-based model, improves fine-grained segmentation (FGS) by dynamically adapting scanning orders using adaptive scan scores, outperforming existing methods on public and new datasets.

Details

Motivation: Existing coarse-grained segmentation (CGS) methods struggle with FGS due to individual variations in small-scale anatomical structures, and fixed scanning orders in Mamba-based models limit adaptability.

Method: Proposes ASM-UNet, which uses adaptive scan scores to dynamically guide scanning orders, combining group-level commonalities and individual-level variations.

Result: ASM-UNet achieves superior performance on public datasets (ACDC, Synapse) and a new biliary tract FGS dataset (BTMS) for both CGS and FGS tasks.

Conclusion: ASM-UNet addresses FGS challenges by dynamically adapting to individual variations, offering improved performance and adaptability in clinical scenarios.

Abstract: Precise lesion resection depends on accurately identifying fine-grained anatomical structures. While many coarse-grained segmentation (CGS) methods have been successful in large-scale segmentation (e.g., organs), they fall short in clinical scenarios requiring fine-grained segmentation (FGS), which remains challenging due to frequent individual variations in small-scale anatomical structures. Although recent Mamba-based models have advanced medical image segmentation, they often rely on fixed manually-defined scanning orders, which limit their adaptability to individual variations in FGS. To address this, we propose ASM-UNet, a novel Mamba-based architecture for FGS. It introduces adaptive scan scores to dynamically guide the scanning order, generated by combining group-level commonalities and individual-level variations. Experiments on two public datasets (ACDC and Synapse) and a newly proposed challenging biliary tract FGS dataset, namely BTMS, demonstrate that ASM-UNet achieves superior performance in both CGS and FGS tasks. Our code and dataset are available at https://github.com/YqunYang/ASM-UNet.

[258] Consistent and Controllable Image Animation with Motion Linear Diffusion Transformers

Xin Ma, Yaohui Wang, Genyun Jia, Xinyuan Chen, Tien-Tsin Wong, Cunjian Chen

Main category: cs.CV

TL;DR: MiraMo is a framework enhancing image animation with efficient linear attention, motion residual learning, and DCT-based noise refinement for better consistency and smoothness.

Details

Motivation: Addressing challenges in appearance consistency, motion smoothness, and computational efficiency in image animation.

Method: Uses a text-to-video architecture with linear attention, motion residual learning, and DCT-based noise refinement.

Result: Outperforms state-of-the-art methods in generating consistent, smooth animations with faster inference.

Conclusion: MiraMo offers superior performance and versatility in image animation and related tasks.

Abstract: Image animation has seen significant progress, driven by the powerful generative capabilities of diffusion models. However, maintaining appearance consistency with static input images and mitigating abrupt motion transitions in generated animations remain persistent challenges. While text-to-video (T2V) generation has demonstrated impressive performance with diffusion transformer models, the image animation field still largely relies on U-Net-based diffusion models, which lag behind the latest T2V approaches. Moreover, the quadratic complexity of vanilla self-attention mechanisms in Transformers imposes heavy computational demands, making image animation particularly resource-intensive. To address these issues, we propose MiraMo, a framework designed to enhance efficiency, appearance consistency, and motion smoothness in image animation. Specifically, MiraMo introduces three key elements: (1) A foundational text-to-video architecture replacing vanilla self-attention with efficient linear attention to reduce computational overhead while preserving generation quality; (2) A novel motion residual learning paradigm that focuses on modeling motion dynamics rather than directly predicting frames, improving temporal consistency; and (3) A DCT-based noise refinement strategy during inference to suppress sudden motion artifacts, complemented by a dynamics control module to balance motion smoothness and expressiveness. Extensive experiments against state-of-the-art methods validate the superiority of MiraMo in generating consistent, smooth, and controllable animations with accelerated inference speed. Additionally, we demonstrate the versatility of MiraMo through applications in motion transfer and video editing tasks.

[259] SUIT: Spatial-Spectral Union-Intersection Interaction Network for Hyperspectral Object Tracking

Fengchao Xiong, Zhenxing Wu, Sen Jia, Yuntao Qian

Main category: cs.CV

TL;DR: The paper proposes a method to improve hyperspectral video tracking by focusing on spectral interactions, using Transformers and a spectral loss for better performance.

Details

Motivation: Existing tracking methods neglect spectral interactions, leading to suboptimal performance in cluttered backgrounds and small object scenarios.

Method: The approach uses Transformers for band-wise spatial relationships and models spectral interactions via the inclusion-exclusion principle. A spectral loss enforces material distribution alignment.

Result: The tracker achieves state-of-the-art performance, validated through extensive experiments.

Conclusion: The method effectively integrates spectral and spatial cues, enhancing tracking robustness, with code and models made available for reproducibility.

Abstract: Hyperspectral videos (HSVs), with their inherent spatial-spectral-temporal structure, offer distinct advantages in challenging tracking scenarios such as cluttered backgrounds and small objects. However, existing methods primarily focus on spatial interactions between the template and search regions, often overlooking spectral interactions, leading to suboptimal performance. To address this issue, this paper investigates spectral interactions from both the architectural and training perspectives. At the architectural level, we first establish band-wise long-range spatial relationships between the template and search regions using Transformers. We then model spectral interactions using the inclusion-exclusion principle from set theory, treating them as the union of spatial interactions across all bands. This enables the effective integration of both shared and band-specific spatial cues. At the training level, we introduce a spectral loss to enforce material distribution alignment between the template and predicted regions, enhancing robustness to shape deformation and appearance variations. Extensive experiments demonstrate that our tracker achieves state-of-the-art tracking performance. The source code, trained models and results will be publicly available via https://github.com/bearshng/suit to support reproducibility.

[260] Understanding Dynamic Scenes in Ego Centric 4D Point Clouds

Junsheng Huang, Shengyu Hao, Bocheng Hu, Gaoang Wang

Main category: cs.CV

TL;DR: EgoDynamic4D introduces a QA benchmark for dynamic 4D scenes with unified annotations and 12 tasks, proposing a spatio-temporal reasoning framework that outperforms baselines.

Details

Motivation: Existing datasets lack unified 4D annotations and task-driven evaluation for fine-grained spatio-temporal reasoning in dynamic scenes.

Method: Proposes an end-to-end framework with instance-aware feature encoding, time/camera encoding, and adaptive down-sampling for 4D scene compression.

Result: The method outperforms baselines on EgoDynamic4D, validating multimodal temporal modeling.

Conclusion: EgoDynamic4D and the proposed framework advance egocentric dynamic scene understanding.

Abstract: Understanding dynamic 4D scenes from an egocentric perspective-modeling changes in 3D spatial structure over time-is crucial for human-machine interaction, autonomous navigation, and embodied intelligence. While existing egocentric datasets contain dynamic scenes, they lack unified 4D annotations and task-driven evaluation protocols for fine-grained spatio-temporal reasoning, especially on motion of objects and human, together with their interactions. To address this gap, we introduce EgoDynamic4D, a novel QA benchmark on highly dynamic scenes, comprising RGB-D video, camera poses, globally unique instance masks, and 4D bounding boxes. We construct 927K QA pairs accompanied by explicit Chain-of-Thought (CoT), enabling verifiable, step-by-step spatio-temporal reasoning. We design 12 dynamic QA tasks covering agent motion, human-object interaction, trajectory prediction, relation understanding, and temporal-causal reasoning, with fine-grained, multidimensional metrics. To tackle these tasks, we propose an end-to-end spatio-temporal reasoning framework that unifies dynamic and static scene information, using instance-aware feature encoding, time and camera encoding, and spatially adaptive down-sampling to compress large 4D scenes into token sequences manageable by LLMs. Experiments on EgoDynamic4D show that our method consistently outperforms baselines, validating the effectiveness of multimodal temporal modeling for egocentric dynamic scene understanding.

[261] Small-Large Collaboration: Training-efficient Concept Personalization for Large VLM using a Meta Personalized Small VLM

Sihan Yang, Huitong Ji, Shaolin Lu, Jiayi Chen, Binxiao Xu, Ming Lu, Yuanxing Zhang, Wenhui Dong, Wentao Zhang

Main category: cs.CV

TL;DR: A collaborative framework (SLC) combines small and large VLMs for efficient personalization, leveraging small VLMs for personalized info and large VLMs for accurate responses, with a reflection strategy to mitigate hallucinations.

Details

Motivation: Large VLMs are costly and restricted, while small VLMs lack reasoning. SLC bridges this gap for efficient personalization.

Method: SLC uses small VLMs for personalized info and large VLMs for integration, with a test-time reflection strategy to ensure accuracy.

Result: SLC is training-efficient, works with open/closed-source VLMs, and shows effectiveness in experiments.

Conclusion: SLC enables broader real-world personalized applications efficiently.

Abstract: Personalizing Vision-Language Models (VLMs) to transform them into daily assistants has emerged as a trending research direction. However, leading companies like OpenAI continue to increase model size and develop complex designs such as the chain of thought (CoT). While large VLMs are proficient in complex multi-modal understanding, their high training costs and limited access via paid APIs restrict direct personalization. Conversely, small VLMs are easily personalized and freely available, but they lack sufficient reasoning capabilities. Inspired by this, we propose a novel collaborative framework named Small-Large Collaboration (SLC) for large VLM personalization, where the small VLM is responsible for generating personalized information, while the large model integrates this personalized information to deliver accurate responses. To effectively incorporate personalized information, we develop a test-time reflection strategy, preventing the potential hallucination of the small VLM. Since SLC only needs to train a meta personalized small VLM for the large VLMs, the overall process is training-efficient. To the best of our knowledge, this is the first training-efficient framework that supports both open-source and closed-source large VLMs, enabling broader real-world personalized applications. We conduct thorough experiments across various benchmarks and large VLMs to demonstrate the effectiveness of the proposed SLC framework. The code will be released at https://github.com/Hhankyangg/SLC.

[262] Representation Understanding via Activation Maximization

Hongbo Zhu, Angelo Cangelosi

Main category: cs.CV

TL;DR: A unified feature visualization framework for CNNs and ViTs, extending to intermediate layers and exploring adversarial example generation.

Details

Motivation: To improve interpretability of DNNs by understanding internal feature representations and revealing vulnerabilities.

Method: Uses Activation Maximization (AM) to synthesize inputs for strong neuron responses, applied to intermediate and output layers.

Result: Effective visualization for CNNs and ViTs, revealing hierarchical features and adversarial vulnerabilities.

Conclusion: The framework generalizes well, offering deeper insights and interpretability for DNNs.

Abstract: Understanding internal feature representations of deep neural networks (DNNs) is a fundamental step toward model interpretability. Inspired by neuroscience methods that probe biological neurons using visual stimuli, recent deep learning studies have employed Activation Maximization (AM) to synthesize inputs that elicit strong responses from artificial neurons. In this work, we propose a unified feature visualization framework applicable to both Convolutional Neural Networks (CNNs) and Vision Transformers (ViTs). Unlike prior efforts that predominantly focus on the last output-layer neurons in CNNs, we extend feature visualization to intermediate layers as well, offering deeper insights into the hierarchical structure of learned feature representations. Furthermore, we investigate how activation maximization can be leveraged to generate adversarial examples, revealing potential vulnerabilities and decision boundaries of DNNs. Our experiments demonstrate the effectiveness of our approach in both traditional CNNs and modern ViT, highlighting its generalizability and interpretive value.

[263] SynMatch: Rethinking Consistency in Medical Image Segmentation with Sparse Annotations

Zhiqiang Shen, Peng Cao, Xiaoli Liu, Jinzhu Yang, Osmar R. Zaiane

Main category: cs.CV

TL;DR: SynMatch addresses label scarcity in medical image segmentation by synthesizing images to match pseudo labels, improving consistency without additional training parameters.

Details

Motivation: Label scarcity in medical image segmentation limits deep learning performance, and existing pseudo-label methods suffer from inconsistencies.

Method: SynMatch synthesizes images using texture and shape features from the same model generating pseudo labels, ensuring consistency.

Result: SynMatch outperforms strong-weak pseudo supervision by 29.71% and 10.05% in polyp segmentation with 5% and 10% scribble annotations.

Conclusion: SynMatch is effective for semi-, weakly-, and barely-supervised learning, especially in challenging settings with limited annotations.

Abstract: Label scarcity remains a major challenge in deep learning-based medical image segmentation. Recent studies use strong-weak pseudo supervision to leverage unlabeled data. However, performance is often hindered by inconsistencies between pseudo labels and their corresponding unlabeled images. In this work, we propose \textbf{SynMatch}, a novel framework that sidesteps the need for improving pseudo labels by synthesizing images to match them instead. Specifically, SynMatch synthesizes images using texture and shape features extracted from the same segmentation model that generates the corresponding pseudo labels for unlabeled images. This design enables the generation of highly consistent synthesized-image-pseudo-label pairs without requiring any training parameters for image synthesis. We extensively evaluate SynMatch across diverse medical image segmentation tasks under semi-supervised learning (SSL), weakly-supervised learning (WSL), and barely-supervised learning (BSL) settings with increasingly limited annotations. The results demonstrate that SynMatch achieves superior performance, especially in the most challenging BSL setting. For example, it outperforms the recent strong-weak pseudo supervision-based method by 29.71% and 10.05% on the polyp segmentation task with 5% and 10% scribble annotations, respectively. The code will be released at https://github.com/Senyh/SynMatch.

[264] BEVANet: Bilateral Efficient Visual Attention Network for Real-Time Semantic Segmentation

Ping-Mao Huang, I-Tien Chao, Ping-Chia Huang, Jia-Wei Liao, Yung-Yu Chuang

Main category: cs.CV

TL;DR: BEVANet introduces LKA and SDLSKA for efficient real-time semantic segmentation, achieving 79.3% mIoU (81.0% with pretraining) at 33 FPS.

Details

Motivation: Addressing the challenge of balancing efficient architectures with large receptive fields for semantic understanding and detailed contours in real-time segmentation.

Method: Uses LKA, SDLSKA, CKS, DLKPPM, and BGAF modules to expand receptive fields, adapt dynamically, and enhance boundary delineation.

Result: Achieves 79.3% mIoU without pretraining and 81.0% mIoU with pretraining on Cityscapes at 33 FPS.

Conclusion: BEVANet demonstrates state-of-the-art performance in real-time semantic segmentation.

Abstract: Real-time semantic segmentation presents the dual challenge of designing efficient architectures that capture large receptive fields for semantic understanding while also refining detailed contours. Vision transformers model long-range dependencies effectively but incur high computational cost. To address these challenges, we introduce the Large Kernel Attention (LKA) mechanism. Our proposed Bilateral Efficient Visual Attention Network (BEVANet) expands the receptive field to capture contextual information and extracts visual and structural features using Sparse Decomposed Large Separable Kernel Attentions (SDLSKA). The Comprehensive Kernel Selection (CKS) mechanism dynamically adapts the receptive field to further enhance performance. Furthermore, the Deep Large Kernel Pyramid Pooling Module (DLKPPM) enriches contextual features by synergistically combining dilated convolutions and large kernel attention. The bilateral architecture facilitates frequent branch communication, and the Boundary Guided Adaptive Fusion (BGAF) module enhances boundary delineation by integrating spatial and semantic features under boundary guidance. BEVANet achieves real-time segmentation at 33 FPS, yielding 79.3% mIoU without pretraining and 81.0% mIoU on Cityscapes after ImageNet pretraining, demonstrating state-of-the-art performance. The code and model is available at https://github.com/maomao0819/BEVANet.

[265] DragonFruitQualityNet: A Lightweight Convolutional Neural Network for Real-Time Dragon Fruit Quality Inspection on Mobile Devices

Md Zahurul Haquea, Yeahyea Sarker, Muhammed Farhan Sadique Mahi, Syed Jubayer Jaman, Md Robiul Islam

Main category: cs.CV

TL;DR: DragonFruitQualityNet is a lightweight CNN for real-time dragon fruit quality assessment, achieving 93.98% accuracy and integrated into a mobile app for practical use.

Details

Motivation: Rising global demand for dragon fruit necessitates efficient quality inspection to improve productivity and reduce post-harvest losses.

Method: A lightweight CNN (DragonFruitQualityNet) was trained on a diverse dataset of 13,789 images classified into four categories (fresh, immature, mature, defective).

Result: The model achieved 93.98% accuracy, outperforming existing methods, and was embedded into a mobile app for real-time use.

Conclusion: The research provides an efficient, scalable AI solution for dragon fruit quality control, supporting digital agriculture and sustainable farming.

Abstract: Dragon fruit, renowned for its nutritional benefits and economic value, has experienced rising global demand due to its affordability and local availability. As dragon fruit cultivation expands, efficient pre- and post-harvest quality inspection has become essential for improving agricultural productivity and minimizing post-harvest losses. This study presents DragonFruitQualityNet, a lightweight Convolutional Neural Network (CNN) optimized for real-time quality assessment of dragon fruits on mobile devices. We curated a diverse dataset of 13,789 images, integrating self-collected samples with public datasets (dataset from Mendeley Data), and classified them into four categories: fresh, immature, mature, and defective fruits to ensure robust model training. The proposed model achieves an impressive 93.98% accuracy, outperforming existing methods in fruit quality classification. To facilitate practical adoption, we embedded the model into an intuitive mobile application, enabling farmers and agricultural stakeholders to conduct on-device, real-time quality inspections. This research provides an accurate, efficient, and scalable AI-driven solution for dragon fruit quality control, supporting digital agriculture and empowering smallholder farmers with accessible technology. By bridging the gap between research and real-world application, our work advances post-harvest management and promotes sustainable farming practices.

[266] MCITlib: Multimodal Continual Instruction Tuning Library and Benchmark

Haiyang Guo, Fei Zhu, Hongbo Zhao, Fanhu Zeng, Wenzhuo Liu, Shijie Ma, Da-Han Wang, Xu-Yao Zhang

Main category: cs.CV

TL;DR: MCITlib is a code library for multimodal continual learning, implementing 8 algorithms and evaluated on 2 benchmarks to advance research in this field.

Details

Motivation: To address the challenges of multimodal continual learning, including catastrophic forgetting and cross-modal interactions, by providing a tool for researchers.

Method: Developed MCITlib, a code library with 8 algorithms for multimodal continual instruction tuning, tested on 2 benchmarks.

Result: MCITlib is introduced as a resource for advancing multimodal continual learning, with ongoing updates planned.

Conclusion: MCITlib supports research in multimodal continual learning and will evolve with the field.

Abstract: Continual learning aims to equip AI systems with the ability to continuously acquire and adapt to new knowledge without forgetting previously learned information, similar to human learning. While traditional continual learning methods focusing on unimodal tasks have achieved notable success, the emergence of Multimodal Large Language Models has brought increasing attention to Multimodal Continual Learning tasks involving multiple modalities, such as vision and language. In this setting, models are expected to not only mitigate catastrophic forgetting but also handle the challenges posed by cross-modal interactions and coordination. To facilitate research in this direction, we introduce MCITlib, a comprehensive and constantly evolving code library for continual instruction tuning of Multimodal Large Language Models. In MCITlib, we have currently implemented 8 representative algorithms for Multimodal Continual Instruction Tuning and systematically evaluated them on 2 carefully selected benchmarks. MCITlib will be continuously updated to reflect advances in the Multimodal Continual Learning field. The codebase is released at https://github.com/Ghy0501/MCITlib.

[267] MobileViCLIP: An Efficient Video-Text Model for Mobile Devices

Min Yang, Zihan Jia, Zhilin Dai, Sheng Guo, Limin Wang

Main category: cs.CV

TL;DR: MobileViCLIP introduces an efficient video-text model for mobile devices, achieving faster inference speeds and strong zero-shot performance compared to existing models.

Details

Motivation: Existing video pre-trained models are inefficient for mobile deployment, lacking lightweight architectures.

Method: Temporal structural reparameterization is applied to an efficient image-text model, trained on a large-scale video-text dataset.

Result: MobileViCLIP-Small is 55.4x faster than InternVideo2-L14 and 6.7x faster than InternVideo2-S14, with comparable or better zero-shot retrieval performance.

Conclusion: MobileViCLIP bridges the gap for efficient video-text models on mobile devices, offering speed and performance advantages.

Abstract: Efficient lightweight neural networks are with increasing attention due to their faster reasoning speed and easier deployment on mobile devices. However, existing video pre-trained models still focus on the common ViT architecture with high latency, and few works attempt to build efficient architecture on mobile devices. This paper bridges this gap by introducing temporal structural reparameterization into an efficient image-text model and training it on a large-scale high-quality video-text dataset, resulting in an efficient video-text model that can run on mobile devices with strong zero-shot classification and retrieval capabilities, termed as MobileViCLIP. In particular, in terms of inference speed on mobile devices, our MobileViCLIP-Small is 55.4x times faster than InternVideo2-L14 and 6.7x faster than InternVideo2-S14. In terms of zero-shot retrieval performance, our MobileViCLIP-Small obtains similar performance as InternVideo2-L14 and obtains 6.9% better than InternVideo2-S14 on MSR-VTT. The code is available at https://github.com/MCG-NJU/MobileViCLIP.

[268] DocR1: Evidence Page-Guided GRPO for Multi-Page Document Understanding

Junyu Xiong, Yonghui Wang, Weichao Zhao, Chenyu Liu, Bing Yin, Wengang Zhou, Houqiang Li

Main category: cs.CV

TL;DR: DocR1, an MLLM trained with EviGRPO, improves multi-page document understanding via evidence-aware RL and achieves SOTA performance.

Details

Motivation: Addressing the challenge of multi-page document understanding in MLLMs, which requires fine-grained visual and multi-hop reasoning.

Method: Introduces EviGRPO, an RL framework with evidence-aware rewards, and a two-stage annotation pipeline with curriculum learning.

Result: DocR1 outperforms on multi-page tasks and maintains strong single-page performance, validated on EviBench and ArxivFullQA datasets.

Conclusion: DocR1 with EviGRPO effectively enhances multi-page document understanding with limited supervision, setting new benchmarks.

Abstract: Understanding multi-page documents poses a significant challenge for multimodal large language models (MLLMs), as it requires fine-grained visual comprehension and multi-hop reasoning across pages. While prior work has explored reinforcement learning (RL) for enhancing advanced reasoning in MLLMs, its application to multi-page document understanding remains underexplored. In this paper, we introduce DocR1, an MLLM trained with a novel RL framework, Evidence Page-Guided GRPO (EviGRPO). EviGRPO incorporates an evidence-aware reward mechanism that promotes a coarse-to-fine reasoning strategy, guiding the model to first retrieve relevant pages before generating answers. This training paradigm enables us to build high-quality models with limited supervision. To support this, we design a two-stage annotation pipeline and a curriculum learning strategy, based on which we construct two datasets: EviBench, a high-quality training set with 4.8k examples, and ArxivFullQA, an evaluation benchmark with 8.6k QA pairs based on scientific papers. Extensive experiments across a wide range of benchmarks demonstrate that DocR1 achieves state-of-the-art performance on multi-page tasks, while consistently maintaining strong results on single-page benchmarks.

[269] RORPCap: Retrieval-based Objects and Relations Prompt for Image Captioning

Jinjing Gu, Tianbao Qin, Yuanyuan Pu, Zhengpeng Zhao

Main category: cs.CV

TL;DR: RORPCap proposes a retrieval-based method for image captioning, using object and relation prompts to enrich embeddings, achieving competitive performance with minimal training time.

Details

Motivation: Existing image captioning models face issues like redundant detection info, complex GCN construction, and high training costs. RORPCap leverages image-text retrieval for semantic enrichment.

Method: Extracts object/relation words, embeds them into prompts, maps CLIP image embeddings to visual-text embeddings, and uses GPT-2 for caption generation.

Result: Achieves 120.5% CIDEr and 22.0% SPICE on MS-COCO with only 2.6 hours of training, matching detector/GCN models.

Conclusion: RORPCap is a viable alternative for image captioning, offering efficiency and competitive performance.

Abstract: Image captioning aims to generate natural language descriptions for input images in an open-form manner. To accurately generate descriptions related to the image, a critical step in image captioning is to identify objects and understand their relations within the image. Modern approaches typically capitalize on object detectors or combine detectors with Graph Convolutional Network (GCN). However, these models suffer from redundant detection information, difficulty in GCN construction, and high training costs. To address these issues, a Retrieval-based Objects and Relations Prompt for Image Captioning (RORPCap) is proposed, inspired by the fact that image-text retrieval can provide rich semantic information for input images. RORPCap employs an Objects and relations Extraction Model to extract object and relation words from the image. These words are then incorporate into predefined prompt templates and encoded as prompt embeddings. Next, a Mamba-based mapping network is designed to quickly map image embeddings extracted by CLIP to visual-text embeddings. Finally, the resulting prompt embeddings and visual-text embeddings are concatenated to form textual-enriched feature embeddings, which are fed into a GPT-2 model for caption generation. Extensive experiments conducted on the widely used MS-COCO dataset show that the RORPCap requires only 2.6 hours under cross-entropy loss training, achieving 120.5% CIDEr score and 22.0% SPICE score on the “Karpathy” test split. RORPCap achieves comparable performance metrics to detector-based and GCN-based models with the shortest training time and demonstrates its potential as an alternative for image captioning.

Tuyen Tran, Thao Minh Le, Quang-Hung Le, Truyen Tran

Main category: cs.CV

TL;DR: Planner-Refiner is a framework for aligning vision and language in videos by iteratively refining visual representations guided by language, achieving superior performance on complex tasks.

Details

Motivation: Addressing the challenges of aligning evolving visual entities and complex language in videos, including semantic gaps and action chains.

Method: Uses a Planner module to decompose language prompts into short sentence chains and a Refiner to iteratively refine visual tokens’ space-time representations.

Result: Demonstrates superior performance on Referring Video Object Segmentation and Temporal Grounding tasks, especially with complex prompts.

Conclusion: Planner-Refiner effectively bridges semantic gaps in video-language alignment, outperforming state-of-the-art methods.

Abstract: Vision-language alignment in video must address the complexity of language, evolving interacting entities, their action chains, and semantic gaps between language and vision. This work introduces Planner-Refiner, a framework to overcome these challenges. Planner-Refiner bridges the semantic gap by iteratively refining visual elements’ space-time representation, guided by language until semantic gaps are minimal. A Planner module schedules language guidance by decomposing complex linguistic prompts into short sentence chains. The Refiner processes each short sentence, a noun-phrase and verb-phrase pair, to direct visual tokens’ self-attention across space then time, achieving efficient single-step refinement. A recurrent system chains these steps, maintaining refined visual token representations. The final representation feeds into task-specific heads for alignment generation. We demonstrate Planner-Refiner’s effectiveness on two video-language alignment tasks: Referring Video Object Segmentation and Temporal Grounding with varying language complexity. We further introduce a new MeViS-X benchmark to assess models’ capability with long queries. Superior performance versus state-of-the-art methods on these benchmarks shows the approach’s potential, especially for complex prompts.

[271] Investigating the Design Space of Visual Grounding in Multimodal Large Language Model

Weitai Kang, Weiming Zhuang, Zhizhong Li, Yan Yan, Lingjuan Lyu

Main category: cs.CV

TL;DR: This paper systematically studies design choices for fine-tuning Multimodal Large Language Models (MLLMs) for visual grounding (VG), using LLaVA-1.5 as a baseline. It identifies effective paradigms and optimizes grounding data, improving performance by up to 7.0%.

Details

Motivation: To address the lack of systematic verification in existing approaches for fine-tuning MLLMs for VG, this study aims to provide a comprehensive analysis of design choices and their impact.

Method: The study uses LLaVA-1.5 to explore visual grounding paradigms and conducts ablation studies on grounding data design.

Result: The findings lead to a stronger MLLM for VG, achieving improvements of +5.6% / +6.9% / +7.0% on RefCOCO/+/g datasets over LLaVA-1.5.

Conclusion: The systematic study of design choices and grounding data optimization significantly enhances MLLM performance for visual grounding tasks.

Abstract: Fine-grained multimodal capability in Multimodal Large Language Models (MLLMs) has emerged as a critical research direction, particularly for tackling the visual grounding (VG) problem. Despite the strong performance achieved by existing approaches, they often employ disparate design choices when fine-tuning MLLMs for VG, lacking systematic verification to support these designs. To bridge this gap, this paper presents a comprehensive study of various design choices that impact the VG performance of MLLMs. We conduct our analysis using LLaVA-1.5, which has been widely adopted in prior empirical studies of MLLMs. While more recent models exist, we follow this convention to ensure our findings remain broadly applicable and extendable to other architectures. We cover two key aspects: (1) exploring different visual grounding paradigms in MLLMs, identifying the most effective design, and providing our insights; and (2) conducting ablation studies on the design of grounding data to optimize MLLMs’ fine-tuning for the VG task. Finally, our findings contribute to a stronger MLLM for VG, achieving improvements of +5.6% / +6.9% / +7.0% on RefCOCO/+/g over the LLaVA-1.5.

[272] CoAR: Concept Injection into Autoregressive Models for Personalized Text-to-Image Generation

Fangtai Wu, Mushui Liu, Weijie He, Wanggui He, Hao Jiang, Zhao Wang, Yunlong Yu

Main category: cs.CV

TL;DR: CoAR is a novel framework for injecting subject concepts into unified AR models without fine-tuning, using minimal parameters and addressing overfitting and language drift.

Details

Motivation: Existing methods for customized image generation are costly and prone to overfitting or catastrophic forgetting.

Method: CoAR uses Layerwise Multimodal Context Learning and regularization to preserve pre-trained distributions and improve subject fidelity.

Result: CoAR achieves superior performance in subject and style personalization with less than 0.05% of parameters tuned.

Conclusion: CoAR is efficient, effective, and outperforms recent methods like Proxy-Tuning.

Abstract: The unified autoregressive (AR) model excels at multimodal understanding and generation, but its potential for customized image generation remains underexplored. Existing customized generation methods rely on full fine-tuning or adapters, making them costly and prone to overfitting or catastrophic forgetting. In this paper, we propose \textbf{CoAR}, a novel framework for injecting subject concepts into the unified AR models while keeping all pre-trained parameters completely frozen. CoAR learns effective, specific subject representations with only a minimal number of parameters using a Layerwise Multimodal Context Learning strategy. To address overfitting and language drift, we further introduce regularization that preserves the pre-trained distribution and anchors context tokens to improve subject fidelity and re-contextualization. Additionally, CoAR supports training-free subject customization in a user-provided style. Experiments demonstrate that CoAR achieves superior performance on both subject-driven personalization and style personalization, while delivering significant gains in computational and memory efficiency. Notably, CoAR tunes less than \textbf{0.05%} of the parameters while achieving competitive performance compared to recent Proxy-Tuning. Code: https://github.com/KZF-kzf/CoAR

[273] SODiff: Semantic-Oriented Diffusion Model for JPEG Compression Artifacts Removal

Tingyu Yang, Jue Gong, Jinpei Guo, Wenbo Li, Yong Guo, Yulun Zhang

Main category: cs.CV

TL;DR: SODiff is a semantic-oriented one-step diffusion model for JPEG artifact removal, outperforming existing methods by leveraging semantic guidance and adaptive denoising.

Details

Motivation: Existing deep learning methods for JPEG artifact removal often fail to recover complex textures, leading to over-smoothed results.

Method: SODiff uses a semantic-aligned image prompt extractor (SAIPE) and a quality factor-aware time predictor to guide the diffusion process.

Result: SODiff achieves superior visual quality and quantitative metrics compared to recent leading methods.

Conclusion: SODiff effectively removes JPEG artifacts by combining semantic guidance and adaptive denoising, setting a new benchmark in the field.

Abstract: JPEG, as a widely used image compression standard, often introduces severe visual artifacts when achieving high compression ratios. Although existing deep learning-based restoration methods have made considerable progress, they often struggle to recover complex texture details, resulting in over-smoothed outputs. To overcome these limitations, we propose SODiff, a novel and efficient semantic-oriented one-step diffusion model for JPEG artifacts removal. Our core idea is that effective restoration hinges on providing semantic-oriented guidance to the pre-trained diffusion model, thereby fully leveraging its powerful generative prior. To this end, SODiff incorporates a semantic-aligned image prompt extractor (SAIPE). SAIPE extracts rich features from low-quality (LQ) images and projects them into an embedding space semantically aligned with that of the text encoder. Simultaneously, it preserves crucial information for faithful reconstruction. Furthermore, we propose a quality factor-aware time predictor that implicitly learns the compression quality factor (QF) of the LQ image and adaptively selects the optimal denoising start timestep for the diffusion process. Extensive experimental results show that our SODiff outperforms recent leading methods in both visual quality and quantitative metrics. Code is available at: https://github.com/frakenation/SODiff

[274] GS4Buildings: Prior-Guided Gaussian Splatting for 3D Building Reconstruction

Qilin Zhang, Olaf Wysocki, Boris Jutzi

Main category: cs.CV

TL;DR: GS4Buildings improves 2D Gaussian Splatting for urban building reconstruction by using semantic 3D models and prior depth/normal maps, achieving better completeness and accuracy.

Details

Motivation: 2D Gaussian Splatting struggles with large-scale urban scenes due to occlusions and incomplete reconstructions. GS4Buildings aims to address this by leveraging semantic 3D building models.

Method: Initializes Gaussians from LoD2 semantic 3D models, incorporates prior depth/normal maps for optimization, and offers a building-focused mode to reduce primitives.

Result: Improves reconstruction completeness by 20.5% and geometric accuracy by 32.8%, with a 71.8% reduction in Gaussian primitives in building-focused mode.

Conclusion: Semantic building model integration enhances GS-based reconstruction, making it viable for urban applications like smart cities and digital twins.

Abstract: Recent advances in Gaussian Splatting (GS) have demonstrated its effectiveness in photo-realistic rendering and 3D reconstruction. Among these, 2D Gaussian Splatting (2DGS) is particularly suitable for surface reconstruction due to its flattened Gaussian representation and integrated normal regularization. However, its performance often degrades in large-scale and complex urban scenes with frequent occlusions, leading to incomplete building reconstructions. We propose GS4Buildings, a novel prior-guided Gaussian Splatting method leveraging the ubiquity of semantic 3D building models for robust and scalable building surface reconstruction. Instead of relying on traditional Structure-from-Motion (SfM) pipelines, GS4Buildings initializes Gaussians directly from low-level Level of Detail (LoD)2 semantic 3D building models. Moreover, we generate prior depth and normal maps from the planar building geometry and incorporate them into the optimization process, providing strong geometric guidance for surface consistency and structural accuracy. We also introduce an optional building-focused mode that limits reconstruction to building regions, achieving a 71.8% reduction in Gaussian primitives and enabling a more efficient and compact representation. Experiments on urban datasets demonstrate that GS4Buildings improves reconstruction completeness by 20.5% and geometric accuracy by 32.8%. These results highlight the potential of semantic building model integration to advance GS-based reconstruction toward real-world urban applications such as smart cities and digital twins. Our project is available: https://github.com/zqlin0521/GS4Buildings.

[275] Training and Inference within 1 Second – Tackle Cross-Sensor Degradation of Real-World Pansharpening with Efficient Residual Feature Tailoring

Tianyu Xin, Jin-Liang Xiao, Zeyu Xia, Shan Yin, Liang-Jian Deng

Main category: cs.CV

TL;DR: A method using modular decomposition and a Feature Tailor improves cross-sensor pansharpening with fast training and inference, outperforming existing techniques in speed and generalization.

Details

Motivation: Addressing poor generalization of pretrained pansharpening models across different sensors without time-consuming retraining or extra data.

Method: Modular decomposition identifies a critical interface; a Feature Tailor is integrated here and trained with physics-aware unsupervised losses. Patch-wise training and parallel inference boost efficiency.

Result: Achieves state-of-the-art quality and efficiency, with sub-second training and inference times, significantly faster than zero-shot methods.

Conclusion: The method enhances generalization and reduces costs, making it practical for real-world cross-sensor pansharpening applications.

Abstract: Deep learning methods for pansharpening have advanced rapidly, yet models pretrained on data from a specific sensor often generalize poorly to data from other sensors. Existing methods to tackle such cross-sensor degradation include retraining model or zero-shot methods, but they are highly time-consuming or even need extra training data. To address these challenges, our method first performs modular decomposition on deep learning-based pansharpening models, revealing a general yet critical interface where high-dimensional fused features begin mapping to the channel space of the final image. % may need revisement A Feature Tailor is then integrated at this interface to address cross-sensor degradation at the feature level, and is trained efficiently with physics-aware unsupervised losses. Moreover, our method operates in a patch-wise manner, training on partial patches and performing parallel inference on all patches to boost efficiency. Our method offers two key advantages: (1) $\textit{Improved Generalization Ability}$: it significantly enhance performance in cross-sensor cases. (2) $\textit{Low Generalization Cost}$: it achieves sub-second training and inference, requiring only partial test inputs and no external data, whereas prior methods often take minutes or even hours. Experiments on the real-world data from multiple datasets demonstrate that our method achieves state-of-the-art quality and efficiency in tackling cross-sensor degradation. For example, training and inference of $512\times512\times8$ image within $\textit{0.2 seconds}$ and $4000\times4000\times8$ image within $\textit{3 seconds}$ at the fastest setting on a commonly used RTX 3090 GPU, which is over 100 times faster than zero-shot methods.

[276] DIP-GS: Deep Image Prior For Gaussian Splatting Sparse View Recovery

Rajaei Khatib, Raja Giryes

Main category: cs.CV

TL;DR: DIP-GS enhances 3D Gaussian Splatting (3DGS) for sparse-view reconstruction using Deep Image Prior (DIP), achieving SOTA results without pre-trained models.

Details

Motivation: 3DGS struggles with sparse-view reconstruction due to limited input views and low overlaps.

Method: Proposes DIP-GS, integrating DIP with 3DGS in a coarse-to-fine manner to leverage internal structure and patterns.

Result: Achieves competitive SOTA results on sparse-view reconstruction tasks.

Conclusion: DIP-GS effectively addresses sparse-view limitations of 3DGS without external pre-trained models.

Abstract: 3D Gaussian Splatting (3DGS) is a leading 3D scene reconstruction method, obtaining high-quality reconstruction with real-time rendering runtime performance. The main idea behind 3DGS is to represent the scene as a collection of 3D gaussians, while learning their parameters to fit the given views of the scene. While achieving superior performance in the presence of many views, 3DGS struggles with sparse view reconstruction, where the input views are sparse and do not fully cover the scene and have low overlaps. In this paper, we propose DIP-GS, a Deep Image Prior (DIP) 3DGS representation. By using the DIP prior, which utilizes internal structure and patterns, with coarse-to-fine manner, DIP-based 3DGS can operate in scenarios where vanilla 3DGS fails, such as sparse view recovery. Note that our approach does not use any pre-trained models such as generative models and depth estimation, but rather relies only on the input frames. Among such methods, DIP-GS obtains state-of-the-art (SOTA) competitive results on various sparse-view reconstruction tasks, demonstrating its capabilities.

[277] LET-US: Long Event-Text Understanding of Scenes

Rui Chen, Xingyu Chen, Shaoan Wang, Shihan Kong, Junzhi Yu

Main category: cs.CV

TL;DR: LET-US is a framework for long event-stream–text comprehension, using adaptive compression and a two-stage optimization paradigm to bridge the modality gap between event streams and text. It outperforms existing MLLMs in accuracy and comprehension.

Details

Motivation: Existing MLLMs struggle with interpreting long event streams or fail to handle them effectively, limiting their application in event-based visual perception.

Method: LET-US employs adaptive compression, text-guided cross-modal queries, hierarchical clustering, and similarity computation to distill representative event features. It uses a two-stage optimization paradigm and a large-scale event-text dataset for training.

Result: LET-US outperforms prior MLLMs in descriptive accuracy and semantic comprehension for long-duration event streams.

Conclusion: The framework sets a new standard for cross-modal understanding of event streams, with publicly available datasets, codes, and models.

Abstract: Event cameras output event streams as sparse, asynchronous data with microsecond-level temporal resolution, enabling visual perception with low latency and a high dynamic range. While existing Multimodal Large Language Models (MLLMs) have achieved significant success in understanding and analyzing RGB video content, they either fail to interpret event streams effectively or remain constrained to very short sequences. In this paper, we introduce LET-US, a framework for long event-stream–text comprehension that employs an adaptive compression mechanism to reduce the volume of input events while preserving critical visual details. LET-US thus establishes a new frontier in cross-modal inferential understanding over extended event sequences. To bridge the substantial modality gap between event streams and textual representations, we adopt a two-stage optimization paradigm that progressively equips our model with the capacity to interpret event-based scenes. To handle the voluminous temporal information inherent in long event streams, we leverage text-guided cross-modal queries for feature reduction, augmented by hierarchical clustering and similarity computation to distill the most representative event features. Moreover, we curate and construct a large-scale event-text aligned dataset to train our model, achieving tighter alignment of event features within the LLM embedding space. We also develop a comprehensive benchmark covering a diverse set of tasks – reasoning, captioning, classification, temporal localization and moment retrieval. Experimental results demonstrate that LET-US outperforms prior state-of-the-art MLLMs in both descriptive accuracy and semantic comprehension on long-duration event streams. All datasets, codes, and models will be publicly available.

[278] ForensicsSAM: Toward Robust and Unified Image Forgery Detection and Localization Resisting to Adversarial Attack

Rongxuan Peng, Shunquan Tan, Chenqi Kong, Anwei Luo, Alex C. Kot, Jiwu Huang

Main category: cs.CV

TL;DR: ForensicsSAM is a robust framework for image forgery detection and localization, addressing adversarial vulnerabilities in PEFT-based methods by integrating forgery and adversary experts, and a lightweight detector.

Details

Motivation: Existing PEFT-based methods for adapting large vision models overlook their vulnerability to adversarial attacks, degrading performance in image forgery tasks.

Method: ForensicsSAM enhances robustness by injecting forgery experts into transformer blocks, designing a lightweight adversary detector, and adaptively activating adversary experts to correct adversarial noise.

Result: ForensicsSAM outperforms existing methods in resisting adversarial attacks and achieves state-of-the-art performance in forgery detection and localization.

Conclusion: ForensicsSAM provides a unified, adversarial-resistant solution for IFDL, validated by extensive benchmarks.

Abstract: Parameter-efficient fine-tuning (PEFT) has emerged as a popular strategy for adapting large vision foundation models, such as the Segment Anything Model (SAM) and LLaVA, to downstream tasks like image forgery detection and localization (IFDL). However, existing PEFT-based approaches overlook their vulnerability to adversarial attacks. In this paper, we show that highly transferable adversarial images can be crafted solely via the upstream model, without accessing the downstream model or training data, significantly degrading the IFDL performance. To address this, we propose ForensicsSAM, a unified IFDL framework with built-in adversarial robustness. Our design is guided by three key ideas: (1) To compensate for the lack of forgery-relevant knowledge in the frozen image encoder, we inject forgery experts into each transformer block to enhance its ability to capture forgery artifacts. These forgery experts are always activated and shared across any input images. (2) To detect adversarial images, we design an light-weight adversary detector that learns to capture structured, task-specific artifact in RGB domain, enabling reliable discrimination across various attack methods. (3) To resist adversarial attacks, we inject adversary experts into the global attention layers and MLP modules to progressively correct feature shifts induced by adversarial noise. These adversary experts are adaptively activated by the adversary detector, thereby avoiding unnecessary interference with clean images. Extensive experiments across multiple benchmarks demonstrate that ForensicsSAM achieves superior resistance to various adversarial attack methods, while also delivering state-of-the-art performance in image-level forgery detection and pixel-level forgery localization. The resource is available at https://github.com/siriusPRX/ForensicsSAM.

[279] CharacterShot: Controllable and Consistent 4D Character Animation

Junyao Gao, Jiaxing Li, Wenran Liu, Yanhong Zeng, Fei Shen, Kai Chen, Yanan Sun, Cairong Zhao

Main category: cs.CV

TL;DR: CharacterShot is a framework for creating 4D character animations from a single image and 2D pose sequence, using a 2D-to-3D lifting approach and 4D Gaussian splatting.

Details

Motivation: To enable individual designers to create dynamic 3D character animations easily and consistently.

Method: Pretrains a 2D animation model, lifts it to 3D using dual-attention and camera prior, and optimizes with 4D Gaussian splatting.

Result: Outperforms state-of-the-art methods on the new CharacterBench benchmark.

Conclusion: CharacterShot provides a scalable and effective solution for 4D character animation.

Abstract: In this paper, we propose \textbf{CharacterShot}, a controllable and consistent 4D character animation framework that enables any individual designer to create dynamic 3D characters (i.e., 4D character animation) from a single reference character image and a 2D pose sequence. We begin by pretraining a powerful 2D character animation model based on a cutting-edge DiT-based image-to-video model, which allows for any 2D pose sequnce as controllable signal. We then lift the animation model from 2D to 3D through introducing dual-attention module together with camera prior to generate multi-view videos with spatial-temporal and spatial-view consistency. Finally, we employ a novel neighbor-constrained 4D gaussian splatting optimization on these multi-view videos, resulting in continuous and stable 4D character representations. Moreover, to improve character-centric performance, we construct a large-scale dataset Character4D, containing 13,115 unique characters with diverse appearances and motions, rendered from multiple viewpoints. Extensive experiments on our newly constructed benchmark, CharacterBench, demonstrate that our approach outperforms current state-of-the-art methods. Code, models, and datasets will be publicly available at https://github.com/Jeoyal/CharacterShot.

[280] CLUE: Leveraging Low-Rank Adaptation to Capture Latent Uncovered Evidence for Image Forgery Localization

Youqi Wang, Shunquan Tan, Rongxuan Peng, Bin Li, Jiwu Huang

Main category: cs.CV

TL;DR: CLUE repurposes Stable Diffusion 3 and SAM to detect and localize digital forgeries with high accuracy and robustness.

Details

Motivation: The rise of AI-generated forgeries threatens digital media authenticity, necessitating advanced detection tools.

Method: CLUE uses LoRA to adapt SD3 for forensic feature extraction, leveraging noise injection and SAM for semantic context.

Result: CLUE outperforms prior methods in generalization and robustness against attacks and OSNs.

Conclusion: CLUE offers a parameter-efficient, high-fidelity solution for forgery localization, with public code availability.

Abstract: The increasing accessibility of image editing tools and generative AI has led to a proliferation of visually convincing forgeries, compromising the authenticity of digital media. In this paper, in addition to leveraging distortions from conventional forgeries, we repurpose the mechanism of a state-of-the-art (SOTA) text-to-image synthesis model by exploiting its internal generative process, turning it into a high-fidelity forgery localization tool. To this end, we propose CLUE (Capture Latent Uncovered Evidence), a framework that employs Low- Rank Adaptation (LoRA) to parameter-efficiently reconfigure Stable Diffusion 3 (SD3) as a forensic feature extractor. Our approach begins with the strategic use of SD3’s Rectified Flow (RF) mechanism to inject noise at varying intensities into the latent representation, thereby steering the LoRAtuned denoising process to amplify subtle statistical inconsistencies indicative of a forgery. To complement the latent analysis with high-level semantic context and precise spatial details, our method incorporates contextual features from the image encoder of the Segment Anything Model (SAM), which is parameter-efficiently adapted to better trace the boundaries of forged regions. Extensive evaluations demonstrate CLUE’s SOTA generalization performance, significantly outperforming prior methods. Furthermore, CLUE shows superior robustness against common post-processing attacks and Online Social Networks (OSNs). Code is publicly available at https://github.com/SZAISEC/CLUE.

[281] Freeze and Reveal: Exposing Modality Bias in Vision-Language Models

Vivek Hruday Kavuri, Vysishtya Karanam, Venkata Jahnavi Venkamsetty, Kriti Madumadukala, Lakshmipathi Balaji Darur, Ponnurangam Kumaraguru

Main category: cs.CV

TL;DR: The paper investigates gender bias in Vision Language Models (VLMs) and introduces methods to mitigate it, showing that bias stems from both vision and text modalities.

Details

Motivation: VLMs inherit gender biases from training data, impacting multi-modal performance. The study aims to dissect and reduce these biases.

Method: Targeted debiasing using Counterfactual Data Augmentation (CDA) and a novel method, DAUDoS, with a new metric, Degree of Stereotypicality.

Result: CDA reduces gender gap by 6%, DAUDoS by 3% with less data. Both improve gender identification by 3%. CLIP’s vision encoder is more biased; PaliGemma2’s text encoder is more biased.

Conclusion: Identifying bias sources enables targeted mitigation, improving future multi-modal systems.

Abstract: Vision Language Models achieve impressive multi-modal performance but often inherit gender biases from their training data. This bias might be coming from both the vision and text modalities. In this work, we dissect the contributions of vision and text backbones to these biases by applying targeted debiasing using Counterfactual Data Augmentation and Task Vector methods. Inspired by data-efficient approaches in hate-speech classification, we introduce a novel metric, Degree of Stereotypicality and a corresponding debiasing method, Data Augmentation Using Degree of Stereotypicality - DAUDoS, to reduce bias with minimal computational cost. We curate a gender annotated dataset and evaluate all methods on VisoGender benchmark to quantify improvements and identify dominant source of bias. Our results show that CDA reduces the gender gap by 6% and DAUDoS by 3% but using only one-third of the data. Both methods also improve the model’s ability to correctly identify gender in images by 3%, with DAUDoS achieving this improvement using only almost one-third of training data. From our experiment’s, we observed that CLIP’s vision encoder is more biased whereas PaliGemma2’s text encoder is more biased. By identifying whether bias stems more from vision or text encoders, our work enables more targeted and effective bias mitigation strategies in future multi-modal systems.

[282] Levarging Learning Bias for Noisy Anomaly Detection

Yuxin Zhang, Yunkang Cao, Yuqi Cheng, Yihan Sun, Weiming Shen

Main category: cs.CV

TL;DR: A two-stage framework leverages learning bias to improve unsupervised image anomaly detection in contaminated training data.

Details

Motivation: Real-world training data often contains unlabeled anomalies, degrading conventional methods that assume anomaly-free data.

Method: The framework uses learning bias (statistical dominance of normal samples and feature-space divergence) to filter anomalies in stage 1 and trains a final detector in stage 2.

Result: Superior performance on the Real-IAD benchmark, with validated resilience to contamination.

Conclusion: The model-agnostic design offers a practical solution for real-world scenarios with imperfect training data.

Abstract: This paper addresses the challenge of fully unsupervised image anomaly detection (FUIAD), where training data may contain unlabeled anomalies. Conventional methods assume anomaly-free training data, but real-world contamination leads models to absorb anomalies as normal, degrading detection performance. To mitigate this, we propose a two-stage framework that systematically exploits inherent learning bias in models. The learning bias stems from: (1) the statistical dominance of normal samples, driving models to prioritize learning stable normal patterns over sparse anomalies, and (2) feature-space divergence, where normal data exhibit high intra-class consistency while anomalies display high diversity, leading to unstable model responses. Leveraging the learning bias, stage 1 partitions the training set into subsets, trains sub-models, and aggregates cross-model anomaly scores to filter a purified dataset. Stage 2 trains the final detector on this dataset. Experiments on the Real-IAD benchmark demonstrate superior anomaly detection and localization performance under different noise conditions. Ablation studies further validate the framework’s contamination resilience, emphasizing the critical role of learning bias exploitation. The model-agnostic design ensures compatibility with diverse unsupervised backbones, offering a practical solution for real-world scenarios with imperfect training data. Code is available at https://github.com/hustzhangyuxin/LLBNAD.

[283] Health Care Waste Classification Using Deep Learning Aligned with Nepal’s Bin Color Guidelines

Suman Kunwar, Prabesh Rai

Main category: cs.CV

TL;DR: The study benchmarks waste classification models for healthcare waste (HCW) in Nepal, finding YOLOv5-s as the most accurate (95.06%) but slightly slower than YOLOv8-n. EfficientNet-B0 showed promise but was slower. YOLOv5-s was deployed for public use.

Details

Motivation: Improper HCW management in Nepal poses health risks, necessitating accurate waste classification.

Method: Benchmarked ResNeXt-50, EfficientNet-B0, MobileNetV3-S, YOLOv8-n, and YOLOv5-s using Stratified K-fold (5 folds) on HCW data.

Result: YOLOv5-s achieved 95.06% accuracy, slightly slower than YOLOv8-n. EfficientNet-B0 had 93.22% accuracy but highest inference time.

Conclusion: YOLOv5-s was deployed for public use; further work on data and localized context is suggested.

Abstract: The increasing number of Health Care facilities in Nepal has also added up the challenges on managing health care waste (HCW). Improper segregation and disposal of HCW leads to the contamination, spreading of infectious diseases and puts a risk of waste handlers. This study benchmarks the state of the art waste classification models: ResNeXt-50, EfficientNet-B0, MobileNetV3-S, YOLOv8-n and YOLOv5-s using Stratified K-fold techniques where we use 5 folds on combined HCW data, and found that the YOLOv5-s achieved higher of 95.06% accuracy but fell short few milliseconds in inference speed with YOLOv8-n model. The EfficientNet-B0 showed promising results of 93.22% accuracy but took the highest inference time. A repetitive ANOVA was performed to see statistical significance and the best performing model (YOLOv5-s) was deployed to the web with mapped bin color using Nepal’s HCW management standards for public usage. Further work on the data was suggested along with localized context.

[284] AURA: A Fine-Grained Benchmark and Decomposed Metric for Audio-Visual Reasoning

Siminfar Samakoush Galougah, Rishie Raj, Sanjoy Chowdhury, Sayan Nag, Ramani Duraiswami

Main category: cs.CV

TL;DR: AURA is a new benchmark for evaluating cross-modal reasoning in AV-LLMs and OLMs, focusing on reasoning fidelity rather than just accuracy. It introduces AuraScore to assess factual consistency and logical validity, revealing a gap between model accuracy and reasoning quality.

Details

Motivation: Existing AV benchmarks prioritize answer accuracy over reasoning, masking flawed logic or hallucinations. AURA addresses this by forcing models to rely on both audio and video inputs.

Method: AURA includes questions across six cognitive domains, unanswerable from a single modality. AuraScore evaluates reasoning fidelity via factual consistency and core inference.

Result: SOTA models show high accuracy (up to 92%) but low reasoning fidelity (below 45%), exposing flawed logic behind correct answers.

Conclusion: AURA highlights the need for robust multimodal evaluation, paving the way for models with genuine comprehension.

Abstract: Current audio-visual (AV) benchmarks focus on final answer accuracy, overlooking the underlying reasoning process. This makes it difficult to distinguish genuine comprehension from correct answers derived through flawed reasoning or hallucinations. To address this, we introduce AURA (Audio-visual Understanding and Reasoning Assessment), a benchmark for evaluating the cross-modal reasoning capabilities of Audio-Visual Large Language Models (AV-LLMs) and Omni-modal Language Models (OLMs). AURA includes questions across six challenging cognitive domains, such as causality, timbre and pitch, tempo and AV synchronization, unanswerability, implicit distractions, and skill profiling, explicitly designed to be unanswerable from a single modality. This forces models to construct a valid logical path grounded in both audio and video, setting AURA apart from AV datasets that allow uni-modal shortcuts. To assess reasoning traces, we propose a novel metric, AuraScore, which addresses the lack of robust tools for evaluating reasoning fidelity. It decomposes reasoning into two aspects: (i) Factual Consistency - whether reasoning is grounded in perceptual evidence, and (ii) Core Inference - the logical validity of each reasoning step. Evaluations of SOTA models on AURA reveal a critical reasoning gap: although models achieve high accuracy (up to 92% on some tasks), their Factual Consistency and Core Inference scores fall below 45%. This discrepancy highlights that models often arrive at correct answers through flawed logic, underscoring the need for our benchmark and paving the way for more robust multimodal evaluation.

[285] VisR-Bench: An Empirical Study on Visual Retrieval-Augmented Generation for Multilingual Long Document Understanding

Jian Chen, Ming Li, Jihyung Kil, Chenguang Wang, Tong Yu, Ryan Rossi, Tianyi Zhou, Changyou Chen, Ruiyi Zhang

Main category: cs.CV

TL;DR: VisR-Bench is a multilingual benchmark for multimodal retrieval in long documents, addressing gaps in existing benchmarks by including diverse languages and question types.

Details

Motivation: To bridge the gap in multilingual and multimodal document retrieval benchmarks, which currently focus on English or single-page QA.

Method: Introduces VisR-Bench with 35K QA pairs across 1.2K documents in 16 languages, featuring diverse question types and queries without explicit answers.

Result: MLLMs outperform text-based and multimodal encoders but struggle with structured tables and low-resource languages.

Conclusion: VisR-Bench highlights challenges in multilingual visual retrieval and provides a robust evaluation framework for future research.

Abstract: Most organizational data in this world are stored as documents, and visual retrieval plays a crucial role in unlocking the collective intelligence from all these documents. However, existing benchmarks focus on English-only document retrieval or only consider multilingual question-answering on a single-page image. To bridge this gap, we introduce VisR-Bench, a multilingual benchmark designed for question-driven multimodal retrieval in long documents. Our benchmark comprises over 35K high-quality QA pairs across 1.2K documents, enabling fine-grained evaluation of multimodal retrieval. VisR-Bench spans sixteen languages with three question types (figures, text, and tables), offering diverse linguistic and question coverage. Unlike prior datasets, we include queries without explicit answers, preventing models from relying on superficial keyword matching. We evaluate various retrieval models, including text-based methods, multimodal encoders, and MLLMs, providing insights into their strengths and limitations. Our results show that while MLLMs significantly outperform text-based and multimodal encoder models, they still struggle with structured tables and low-resource languages, highlighting key challenges in multilingual visual retrieval.

[286] FormCoach: Lift Smarter, Not Harder

Xiaoye Zuo, Nikos Athanasiou, Ginger Delmas, Yiming Huang, Xingyu Fu, Lingjie Liu

Main category: cs.CV

TL;DR: FormCoach uses AI to provide real-time fitness form feedback via a camera, leveraging vision-language models, and benchmarks VLMs on a dataset of 1,700 expert-annotated videos.

Details

Motivation: To bridge the gap for at-home fitness enthusiasts lacking expert feedback by providing accessible, real-time form correction.

Method: Utilizes vision-language models (VLMs) to analyze user form, benchmarked on a dataset of 1,700 expert-annotated videos, and includes a web interface for interaction.

Result: Benchmarks show gaps compared to human-level coaching, highlighting challenges in nuanced movement analysis.

Conclusion: FormCoach pioneers embodied AI by framing form correction as a collaborative human-machine process, releasing datasets and tools to advance research.

Abstract: Good form is the difference between strength and strain, yet for the fast-growing community of at-home fitness enthusiasts, expert feedback is often out of reach. FormCoach transforms a simple camera into an always-on, interactive AI training partner, capable of spotting subtle form errors and delivering tailored corrections in real time, leveraging vision-language models (VLMs). We showcase this capability through a web interface and benchmark state-of-the-art VLMs on a dataset of 1,700 expert-annotated user-reference video pairs spanning 22 strength and mobility exercises. To accelerate research in AI-driven coaching, we release both the dataset and an automated, rubric-based evaluation pipeline, enabling standardized comparison across models. Our benchmarks reveal substantial gaps compared to human-level coaching, underscoring both the challenges and opportunities in integrating nuanced, context-aware movement analysis into interactive AI systems. By framing form correction as a collaborative and creative process between humans and machines, FormCoach opens a new frontier in embodied AI.

[287] From Field to Drone: Domain Drift Tolerant Automated Multi-Species and Damage Plant Semantic Segmentation for Herbicide Trials

Artzai Picon, Itziar Eguskiza, Daniel Mugica, Javier Romero, Carlos Javier Jimenez, Eric White, Gabriel Do-Lago-Junqueira, Christian Klukas, Ramon Navarra-Mestre

Main category: cs.CV

TL;DR: An improved segmentation model for automated crop and weed monitoring enhances efficiency and consistency in herbicide research, outperforming traditional manual methods and showing robustness under domain shifts.

Details

Motivation: Traditional manual visual assessments in herbicide research are time-consuming, labor-intensive, and subjective. Automating species and damage identification can improve efficiency and consistency.

Method: The model combines a self-supervised visual model with hierarchical inference based on botanical taxonomy, trained on multi-year, multi-location datasets and tested across devices and geographies.

Result: The model significantly improved species identification and damage classification, maintaining strong performance under domain shifts (e.g., drone imagery).

Conclusion: The model’s robustness and real-world applicability are confirmed, and it is now deployed in BASF’s phenotyping pipeline for large-scale, automated monitoring.

Abstract: Field trials are vital in herbicide research and development to assess effects on crops and weeds under varied conditions. Traditionally, evaluations rely on manual visual assessments, which are time-consuming, labor-intensive, and subjective. Automating species and damage identification is challenging due to subtle visual differences, but it can greatly enhance efficiency and consistency. We present an improved segmentation model combining a general-purpose self-supervised visual model with hierarchical inference based on botanical taxonomy. Trained on a multi-year dataset (2018-2020) from Germany and Spain using digital and mobile cameras, the model was tested on digital camera data (year 2023) and drone imagery from the United States, Germany, and Spain (year 2024) to evaluate robustness under domain shift. This cross-device evaluation marks a key step in assessing generalization across platforms of the model. Our model significantly improved species identification (F1-score: 0.52 to 0.85, R-squared: 0.75 to 0.98) and damage classification (F1-score: 0.28 to 0.44, R-squared: 0.71 to 0.87) over prior methods. Under domain shift (drone images), it maintained strong performance with moderate degradation (species: F1-score 0.60, R-squared 0.80; damage: F1-score 0.41, R-squared 0.62), where earlier models failed. These results confirm the model’s robustness and real-world applicability. It is now deployed in BASF’s phenotyping pipeline, enabling large-scale, automated crop and weed monitoring across diverse geographies.

[288] Exploring Multimodal Diffusion Transformers for Enhanced Prompt-based Image Editing

Joonghyuk Shin, Alchan Hwang, Yujin Kim, Daneul Kim, Jaesik Park

Main category: cs.CV

TL;DR: The paper analyzes MM-DiT’s bidirectional attention mechanism, proposes a prompt-based editing method, and bridges U-Net and MM-DiT approaches.

Details

Motivation: To address challenges in editing MM-DiT models due to their unified bidirectional attention, differing from traditional unidirectional methods.

Method: Decomposes MM-DiT’s attention matrices into four blocks, analyzes their characteristics, and develops a prompt-based editing technique.

Result: A robust editing method for MM-DiT, supporting global to local edits across variants, including few-step models.

Conclusion: The findings provide insights into MM-DiT’s behavior and bridge the gap between U-Net and MM-DiT architectures.

Abstract: Transformer-based diffusion models have recently superseded traditional U-Net architectures, with multimodal diffusion transformers (MM-DiT) emerging as the dominant approach in state-of-the-art models like Stable Diffusion 3 and Flux.1. Previous approaches have relied on unidirectional cross-attention mechanisms, with information flowing from text embeddings to image latents. In contrast, MMDiT introduces a unified attention mechanism that concatenates input projections from both modalities and performs a single full attention operation, allowing bidirectional information flow between text and image branches. This architectural shift presents significant challenges for existing editing techniques. In this paper, we systematically analyze MM-DiT’s attention mechanism by decomposing attention matrices into four distinct blocks, revealing their inherent characteristics. Through these analyses, we propose a robust, prompt-based image editing method for MM-DiT that supports global to local edits across various MM-DiT variants, including few-step models. We believe our findings bridge the gap between existing U-Net-based methods and emerging architectures, offering deeper insights into MMDiT’s behavioral patterns.

[289] A DICOM Image De-identification Algorithm in the MIDI-B Challenge

Hongzhu Jiang, Sihan Xie, Zhiyu Wan

Main category: cs.CV

TL;DR: The paper discusses the MIDI-B Challenge for evaluating DICOM image de-identification methods, highlighting the importance of privacy compliance and utility. The authors’ algorithm achieved 99.92% accuracy, ranking 2nd.

Details

Motivation: To address the need for compliant and effective de-identification of medical images, ensuring patient privacy while maintaining data utility for research and diagnostics.

Method: Applied techniques like pixel masking, date shifting, text recognition, and removal, adhering to standards like HIPAA and DICOM PS3.15.

Result: The algorithm achieved 99.92% accuracy in de-identification, ranking 2nd out of 10 teams in the MIDI-B Challenge.

Conclusion: The study underscores the effectiveness of the proposed methods but acknowledges limitations, suggesting future improvements.

Abstract: Image de-identification is essential for the public sharing of medical images, particularly in the widely used Digital Imaging and Communications in Medicine (DICOM) format as required by various regulations and standards, including Health Insurance Portability and Accountability Act (HIPAA) privacy rules, the DICOM PS3.15 standard, and best practices recommended by the Cancer Imaging Archive (TCIA). The Medical Image De-Identification Benchmark (MIDI-B) Challenge at the 27th International Conference on Medical Image Computing and Computer Assisted Intervention (MICCAI 2024) was organized to evaluate rule-based DICOM image de-identification algorithms with a large dataset of clinical DICOM images. In this report, we explore the critical challenges of de-identifying DICOM images, emphasize the importance of removing personally identifiable information (PII) to protect patient privacy while ensuring the continued utility of medical data for research, diagnostics, and treatment, and provide a comprehensive overview of the standards and regulations that govern this process. Additionally, we detail the de-identification methods we applied

such as pixel masking, date shifting, date hashing, text recognition, text replacement, and text removal - to process datasets during the test phase in strict compliance with these standards. According to the final leaderboard of the MIDI-B challenge, the latest version of our solution algorithm correctly executed 99.92% of the required actions and ranked 2nd out of 10 teams that completed the challenge (from a total of 22 registered teams). Finally, we conducted a thorough analysis of the resulting statistics and discussed the limitations of current approaches and potential avenues for future improvement.

[290] Enhancing Reliability of Medical Image Diagnosis through Top-rank Learning with Rejection Module

Xiaotong Ji, Ryoma Bise, Seiichi Uchida

Main category: cs.CV

TL;DR: A novel approach enhances top-rank learning in medical image processing by integrating a rejection module to handle noisy labels and class-ambiguous instances, improving diagnosis accuracy.

Details

Motivation: Accurate medical diagnosis is critical, but noisy labels and ambiguous instances hinder top-rank learning, necessitating a solution to mitigate these outliers.

Method: The proposed method integrates a rejection module, cooptimized with top-rank loss, to identify and mitigate outliers using a rejection function.

Result: Experimental validation shows the method effectively detects and mitigates outliers, enhancing reliability and accuracy in medical image diagnoses.

Conclusion: The approach successfully addresses challenges in top-rank learning, improving diagnostic performance in medical image processing.

Abstract: In medical image processing, accurate diagnosis is of paramount importance. Leveraging machine learning techniques, particularly top-rank learning, shows significant promise by focusing on the most crucial instances. However, challenges arise from noisy labels and class-ambiguous instances, which can severely hinder the top-rank objective, as they may be erroneously placed among the top-ranked instances. To address these, we propose a novel approach that enhances toprank learning by integrating a rejection module. Cooptimized with the top-rank loss, this module identifies and mitigates the impact of outliers that hinder training effectiveness. The rejection module functions as an additional branch, assessing instances based on a rejection function that measures their deviation from the norm. Through experimental validation on a medical dataset, our methodology demonstrates its efficacy in detecting and mitigating outliers, improving the reliability and accuracy of medical image diagnoses.

[291] ShoulderShot: Generating Over-the-Shoulder Dialogue Videos

Yuang Zhang, Junqi Cheng, Haoyu Zhao, Jiaxi Gu, Fangyuan Zou, Zenghui Lu, Peng Shu

Main category: cs.CV

TL;DR: ShoulderShot is a framework for generating over-the-shoulder dialogue videos, addressing challenges like character consistency, spatial continuity, and long dialogues efficiently.

Details

Motivation: Over-the-shoulder dialogue videos are crucial for visual storytelling but underexplored in video generation research due to technical challenges.

Method: ShoulderShot combines dual-shot generation with looping video to maintain character consistency and spatial continuity while enabling extended dialogues.

Result: The framework outperforms existing methods in shot-reverse-shot layout, spatial continuity, and dialogue length flexibility.

Conclusion: ShoulderShot advances practical dialogue video generation, offering new possibilities for filmmakers and advertisers.

Abstract: Over-the-shoulder dialogue videos are essential in films, short dramas, and advertisements, providing visual variety and enhancing viewers’ emotional connection. Despite their importance, such dialogue scenes remain largely underexplored in video generation research. The main challenges include maintaining character consistency across different shots, creating a sense of spatial continuity, and generating long, multi-turn dialogues within limited computational budgets. Here, we present ShoulderShot, a framework that combines dual-shot generation with looping video, enabling extended dialogues while preserving character consistency. Our results demonstrate capabilities that surpass existing methods in terms of shot-reverse-shot layout, spatial continuity, and flexibility in dialogue length, thereby opening up new possibilities for practical dialogue video generation. Videos and comparisons are available at https://shouldershot.github.io.

[292] Enhanced Generative Structure Prior for Chinese Text Image Super-resolution

Xiaoming Li, Wangmeng Zuo, Chen Change Loy

Main category: cs.CV

TL;DR: A framework for high-quality Chinese text image super-resolution using a novel structure prior within StyleGAN, ensuring precise stroke restoration.

Details

Motivation: Existing methods focus on English text, neglecting complex scripts like Chinese. The goal is to restore degraded Chinese characters accurately.

Method: Proposes a structure prior integrated into StyleGAN, using a codebook for character structures and StyleGAN’s vector $w$ for style control.

Result: The framework effectively restores clear strokes in degraded Chinese characters, even with irregular layouts.

Conclusion: The structure prior provides robust guidance for accurate Chinese text image super-resolution, outperforming prior methods.

Abstract: Faithful text image super-resolution (SR) is challenging because each character has a unique structure and usually exhibits diverse font styles and layouts. While existing methods primarily focus on English text, less attention has been paid to more complex scripts like Chinese. In this paper, we introduce a high-quality text image SR framework designed to restore the precise strokes of low-resolution (LR) Chinese characters. Unlike methods that rely on character recognition priors to regularize the SR task, we propose a novel structure prior that offers structure-level guidance to enhance visual quality. Our framework incorporates this structure prior within a StyleGAN model, leveraging its generative capabilities for restoration. To maintain the integrity of character structures while accommodating various font styles and layouts, we implement a codebook-based mechanism that restricts the generative space of StyleGAN. Each code in the codebook represents the structure of a specific character, while the vector $w$ in StyleGAN controls the character’s style, including typeface, orientation, and location. Through the collaborative interaction between the codebook and style, we generate a high-resolution structure prior that aligns with LR characters both spatially and structurally. Experiments demonstrate that this structure prior provides robust, character-specific guidance, enabling the accurate restoration of clear strokes in degraded characters, even for real-world LR Chinese text with irregular layouts. Our code and pre-trained models will be available at https://github.com/csxmli2016/MARCONetPlusPlus

[293] Domain Generalization of Pathological Image Segmentation by Patch-Level and WSI-Level Contrastive Learning

Yuki Shigeyasu, Shota Harada, Akihiko Yoshizawa, Kazuhiro Terada, Naoki Nakazima, Mariyo Kurata, Hiroyuki Abe, Tetsuo Ushiku, Ryoma Bise

Main category: cs.CV

TL;DR: A method for addressing domain shifts in pathological images by leveraging intra-hospital shifts using contrastive learning.

Details

Motivation: Traditional methods rely on multi-hospital data, which is often impractical due to collection challenges. This work focuses on intra-hospital shifts like patient characteristics and tissue thickness.

Method: Clusters WSI-level features from non-tumor regions as domains and applies a two-stage contrastive learning approach (WSI-level and patch-level) to reduce feature gaps.

Result: Effectively minimizes domain shifts by leveraging intra-hospital variations.

Conclusion: The proposed method offers a practical solution for domain generalization in pathological images without relying on multi-hospital data.

Abstract: In this paper, we address domain shifts in pathological images by focusing on shifts within whole slide images~(WSIs), such as patient characteristics and tissue thickness, rather than shifts between hospitals. Traditional approaches rely on multi-hospital data, but data collection challenges often make this impractical. Therefore, the proposed domain generalization method captures and leverages intra-hospital domain shifts by clustering WSI-level features from non-tumor regions and treating these clusters as domains. To mitigate domain shift, we apply contrastive learning to reduce feature gaps between WSI pairs from different clusters. The proposed method introduces a two-stage contrastive learning approach WSI-level and patch-level contrastive learning to minimize these gaps effectively.

[294] SOFA: Deep Learning Framework for Simulating and Optimizing Atrial Fibrillation Ablation

Yunsung Chung, Chanho Lim, Ghassan Bidaoui, Christian Massad, Nassir Marrouche, Jihun Hamm

Main category: cs.CV

TL;DR: SOFA is a deep-learning framework for predicting AF recurrence and optimizing ablation strategies using patient-specific data and procedural parameters.

Details

Motivation: To address the variability in AF ablation outcomes by predicting recurrence and optimizing procedural parameters.

Method: SOFA simulates ablation outcomes using pre-ablation LGE-MRI and procedural parameters, predicts recurrence risk, and optimizes parameters to minimize risk.

Result: SOFA reduces model-predicted recurrence risk by 22.18% and accurately synthesizes post-ablation images.

Conclusion: SOFA is the first framework to integrate simulation, prediction, and optimization for personalized AF ablation.

Abstract: Atrial fibrillation (AF) is a prevalent cardiac arrhythmia often treated with catheter ablation procedures, but procedural outcomes are highly variable. Evaluating and improving ablation efficacy is challenging due to the complex interaction between patient-specific tissue and procedural factors. This paper asks two questions: Can AF recurrence be predicted by simulating the effects of procedural parameters? How should we ablate to reduce AF recurrence? We propose SOFA (Simulating and Optimizing Atrial Fibrillation Ablation), a novel deep-learning framework that addresses these questions. SOFA first simulates the outcome of an ablation strategy by generating a post-ablation image depicting scar formation, conditioned on a patient’s pre-ablation LGE-MRI and the specific procedural parameters used (e.g., ablation locations, duration, temperature, power, and force). During this simulation, it predicts AF recurrence risk. Critically, SOFA then introduces an optimization scheme that refines these procedural parameters to minimize the predicted risk. Our method leverages a multi-modal, multi-view generator that processes 2.5D representations of the atrium. Quantitative evaluations show that SOFA accurately synthesizes post-ablation images and that our optimization scheme leads to a 22.18% reduction in the model-predicted recurrence risk. To the best of our knowledge, SOFA is the first framework to integrate the simulation of procedural effects, recurrence prediction, and parameter optimization, offering a novel tool for personalizing AF ablation.

[295] CoT-Pose: Chain-of-Thought Reasoning for 3D Pose Generation from Abstract Prompts

Junuk Cha, Jihyeon Kim

Main category: cs.CV

TL;DR: The paper introduces CoT-Pose, a framework integrating chain-of-thought reasoning to generate 3D human poses from abstract prompts, addressing the gap between high-level language and detailed pose descriptions.

Details

Motivation: Existing text-to-pose models require low-level prompts, unlike human communication which uses abstract language. This mismatch limits real-world deployment.

Method: The framework incorporates CoT reasoning to interpret abstract prompts and a data synthesis pipeline for generating training triplets (abstract prompts, detailed prompts, 3D poses).

Result: CoT-Pose effectively generates plausible and semantically aligned poses from abstract inputs.

Conclusion: The work emphasizes high-level understanding in pose generation and suggests reasoning-enhanced approaches as a promising direction.

Abstract: Recent advances in multi-modal large language models (MLLMs) and chain-of-thought (CoT) reasoning have led to significant progress in image and text generation tasks. However, the field of 3D human pose generation still faces critical limitations. Most existing text-to-pose models rely heavily on detailed (low-level) prompts that explicitly describe joint configurations. In contrast, humans tend to communicate actions and intentions using abstract (high-level) language. This mismatch results in a practical challenge for deploying pose generation systems in real-world scenarios. To bridge this gap, we introduce a novel framework that incorporates CoT reasoning into the pose generation process, enabling the interpretation of abstract prompts into accurate 3D human poses. We further propose a data synthesis pipeline that automatically generates triplets of abstract prompts, detailed prompts, and corresponding 3D poses for training process. Experimental results demonstrate that our reasoning-enhanced model, CoT-Pose, can effectively generate plausible and semantically aligned poses from abstract textual inputs. This work highlights the importance of high-level understanding in pose generation and opens new directions for reasoning-enhanced approach for human pose generation.

[296] Commentary Generation for Soccer Highlights

Chidaksh Ravuru

Main category: cs.CV

TL;DR: The paper extends MatchVoice for soccer commentary generation using the GOAL dataset, evaluates its performance, and suggests integrating broader video-language techniques.

Details

Motivation: To address the challenge of fine-grained alignment between video content and commentary in soccer, building on frameworks like SoccerNet-Caption and MatchTime.

Method: Extends MatchVoice for soccer highlights using the GOAL dataset, conducts experiments to reproduce MatchTime results, and evaluates training configurations and hardware limitations.

Result: MatchVoice shows promising generalization but benefits from integrating broader video-language techniques.

Conclusion: The study highlights the potential of MatchVoice for commentary generation but underscores the need for further enhancements from video-language domains.

Abstract: Automated soccer commentary generation has evolved from template-based systems to advanced neural architectures, aiming to produce real-time descriptions of sports events. While frameworks like SoccerNet-Caption laid foundational work, their inability to achieve fine-grained alignment between video content and commentary remains a significant challenge. Recent efforts such as MatchTime, with its MatchVoice model, address this issue through coarse and fine-grained alignment techniques, achieving improved temporal synchronization. In this paper, we extend MatchVoice to commentary generation for soccer highlights using the GOAL dataset, which emphasizes short clips over entire games. We conduct extensive experiments to reproduce the original MatchTime results and evaluate our setup, highlighting the impact of different training configurations and hardware limitations. Furthermore, we explore the effect of varying window sizes on zero-shot performance. While MatchVoice exhibits promising generalization capabilities, our findings suggest the need for integrating techniques from broader video-language domains to further enhance performance. Our code is available at https://github.com/chidaksh/SoccerCommentary.

[297] TAR-TVG: Enhancing VLMs with Timestamp Anchor-Constrained Reasoning for Temporal Video Grounding

Chaohong Guo, Xun Mo, Yongwei Nie, Xuemiao Xu, Chao Xu, Fei Yu, Chengjiang Long

Main category: cs.CV

TL;DR: The paper introduces TAR-TVG, a framework for Temporal Video Grounding that uses timestamp anchors to supervise reasoning, ensuring progressively accurate predictions. It includes a three-stage training strategy for robust anchor generation.

Details

Motivation: Existing reinforcement learning approaches lack explicit constraints on reasoning quality for temporal predictions in TVG.

Method: Proposes TAR-TVG with timestamp anchors for intermediate supervision and a three-stage training strategy (GRPO, SFT, GRPO).

Result: Achieves state-of-the-art performance with interpretable reasoning chains and refined temporal estimations.

Conclusion: TAR-TVG effectively improves reasoning quality and prediction accuracy in TVG through explicit supervision and progressive refinement.

Abstract: Temporal Video Grounding (TVG) aims to precisely localize video segments corresponding to natural language queries, which is a critical capability for long-form video understanding. Although existing reinforcement learning approaches encourage models to generate reasoning chains before predictions, they fail to explicitly constrain the reasoning process to ensure the quality of the final temporal predictions. To address this limitation, we propose Timestamp Anchor-constrained Reasoning for Temporal Video Grounding (TAR-TVG), a novel framework that introduces timestamp anchors within the reasoning process to enforce explicit supervision to the thought content. These anchors serve as intermediate verification points. More importantly, we require each reasoning step to produce increasingly accurate temporal estimations, thereby ensuring that the reasoning process contributes meaningfully to the final prediction. To address the challenge of low-probability anchor generation in models (e.g., Qwen2.5-VL-3B), we develop an efficient self-distillation training strategy: (1) initial GRPO training to collect 30K high-quality reasoning traces containing multiple timestamp anchors, (2) supervised fine-tuning (SFT) on distilled data, and (3) final GRPO optimization on the SFT-enhanced model. This three-stage training strategy enables robust anchor generation while maintaining reasoning quality. Experiments show that our model achieves state-of-the-art performance while producing interpretable, verifiable reasoning chains with progressively refined temporal estimations.

[298] Adaptive Pseudo Label Selection for Individual Unlabeled Data by Positive and Unlabeled Learning

Takehiro Yamane, Itaru Tsuge, Susumu Saito, Ryoma Bise

Main category: cs.CV

TL;DR: A novel pseudo-labeling method for medical image segmentation using PU learning to select effective pseudo-labels on individual images.

Details

Motivation: To improve medical image segmentation by selecting effective pseudo-labels for learning on individual images.

Method: Uses Positive and Unlabeled Learning (PU learning) to discriminate foreground and background regions on unlabeled images.

Result: The method effectively selects pseudo-labels for various background regions.

Conclusion: The proposed PU learning-based approach is effective for medical image segmentation.

Abstract: This paper proposes a novel pseudo-labeling method for medical image segmentation that can perform learning on ``individual images’’ to select effective pseudo-labels. We introduce Positive and Unlabeled Learning (PU learning), which uses only positive and unlabeled data for binary classification problems, to obtain the appropriate metric for discriminating foreground and background regions on each unlabeled image. Our PU learning makes us easy to select pseudo-labels for various background regions. The experimental results show the effectiveness of our method.

[299] DoorDet: Semi-Automated Multi-Class Door Detection Dataset via Object Detection and Large Language Models

Licheng Zhang, Bach Le, Naveed Akhtar, Tuan Ngo

Main category: cs.CV

TL;DR: A semi-automated pipeline combines deep object detection and LLM for efficient multi-class door detection in floor plans, reducing manual effort.

Details

Motivation: Accurate door detection is crucial for building compliance and indoor scene understanding, but lacks specialized datasets.

Method: Uses a deep object detector for unified door detection, an LLM for classification, and human-in-the-loop for quality assurance.

Result: Produces a high-quality dataset with reduced annotation cost, suitable for benchmarking neural models.

Conclusion: Demonstrates the effectiveness of combining deep learning and multimodal reasoning for dataset construction in complex domains.

Abstract: Accurate detection and classification of diverse door types in floor plans drawings is critical for multiple applications, such as building compliance checking, and indoor scene understanding. Despite their importance, publicly available datasets specifically designed for fine-grained multi-class door detection remain scarce. In this work, we present a semi-automated pipeline that leverages a state-of-the-art object detector and a large language model (LLM) to construct a multi-class door detection dataset with minimal manual effort. Doors are first detected as a unified category using a deep object detection model. Next, an LLM classifies each detected instance based on its visual and contextual features. Finally, a human-in-the-loop stage ensures high-quality labels and bounding boxes. Our method significantly reduces annotation cost while producing a dataset suitable for benchmarking neural models in floor plan analysis. This work demonstrates the potential of combining deep learning and multimodal reasoning for efficient dataset construction in complex real-world domains.

[300] Decoupled Functional Evaluation of Autonomous Driving Models via Feature Map Quality Scoring

Ludan Zhang, Sihan Wang, Yuqi Dai, Shuofei Qiao, Lei He

Main category: cs.CV

TL;DR: The paper proposes an independent evaluation method (FMCS) for feature maps in autonomous driving models, improving interpretability and performance.

Details

Motivation: Addressing the lack of explicit supervision for intermediate modules in end-to-end autonomous driving models, which limits interpretability and evaluation.

Method: Introduces Feature Map Convergence Score (FMCS) and a Dual-Granularity Dynamic Weighted Scoring System (DG-DWSS) for unified evaluation. Develops CLIP-FMQE-Net for real-time quality analysis.

Result: Experiments on NuScenes show a 3.89% NDS improvement in 3D object detection.

Conclusion: The method effectively enhances feature representation quality and model performance.

Abstract: End-to-end models are emerging as the mainstream in autonomous driving perception and planning. However, the lack of explicit supervision signals for intermediate functional modules leads to opaque operational mechanisms and limited interpretability, making it challenging for traditional methods to independently evaluate and train these modules. Pioneering in the issue, this study builds upon the feature map-truth representation similarity-based evaluation framework and proposes an independent evaluation method based on Feature Map Convergence Score (FMCS). A Dual-Granularity Dynamic Weighted Scoring System (DG-DWSS) is constructed, formulating a unified quantitative metric - Feature Map Quality Score - to enable comprehensive evaluation of the quality of feature maps generated by functional modules. A CLIP-based Feature Map Quality Evaluation Network (CLIP-FMQE-Net) is further developed, combining feature-truth encoders and quality score prediction heads to enable real-time quality analysis of feature maps generated by functional modules. Experimental results on the NuScenes dataset demonstrate that integrating our evaluation module into the training improves 3D object detection performance, achieving a 3.89 percent gain in NDS. These results verify the effectiveness of our method in enhancing feature representation quality and overall model performance.

[301] Splat4D: Diffusion-Enhanced 4D Gaussian Splatting for Temporally and Spatially Consistent Content Creation

Minghao Yin, Yukang Cao, Songyou Peng, Kai Han

Main category: cs.CV

TL;DR: Splat4D is a novel framework for generating high-quality 4D content from monocular videos, addressing challenges like temporal-spatial consistency and detail preservation. It outperforms existing methods and supports diverse applications.

Details

Motivation: The need for high-fidelity 4D content generation from monocular videos for digital humans and AR/VR, ensuring consistency, detail preservation, and user guidance.

Method: Splat4D uses multi-view rendering, inconsistency identification, a video diffusion model, and an asymmetric U-Net for refinement.

Result: State-of-the-art performance on benchmarks, with applications in text/image-conditioned 4D generation, human generation, and text-guided editing.

Conclusion: Splat4D effectively addresses key challenges in 4D content generation, demonstrating versatility and superior performance.

Abstract: Generating high-quality 4D content from monocular videos for applications such as digital humans and AR/VR poses challenges in ensuring temporal and spatial consistency, preserving intricate details, and incorporating user guidance effectively. To overcome these challenges, we introduce Splat4D, a novel framework enabling high-fidelity 4D content generation from a monocular video. Splat4D achieves superior performance while maintaining faithful spatial-temporal coherence by leveraging multi-view rendering, inconsistency identification, a video diffusion model, and an asymmetric U-Net for refinement. Through extensive evaluations on public benchmarks, Splat4D consistently demonstrates state-of-the-art performance across various metrics, underscoring the efficacy of our approach. Additionally, the versatility of Splat4D is validated in various applications such as text/image conditioned 4D generation, 4D human generation, and text-guided content editing, producing coherent outcomes following user instructions.

[302] Adaptive Cache Enhancement for Test-Time Adaptation of Vision-Language Models

Khanh-Binh Nguyen, Phuoc-Nguyen Bui, Hyunseung Choo, Duc Thanh Nguyen

Main category: cs.CV

TL;DR: ACE framework improves VLMs’ robustness to distribution shifts by dynamically refining class-specific thresholds and maintaining a selective cache of high-confidence embeddings.

Details

Motivation: Address performance degradation of VLMs under distribution shifts in downstream tasks without labeled data.

Method: Uses Adaptive Cache Enhancement (ACE) with dynamic, class-specific thresholds and selective storage of high-confidence/low-entropy embeddings.

Result: ACE achieves state-of-the-art performance on 15 benchmark datasets, outperforming existing TTA methods.

Conclusion: ACE enhances robustness and generalization of VLMs in out-of-distribution scenarios.

Abstract: Vision-language models (VLMs) exhibit remarkable zero-shot generalization but suffer performance degradation under distribution shifts in downstream tasks, particularly in the absence of labeled data. Test-Time Adaptation (TTA) addresses this challenge by enabling online optimization of VLMs during inference, eliminating the need for annotated data. Cache-based TTA methods exploit historical knowledge by maintaining a dynamic memory cache of low-entropy or high-confidence samples, promoting efficient adaptation to out-of-distribution data. Nevertheless, these methods face two critical challenges: (1) unreliable confidence metrics under significant distribution shifts, resulting in error accumulation within the cache and degraded adaptation performance; and (2) rigid decision boundaries that fail to accommodate substantial distributional variations, leading to suboptimal predictions. To overcome these limitations, we introduce the Adaptive Cache Enhancement (ACE) framework, which constructs a robust cache by selectively storing high-confidence or low-entropy image embeddings per class, guided by dynamic, class-specific thresholds initialized from zero-shot statistics and iteratively refined using an exponential moving average and exploration-augmented updates. This approach enables adaptive, class-wise decision boundaries, ensuring robust and accurate predictions across diverse visual distributions. Extensive experiments on 15 diverse benchmark datasets demonstrate that ACE achieves state-of-the-art performance, delivering superior robustness and generalization compared to existing TTA methods in challenging out-of-distribution scenarios.

[303] Exploiting Layer Normalization Fine-tuning in Visual Transformer Foundation Models for Classification

Zhaorui Tan, Tan Pan, Kaizhu Huang, Weimiao Yu, Kai Yao, Chen Jiang, Qiufeng Wang, Anh Nguyen, Xin Guo, Yuan Cheng, Xi Yang

Main category: cs.CV

TL;DR: The paper explores LayerNorm shifts in Vision Transformers (ViTs) during fine-tuning under data scarcity and domain shifts, proposing a rescaling mechanism and cyclic framework to improve performance.

Details

Motivation: To understand and optimize LayerNorm fine-tuning dynamics in ViTs, especially under data scarcity and domain shifts, which are underexplored.

Method: Proposes a rescaling mechanism using a scalar λ negatively correlated to FSR and a cyclic framework to align LayerNorm shifts with ideal ones.

Result: Experiments show OOD tasks yield lower FSR and higher λ, indicating under-represented data. Pathological data behaves like ID settings, favoring conservative updates.

Conclusion: The study provides insights into LayerNorm dynamics in transfer learning and practical strategies for fine-tuning, validated across diverse settings.

Abstract: LayerNorm is pivotal in Vision Transformers (ViTs), yet its fine-tuning dynamics under data scarcity and domain shifts remain underexplored. This paper shows that shifts in LayerNorm parameters after fine-tuning (LayerNorm shifts) are indicative of the transitions between source and target domains; its efficacy is contingent upon the degree to which the target training samples accurately represent the target domain, as quantified by our proposed Fine-tuning Shift Ratio ($FSR$). Building on this, we propose a simple yet effective rescaling mechanism using a scalar $\lambda$ that is negatively correlated to $FSR$ to align learned LayerNorm shifts with those ideal shifts achieved under fully representative data, combined with a cyclic framework that further enhances the LayerNorm fine-tuning. Extensive experiments across natural and pathological images, in both in-distribution (ID) and out-of-distribution (OOD) settings, and various target training sample regimes validate our framework. Notably, OOD tasks tend to yield lower $FSR$ and higher $\lambda$ in comparison to ID cases, especially with scarce data, indicating under-represented target training samples. Moreover, ViTFs fine-tuned on pathological data behave more like ID settings, favoring conservative LayerNorm updates. Our findings illuminate the underexplored dynamics of LayerNorm in transfer learning and provide practical strategies for LayerNorm fine-tuning.

[304] UniSVG: A Unified Dataset for Vector Graphic Understanding and Generation with Multimodal Large Language Models

Jinke Li, Jiarui Yu, Chenxing Wei, Hande Dong, Qiang Lin, Liangjing Yang, Zhicai Wang, Yanbin Hao

Main category: cs.CV

TL;DR: The paper introduces UniSVG, a dataset for training Multi-modal Large Language Models (MLLMs) to understand and generate scalable vector graphics (SVG), addressing challenges in precision and multi-modal processing.

Details

Motivation: AI-driven SVG understanding and generation face challenges due to high precision requirements and diverse conditional constraints. The rise of MLLMs offers potential solutions.

Method: The authors propose UniSVG, a dataset with 525k items, designed for MLLM training and evaluation in SVG tasks.

Result: Training on UniSVG improves MLLM performance on SVG tasks, outperforming state-of-the-art models like GPT-4V.

Conclusion: UniSVG unlocks MLLM capabilities for SVG tasks, providing a unified solution for generation and understanding, with released resources for further research.

Abstract: Unlike bitmap images, scalable vector graphics (SVG) maintain quality when scaled, frequently employed in computer vision and artistic design in the representation of SVG code. In this era of proliferating AI-powered systems, enabling AI to understand and generate SVG has become increasingly urgent. However, AI-driven SVG understanding and generation (U&G) remain significant challenges. SVG code, equivalent to a set of curves and lines controlled by floating-point parameters, demands high precision in SVG U&G. Besides, SVG generation operates under diverse conditional constraints, including textual prompts and visual references, which requires powerful multi-modal processing for condition-to-SVG transformation. Recently, the rapid growth of Multi-modal Large Language Models (MLLMs) have demonstrated capabilities to process multi-modal inputs and generate complex vector controlling parameters, suggesting the potential to address SVG U&G tasks within a unified model. To unlock MLLM’s capabilities in the SVG area, we propose an SVG-centric dataset called UniSVG, comprising 525k data items, tailored for MLLM training and evaluation. To our best knowledge, it is the first comprehensive dataset designed for unified SVG generation (from textual prompts and images) and SVG understanding (color, category, usage, etc.). As expected, learning on the proposed dataset boosts open-source MLLMs’ performance on various SVG U&G tasks, surpassing SOTA close-source MLLMs like GPT-4V. We release dataset, benchmark, weights, codes and experiment details on https://ryanlijinke.github.io/.

[305] GAPNet: A Lightweight Framework for Image and Video Salient Object Detection via Granularity-Aware Paradigm

Yu-Huan Wu, Wei Liu, Zi-Xuan Zhu, Zizhou Wang, Yong Liu, Liangli Zhen

Main category: cs.CV

TL;DR: GAPNet is a lightweight network for salient object detection (SOD) using granularity-aware supervision and efficient feature fusion, achieving state-of-the-art performance with low computational cost.

Details

Motivation: Heavyweight backbones in SOD models increase computational costs, limiting practical use, especially on edge devices. GAPNet addresses this by offering a lightweight solution.

Method: GAPNet uses granularity-aware supervision for multi-scale decoder outputs, granular pyramid convolution (GPC), cross-scale attention (CSA), and a self-attention module for global information.

Result: GAPNet achieves state-of-the-art performance in lightweight image and video SOD models.

Conclusion: GAPNet optimizes feature utilization and semantic interpretation, providing an efficient and effective solution for SOD with minimal computational overhead.

Abstract: Recent salient object detection (SOD) models predominantly rely on heavyweight backbones, incurring substantial computational cost and hindering their practical application in various real-world settings, particularly on edge devices. This paper presents GAPNet, a lightweight network built on the granularity-aware paradigm for both image and video SOD. We assign saliency maps of different granularities to supervise the multi-scale decoder side-outputs: coarse object locations for high-level outputs and fine-grained object boundaries for low-level outputs. Specifically, our decoder is built with granularity-aware connections which fuse high-level features of low granularity and low-level features of high granularity, respectively. To support these connections, we design granular pyramid convolution (GPC) and cross-scale attention (CSA) modules for efficient fusion of low-scale and high-scale features, respectively. On top of the encoder, a self-attention module is built to learn global information, enabling accurate object localization with negligible computational cost. Unlike traditional U-Net-based approaches, our proposed method optimizes feature utilization and semantic interpretation while applying appropriate supervision at each processing stage. Extensive experiments show that the proposed method achieves a new state-of-the-art performance among lightweight image and video SOD models. Code is available at https://github.com/yuhuan-wu/GAPNet.

[306] Architectural Co-Design for Zero-Shot Anomaly Detection: Decoupling Representation and Dynamically Fusing Features in CLIP

Ke Ma, Jun Long, Hongxiao Fei, Liujie Hua, Yueyi Luo

Main category: cs.CV

TL;DR: The paper proposes an Architectural Co-Design framework to improve Zero-Shot Anomaly Detection (ZSAD) by refining feature representation and cross-modal fusion in Vision-Language Models (VLMs).

Details

Motivation: VLMs struggle with ZSAD due to lacking local inductive biases for dense prediction and inflexible feature fusion.

Method: The framework integrates Conv-LoRA for fine-grained representation and Dynamic Fusion Gateway (DFG) for adaptive cross-modal fusion.

Result: Experiments show superior accuracy and robustness on industrial and medical benchmarks.

Conclusion: The co-design framework effectively adapts foundation models to dense perception tasks.

Abstract: Pre-trained Vision-Language Models (VLMs) face a significant adaptation gap when applied to Zero-Shot Anomaly Detection (ZSAD), stemming from their lack of local inductive biases for dense prediction and their reliance on inflexible feature fusion paradigms. We address these limitations through an Architectural Co-Design framework that jointly refines feature representation and cross-modal fusion. Our method integrates a parameter-efficient Convolutional Low-Rank Adaptation (Conv-LoRA) adapter to inject local inductive biases for fine-grained representation, and introduces a Dynamic Fusion Gateway (DFG) that leverages visual context to adaptively modulate text prompts, enabling a powerful bidirectional fusion. Extensive experiments on diverse industrial and medical benchmarks demonstrate superior accuracy and robustness, validating that this synergistic co-design is critical for robustly adapting foundation models to dense perception tasks.

[307] From Prediction to Explanation: Multimodal, Explainable, and Interactive Deepfake Detection Framework for Non-Expert Users

Shahroz Tariq, Simon S. Woo, Priyanka Singh, Irena Irmalasari, Saakshi Gupta, Dev Gupta

Main category: cs.CV

TL;DR: DF-P2E is a multimodal framework for interpretable deepfake detection, combining visual, semantic, and narrative explanations to enhance transparency and usability.

Details

Motivation: Deepfake detection lacks interpretability, hindering real-world usability, especially for non-experts. DF-P2E addresses this by integrating explanations into the detection process.

Method: The framework includes a deepfake classifier with Grad-CAM, a visual captioning module, and a narrative refinement module using an LLM. Evaluated on the DF40 dataset.

Result: Achieves competitive detection performance while providing high-quality, aligned explanations.

Conclusion: DF-P2E advances interpretable deepfake detection, supporting trustworthy AI in adversarial media.

Abstract: The proliferation of deepfake technologies poses urgent challenges and serious risks to digital integrity, particularly within critical sectors such as forensics, journalism, and the legal system. While existing detection systems have made significant progress in classification accuracy, they typically function as black-box models, offering limited transparency and minimal support for human reasoning. This lack of interpretability hinders their usability in real-world decision-making contexts, especially for non-expert users. In this paper, we present DF-P2E (Deepfake: Prediction to Explanation), a novel multimodal framework that integrates visual, semantic, and narrative layers of explanation to make deepfake detection interpretable and accessible. The framework consists of three modular components: (1) a deepfake classifier with Grad-CAM-based saliency visualisation, (2) a visual captioning module that generates natural language summaries of manipulated regions, and (3) a narrative refinement module that uses a fine-tuned Large Language Model (LLM) to produce context-aware, user-sensitive explanations. We instantiate and evaluate the framework on the DF40 benchmark, the most diverse deepfake dataset to date. Experiments demonstrate that our system achieves competitive detection performance while providing high-quality explanations aligned with Grad-CAM activations. By unifying prediction and explanation in a coherent, human-aligned pipeline, this work offers a scalable approach to interpretable deepfake detection, advancing the broader vision of trustworthy and transparent AI systems in adversarial media environments.

[308] LaVieID: Local Autoregressive Diffusion Transformers for Identity-Preserving Video Creation

Wenhui Song, Hanhui Li, Jiehui Huang, Panwen Hu, Yuhao Cheng, Long Chen, Yiqiang Yan, Xiaodan Liang

Main category: cs.CV

TL;DR: LaVieID is a local autoregressive video diffusion framework for identity-preserving text-to-video generation, improving spatial and temporal identity consistency.

Details

Motivation: To address identity loss in diffusion transformers by enhancing spatial and temporal modeling of facial features.

Method: Introduces a local router for fine-grained facial structure representation and a temporal autoregressive module for inter-frame consistency.

Result: Achieves high-fidelity personalized videos and state-of-the-art performance.

Conclusion: LaVieID effectively preserves identity in text-to-video generation, with code and models publicly available.

Abstract: In this paper, we present LaVieID, a novel \underline{l}ocal \underline{a}utoregressive \underline{vi}d\underline{e}o diffusion framework designed to tackle the challenging \underline{id}entity-preserving text-to-video task. The key idea of LaVieID is to mitigate the loss of identity information inherent in the stochastic global generation process of diffusion transformers (DiTs) from both spatial and temporal perspectives. Specifically, unlike the global and unstructured modeling of facial latent states in existing DiTs, LaVieID introduces a local router to explicitly represent latent states by weighted combinations of fine-grained local facial structures. This alleviates undesirable feature interference and encourages DiTs to capture distinctive facial characteristics. Furthermore, a temporal autoregressive module is integrated into LaVieID to refine denoised latent tokens before video decoding. This module divides latent tokens temporally into chunks, exploiting their long-range temporal dependencies to predict biases for rectifying tokens, thereby significantly enhancing inter-frame identity consistency. Consequently, LaVieID can generate high-fidelity personalized videos and achieve state-of-the-art performance. Our code and models are available at https://github.com/ssugarwh/LaVieID.

[309] Deep Space Weather Model: Long-Range Solar Flare Prediction from Multi-Wavelength Images

Shunya Nagashima, Komei Sugiura

Main category: cs.CV

TL;DR: Deep SWM, a deep state space model, improves solar flare prediction by handling long-range dependencies and using a novel pretraining strategy, outperforming baselines and human experts.

Details

Motivation: Accurate solar flare prediction is critical for infrastructure protection, but existing methods lack representation learning or struggle with temporal dependencies.

Method: Deep SWM uses multiple deep state space models, a sparse masked autoencoder, and a two-phase masking pretraining strategy to process solar images and dependencies.

Result: Deep SWM outperformed baseline methods and human experts in performance and reliability on the FlareBench benchmark.

Conclusion: Deep SWM offers a robust solution for solar flare prediction, validated by a comprehensive benchmark.

Abstract: Accurate, reliable solar flare prediction is crucial for mitigating potential disruptions to critical infrastructure, while predicting solar flares remains a significant challenge. Existing methods based on heuristic physical features often lack representation learning from solar images. On the other hand, end-to-end learning approaches struggle to model long-range temporal dependencies in solar images. In this study, we propose Deep Space Weather Model (Deep SWM), which is based on multiple deep state space models for handling both ten-channel solar images and long-range spatio-temporal dependencies. Deep SWM also features a sparse masked autoencoder, a novel pretraining strategy that employs a two-phase masking approach to preserve crucial regions such as sunspots while compressing spatial information. Furthermore, we built FlareBench, a new public benchmark for solar flare prediction covering a full 11-year solar activity cycle, to validate our method. Our method outperformed baseline methods and even human expert performance on standard metrics in terms of performance and reliability. The project page can be found at https://keio-smilab25.github.io/DeepSWM.

[310] X2Edit: Revisiting Arbitrary-Instruction Image Editing through Self-Constructed Data and Task-Aware Representation Learning

Jian Ma, Xujie Zhu, Zihao Pan, Qirong Peng, Xu Guo, Chen Chen, Haonan Lu

Main category: cs.CV

TL;DR: The paper introduces X2Edit Dataset for image editing tasks and a plug-and-play module compatible with generative models, achieving competitive performance.

Details

Motivation: Addressing the lack of high-quality open-source datasets and a compatible editing module for prevalent generative models.

Method: Constructing X2Edit Dataset with 3.7M high-quality data using expert models and scoring mechanisms, and designing task-aware MoE-LoRA training with contrastive learning.

Result: The model achieves competitive editing performance, and the dataset outperforms existing open-source datasets.

Conclusion: X2Edit provides a valuable dataset and efficient module for image editing, with open-source resources available.

Abstract: Existing open-source datasets for arbitrary-instruction image editing remain suboptimal, while a plug-and-play editing module compatible with community-prevalent generative models is notably absent. In this paper, we first introduce the X2Edit Dataset, a comprehensive dataset covering 14 diverse editing tasks, including subject-driven generation. We utilize the industry-leading unified image generation models and expert models to construct the data. Meanwhile, we design reasonable editing instructions with the VLM and implement various scoring mechanisms to filter the data. As a result, we construct 3.7 million high-quality data with balanced categories. Second, to better integrate seamlessly with community image generation models, we design task-aware MoE-LoRA training based on FLUX.1, with only 8% of the parameters of the full model. To further improve the final performance, we utilize the internal representations of the diffusion model and define positive/negative samples based on image editing types to introduce contrastive learning. Extensive experiments demonstrate that the model’s editing performance is competitive among many excellent models. Additionally, the constructed dataset exhibits substantial advantages over existing open-source datasets. The open-source code, checkpoints, and datasets for X2Edit can be found at the following link: https://github.com/OPPO-Mente-Lab/X2Edit.

[311] An Iterative Reconstruction Method for Dental Cone-Beam Computed Tomography with a Truncated Field of View

Hyoung Suk Park, Kiwan Jeon

Main category: cs.CV

TL;DR: A two-stage method using Implicit Neural Representation (INR) and iterative reconstruction reduces truncation artifacts in dental CBCT, improving image quality.

Details

Motivation: Small detectors in dental CBCT cause truncated FOV, degrading image quality in iterative reconstruction due to accumulated discrepancies.

Method: 1. Use INR to generate a prior image covering the full head. 2. Correct discrepancies and perform iterative reconstruction in the truncated region.

Result: The approach effectively suppresses truncation artifacts, enhancing CBCT image quality.

Conclusion: The two-stage method with INR and iterative reconstruction is a viable solution for truncation artifacts in dental CBCT.

Abstract: In dental cone-beam computed tomography (CBCT), compact and cost-effective system designs often use small detectors, resulting in a truncated field of view (FOV) that does not fully encompass the patient’s head. In iterative reconstruction approaches, the discrepancy between the actual projection and the forward projection within the truncated FOV accumulates over iterations, leading to significant degradation in the reconstructed image quality. In this study, we propose a two-stage approach to mitigate truncation artifacts in dental CBCT. In the first stage, we employ Implicit Neural Representation (INR), leveraging its superior representation power, to generate a prior image over an extended region so that its forward projection fully covers the patient’s head. To reduce computational and memory burdens, INR reconstruction is performed with a coarse voxel size. The forward projection of this prior image is then used to estimate the discrepancy due to truncated FOV in the measured projection data. In the second stage, the discrepancy-corrected projection data is utilized in a conventional iterative reconstruction process within the truncated region. Our numerical results demonstrate that the proposed two-grid approach effectively suppresses truncation artifacts, leading to improved CBCT image quality.

[312] Enhancing Egocentric Object Detection in Static Environments using Graph-based Spatial Anomaly Detection and Correction

Vishakha Lall, Yisi Liu

Main category: cs.CV

TL;DR: A graph-based post-processing pipeline improves object detection by modeling spatial relationships, correcting anomalies, and boosting performance by up to 4% mAP@50.

Details

Motivation: Current object detectors fail to leverage spatial consistency in static environments, leading to errors in cluttered or occluded scenes.

Method: A graph neural network (GNN) models spatial relationships to correct detection anomalies, trained on annotated data.

Result: The method improves detection performance, achieving up to 4% mAP@50 gain when used with standard detectors like YOLOv7 and RT-DETR.

Conclusion: Spatial reasoning enhances object detection reliability, demonstrating the value of leveraging environmental structure.

Abstract: In many real-world applications involving static environments, the spatial layout of objects remains consistent across instances. However, state-of-the-art object detection models often fail to leverage this spatial prior, resulting in inconsistent predictions, missed detections, or misclassifications, particularly in cluttered or occluded scenes. In this work, we propose a graph-based post-processing pipeline that explicitly models the spatial relationships between objects to correct detection anomalies in egocentric frames. Using a graph neural network (GNN) trained on manually annotated data, our model identifies invalid object class labels and predicts corrected class labels based on their neighbourhood context. We evaluate our approach both as a standalone anomaly detection and correction framework and as a post-processing module for standard object detectors such as YOLOv7 and RT-DETR. Experiments demonstrate that incorporating this spatial reasoning significantly improves detection performance, with mAP@50 gains of up to 4%. This method highlights the potential of leveraging the environment’s spatial structure to improve reliability in object detection systems.

[313] Selective Contrastive Learning for Weakly Supervised Affordance Grounding

WonJun Moon, Hyun Seok Seong, Jae-Pil Heo

Main category: cs.CV

TL;DR: The paper introduces a method for weakly supervised affordance grounding (WSAG) that uses selective prototypical and pixel contrastive objectives to learn affordance-relevant cues at part and object levels, improving accuracy by focusing on meaningful regions.

Details

Motivation: Current WSAG methods rely heavily on classification, often focusing on irrelevant patterns rather than affordance-relevant parts. The paper aims to address this by learning affordance cues adaptively.

Method: The approach leverages CLIP to identify action-associated objects in egocentric and exocentric images, then cross-references these to find part-level affordance clues. It uses selective prototypical and pixel contrastive objectives to focus on relevant regions.

Result: The method effectively shifts activation from irrelevant areas to meaningful affordance cues, demonstrating improved performance in experiments.

Conclusion: The proposed approach advances WSAG by adaptively learning affordance-relevant cues, outperforming previous methods that rely on isolated part-level learning.

Abstract: Facilitating an entity’s interaction with objects requires accurately identifying parts that afford specific actions. Weakly supervised affordance grounding (WSAG) seeks to imitate human learning from third-person demonstrations, where humans intuitively grasp functional parts without needing pixel-level annotations. To achieve this, grounding is typically learned using a shared classifier across images from different perspectives, along with distillation strategies incorporating part discovery process. However, since affordance-relevant parts are not always easily distinguishable, models primarily rely on classification, often focusing on common class-specific patterns that are unrelated to affordance. To address this limitation, we move beyond isolated part-level learning by introducing selective prototypical and pixel contrastive objectives that adaptively learn affordance-relevant cues at both the part and object levels, depending on the granularity of the available information. Initially, we find the action-associated objects in both egocentric (object-focused) and exocentric (third-person example) images by leveraging CLIP. Then, by cross-referencing the discovered objects of complementary views, we excavate the precise part-level affordance clues in each perspective. By consistently learning to distinguish affordance-relevant regions from affordance-irrelevant background context, our approach effectively shifts activation from irrelevant areas toward meaningful affordance cues. Experimental results demonstrate the effectiveness of our method. Codes are available at github.com/hynnsk/SelectiveCL.

[314] A Trustworthy Method for Multimodal Emotion Recognition

Junxiao Xue, Xiaozhen Liu, Jie Wang, Xuecheng Wu, Bin Wu

Main category: cs.CV

TL;DR: The paper introduces Trusted Emotion Recognition (TER), a method that uses uncertainty estimation to improve reliability in emotion recognition, especially for noisy or corrupted data. It combines multimodal results based on confidence values and introduces new evaluation metrics for trusted performance.

Details

Motivation: Existing emotion recognition methods focus on performance but lack reliability for noisy or out-of-distribution data. TER addresses this by incorporating uncertainty estimation.

Method: TER uses uncertainty estimation to calculate prediction confidence, combines multimodal results based on confidence, and introduces trusted precision, recall, Acc., and F1 score for evaluation.

Result: TER achieves state-of-the-art performance (82.40% Acc. on Music-video) and superior trusted F1 scores (0.7511 on IEMOCAP, 0.9035 on Music-video).

Conclusion: TER enhances reliability and robustness in emotion recognition, validated by experimental results, and outperforms existing methods in trusted performance.

Abstract: Existing emotion recognition methods mainly focus on enhancing performance by employing complex deep models, typically resulting in significantly higher model complexity. Although effective, it is also crucial to ensure the reliability of the final decision, especially for noisy, corrupted and out-of-distribution data. To this end, we propose a novel emotion recognition method called trusted emotion recognition (TER), which utilizes uncertainty estimation to calculate the confidence value of predictions. TER combines the results from multiple modalities based on their confidence values to output the trusted predictions. We also provide a new evaluation criterion to assess the reliability of predictions. Specifically, we incorporate trusted precision and trusted recall to determine the trusted threshold and formulate the trusted Acc. and trusted F1 score to evaluate the model’s trusted performance. The proposed framework combines the confidence module that accordingly endows the model with reliability and robustness against possible noise or corruption. The extensive experimental results validate the effectiveness of our proposed model. The TER achieves state-of-the-art performance on the Music-video, achieving 82.40% Acc. In terms of trusted performance, TER outperforms other methods on the IEMOCAP and Music-video, achieving trusted F1 scores of 0.7511 and 0.9035, respectively.

[315] AR-VRM: Imitating Human Motions for Visual Robot Manipulation with Analogical Reasoning

Dejie Yang, Zijing Zhao, Yang Liu

Main category: cs.CV

TL;DR: AR-VRM improves robot manipulation by explicitly imitating human actions from hand keypoints, outperforming prior methods in few-shot scenarios.

Details

Motivation: Existing methods for visual robot manipulation (VRM) rely on web data or implicit training, limiting generalization. AR-VRM addresses this by leveraging human action videos explicitly.

Method: AR-VRM uses a keypoint Vision-Language Model (VLM) to learn human actions from hand keypoints and applies analogical reasoning to map human motions to robot tasks.

Result: AR-VRM achieves leading performance on the CALVIN benchmark and excels in few-shot scenarios, demonstrating superior generalization.

Conclusion: Explicit imitation of human actions via keypoints and analogical reasoning significantly enhances robot manipulation under data scarcity.

Abstract: Visual Robot Manipulation (VRM) aims to enable a robot to follow natural language instructions based on robot states and visual observations, and therefore requires costly multi-modal data. To compensate for the deficiency of robot data, existing approaches have employed vision-language pretraining with large-scale data. However, they either utilize web data that differs from robotic tasks, or train the model in an implicit way (e.g., predicting future frames at the pixel level), thus showing limited generalization ability under insufficient robot data. In this paper, we propose to learn from large-scale human action video datasets in an explicit way (i.e., imitating human actions from hand keypoints), introducing Visual Robot Manipulation with Analogical Reasoning (AR-VRM). To acquire action knowledge explicitly from human action videos, we propose a keypoint Vision-Language Model (VLM) pretraining scheme, enabling the VLM to learn human action knowledge and directly predict human hand keypoints. During fine-tuning on robot data, to facilitate the robotic arm in imitating the action patterns of human motions, we first retrieve human action videos that perform similar manipulation tasks and have similar historical observations , and then learn the Analogical Reasoning (AR) map between human hand keypoints and robot components. Taking advantage of focusing on action keypoints instead of irrelevant visual cues, our method achieves leading performance on the CALVIN benchmark {and real-world experiments}. In few-shot scenarios, our AR-VRM outperforms previous methods by large margins , underscoring the effectiveness of explicitly imitating human actions under data scarcity.

[316] NeeCo: Image Synthesis of Novel Instrument States Based on Dynamic and Deformable 3D Gaussian Reconstruction

Tianle Zeng, Junlei Hu, Gerardo Loza Galindo, Sharib Ali, Duygu Sarikaya, Pietro Valdastri, Dominic Jones

Main category: cs.CV

TL;DR: A novel dynamic Gaussian Splatting technique is introduced to address data scarcity in surgical image datasets, enabling realistic synthetic data generation and improving model performance.

Details

Motivation: Current data-driven approaches in surgical automation require large labeled datasets, which are scarce. This work aims to overcome this limitation by generating synthetic data.

Method: Proposes a dynamic Gaussian model for surgical scenes, dynamic training adjustment for camera pose challenges, and automatic annotation generation for synthetic data. Evaluated on a new dataset with 14,000 frames.

Result: Achieves photo-realistic labeled datasets with high PSNR (29.87). Models trained on synthetic data outperform state-of-the-art augmentation by 10%, improving overall performance by 15%.

Conclusion: The dynamic Gaussian Splatting technique effectively addresses data scarcity, enhancing surgical automation with realistic synthetic data and improved model training.

Abstract: Computer vision-based technologies significantly enhance surgical automation by advancing tool tracking, detection, and localization. However, Current data-driven approaches are data-voracious, requiring large, high-quality labeled image datasets, which limits their application in surgical data science. Our Work introduces a novel dynamic Gaussian Splatting technique to address the data scarcity in surgical image datasets. We propose a dynamic Gaussian model to represent dynamic surgical scenes, enabling the rendering of surgical instruments from unseen viewpoints and deformations with real tissue backgrounds. We utilize a dynamic training adjustment strategy to address challenges posed by poorly calibrated camera poses from real-world scenarios. Additionally, we propose a method based on dynamic Gaussians for automatically generating annotations for our synthetic data. For evaluation, we constructed a new dataset featuring seven scenes with 14,000 frames of tool and camera motion and tool jaw articulation, with a background of an ex-vivo porcine model. Using this dataset, we synthetically replicate the scene deformation from the ground truth data, allowing direct comparisons of synthetic image quality. Experimental results illustrate that our method generates photo-realistic labeled image datasets with the highest values in Peak-Signal-to-Noise Ratio (29.87). We further evaluate the performance of medical-specific neural networks trained on real and synthetic images using an unseen real-world image dataset. Our results show that the performance of models trained on synthetic images generated by the proposed method outperforms those trained with state-of-the-art standard data augmentation by 10%, leading to an overall improvement in model performances by nearly 15%.

[317] LaRender: Training-Free Occlusion Control in Image Generation via Latent Rendering

Xiaohang Zhan, Dingming Liu

Main category: cs.CV

TL;DR: A training-free method for precise occlusion control in image generation using volume rendering principles in latent space.

Details

Motivation: Existing methods lack precision in controlling occlusion relationships between objects in generated images.

Method: Leverages volume rendering principles in latent space of a pre-trained diffusion model, guided by occlusion relationships and transmittance estimates.

Result: Outperforms existing methods in occlusion accuracy and enables effects like transparency, density, and light adjustments.

Conclusion: The method provides accurate occlusion control without retraining, expanding creative possibilities in image generation.

Abstract: We propose a novel training-free image generation algorithm that precisely controls the occlusion relationships between objects in an image. Existing image generation methods typically rely on prompts to influence occlusion, which often lack precision. While layout-to-image methods provide control over object locations, they fail to address occlusion relationships explicitly. Given a pre-trained image diffusion model, our method leverages volume rendering principles to “render” the scene in latent space, guided by occlusion relationships and the estimated transmittance of objects. This approach does not require retraining or fine-tuning the image diffusion model, yet it enables accurate occlusion control due to its physics-grounded foundation. In extensive experiments, our method significantly outperforms existing approaches in terms of occlusion accuracy. Furthermore, we demonstrate that by adjusting the opacities of objects or concepts during rendering, our method can achieve a variety of effects, such as altering the transparency of objects, the density of mass (e.g., forests), the concentration of particles (e.g., rain, fog), the intensity of light, and the strength of lens effects, etc.

[318] Collaborative Learning of Scattering and Deep Features for SAR Target Recognition with Noisy Labels

Yimin Fu, Zhunga Liu, Dongxiu Guo, Longfei Wang

Main category: cs.CV

TL;DR: Proposes CLSDF for SAR ATR with noisy labels by integrating scattering and deep features, using GMMs for label division and semi-supervised learning, achieving state-of-the-art results.

Details

Motivation: High-quality labeled SAR data is hard to acquire, leading to noisy labels and degraded ATR performance. Existing methods focus on image data, which doesn't suit SAR's non-intuitive characteristics.

Method: Collaborative learning of scattering and deep features (CLSDF) via multi-model fusion, dynamic graph ASCs, GMMs for label division, and semi-supervised learning with joint distribution alignment.

Result: Achieves state-of-the-art performance on MSTAR dataset under various noisy label conditions.

Conclusion: CLSDF effectively addresses noisy label challenges in SAR ATR by leveraging physical and deep features, improving robustness and accuracy.

Abstract: The acquisition of high-quality labeled synthetic aperture radar (SAR) data is challenging due to the demanding requirement for expert knowledge. Consequently, the presence of unreliable noisy labels is unavoidable, which results in performance degradation of SAR automatic target recognition (ATR). Existing research on learning with noisy labels mainly focuses on image data. However, the non-intuitive visual characteristics of SAR data are insufficient to achieve noise-robust learning. To address this problem, we propose collaborative learning of scattering and deep features (CLSDF) for SAR ATR with noisy labels. Specifically, a multi-model feature fusion framework is designed to integrate scattering and deep features. The attributed scattering centers (ASCs) are treated as dynamic graph structure data, and the extracted physical characteristics effectively enrich the representation of deep image features. Then, the samples with clean and noisy labels are divided by modeling the loss distribution with multiple class-wise Gaussian Mixture Models (GMMs). Afterward, the semi-supervised learning of two divergent branches is conducted based on the data divided by each other. Moreover, a joint distribution alignment strategy is introduced to enhance the reliability of co-guessed labels. Extensive experiments have been done on the Moving and Stationary Target Acquisition and Recognition (MSTAR) dataset, and the results show that the proposed method can achieve state-of-the-art performance under different operating conditions with various label noises.

[319] Undress to Redress: A Training-Free Framework for Virtual Try-On

Zhiying Li, Junhao Wu, Yeying Jin, Daiheng Gao, Yun Ji, Kaichuan Kong, Lei Yu, Hao Xu, Kai Chen, Bruce Gu, Nana Wang, Zhaoxin Fan

Main category: cs.CV

TL;DR: UR-VTON improves virtual try-on by addressing long-sleeve-to-short-sleeve conversions with a training-free, two-step ‘undress-to-redress’ mechanism, outperforming existing methods.

Details

Motivation: Existing VTON methods struggle with long-sleeve-to-short-sleeve conversions due to inaccurate skin restoration, limiting realism.

Method: UR-VTON uses an ‘undress-to-redress’ approach, Dynamic Classifier-Free Guidance, and Structural Refiner for enhanced detail.

Result: UR-VTON outperforms state-of-the-art methods in detail preservation and image quality.

Conclusion: UR-VTON offers a practical, effective solution for challenging VTON scenarios, validated by a new benchmark (LS-TON).

Abstract: Virtual try-on (VTON) is a crucial task for enhancing user experience in online shopping by generating realistic garment previews on personal photos. Although existing methods have achieved impressive results, they struggle with long-sleeve-to-short-sleeve conversions-a common and practical scenario-often producing unrealistic outputs when exposed skin is underrepresented in the original image. We argue that this challenge arises from the ‘‘majority’’ completion rule in current VTON models, which leads to inaccurate skin restoration in such cases. To address this, we propose UR-VTON (Undress-Redress Virtual Try-ON), a novel, training-free framework that can be seamlessly integrated with any existing VTON method. UR-VTON introduces an ‘‘undress-to-redress’’ mechanism: it first reveals the user’s torso by virtually ‘‘undressing,’’ then applies the target short-sleeve garment, effectively decomposing the conversion into two more manageable steps. Additionally, we incorporate Dynamic Classifier-Free Guidance scheduling to balance diversity and image quality during DDPM sampling, and employ Structural Refiner to enhance detail fidelity using high-frequency cues. Finally, we present LS-TON, a new benchmark for long-sleeve-to-short-sleeve try-on. Extensive experiments demonstrate that UR-VTON outperforms state-of-the-art methods in both detail preservation and image quality. Code will be released upon acceptance.

[320] Omni-Effects: Unified and Spatially-Controllable Visual Effects Generation

Fangyuan Mao, Aiming Hao, Jintao Chen, Dongxia Liu, Xiaokun Feng, Jiashu Zhu, Meiqi Wu, Chubin Chen, Jiahong Wu, Xiangxiang Chu

Main category: cs.CV

TL;DR: Omni-Effects is a unified framework for generating prompt-guided and spatially controllable composite visual effects (VFX), overcoming limitations of current methods by using LoRA-MoE and SAP innovations.

Details

Motivation: Current VFX generation methods are limited to single effects due to per-effect LoRA training, hindering applications requiring multiple spatially controllable effects.

Method: Proposes Omni-Effects with LoRA-MoE for diverse effect integration and SAP for spatial control, along with an IIF module to isolate control signals.

Result: Achieves precise spatial control and diverse effect generation, validated by extensive experiments on the Omni-VFX dataset.

Conclusion: Omni-Effects enables users to specify effect categories and locations, advancing VFX production capabilities.

Abstract: Visual effects (VFX) are essential visual enhancements fundamental to modern cinematic production. Although video generation models offer cost-efficient solutions for VFX production, current methods are constrained by per-effect LoRA training, which limits generation to single effects. This fundamental limitation impedes applications that require spatially controllable composite effects, i.e., the concurrent generation of multiple effects at designated locations. However, integrating diverse effects into a unified framework faces major challenges: interference from effect variations and spatial uncontrollability during multi-VFX joint training. To tackle these challenges, we propose Omni-Effects, a first unified framework capable of generating prompt-guided effects and spatially controllable composite effects. The core of our framework comprises two key innovations: (1) LoRA-based Mixture of Experts (LoRA-MoE), which employs a group of expert LoRAs, integrating diverse effects within a unified model while effectively mitigating cross-task interference. (2) Spatial-Aware Prompt (SAP) incorporates spatial mask information into the text token, enabling precise spatial control. Furthermore, we introduce an Independent-Information Flow (IIF) module integrated within the SAP, isolating the control signals corresponding to individual effects to prevent any unwanted blending. To facilitate this research, we construct a comprehensive VFX dataset Omni-VFX via a novel data collection pipeline combining image editing and First-Last Frame-to-Video (FLF2V) synthesis, and introduce a dedicated VFX evaluation framework for validating model performance. Extensive experiments demonstrate that Omni-Effects achieves precise spatial control and diverse effect generation, enabling users to specify both the category and location of desired effects.

[321] Make Your MoVe: Make Your 3D Contents by Adapting Multi-View Diffusion Models to External Editing

Weitao Wang, Haoran Xu, Jun Meng, Haoqian Wang

Main category: cs.CV

TL;DR: A tuning-free, plug-and-play method for aligning edited 3D assets with original geometry, improving multi-view consistency and mesh quality.

Details

Motivation: Addressing the gap in 3D editing tools by preserving geometry while enhancing color, style, and lighting, avoiding quality degradation from 2D editing methods.

Method: Proposes a geometry preservation module using original input normal latents and an injection switcher to control supervision, ensuring alignment between edited and original views.

Result: Consistently improves multi-view consistency and mesh quality across various multi-view diffusion models and editing methods.

Conclusion: The method effectively bridges the gap between 2D editing and 3D generation, enhancing the quality of personalized 3D content.

Abstract: As 3D generation techniques continue to flourish, the demand for generating personalized content is rapidly rising. Users increasingly seek to apply various editing methods to polish generated 3D content, aiming to enhance its color, style, and lighting without compromising the underlying geometry. However, most existing editing tools focus on the 2D domain, and directly feeding their results into 3D generation methods (like multi-view diffusion models) will introduce information loss, degrading the quality of the final 3D assets. In this paper, we propose a tuning-free, plug-and-play scheme that aligns edited assets with their original geometry in a single inference run. Central to our approach is a geometry preservation module that guides the edited multi-view generation with original input normal latents. Besides, an injection switcher is proposed to deliberately control the supervision extent of the original normals, ensuring the alignment between the edited color and normal views. Extensive experiments show that our method consistently improves both the multi-view consistency and mesh quality of edited 3D assets, across multiple combinations of multi-view diffusion models and editing methods.

[322] Multi-view Normal and Distance Guidance Gaussian Splatting for Surface Reconstruction

Bo Jia, Yanan Guo, Ying Chang, Benkui Zhang, Ying Xie, Kangning Du, Lin Cao

Main category: cs.CV

TL;DR: The paper introduces a multi-view normal and distance-guided Gaussian splatting method to address biases in 3DGS by unifying geometric depth and aligning 3D normals, improving reconstruction accuracy.

Details

Motivation: Biases in 3D Gaussian Splatting (3DGS) arise when normals align within single-view planes, causing inconsistencies in nearby views. The goal is to improve multi-view scene reconstruction.

Method: Proposes multi-view distance reprojection regularization and normal enhancement modules to align Gaussians and ensure consistency across views.

Result: Outperforms baseline in quantitative and qualitative evaluations, enhancing 3DGS’s surface reconstruction.

Conclusion: The method effectively addresses multi-view challenges, improving geometric accuracy and consistency in 3DGS.

Abstract: 3D Gaussian Splatting (3DGS) achieves remarkable results in the field of surface reconstruction. However, when Gaussian normal vectors are aligned within the single-view projection plane, while the geometry appears reasonable in the current view, biases may emerge upon switching to nearby views. To address the distance and global matching challenges in multi-view scenes, we design multi-view normal and distance-guided Gaussian splatting. This method achieves geometric depth unification and high-accuracy reconstruction by constraining nearby depth maps and aligning 3D normals. Specifically, for the reconstruction of small indoor and outdoor scenes, we propose a multi-view distance reprojection regularization module that achieves multi-view Gaussian alignment by computing the distance loss between two nearby views and the same Gaussian surface. Additionally, we develop a multi-view normal enhancement module, which ensures consistency across views by matching the normals of pixel points in nearby views and calculating the loss. Extensive experimental results demonstrate that our method outperforms the baseline in both quantitative and qualitative evaluations, significantly enhancing the surface reconstruction capability of 3DGS.

[323] A Registration-Based Star-Shape Segmentation Model and Fast Algorithms

Daoping Zhang, Xue-Cheng Tai, Lok Ming Lui

Main category: cs.CV

TL;DR: A star-shape segmentation model using level set representation and registration framework is proposed for accurate segmentation in corrupted images, handling full/partial star-shapes and landmark constraints.

Details

Motivation: Accurate image segmentation is challenging due to occlusions, obscurities, or noise. Star-shape priors are explored to address this.

Method: Combines level set representation with registration framework, constrains deformed level set function, and uses alternating direction method of multipliers.

Result: Effective segmentation of star-shape objects, accommodating single/multiple centers and landmark constraints, validated on synthetic and real images.

Conclusion: The proposed model successfully achieves accurate star-shape segmentation, even in challenging conditions.

Abstract: Image segmentation plays a crucial role in extracting objects of interest and identifying their boundaries within an image. However, accurate segmentation becomes challenging when dealing with occlusions, obscurities, or noise in corrupted images. To tackle this challenge, prior information is often utilized, with recent attention on star-shape priors. In this paper, we propose a star-shape segmentation model based on the registration framework. By combining the level set representation with the registration framework and imposing constraints on the deformed level set function, our model enables both full and partial star-shape segmentation, accommodating single or multiple centers. Additionally, our approach allows for the enforcement of identified boundaries to pass through specified landmark locations. We tackle the proposed models using the alternating direction method of multipliers. Through numerical experiments conducted on synthetic and real images, we demonstrate the efficacy of our approach in achieving accurate star-shape segmentation.

[324] Enhancing Small-Scale Dataset Expansion with Triplet-Connection-based Sample Re-Weighting

Ting Xiang, Changjian Chen, Zhuo Tang, Qifeng Zhang, Fei Lyu, Li Yang, Jiapeng Zhang, Kenli Li

Main category: cs.CV

TL;DR: TriReWeight, a triplet-connection-based sample re-weighting method, enhances generative data augmentation by addressing noisy images, outperforming SOTA methods by 7.9% on natural and 3.4% on medical datasets.

Details

Motivation: Limited performance of computer vision models due to scarce datasets, especially in medical diagnosis, motivates the use of generative models for data augmentation. Noisy images from uncontrolled generation processes pose a challenge.

Method: Theoretical analysis of three supervision types for generated images leads to TriReWeight, a method that re-weights noisy images without degrading performance. It integrates with any generative augmentation method.

Result: TriReWeight achieves superior performance, with a 7.9% average improvement on natural image datasets and 3.4% on medical datasets, and maintains optimal generalization.

Conclusion: TriReWeight effectively addresses noisy image issues in generative data augmentation, enhancing model performance across diverse datasets and methods.

Abstract: The performance of computer vision models in certain real-world applications, such as medical diagnosis, is often limited by the scarcity of available images. Expanding datasets using pre-trained generative models is an effective solution. However, due to the uncontrollable generation process and the ambiguity of natural language, noisy images may be generated. Re-weighting is an effective way to address this issue by assigning low weights to such noisy images. We first theoretically analyze three types of supervision for the generated images. Based on the theoretical analysis, we develop TriReWeight, a triplet-connection-based sample re-weighting method to enhance generative data augmentation. Theoretically, TriReWeight can be integrated with any generative data augmentation methods and never downgrade their performance. Moreover, its generalization approaches the optimal in the order $O(\sqrt{d\ln (n)/n})$. Our experiments validate the correctness of the theoretical analysis and demonstrate that our method outperforms the existing SOTA methods by $7.9%$ on average over six natural image datasets and by $3.4%$ on average over three medical datasets. We also experimentally validate that our method can enhance the performance of different generative data augmentation methods.

[325] Grouped Speculative Decoding for Autoregressive Image Generation

Junhyuk So, Juncheol Shin, Hyunho Kook, Eunhyeok Park

Main category: cs.CV

TL;DR: Grouped Speculative Decoding (GSD) accelerates autoregressive image models by 3.7x without extra training, leveraging token redundancy and diversity.

Details

Motivation: Autoregressive image models have slow inference due to sequential nature; existing speculative decoding methods are inefficient for image tokens.

Method: Proposes GSD, a training-free method that evaluates clusters of visually valid tokens dynamically, unlike traditional single-token SD.

Result: GSD achieves 3.7x speedup on average while maintaining image quality.

Conclusion: GSD effectively addresses the inefficiency of traditional SD for image tokens, offering scalable acceleration for AR image models.

Abstract: Recently, autoregressive (AR) image models have demonstrated remarkable generative capabilities, positioning themselves as a compelling alternative to diffusion models. However, their sequential nature leads to long inference times, limiting their practical scalability. In this work, we introduce Grouped Speculative Decoding (GSD), a novel, training-free acceleration method for AR image models. While recent studies have explored Speculative Decoding (SD) as a means to speed up AR image generation, existing approaches either provide only modest acceleration or require additional training. Our in-depth analysis reveals a fundamental difference between language and image tokens: image tokens exhibit inherent redundancy and diversity, meaning multiple tokens can convey valid semantics. However, traditional SD methods are designed to accept only a single most-likely token, which fails to leverage this difference, leading to excessive false-negative rejections. To address this, we propose a new SD strategy that evaluates clusters of visually valid tokens rather than relying on a single target token. Additionally, we observe that static clustering based on embedding distance is ineffective, which motivates our dynamic GSD approach. Extensive experiments show that GSD accelerates AR image models by an average of 3.7x while preserving image quality-all without requiring any additional training. The source code is available at https://github.com/junhyukso/GSD

[326] Comparison Reveals Commonality: Customized Image Generation through Contrastive Inversion

Minseo Kim, Minchan Kwon, Dongyeun Lee, Yunho Jeon, Junmo Kim

Main category: cs.CV

TL;DR: Contrastive Inversion is a novel method for extracting common concepts from small image sets without relying on additional guidance, improving generation quality.

Details

Motivation: Addressing the limitations of existing methods that depend on manual guidance, which can degrade generation quality by failing to fully separate auxiliary features.

Method: Uses contrastive learning to train target and auxiliary text tokens, followed by disentangled cross-attention fine-tuning for better concept fidelity.

Result: Achieves superior performance in concept representation and editing compared to existing techniques.

Conclusion: The proposed method effectively identifies and represents common concepts without additional guidance, enhancing generation quality.

Abstract: The recent demand for customized image generation raises a need for techniques that effectively extract the common concept from small sets of images. Existing methods typically rely on additional guidance, such as text prompts or spatial masks, to capture the common target concept. Unfortunately, relying on manually provided guidance can lead to incomplete separation of auxiliary features, which degrades generation quality.In this paper, we propose Contrastive Inversion, a novel approach that identifies the common concept by comparing the input images without relying on additional information. We train the target token along with the image-wise auxiliary text tokens via contrastive learning, which extracts the well-disentangled true semantics of the target. Then we apply disentangled cross-attention fine-tuning to improve concept fidelity without overfitting. Experimental results and analysis demonstrate that our method achieves a balanced, high-level performance in both concept representation and editing, outperforming existing techniques.

[327] Correspondence as Video: Test-Time Adaption on SAM2 for Reference Segmentation in the Wild

Haoran Wang, Zekun Li, Jian Zhang, Lei Qi, Yinghuan Shi

Main category: cs.CV

TL;DR: CAV-SAM adapts SAM2 for downstream tasks by treating reference-target pairs as pseudo videos, improving segmentation performance by over 5%.

Details

Motivation: Large vision models like SAM struggle with downstream tasks; reference segmentation offers a solution but existing methods are costly.

Method: CAV-SAM uses pseudo videos to leverage SAM2’s iVOS capabilities, with DBST for semantic transformation and TTGA for geometric alignment.

Result: Achieves over 5% improvement in segmentation performance compared to SOTA methods.

Conclusion: CAV-SAM provides a lightweight, effective adaptation of SAM2 for downstream tasks.

Abstract: Large vision models like the Segment Anything Model (SAM) exhibit significant limitations when applied to downstream tasks in the wild. Consequently, reference segmentation, which leverages reference images and their corresponding masks to impart novel knowledge to the model, emerges as a promising new direction for adapting vision models. However, existing reference segmentation approaches predominantly rely on meta-learning, which still necessitates an extensive meta-training process and brings massive data and computational cost. In this study, we propose a novel approach by representing the inherent correspondence between reference-target image pairs as a pseudo video. This perspective allows the latest version of SAM, known as SAM2, which is equipped with interactive video object segmentation (iVOS) capabilities, to be adapted to downstream tasks in a lightweight manner. We term this approach Correspondence As Video for SAM (CAV-SAM). CAV-SAM comprises two key modules: the Diffusion-Based Semantic Transition (DBST) module employs a diffusion model to construct a semantic transformation sequence, while the Test-Time Geometric Alignment (TTGA) module aligns the geometric changes within this sequence through test-time fine-tuning. We evaluated CAVSAM on widely-used datasets, achieving segmentation performance improvements exceeding 5% over SOTA methods. Implementation is provided in the supplementary materials.

[328] Hyperspectral Imaging

Danfeng Hong, Chenyu Li, Naoto Yokoya, Bing Zhang, Xiuping Jia, Antonio Plaza, Paolo Gamba, Jon Atli Benediktsson, Jocelyn Chanussot

Main category: cs.CV

TL;DR: A primer on hyperspectral imaging (HSI) covering principles, methods, applications, challenges, and future directions, emphasizing its cross-disciplinary potential.

Details

Motivation: To provide a comprehensive overview of HSI, highlighting its non-invasive, label-free capabilities for analyzing material, chemical, and biological properties.

Method: Discusses physical principles, sensor architectures, data acquisition, calibration, and analysis techniques like dimensionality reduction, classification, and AI-driven methods.

Result: HSI enables advanced monitoring and decision-making across fields like agriculture, biomedicine, and security, though challenges like data complexity persist.

Conclusion: HSI’s future lies in scalable, real-time systems and cross-disciplinary applications, driven by innovations like sensor miniaturization and AI.

Abstract: Hyperspectral imaging (HSI) is an advanced sensing modality that simultaneously captures spatial and spectral information, enabling non-invasive, label-free analysis of material, chemical, and biological properties. This Primer presents a comprehensive overview of HSI, from the underlying physical principles and sensor architectures to key steps in data acquisition, calibration, and correction. We summarize common data structures and highlight classical and modern analysis methods, including dimensionality reduction, classification, spectral unmixing, and AI-driven techniques such as deep learning. Representative applications across Earth observation, precision agriculture, biomedicine, industrial inspection, cultural heritage, and security are also discussed, emphasizing HSI’s ability to uncover sub-visual features for advanced monitoring, diagnostics, and decision-making. Persistent challenges, such as hardware trade-offs, acquisition variability, and the complexity of high-dimensional data, are examined alongside emerging solutions, including computational imaging, physics-informed modeling, cross-modal fusion, and self-supervised learning. Best practices for dataset sharing, reproducibility, and metadata documentation are further highlighted to support transparency and reuse. Looking ahead, we explore future directions toward scalable, real-time, and embedded HSI systems, driven by sensor miniaturization, self-supervised learning, and foundation models. As HSI evolves into a general-purpose, cross-disciplinary platform, it holds promise for transformative applications in science, technology, and society.

[329] Dream4D: Lifting Camera-Controlled I2V towards Spatiotemporally Consistent 4D Generation

Xiaoyan Liu, Kangrui Li, Jiaxin Liu

Main category: cs.CV

TL;DR: Dream4D introduces a novel framework combining controllable video generation and neural 4D reconstruction to create spatiotemporally coherent 4D content, outperforming existing methods in quality.

Details

Motivation: Current methods struggle with maintaining view consistency and handling complex dynamics in large-scale environments, prompting the need for a unified approach.

Method: Dream4D uses a two-stage architecture: predicting camera trajectories from a single image via few-shot learning, then generating multi-view sequences with a pose-conditioned diffusion process, and converting them into a 4D representation.

Result: The framework leverages temporal priors and geometric awareness, achieving higher quality (e.g., mPSNR, mSSIM) compared to existing methods.

Conclusion: Dream4D successfully bridges the gap in 4D content synthesis, offering improved consistency and quality for complex scenes.

Abstract: The synthesis of spatiotemporally coherent 4D content presents fundamental challenges in computer vision, requiring simultaneous modeling of high-fidelity spatial representations and physically plausible temporal dynamics. Current approaches often struggle to maintain view consistency while handling complex scene dynamics, particularly in large-scale environments with multiple interacting elements. This work introduces Dream4D, a novel framework that bridges this gap through a synergy of controllable video generation and neural 4D reconstruction. Our approach seamlessly combines a two-stage architecture: it first predicts optimal camera trajectories from a single image using few-shot learning, then generates geometrically consistent multi-view sequences via a specialized pose-conditioned diffusion process, which are finally converted into a persistent 4D representation. This framework is the first to leverage both rich temporal priors from video diffusion models and geometric awareness of the reconstruction models, which significantly facilitates 4D generation and shows higher quality (e.g., mPSNR, mSSIM) over existing methods.

[330] GRASPTrack: Geometry-Reasoned Association via Segmentation and Projection for Multi-Object Tracking

Xudong Han, Pengcheng Fang, Yueying Tian, Jianhui Yu, Xiaohao Cai, Daniel Roggen, Philip Birch

Main category: cs.CV

TL;DR: GRASPTrack integrates depth estimation and instance segmentation into tracking-by-detection for 3D geometric reasoning, improving MOT robustness in occluded and complex scenes.

Details

Motivation: Addressing occlusions and depth ambiguity in monocular MOT, which conventional TBD methods struggle with due to lack of geometric awareness.

Method: Combines monocular depth estimation and instance segmentation to generate 3D point clouds, uses voxel-based 3D IoU for association, and incorporates depth-aware noise compensation and motion cues.

Result: Achieves competitive performance on MOT17, MOT20, and DanceTrack benchmarks, especially in occluded and complex scenes.

Conclusion: GRASPTrack enhances MOT robustness by leveraging 3D geometric reasoning and adaptive depth-aware techniques.

Abstract: Multi-object tracking (MOT) in monocular videos is fundamentally challenged by occlusions and depth ambiguity, issues that conventional tracking-by-detection (TBD) methods struggle to resolve owing to a lack of geometric awareness. To address these limitations, we introduce GRASPTrack, a novel depth-aware MOT framework that integrates monocular depth estimation and instance segmentation into a standard TBD pipeline to generate high-fidelity 3D point clouds from 2D detections, thereby enabling explicit 3D geometric reasoning. These 3D point clouds are then voxelized to enable a precise and robust Voxel-Based 3D Intersection-over-Union (IoU) for spatial association. To further enhance tracking robustness, our approach incorporates Depth-aware Adaptive Noise Compensation, which dynamically adjusts the Kalman filter process noise based on occlusion severity for more reliable state estimation. Additionally, we propose a Depth-enhanced Observation-Centric Momentum, which extends the motion direction consistency from the image plane into 3D space to improve motion-based association cues, particularly for objects with complex trajectories. Extensive experiments on the MOT17, MOT20, and DanceTrack benchmarks demonstrate that our method achieves competitive performance, significantly improving tracking robustness in complex scenes with frequent occlusions and intricate motion patterns.

[331] Prototype-Guided Curriculum Learning for Zero-Shot Learning

Lei Wang, Shiming Chen, Guo-Sen Xie, Ziming Hong, Chaojian Yu, Qinmu Peng, Xinge You

Main category: cs.CV

TL;DR: The paper proposes CLZSL, a prototype-guided curriculum learning framework for Zero-Shot Learning (ZSL), addressing instance-level mismatches and class-level imprecision in semantic prototypes to improve knowledge transfer.

Details

Motivation: Existing embedding-based ZSL methods suffer from noisy supervision due to instance-level mismatches and class-level imprecision in manually defined semantic prototypes, hindering effective knowledge transfer.

Method: The CLZSL framework includes a Prototype-Guided Curriculum Learning (PCL) module to prioritize well-aligned samples and a Prototype Update (PUP) module to dynamically refine class-level prototypes.

Result: Experiments on AWA2, SUN, and CUB datasets demonstrate the effectiveness of CLZSL in improving visual-semantic mapping.

Conclusion: CLZSL successfully mitigates noisy supervision in ZSL, enhancing knowledge transfer to unseen classes.

Abstract: In Zero-Shot Learning (ZSL), embedding-based methods enable knowledge transfer from seen to unseen classes by learning a visual-semantic mapping from seen-class images to class-level semantic prototypes (e.g., attributes). However, these semantic prototypes are manually defined and may introduce noisy supervision for two main reasons: (i) instance-level mismatch: variations in perspective, occlusion, and annotation bias will cause discrepancies between individual sample and the class-level semantic prototypes; and (ii) class-level imprecision: the manually defined semantic prototypes may not accurately reflect the true semantics of the class. Consequently, the visual-semantic mapping will be misled, reducing the effectiveness of knowledge transfer to unseen classes. In this work, we propose a prototype-guided curriculum learning framework (dubbed as CLZSL), which mitigates instance-level mismatches through a Prototype-Guided Curriculum Learning (PCL) module and addresses class-level imprecision via a Prototype Update (PUP) module. Specifically, the PCL module prioritizes samples with high cosine similarity between their visual mappings and the class-level semantic prototypes, and progressively advances to less-aligned samples, thereby reducing the interference of instance-level mismatches to achieve accurate visual-semantic mapping. Besides, the PUP module dynamically updates the class-level semantic prototypes by leveraging the visual mappings learned from instances, thereby reducing class-level imprecision and further improving the visual-semantic mapping. Experiments were conducted on standard benchmark datasets-AWA2, SUN, and CUB-to verify the effectiveness of our method.

[332] Forecasting Continuous Non-Conservative Dynamical Systems in SO(3)

Lennart Bastian, Mohammad Rashed, Nassir Navab, Tolga Birdal

Main category: cs.CV

TL;DR: The paper proposes a method for robustly modeling and extrapolating 3D rotation trajectories using Neural Controlled Differential Equations and SO(3) Savitzky-Golay paths, addressing challenges like unknown dynamics and noisy observations.

Details

Motivation: Challenges in SO(3) extrapolation include unknown dynamics (e.g., moment of inertia), non-conservative forces, and noisy observations, which limit existing methods relying on energy conservation or constant velocity assumptions.

Method: The approach uses Neural Controlled Differential Equations guided by SO(3) Savitzky-Golay paths to model trajectories in a physically and geometrically meaningful way, without relying on energy or momentum conservation.

Result: The model achieves robust extrapolation in simulations and real-world scenarios, generalizing well to trajectories with unknown physical parameters and noisy inputs.

Conclusion: The proposed method is versatile, robust to noise, and easily integrable into existing pipelines, making it suitable for complex, non-inertial systems.

Abstract: Modeling the rotation of moving objects is a fundamental task in computer vision, yet $SO(3)$ extrapolation still presents numerous challenges: (1) unknown quantities such as the moment of inertia complicate dynamics, (2) the presence of external forces and torques can lead to non-conservative kinematics, and (3) estimating evolving state trajectories under sparse, noisy observations requires robustness. We propose modeling trajectories of noisy pose estimates on the manifold of 3D rotations in a physically and geometrically meaningful way by leveraging Neural Controlled Differential Equations guided with $SO(3)$ Savitzky-Golay paths. Existing extrapolation methods often rely on energy conservation or constant velocity assumptions, limiting their applicability in real-world scenarios involving non-conservative forces. In contrast, our approach is agnostic to energy and momentum conservation while being robust to input noise, making it applicable to complex, non-inertial systems. Our approach is easily integrated as a module in existing pipelines and generalizes well to trajectories with unknown physical parameters. By learning to approximate object dynamics from noisy states during training, our model attains robust extrapolation capabilities in simulation and various real-world settings. Code is available at https://github.com/bastianlb/forecasting-rotational-dynamics

[333] GaitSnippet: Gait Recognition Beyond Unordered Sets and Ordered Sequences

Saihui Hou, Chenye Wang, Wenpeng Lang, Zhengxiang Lan, Yongzhen Huang

Main category: cs.CV

TL;DR: The paper proposes a novel gait recognition method by treating gait as individualized actions (snippets), addressing limitations of set-based and sequence-based approaches. It achieves high accuracy on benchmark datasets.

Details

Motivation: Existing gait recognition methods (set-based and sequence-based) have limitations in capturing temporal context. The paper aims to address these by modeling gait as snippets of actions.

Method: The approach involves Snippet Sampling and Snippet Modeling to capture multi-scale temporal context. It uses a 2D convolution-based backbone for implementation.

Result: The method achieves rank-1 accuracy of 77.5% on Gait3D and 81.7% on GREW, demonstrating its effectiveness.

Conclusion: The snippet-based approach shows promise for gait recognition, offering a balanced way to capture temporal dependencies.

Abstract: Recent advancements in gait recognition have significantly enhanced performance by treating silhouettes as either an unordered set or an ordered sequence. However, both set-based and sequence-based approaches exhibit notable limitations. Specifically, set-based methods tend to overlook short-range temporal context for individual frames, while sequence-based methods struggle to capture long-range temporal dependencies effectively. To address these challenges, we draw inspiration from human identification and propose a new perspective that conceptualizes human gait as a composition of individualized actions. Each action is represented by a series of frames, randomly selected from a continuous segment of the sequence, which we term a snippet. Fundamentally, the collection of snippets for a given sequence enables the incorporation of multi-scale temporal context, facilitating more comprehensive gait feature learning. Moreover, we introduce a non-trivial solution for snippet-based gait recognition, focusing on Snippet Sampling and Snippet Modeling as key components. Extensive experiments on four widely-used gait datasets validate the effectiveness of our proposed approach and, more importantly, highlight the potential of gait snippets. For instance, our method achieves the rank-1 accuracy of 77.5% on Gait3D and 81.7% on GREW using a 2D convolution-based backbone.

[334] MedReasoner: Reinforcement Learning Drives Reasoning Grounding from Clinical Thought to Pixel-Level Precision

Zhonghao Yan, Muxi Diao, Yuxuan Yang, Jiayuan Xu, Kaizhou Zhang, Ruoyan Jing, Lele Yang, Yanxi Liu, Kongming Liang, Zhanyu Ma

Main category: cs.CV

TL;DR: The paper introduces Unified Medical Reasoning Grounding (UMRG), a novel task for clinical reasoning and pixel-level grounding in medical imaging, and presents MedReasoner, a modular framework achieving state-of-the-art performance.

Details

Motivation: Current medical-grounding pipelines rely on supervised fine-tuning with explicit spatial hints, which fail to handle implicit clinical queries.

Method: Proposes UMRG task and U-MRG-14K dataset, and introduces MedReasoner, a framework separating reasoning (optimized with reinforcement learning) from segmentation (using a frozen expert).

Result: MedReasoner achieves state-of-the-art performance on U-MRG-14K and generalizes well to unseen clinical queries.

Conclusion: Reinforcement learning shows promise for interpretable medical grounding, as demonstrated by MedReasoner.

Abstract: Accurately grounding regions of interest (ROIs) is critical for diagnosis and treatment planning in medical imaging. While multimodal large language models (MLLMs) combine visual perception with natural language, current medical-grounding pipelines still rely on supervised fine-tuning with explicit spatial hints, making them ill-equipped to handle the implicit queries common in clinical practice. This work makes three core contributions. We first define Unified Medical Reasoning Grounding (UMRG), a novel vision-language task that demands clinical reasoning and pixel-level grounding. Second, we release U-MRG-14K, a dataset of 14K samples featuring pixel-level masks alongside implicit clinical queries and reasoning traces, spanning 10 modalities, 15 super-categories, and 108 specific categories. Finally, we introduce MedReasoner, a modular framework that distinctly separates reasoning from segmentation: an MLLM reasoner is optimized with reinforcement learning, while a frozen segmentation expert converts spatial prompts into masks, with alignment achieved through format and accuracy rewards. MedReasoner achieves state-of-the-art performance on U-MRG-14K and demonstrates strong generalization to unseen clinical queries, underscoring the significant promise of reinforcement learning for interpretable medical grounding.

[335] Boosting Active Defense Persistence: A Two-Stage Defense Framework Combining Interruption and Poisoning Against Deepfake

Hongrui Zheng, Yuezun Li, Liejun Wang, Yunfeng Diao, Zhiqing Guo

Main category: cs.CV

TL;DR: The paper proposes a Two-Stage Defense Framework (TSDF) to counter deepfake threats by distorting forged content and disrupting attackers’ retraining pipelines, ensuring long-term defense effectiveness.

Details

Motivation: Current active defenses against deepfakes lack persistence, as attackers bypass them by retraining models on protected samples. The paper aims to create a defense that not only distorts content but also blocks model adaptation.

Method: The TSDF uses dual-function adversarial perturbations: distorting forged results and poisoning the data source to disrupt attackers’ retraining pipelines.

Result: Experiments show TSDF outperforms traditional methods, maintaining strong defense even under adversarial retraining.

Conclusion: TSDF provides a persistent and effective defense against deepfakes by combining distortion and data poisoning.

Abstract: Active defense strategies have been developed to counter the threat of deepfake technology. However, a primary challenge is their lack of persistence, as their effectiveness is often short-lived. Attackers can bypass these defenses by simply collecting protected samples and retraining their models. This means that static defenses inevitably fail when attackers retrain their models, which severely limits practical use. We argue that an effective defense not only distorts forged content but also blocks the model’s ability to adapt, which occurs when attackers retrain their models on protected images. To achieve this, we propose an innovative Two-Stage Defense Framework (TSDF). Benefiting from the intensity separation mechanism designed in this paper, the framework uses dual-function adversarial perturbations to perform two roles. First, it can directly distort the forged results. Second, it acts as a poisoning vehicle that disrupts the data preparation process essential for an attacker’s retraining pipeline. By poisoning the data source, TSDF aims to prevent the attacker’s model from adapting to the defensive perturbations, thus ensuring the defense remains effective long-term. Comprehensive experiments show that the performance of traditional interruption methods degrades sharply when it is subjected to adversarial retraining. However, our framework shows a strong dual defense capability, which can improve the persistence of active defense. Our code will be available at https://github.com/vpsg-research/TSDF.

[336] Power Battery Detection

Xiaoqi Zhao, Peiqian Cao, Lihe Zhang, Zonglei Feng, Hanqi Liu, Jiaming Zuo, Youwei Pang, Weisi Lin, Georges El Fakhri, Huchuan Lu, Xiaofeng Liu

Main category: cs.CV

TL;DR: The paper introduces PBD5K, a benchmark for power battery detection (PBD) using X-ray images, and proposes MDCNeXt, a model for point-level segmentation to address challenges like dense endpoints and low contrast.

Details

Motivation: Manual inspection of power batteries is inefficient, and traditional vision algorithms fail due to dense plates, low contrast, and imaging artifacts. The study aims to automate and improve this process.

Method: Developed PBD5K, a dataset with 5,000 X-ray images, and MDCNeXt, a model integrating multi-dimensional clues (point, line, count) with state space modules for better segmentation.

Result: MDCNeXt effectively localizes dense endpoints and suppresses visual interference, supported by a distance-adaptive mask generation strategy.

Conclusion: The work advances PBD with a scalable benchmark and robust model, encouraging further research in battery quality inspection.

Abstract: Power batteries are essential components in electric vehicles, where internal structural defects can pose serious safety risks. We conduct a comprehensive study on a new task, power battery detection (PBD), which aims to localize the dense endpoints of cathode and anode plates from industrial X-ray images for quality inspection. Manual inspection is inefficient and error-prone, while traditional vision algorithms struggle with densely packed plates, low contrast, scale variation, and imaging artifacts. To address this issue and drive more attention into this meaningful task, we present PBD5K, the first large-scale benchmark for this task, consisting of 5,000 X-ray images from nine battery types with fine-grained annotations and eight types of real-world visual interference. To support scalable and consistent labeling, we develop an intelligent annotation pipeline that combines image filtering, model-assisted pre-labeling, cross-verification, and layered quality evaluation. We formulate PBD as a point-level segmentation problem and propose MDCNeXt, a model designed to extract and integrate multi-dimensional structure clues including point, line, and count information from the plate itself. To improve discrimination between plates and suppress visual interference, MDCNeXt incorporates two state space modules. The first is a prompt-filtered module that learns contrastive relationships guided by task-specific prompts. The second is a density-aware reordering module that refines segmentation in regions with high plate density. In addition, we propose a distance-adaptive mask generation strategy to provide robust supervision under varying spatial distributions of anode and cathode positions. The source code and datasets will be publicly available at \href{https://github.com/Xiaoqi-Zhao-DLUT/X-ray-PBD}{PBD5K}.

[337] MambaTrans: Multimodal Fusion Image Translation via Large Language Model Priors for Downstream Visual Tasks

Yushen Xu, Xiaosong Li, Zhenyu Kuang, Xiaoqi Cheng, Haishu Tan, Huafeng Li

Main category: cs.CV

TL;DR: MambaTrans, a novel multimodal fusion image translator, improves downstream task performance by adapting fused images to models trained on visible images, leveraging multimodal descriptions and semantic masks.

Details

Motivation: Existing pre-trained models on visible images struggle with multimodal fused images due to pixel distribution differences, degrading downstream task performance.

Method: MambaTrans uses multimodal large language model descriptions and semantic masks, combining mask-image-text cross-attention and a 3D-Selective Scan Module to enhance visual capabilities.

Result: Experiments show MambaTrans effectively improves multimodal image performance in downstream tasks without adjusting pre-trained model parameters.

Conclusion: MambaTrans successfully bridges the gap between multimodal fused images and visible image-trained models, enhancing downstream task performance.

Abstract: The goal of multimodal image fusion is to integrate complementary information from infrared and visible images, generating multimodal fused images for downstream tasks. Existing downstream pre-training models are typically trained on visible images. However, the significant pixel distribution differences between visible and multimodal fusion images can degrade downstream task performance, sometimes even below that of using only visible images. This paper explores adapting multimodal fused images with significant modality differences to object detection and semantic segmentation models trained on visible images. To address this, we propose MambaTrans, a novel multimodal fusion image modality translator. MambaTrans uses descriptions from a multimodal large language model and masks from semantic segmentation models as input. Its core component, the Multi-Model State Space Block, combines mask-image-text cross-attention and a 3D-Selective Scan Module, enhancing pure visual capabilities. By leveraging object detection prior knowledge, MambaTrans minimizes detection loss during training and captures long-term dependencies among text, masks, and images. This enables favorable results in pre-trained models without adjusting their parameters. Experiments on public datasets show that MambaTrans effectively improves multimodal image performance in downstream tasks.

[338] OMGSR: You Only Need One Mid-timestep Guidance for Real-World Image Super-Resolution

Zhiqiang Wu, Zhaomang Sun, Tong Zhou, Bingtao Fu, Ji Cong, Yitong Dong, Huaqi Zhang, Xuan Tang, Mingsong Chen, Xian Wei

Main category: cs.CV

TL;DR: OMGSR is a framework for one-step Real-ISR using DDPM/FM models, addressing the gap between LQ image and Gaussian latent distributions by injecting LQ at mid-timesteps and refining the latent distribution.

Details

Motivation: The gap between LQ image latent distribution and Gaussian noisy latent distribution limits generative prior utilization in one-step Real-ISR.

Method: OMGSR injects LQ image latent at mid-timesteps, uses Latent Distribution Refinement loss, and Overlap-Chunked LPIPS/GAN loss to avoid artifacts.

Result: OMGSR variants (OMGSR-S/F) achieve balanced/excellent performance at 512-resolution, with OMGSR-F dominating reference metrics. Scaling to 1k/2k resolution yields detailed results.

Conclusion: OMGSR effectively bridges the latent distribution gap and enhances Real-ISR performance, especially at higher resolutions.

Abstract: Denoising Diffusion Probabilistic Models (DDPM) and Flow Matching (FM) generative models show promising potential for one-step Real-World Image Super-Resolution (Real-ISR). Recent one-step Real-ISR models typically inject a Low-Quality (LQ) image latent distribution at the initial timestep. However, a fundamental gap exists between the LQ image latent distribution and the Gaussian noisy latent distribution, limiting the effective utilization of generative priors. We observe that the noisy latent distribution at DDPM/FM mid-timesteps aligns more closely with the LQ image latent distribution. Based on this insight, we present One Mid-timestep Guidance Real-ISR (OMGSR), a universal framework applicable to DDPM/FM-based generative models. OMGSR injects the LQ image latent distribution at a pre-computed mid-timestep, incorporating the proposed Latent Distribution Refinement loss to alleviate the latent distribution gap. We also design the Overlap-Chunked LPIPS/GAN loss to eliminate checkerboard artifacts in image generation. Within this framework, we instantiate OMGSR for DDPM/FM-based generative models with two variants: OMGSR-S (SD-Turbo) and OMGSR-F (FLUX.1-dev). Experimental results demonstrate that OMGSR-S/F achieves balanced/excellent performance across quantitative and qualitative metrics at 512-resolution. Notably, OMGSR-F establishes overwhelming dominance in all reference metrics. We further train a 1k-resolution OMGSR-F to match the default resolution of FLUX.1-dev, which yields excellent results, especially in the details of the image generation. We also generate 2k-resolution images by the 1k-resolution OMGSR-F using our two-stage Tiled VAE & Diffusion.

[339] Pose-RFT: Enhancing MLLMs for 3D Pose Generation via Hybrid Action Reinforcement Fine-Tuning

Bao Li, Xiaomei Zhang, Miao Xu, Zhaoxin Fan, Xiangyu Zhu, Zhen Lei

Main category: cs.CV

TL;DR: Pose-RFT is a reinforcement fine-tuning framework for 3D human pose generation in multimodal large language models (MLLMs), addressing ambiguity and alignment issues via hybrid action reinforcement learning.

Details

Motivation: Existing pose-specific MLLMs struggle with ambiguity and task-specific alignment in 3D pose generation due to supervised objectives like SMPL parameter regression.

Method: Proposes Pose-RFT, using HyGRPO (a hybrid reinforcement learning algorithm) to jointly optimize discrete language prediction and continuous pose generation with task-specific rewards.

Result: Pose-RFT outperforms existing pose-specific MLLMs on multiple benchmarks, demonstrating improved spatial and semantic alignment.

Conclusion: Hybrid action reinforcement fine-tuning is effective for accurate 3D human pose generation in multimodal tasks.

Abstract: Generating 3D human poses from multimodal inputs such as images or text requires models to capture both rich spatial and semantic correspondences. While pose-specific multimodal large language models (MLLMs) have shown promise in this task, they are typically trained with supervised objectives such as SMPL parameter regression or token-level prediction, which struggle to model the inherent ambiguity and achieve task-specific alignment required for accurate 3D pose generation. To address these limitations, we propose Pose-RFT, a reinforcement fine-tuning framework tailored for 3D human pose generation in MLLMs. We formulate the task as a hybrid action reinforcement learning problem that jointly optimizes discrete language prediction and continuous pose generation. To this end, we introduce HyGRPO, a hybrid reinforcement learning algorithm that performs group-wise reward normalization over sampled responses to guide joint optimization of discrete and continuous actions. Pose-RFT further incorporates task-specific reward functions to guide optimization towards spatial alignment in image-to-pose generation and semantic consistency in text-to-pose generation. Extensive experiments on multiple pose generation benchmarks demonstrate that Pose-RFT significantly improves performance over existing pose-specific MLLMs, validating the effectiveness of hybrid action reinforcement fine-tuning for 3D pose generation.

[340] DiTVR: Zero-Shot Diffusion Transformer for Video Restoration

Sicheng Gao, Nancy Mehta, Zongwei Wu, Radu Timofte

Main category: cs.CV

TL;DR: DiTVR is a zero-shot video restoration framework using a diffusion transformer with trajectory-aware attention and a wavelet-guided sampler, achieving state-of-the-art results with superior temporal consistency and detail preservation.

Details

Motivation: Traditional methods produce unrealistic details and require paired datasets, while generative diffusion models struggle with temporal consistency.

Method: DiTVR combines a diffusion transformer with trajectory-aware attention and a flow-guided sampler, focusing on temporal dynamics and motion correspondences.

Result: DiTVR achieves state-of-the-art performance on video restoration benchmarks, excelling in temporal consistency and robustness to noise and occlusions.

Conclusion: DiTVR offers an effective zero-shot solution for video restoration, balancing detail preservation and temporal consistency.

Abstract: Video restoration aims to reconstruct high quality video sequences from low quality inputs, addressing tasks such as super resolution, denoising, and deblurring. Traditional regression based methods often produce unrealistic details and require extensive paired datasets, while recent generative diffusion models face challenges in ensuring temporal consistency. We introduce DiTVR, a zero shot video restoration framework that couples a diffusion transformer with trajectory aware attention and a wavelet guided, flow consistent sampler. Unlike prior 3D convolutional or frame wise diffusion approaches, our attention mechanism aligns tokens along optical flow trajectories, with particular emphasis on vital layers that exhibit the highest sensitivity to temporal dynamics. A spatiotemporal neighbour cache dynamically selects relevant tokens based on motion correspondences across frames. The flow guided sampler injects data consistency only into low-frequency bands, preserving high frequency priors while accelerating convergence. DiTVR establishes a new zero shot state of the art on video restoration benchmarks, demonstrating superior temporal consistency and detail preservation while remaining robust to flow noise and occlusions.

[341] Cut2Next: Generating Next Shot via In-Context Tuning

Jingwen He, Hongbo Liu, Jiajun Li, Ziqi Huang, Yu Qiao, Wanli Ouyang, Ziwei Liu

Main category: cs.CV

TL;DR: Cut2Next introduces Next Shot Generation (NSG) for high-quality, cinematically coherent shots using a Diffusion Transformer (DiT) with Hierarchical Multi-Prompting and architectural innovations like CACI and HAM.

Details

Motivation: Current methods lack narrative sophistication and cinematic integrity, focusing only on visual consistency.

Method: Uses a Diffusion Transformer (DiT) with Hierarchical Multi-Prompting (Relational and Individual Prompts) and innovations like Context-Aware Condition Injection (CACI) and Hierarchical Attention Mask (HAM).

Result: Cut2Next outperforms in visual consistency, text fidelity, and user preference for editing patterns and cinematic continuity.

Conclusion: Cut2Next successfully generates narratively expressive and cinematically coherent shots, validated by user studies.

Abstract: Effective multi-shot generation demands purposeful, film-like transitions and strict cinematic continuity. Current methods, however, often prioritize basic visual consistency, neglecting crucial editing patterns (e.g., shot/reverse shot, cutaways) that drive narrative flow for compelling storytelling. This yields outputs that may be visually coherent but lack narrative sophistication and true cinematic integrity. To bridge this, we introduce Next Shot Generation (NSG): synthesizing a subsequent, high-quality shot that critically conforms to professional editing patterns while upholding rigorous cinematic continuity. Our framework, Cut2Next, leverages a Diffusion Transformer (DiT). It employs in-context tuning guided by a novel Hierarchical Multi-Prompting strategy. This strategy uses Relational Prompts to define overall context and inter-shot editing styles. Individual Prompts then specify per-shot content and cinematographic attributes. Together, these guide Cut2Next to generate cinematically appropriate next shots. Architectural innovations, Context-Aware Condition Injection (CACI) and Hierarchical Attention Mask (HAM), further integrate these diverse signals without introducing new parameters. We construct RawCuts (large-scale) and CuratedCuts (refined) datasets, both with hierarchical prompts, and introduce CutBench for evaluation. Experiments show Cut2Next excels in visual consistency and text fidelity. Crucially, user studies reveal a strong preference for Cut2Next, particularly for its adherence to intended editing patterns and overall cinematic continuity, validating its ability to generate high-quality, narratively expressive, and cinematically coherent subsequent shots.

[342] Semi-supervised Multiscale Matching for SAR-Optical Image

Jingze Gai, Changchun Li

Main category: cs.CV

TL;DR: A semi-supervised SAR-optical image matching method (S2M2-SAR) is proposed to reduce reliance on manual annotation by leveraging labeled and unlabeled data, achieving competitive performance with fully supervised methods.

Details

Motivation: Manual annotation for SAR-optical image matching is time-consuming and complex, limiting the availability of labeled data.

Method: S2M2-SAR uses pseudo-labeling for unlabeled data and a cross-modal feature enhancement module to disentangle shared and specific features.

Result: S2M2-SAR outperforms semi-supervised methods and matches fully supervised state-of-the-art performance.

Conclusion: The method efficiently reduces annotation dependency while maintaining high performance, showing practical potential.

Abstract: Driven by the complementary nature of optical and synthetic aperture radar (SAR) images, SAR-optical image matching has garnered significant interest. Most existing SAR-optical image matching methods aim to capture effective matching features by employing the supervision of pixel-level matched correspondences within SAR-optical image pairs, which, however, suffers from time-consuming and complex manual annotation, making it difficult to collect sufficient labeled SAR-optical image pairs. To handle this, we design a semi-supervised SAR-optical image matching pipeline that leverages both scarce labeled and abundant unlabeled image pairs and propose a semi-supervised multiscale matching for SAR-optical image matching (S2M2-SAR). Specifically, we pseudo-label those unlabeled SAR-optical image pairs with pseudo ground-truth similarity heatmaps by combining both deep and shallow level matching results, and train the matching model by employing labeled and pseudo-labeled similarity heatmaps. In addition, we introduce a cross-modal feature enhancement module trained using a cross-modality mutual independence loss, which requires no ground-truth labels. This unsupervised objective promotes the separation of modality-shared and modality-specific features by encouraging statistical independence between them, enabling effective feature disentanglement across optical and SAR modalities. To evaluate the effectiveness of S2M2-SAR, we compare it with existing competitors on benchmark datasets. Experimental results demonstrate that S2M2-SAR not only surpasses existing semi-supervised methods but also achieves performance competitive with fully supervised SOTA methods, demonstrating its efficiency and practical potential.

[343] Segmenting and Understanding: Region-aware Semantic Attention for Fine-grained Image Quality Assessment with Large Language Models

Chenyue Song, Chen Hui, Haiqi Zhu, Feng Jiang, Yachun Mi, Wei Zhang, Shaohui Liu

Main category: cs.CV

TL;DR: RSFIQA is a fine-grained NR-IQA model that integrates region-level distortion info using SAM and MLLM, enhanced by a Region-Aware Semantic Attention mechanism, achieving competitive results.

Details

Motivation: Existing NR-IQA methods lack sensitivity to local quality variations or fail to focus on semantically salient regions.

Method: Uses SAM for dynamic image partitioning, MLLM for distortion perception, and RSA for attention aggregation.

Result: Achieves robust and competitive performance across multiple benchmark datasets.

Conclusion: RSFIQA effectively combines local semantics and quality degradation for superior NR-IQA.

Abstract: No-reference image quality assessment (NR-IQA) aims to simulate the process of perceiving image quality aligned with subjective human perception. However, existing NR-IQA methods either focus on global representations that leads to limited insights into the semantically salient regions or employ a uniform weighting for region features that weakens the sensitivity to local quality variations. In this paper, we propose a fine-grained image quality assessment model, named RSFIQA, which integrates region-level distortion information to perceive multi-dimensional quality discrepancies. To enhance regional quality awareness, we first utilize the Segment Anything Model (SAM) to dynamically partition the input image into non-overlapping semantic regions. For each region, we teach a powerful Multi-modal Large Language Model (MLLM) to extract descriptive content and perceive multi-dimensional distortions, enabling a comprehensive understanding of both local semantics and quality degradations. To effectively leverage this information, we introduce Region-Aware Semantic Attention (RSA) mechanism, which generates a global attention map by aggregating fine-grained representations from local regions. In addition, RSFIQA is backbone-agnostic and can be seamlessly integrated into various deep neural network architectures. Extensive experiments demonstrate the robustness and effectiveness of the proposed method, which achieves competitive quality prediction performance across multiple benchmark datasets.

[344] MIMIC: Multimodal Inversion for Model Interpretation and Conceptualization

Animesh Jain, Alexandros Stergiou

Main category: cs.CV

TL;DR: MIMIC is a framework to visualize internal representations of Vision Language Models (VLMs) by synthesizing visual concepts, using inversion and feature alignment with regularizers for quality and realism.

Details

Motivation: VLMs lack transparency due to their complex architectures, limiting trust and interpretability.

Method: MIMIC employs joint VLM-based inversion and feature alignment, with regularizers for spatial alignment, smoothness, and semantic realism.

Result: Evaluated on free-form VLM output texts, MIMIC shows strong performance in visual quality and semantic metrics.

Conclusion: MIMIC is the first model inversion approach for visual interpretation of VLM concepts, enhancing transparency.

Abstract: Vision Language Models (VLMs) encode multimodal inputs over large, complex, and difficult-to-interpret architectures, which limit transparency and trust. We propose a Multimodal Inversion for Model Interpretation and Conceptualization (MIMIC) framework to visualize the internal representations of VLMs by synthesizing visual concepts corresponding to internal encodings. MIMIC uses a joint VLM-based inversion and a feature alignment objective to account for VLM’s autoregressive processing. It additionally includes a triplet of regularizers for spatial alignment, natural image smoothness, and semantic realism. We quantitatively and qualitatively evaluate MIMIC by inverting visual concepts over a range of varying-length free-form VLM output texts. Reported results include both standard visual quality metrics as well as semantic text-based metrics. To the best of our knowledge, this is the first model inversion approach addressing visual interpretations of VLM concepts.

[345] Effortless Vision-Language Model Specialization in Histopathology without Annotation

Jingna Qiu, Nishanth Jain, Jonas Ammeling, Marc Aubreville, Katharina Breininger

Main category: cs.CV

TL;DR: The paper explores annotation-free adaptation of Vision-Language Models (VLMs) in histopathology by continued pretraining on domain-relevant image-caption pairs, improving zero-shot and few-shot performance without manual labeling.

Details

Motivation: General-purpose VLMs like CONCH and QuiltNet may underperform in specific histopathology tasks. Supervised fine-tuning requires manual labels, which is a limitation.

Method: Annotation-free adaptation via continued pretraining on task-relevant image-caption pairs from existing databases.

Result: Continued pretraining enhances zero-shot and few-shot performance, matching few-shot methods with larger training sizes.

Conclusion: This approach offers a task-agnostic, annotation-free solution for adapting VLMs to histopathology tasks, demonstrating significant promise.

Abstract: Recent advances in Vision-Language Models (VLMs) in histopathology, such as CONCH and QuiltNet, have demonstrated impressive zero-shot classification capabilities across various tasks. However, their general-purpose design may lead to suboptimal performance in specific downstream applications. While supervised fine-tuning methods address this issue, they require manually labeled samples for adaptation. This paper investigates annotation-free adaptation of VLMs through continued pretraining on domain- and task-relevant image-caption pairs extracted from existing databases. Our experiments on two VLMs, CONCH and QuiltNet, across three downstream tasks reveal that these pairs substantially enhance both zero-shot and few-shot performance. Notably, with larger training sizes, continued pretraining matches the performance of few-shot methods while eliminating manual labeling. Its effectiveness, task-agnostic design, and annotation-free workflow make it a promising pathway for adapting VLMs to new histopathology tasks. Code is available at https://github.com/DeepMicroscopy/Annotation-free-VLM-specialization.

[346] CBDES MoE: Hierarchically Decoupled Mixture-of-Experts for Functional Modules in Autonomous Driving

Qi Xiang, Kunsong Shi, Zhigui Lin, Lei He

Main category: cs.CV

TL;DR: The paper proposes CBDES MoE, a modular Mixture-of-Experts framework for BEV perception in autonomous driving, improving adaptability and performance over single-expert models.

Details

Motivation: Existing multi-modal BEV methods face limitations in input adaptability, modeling capacity, and generalization.

Method: Introduces CBDES MoE, a hierarchically decoupled Mixture-of-Experts architecture with a Self-Attention Router for dynamic expert path selection.

Result: Outperforms single-expert baselines on nuScenes dataset, with a 1.6-point mAP and 4.1-point NDS improvement.

Conclusion: CBDES MoE is effective and practical for enhancing BEV perception in autonomous driving.

Abstract: Bird’s Eye View (BEV) perception systems based on multi-sensor feature fusion have become a fundamental cornerstone for end-to-end autonomous driving. However, existing multi-modal BEV methods commonly suffer from limited input adaptability, constrained modeling capacity, and suboptimal generalization. To address these challenges, we propose a hierarchically decoupled Mixture-of-Experts architecture at the functional module level, termed Computing Brain DEvelopment System Mixture-of-Experts (CBDES MoE). CBDES MoE integrates multiple structurally heterogeneous expert networks with a lightweight Self-Attention Router (SAR) gating mechanism, enabling dynamic expert path selection and sparse, input-aware efficient inference. To the best of our knowledge, this is the first modular Mixture-of-Experts framework constructed at the functional module granularity within the autonomous driving domain. Extensive evaluations on the real-world nuScenes dataset demonstrate that CBDES MoE consistently outperforms fixed single-expert baselines in 3D object detection. Compared to the strongest single-expert model, CBDES MoE achieves a 1.6-point increase in mAP and a 4.1-point improvement in NDS, demonstrating the effectiveness and practical advantages of the proposed approach.

[347] Morphological Analysis of Semiconductor Microstructures using Skeleton Graphs

Noriko Nitta, Rei Miyata, Naoto Oishi

Main category: cs.CV

TL;DR: Graph convolutional networks and PCA were used to analyze Ge surface microstructures from electron microscopy images, revealing irradiation angle’s greater impact on morphology than fluence.

Details

Motivation: To understand how ion beam irradiation parameters (angle and fluence) affect the morphological properties of Ge surfaces.

Method: Processed electron microscopy images into skeleton graphs, embedded them using a graph convolutional network, and analyzed embeddings with PCA and Davies-Bouldin index.

Result: Irradiation angle has a more significant impact on Ge surface morphology than fluence.

Conclusion: The study highlights the importance of irradiation angle in controlling Ge surface microstructures.

Abstract: In this paper, electron microscopy images of microstructures formed on Ge surfaces by ion beam irradiation were processed to extract topological features as skeleton graphs, which were then embedded using a graph convolutional network. The resulting embeddings were analyzed using principal component analysis, and cluster separability in the resulting PCA space was evaluated using the Davies-Bouldin index. The results indicate that variations in irradiation angle have a more significant impact on the morphological properties of Ge surfaces than variations in irradiation fluence.

[348] Tracking Any Point Methods for Markerless 3D Tissue Tracking in Endoscopic Stereo Images

Konrad Reuter, Suresh Guttikonda, Sarah Latus, Lennart Maack, Christian Betz, Tobias Maurer, Alexander Schlaefer

Main category: cs.CV

TL;DR: A novel method for markerless 3D tissue tracking in minimally invasive surgery using 2D TAP networks, combining temporal tracking and stereo matching for accurate 3D motion estimation.

Details

Motivation: Address challenges like dynamic tissue motion and limited field of view in minimally invasive surgery to improve surgical guidance, safety, and robotic assistance.

Method: Combines two CoTracker models (temporal tracking and stereo matching) to estimate 3D motion from stereo endoscopic images. Evaluated on synthetic and chicken tissue phantoms.

Result: Achieved reliable tracking with Euclidean distance errors as low as 1.1 mm at 10 mm/s velocity on chicken tissue.

Conclusion: Demonstrates the potential of TAP-based models for accurate, markerless 3D tracking in surgical scenarios.

Abstract: Minimally invasive surgery presents challenges such as dynamic tissue motion and a limited field of view. Accurate tissue tracking has the potential to support surgical guidance, improve safety by helping avoid damage to sensitive structures, and enable context-aware robotic assistance during complex procedures. In this work, we propose a novel method for markerless 3D tissue tracking by leveraging 2D Tracking Any Point (TAP) networks. Our method combines two CoTracker models, one for temporal tracking and one for stereo matching, to estimate 3D motion from stereo endoscopic images. We evaluate the system using a clinical laparoscopic setup and a robotic arm simulating tissue motion, with experiments conducted on a synthetic 3D-printed phantom and a chicken tissue phantom. Tracking on the chicken tissue phantom yielded more reliable results, with Euclidean distance errors as low as 1.1 mm at a velocity of 10 mm/s. These findings highlight the potential of TAP-based models for accurate, markerless 3D tracking in challenging surgical scenarios.

[349] Being-M0.5: A Real-Time Controllable Vision-Language-Motion Model

Bin Cao, Sipeng Zheng, Ye Wang, Lujie Xia, Qianshan Wei, Qin Jin, Jing Liu, Zongqing Lu

Main category: cs.CV

TL;DR: The paper introduces Being-M0.5, a real-time, controllable vision-language-motion model (VLMM) addressing key limitations in human motion generation, leveraging the HuMo100M dataset and a novel part-aware residual quantization technique.

Details

Motivation: Existing VLMMs lack controllability in diverse human commands, pose initialization, long-term sequences, unseen scenarios, and fine-grained body part control, hindering practical deployment.

Method: The approach uses HuMo100M, a large-scale dataset with detailed annotations, and introduces part-aware residual quantization for motion tokenization to enable precise control.

Result: Being-M0.5 achieves state-of-the-art performance across motion benchmarks and demonstrates real-time capabilities.

Conclusion: The contributions of HuMo100M and Being-M0.5 advance motion generation technology, facilitating real-world adoption.

Abstract: Human motion generation has emerged as a critical technology with transformative potential for real-world applications. However, existing vision-language-motion models (VLMMs) face significant limitations that hinder their practical deployment. We identify controllability as a main bottleneck, manifesting in five key aspects: inadequate response to diverse human commands, limited pose initialization capabilities, poor performance on long-term sequences, insufficient handling of unseen scenarios, and lack of fine-grained control over individual body parts. To overcome these limitations, we present Being-M0.5, the first real-time, controllable VLMM that achieves state-of-the-art performance across multiple motion generation tasks. Our approach is built upon HuMo100M, the largest and most comprehensive human motion dataset to date, comprising over 5 million self-collected motion sequences, 100 million multi-task instructional instances, and detailed part-level annotations that address a critical gap in existing datasets. We introduce a novel part-aware residual quantization technique for motion tokenization that enables precise, granular control over individual body parts during generation. Extensive experimental validation demonstrates Being-M0.5’s superior performance across diverse motion benchmarks, while comprehensive efficiency analysis confirms its real-time capabilities. Our contributions include design insights and detailed computational analysis to guide future development of practical motion generators. We believe that HuMo100M and Being-M0.5 represent significant advances that will accelerate the adoption of motion generation technologies in real-world applications. The project page is available at https://beingbeyond.github.io/Being-M0.5.

[350] CATP: Contextually Adaptive Token Pruning for Efficient and Enhanced Multimodal In-Context Learning

Yanshu Li, Jianjiang Yang, Zhennan Shen, Ligong Han, Haoyan Xu, Ruixiang Tang

Main category: cs.CV

TL;DR: CATP is a training-free pruning method for multimodal in-context learning (ICL) that reduces image token redundancy, improving efficiency and performance.

Details

Motivation: Existing token pruning methods overlook multimodal ICL, where redundancy is greater and efficiency is critical, leading to unstable performance and accuracy drops.

Method: CATP performs progressive pruning in two stages to account for cross-modal interactions, removing 77.8% of redundant image tokens.

Result: CATP improves performance by 0.6% and reduces inference latency by 10.78% on average across benchmarks.

Conclusion: CATP enhances multimodal ICL’s practical value and sets a foundation for future work in interleaved image-text scenarios.

Abstract: Modern large vision-language models (LVLMs) convert each input image into a large set of tokens, far outnumbering the text tokens. Although this improves visual perception, it introduces severe image token redundancy. Because image tokens carry sparse information, many add little to reasoning, yet greatly increase inference cost. The emerging image token pruning methods tackle this issue by identifying the most important tokens and discarding the rest. These methods can raise efficiency with only modest performance loss. However, most of them only consider single-image tasks and overlook multimodal in-context learning (ICL), where redundancy is greater and efficiency is more critical. Redundant tokens weaken the advantage of multimodal ICL for rapid domain adaptation and cause unstable performance. Applying existing pruning methods in this setting leads to large accuracy drops, exposing a clear gap and the need for new techniques. Thus, we propose Contextually Adaptive Token Pruning (CATP), a training-free pruning method targeted at multimodal ICL. CATP consists of two stages that perform progressive pruning to fully account for the complex cross-modal interactions in the input sequence. After removing 77.8% of the image tokens, CATP produces an average performance gain of 0.6% over the vanilla model on four LVLMs and eight benchmarks, exceeding all baselines remarkably. Meanwhile, it effectively improves efficiency by achieving an average reduction of 10.78% in inference latency. CATP enhances the practical value of multimodal ICL and lays the groundwork for future progress in interleaved image-text scenarios.

[351] SynthVLM: Towards High-Quality and Efficient Synthesis of Image-Caption Datasets for Vision-Language Models

Zheng Liu, Hao Liang, Bozhou Li, Wentao Xiong, Chong Chen, Conghui He, Wentao Zhang, Bin Cui

Main category: cs.CV

TL;DR: SynthVLM introduces a novel method for synthesizing high-quality image-caption pairs using diffusion models, outperforming traditional datasets and enabling SOTA performance in vision-language tasks.

Details

Motivation: Training VLMs requires large datasets, but existing methods face challenges in efficiency, effectiveness, and data quality. SynthVLM addresses this by synthesizing and curating precise image-text pairs.

Method: SynthVLM uses advanced diffusion models to generate images from high-quality captions, creating aligned image-text pairs. It also introduces SynthVLM-100K, a curated dataset.

Result: SynthVLM-100K outperforms traditional datasets in evaluations. Models trained on it (SynthVLM-7B and SynthVLM-13B) achieve SOTA performance in VQA tasks and MMLU benchmarks.

Conclusion: SynthVLM’s synthesis method and dataset enable superior performance in vision-language tasks, demonstrating the value of high-quality synthetic data.

Abstract: Vision-Language Models (VLMs) have recently emerged, demonstrating remarkable vision-understanding capabilities. However, training these models requires large-scale datasets, which brings challenges related to efficiency, effectiveness, and quality of web data. In this paper, we introduce SynthVLM, a new data synthesis and curation method for generating image-caption pairs. Unlike traditional methods, where captions are generated from images, SynthVLM utilizes advanced diffusion models and high-quality captions to synthesize and select images from text captions, thereby creating precisely aligned image-text pairs. We further introduce SynthVLM-100K, a high-quality dataset consisting of 100K curated and synthesized image-caption pairs. In both model and human evaluations, SynthVLM-100K outperforms traditional real-world datasets. Leveraging this dataset, we develop a new family of multimodal large language models (MLLMs), SynthVLM-7B and SynthVLM-13B, which achieve state-of-the-art (SOTA) performance on various vision question-answering (VQA) tasks. Notably, our models outperform LLaVA across most metrics with only 18% pretrain data. Furthermore, SynthVLM-7B and SynthVLM-13B attain SOTA performance on the MMLU benchmark, demonstrating that the high-quality SynthVLM-100K dataset preserves language abilities.

[352] TAP: Parameter-efficient Task-Aware Prompting for Adverse Weather Removal

Hanting Wang, Shengpeng Ji, Shulei Wang, Hai Huang, Xiao Jin, Qifei Zhang, Tao Jin

Main category: cs.CV

TL;DR: A parameter-efficient All-in-One image restoration framework using task-aware enhanced prompts to handle various adverse weather degradations with minimal parameters.

Details

Motivation: Existing methods rely on dedicated modules for each degradation type, leading to high parameter overhead and overlooking inter-task relatedness.

Method: Two-stage training: pretraining for general restoration knowledge and prompt-tuning with task-aware enhanced prompts using low-rank decomposition and contrastive constraints.

Result: Achieves superior performance with only 2.75M parameters, validated by t-SNE analysis.

Conclusion: The proposed framework improves parameter efficiency and task modeling accuracy for multi-task image restoration.

Abstract: Image restoration under adverse weather conditions has been extensively explored, leading to numerous high-performance methods. In particular, recent advances in All-in-One approaches have shown impressive results by training on multi-task image restoration datasets. However, most of these methods rely on dedicated network modules or parameters for each specific degradation type, resulting in a significant parameter overhead. Moreover, the relatedness across different restoration tasks is often overlooked. In light of these issues, we propose a parameter-efficient All-in-One image restoration framework that leverages task-aware enhanced prompts to tackle various adverse weather degradations.Specifically, we adopt a two-stage training paradigm consisting of a pretraining phase and a prompt-tuning phase to mitigate parameter conflicts across tasks. We first employ supervised learning to acquire general restoration knowledge, and then adapt the model to handle specific degradation via trainable soft prompts. Crucially, we enhance these task-specific prompts in a task-aware manner. We apply low-rank decomposition to these prompts to capture both task-general and task-specific characteristics, and impose contrastive constraints to better align them with the actual inter-task relatedness. These enhanced prompts not only improve the parameter efficiency of the restoration model but also enable more accurate task modeling, as evidenced by t-SNE analysis. Experimental results on different restoration tasks demonstrate that the proposed method achieves superior performance with only 2.75M parameters.

[353] Stand-In: A Lightweight and Plug-and-Play Identity Control for Video Generation

Bowen Xue, Qixin Yan, Wenjing Wang, Hao Liu, Chen Li

Main category: cs.CV

TL;DR: Stand-In is a lightweight, plug-and-play framework for identity-preserving video generation, using minimal additional parameters and outperforming full-parameter methods.

Details

Motivation: Existing methods for high-fidelity human video generation are resource-intensive and lack compatibility with other AIGC tools.

Method: Introduces a conditional image branch into a pre-trained video model, using restricted self-attentions and conditional position mapping for identity control.

Result: Achieves excellent video quality and identity preservation with only ~1% additional parameters, outperforming full-parameter methods.

Conclusion: Stand-In is efficient, versatile, and compatible with various tasks like subject-driven video generation and face swapping.

Abstract: Generating high-fidelity human videos that match user-specified identities is important yet challenging in the field of generative AI. Existing methods often rely on an excessive number of training parameters and lack compatibility with other AIGC tools. In this paper, we propose Stand-In, a lightweight and plug-and-play framework for identity preservation in video generation. Specifically, we introduce a conditional image branch into the pre-trained video generation model. Identity control is achieved through restricted self-attentions with conditional position mapping, and can be learned quickly with only 2000 pairs. Despite incorporating and training just $\sim$1% additional parameters, our framework achieves excellent results in video quality and identity preservation, outperforming other full-parameter training methods. Moreover, our framework can be seamlessly integrated for other tasks, such as subject-driven video generation, pose-referenced video generation, stylization, and face swapping.

[354] MathScape: Benchmarking Multimodal Large Language Models in Real-World Mathematical Contexts

Hao Liang, Linzhuang Sun, Minxuan Zhou, Zirong Chen, Meiyi Qiang, Mingan Lin, Tianpeng Li, Fan Yang, Zenan Zhou, Wentao Zhang

Main category: cs.CV

TL;DR: MathScape is a new benchmark for evaluating multimodal LLMs’ mathematical reasoning in real-world contexts, revealing gaps in current models’ capabilities.

Details

Motivation: Existing benchmarks for multimodal math reasoning rely on digitally rendered content, missing real-world complexity. MathScape aims to address this gap.

Method: MathScape includes 1,369 real-world math problems paired with human-captured images. It evaluates 19 MLLMs, including SOTA and smaller-scale models.

Result: SOTA models struggle with real-world math tasks, performing worse than humans. Synthetic image performance doesn’t translate to real-world effectiveness.

Conclusion: MathScape highlights limitations in current MLLMs and underscores the need for benchmarks that reflect real-world challenges in multimodal math reasoning.

Abstract: With the rapid progress of Multimodal LLMs, evaluating their mathematical reasoning capabilities has become an increasingly important research direction. In particular, visual-textual mathematical reasoning serves as a key indicator of an MLLM’s ability to comprehend and solve complex, multi-step quantitative problems. While existing benchmarks such as MathVista and MathVerse have advanced the evaluation of multimodal math proficiency, they primarily rely on digitally rendered content and fall short in capturing the complexity of real-world scenarios. To bridge this gap, we introduce MathScape, a novel benchmark focused on assessing MLLMs’ reasoning ability in realistic mathematical contexts. MathScape comprises 1,369 high-quality math problems paired with human-captured real-world images, closely reflecting the challenges encountered in practical educational settings. We conduct a thorough multi-dimensional evaluation across nine leading closed-source MLLMs, three open-source MLLMs with over 20 billion parameters, and seven smaller-scale MLLMs. Our results show that even SOTA models struggle with real-world math tasks, lagging behind human performance – highlighting critical limitations in current model capabilities. Moreover, we find that strong performance on synthetic or digitally rendered images does not guarantee similar effectiveness on real-world tasks. This underscores the necessity of MathScape in the next stage of multimodal mathematical reasoning.

[355] CTC Transcription Alignment of the Bullinger Letters: Automatic Improvement of Annotation Quality

Marco Peer, Anna Scius-Bertrand, Andreas Fischer

Main category: cs.CV

TL;DR: A self-training method using CTC alignment improves handwritten text recognition for historical documents by addressing annotation errors, particularly hyphenation, in the Bullinger dataset.

Details

Motivation: Historical handwritten text recognition is hindered by handwriting variability, degraded sources, and annotation errors like hyphenation.

Method: A self-training method based on CTC alignment algorithm matches transcriptions to text line images using dynamic programming and model output probabilities.

Result: Performance improves by 1.1 percentage points CER, with weaker models yielding more accurate alignments, enabling iterative training.

Conclusion: The approach enhances alignment accuracy and CER, with potential for iterative application; code and corrected dataset are released.

Abstract: Handwritten text recognition for historical documents remains challenging due to handwriting variability, degraded sources, and limited layout-aware annotations. In this work, we address annotation errors - particularly hyphenation issues - in the Bullinger correspondence, a large 16th-century letter collection. We introduce a self-training method based on a CTC alignment algorithm that matches full transcriptions to text line images using dynamic programming and model output probabilities trained with the CTC loss. Our approach improves performance (e.g., by 1.1 percentage points CER with PyLaia) and increases alignment accuracy. Interestingly, we find that weaker models yield more accurate alignments, enabling an iterative training strategy. We release a new manually corrected subset of 100 pages from the Bullinger dataset, along with our code and benchmarks. Our approach can be applied iteratively to further improve the CER as well as the alignment quality for text recognition pipelines. Code and data are available via https://github.com/andreas-fischer-unifr/nntp.

[356] Meta-Unlearning on Diffusion Models: Preventing Relearning Unlearned Concepts

Hongcheng Gao, Tianyu Pang, Chao Du, Taihang Hu, Zhijie Deng, Min Lin

Main category: cs.CV

TL;DR: The paper introduces meta-unlearning to prevent relearning of unlearned harmful or copyrighted concepts in diffusion models (DMs) after malicious finetuning.

Details

Motivation: To address the issue of DMs relearning unlearned concepts due to malicious finetuning, which exploits related benign concepts.

Method: Proposes meta-unlearning, where a DM self-destructs related benign concepts during malicious finetuning to hinder relearning. Compatible with existing unlearning methods.

Result: Validated on Stable Diffusion models (SD-v1-4 and SDXL) with empirical experiments and ablation studies.

Conclusion: Meta-unlearning effectively prevents relearning of unlearned concepts in DMs, enhancing model safety.

Abstract: With the rapid progress of diffusion-based content generation, significant efforts are being made to unlearn harmful or copyrighted concepts from pretrained diffusion models (DMs) to prevent potential model misuse. However, it is observed that even when DMs are properly unlearned before release, malicious finetuning can compromise this process, causing DMs to relearn the unlearned concepts. This occurs partly because certain benign concepts (e.g., “skin”) retained in DMs are related to the unlearned ones (e.g., “nudity”), facilitating their relearning via finetuning. To address this, we propose meta-unlearning on DMs. Intuitively, a meta-unlearned DM should behave like an unlearned DM when used as is; moreover, if the meta-unlearned DM undergoes malicious finetuning on unlearned concepts, the related benign concepts retained within it will be triggered to self-destruct, hindering the relearning of unlearned concepts. Our meta-unlearning framework is compatible with most existing unlearning methods, requiring only the addition of an easy-to-implement meta objective. We validate our approach through empirical experiments on meta-unlearning concepts from Stable Diffusion models (SD-v1-4 and SDXL), supported by extensive ablation studies. Our code is available at https://github.com/sail-sg/Meta-Unlearning.

[357] Generative Video Matting

Yongtao Ge, Kangyang Xie, Guangkai Xu, Mingyu Liu, Li Ke, Longtao Huang, Hui Xue, Hao Chen, Chunhua Shen

Main category: cs.CV

TL;DR: The paper addresses video matting limitations by proposing large-scale pre-training with synthetic data and a novel video matting approach leveraging pre-trained video diffusion models for better generalization and temporal consistency.

Details

Motivation: Traditional video matting suffers from poor generalization due to lack of high-quality ground-truth data and reliance on imperfect human annotations.

Method: The approach combines large-scale pre-training using diverse synthetic and pseudo-labeled datasets with a scalable synthetic data generation pipeline. It also introduces a video matting model leveraging pre-trained video diffusion models for strong priors and temporal consistency.

Result: The method outperforms benchmarks across three datasets, showing superior performance and strong generalization in real-world scenarios.

Conclusion: The proposed approach effectively bridges the domain gap between synthetic and real-world scenes, ensuring high-quality video matting with temporal consistency.

Abstract: Video matting has traditionally been limited by the lack of high-quality ground-truth data. Most existing video matting datasets provide only human-annotated imperfect alpha and foreground annotations, which must be composited to background images or videos during the training stage. Thus, the generalization capability of previous methods in real-world scenarios is typically poor. In this work, we propose to solve the problem from two perspectives. First, we emphasize the importance of large-scale pre-training by pursuing diverse synthetic and pseudo-labeled segmentation datasets. We also develop a scalable synthetic data generation pipeline that can render diverse human bodies and fine-grained hairs, yielding around 200 video clips with a 3-second duration for fine-tuning. Second, we introduce a novel video matting approach that can effectively leverage the rich priors from pre-trained video diffusion models. This architecture offers two key advantages. First, strong priors play a critical role in bridging the domain gap between synthetic and real-world scenes. Second, unlike most existing methods that process video matting frame-by-frame and use an independent decoder to aggregate temporal information, our model is inherently designed for video, ensuring strong temporal consistency. We provide a comprehensive quantitative evaluation across three benchmark datasets, demonstrating our approach’s superior performance, and present comprehensive qualitative results in diverse real-world scenes, illustrating the strong generalization capability of our method. The code is available at https://github.com/aim-uofa/GVM.

[358] Safeguarding Generative AI Applications in Preclinical Imaging through Hybrid Anomaly Detection

Jakub Binda, Valentina Paneta, Vasileios Eleftheriadis, Hongkyou Chung, Panagiotis Papadimitroulas, Neo Christopher Chung

Main category: cs.CV

TL;DR: A hybrid anomaly detection framework improves reliability and regulatory compliance of generative AI in nuclear medicine applications like synthetic X-ray generation and radiation dose estimation.

Details

Motivation: To address the need for robust mechanisms in high-stakes biomedical imaging, ensuring safe and reliable use of generative AI.

Method: Development and implementation of a hybrid anomaly detection framework to monitor and manage unexpected model behavior in GenAI systems.

Result: Enhanced reliability, reduced manual oversight, and real-time quality control in applications like Pose2Xray and DosimetrEYE.

Conclusion: The framework strengthens industrial viability of GenAI in preclinical settings by improving robustness, scalability, and compliance.

Abstract: Generative AI holds great potentials to automate and enhance data synthesis in nuclear medicine. However, the high-stakes nature of biomedical imaging necessitates robust mechanisms to detect and manage unexpected or erroneous model behavior. We introduce development and implementation of a hybrid anomaly detection framework to safeguard GenAI models in BIOEMTECH’s eyes(TM) systems. Two applications are demonstrated: Pose2Xray, which generates synthetic X-rays from photographic mouse images, and DosimetrEYE, which estimates 3D radiation dose maps from 2D SPECT/CT scans. In both cases, our outlier detection (OD) enhances reliability, reduces manual oversight, and supports real-time quality control. This approach strengthens the industrial viability of GenAI in preclinical settings by increasing robustness, scalability, and regulatory compliance.

[359] WSI-LLaVA: A Multimodal Large Language Model for Whole Slide Image

Yuci Liang, Xinheng Lyu, Meidan Ding, Wenting Chen, Jipeng Zhang, Yuexiang Ren, Xiangjian He, Song Wu, Xiyue Wang, Sen Yang, Xiaohan Xing, Linlin Shen

Main category: cs.CV

TL;DR: WSI-LLaVA, a new framework for gigapixel WSI understanding, outperforms existing models in morphological analysis, aided by the WSI-Bench benchmark and specialized metrics.

Details

Motivation: Address limitations of current MLLMs in analyzing WSIs comprehensively and capturing crucial morphological features for diagnosis.

Method: Introduce WSI-Bench benchmark and WSI-LLaVA framework with a three-stage training approach (WSI-text alignment, feature space alignment, task-specific tuning) and specialized metrics (WSI-Precision, WSI-Relevance).

Result: WSI-LLaVA outperforms existing models, showing significant improvement in morphological analysis and diagnostic accuracy.

Conclusion: The framework establishes a clear link between morphological understanding and diagnostic accuracy, advancing computational pathology.

Abstract: Recent advancements in computational pathology have produced patch-level Multi-modal Large Language Models (MLLMs), but these models are limited by their inability to analyze whole slide images (WSIs) comprehensively and their tendency to bypass crucial morphological features that pathologists rely on for diagnosis. To address these challenges, we first introduce WSI-Bench, a large-scale morphology-aware benchmark containing 180k VQA pairs from 9,850 WSIs across 30 cancer types, designed to evaluate MLLMs’ understanding of morphological characteristics crucial for accurate diagnosis. Building upon this benchmark, we present WSI-LLaVA, a novel framework for gigapixel WSI understanding that employs a three-stage training approach: WSI-text alignment, feature space alignment, and task-specific instruction tuning. To better assess model performance in pathological contexts, we develop two specialized WSI metrics: WSI-Precision and WSI-Relevance. Experimental results demonstrate that WSI-LLaVA outperforms existing models across all capability dimensions, with a significant improvement in morphological analysis, establishing a clear correlation between morphological understanding and diagnostic accuracy.

[360] Mem4D: Decoupling Static and Dynamic Memory for Dynamic Scene Reconstruction

Xudong Cai, Shuo Wang, Peng Wang, Yongcai Wang, Zhaoxin Fan, Wanting Li, Tianbao Zhang, Jianrong Tao, Yeying Jin, Deying Li

Main category: cs.CV

TL;DR: Mem4D proposes a dual-memory framework to resolve the memory demand dilemma in dynamic scene reconstruction, balancing static stability and dynamic detail.

Details

Motivation: The challenge of reconstructing dense geometry for dynamic scenes from monocular videos, where existing methods compromise between static stability and dynamic detail.

Method: Mem4D uses a dual-memory architecture: Transient Dynamics Memory (TDM) for dynamic motion and Persistent Structure Memory (PSM) for static geometry.

Result: Achieves state-of-the-art performance, maintaining global consistency for static elements and high-fidelity dynamic reconstructions.

Conclusion: Mem4D effectively addresses the memory demand dilemma, offering a balanced and efficient solution for dynamic scene reconstruction.

Abstract: Reconstructing dense geometry for dynamic scenes from a monocular video is a critical yet challenging task. Recent memory-based methods enable efficient online reconstruction, but they fundamentally suffer from a Memory Demand Dilemma: The memory representation faces an inherent conflict between the long-term stability required for static structures and the rapid, high-fidelity detail retention needed for dynamic motion. This conflict forces existing methods into a compromise, leading to either geometric drift in static structures or blurred, inaccurate reconstructions of dynamic objects. To address this dilemma, we propose Mem4D, a novel framework that decouples the modeling of static geometry and dynamic motion. Guided by this insight, we design a dual-memory architecture: 1) The Transient Dynamics Memory (TDM) focuses on capturing high-frequency motion details from recent frames, enabling accurate and fine-grained modeling of dynamic content; 2) The Persistent Structure Memory (PSM) compresses and preserves long-term spatial information, ensuring global consistency and drift-free reconstruction for static elements. By alternating queries to these specialized memories, Mem4D simultaneously maintains static geometry with global consistency and reconstructs dynamic elements with high fidelity. Experiments on challenging benchmarks demonstrate that our method achieves state-of-the-art or competitive performance while maintaining high efficiency. Codes will be publicly available.

[361] RSVLM-QA: A Benchmark Dataset for Remote Sensing Vision Language Model-based Question Answering

Xing Zi, Jinghao Xiao, Yunxiao Shi, Xian Tao, Jun Li, Ali Braytee, Mukesh Prasad

Main category: cs.CV

TL;DR: The paper introduces RSVLM-QA, a large-scale VQA dataset for remote sensing, addressing limitations in existing datasets by leveraging LLMs and automated processes for rich annotations and diverse questions.

Details

Motivation: Existing RS VQA datasets lack annotation richness, question diversity, and specific reasoning assessments, necessitating a more comprehensive dataset.

Method: RSVLM-QA integrates data from multiple RS datasets, using GPT-4.1 for automatic annotation generation and a specialized process for object counting QA pairs.

Result: The dataset includes 13,820 images and 162,373 VQA pairs, outperforming existing benchmarks in depth and breadth. Benchmark tests on VLMs validate its effectiveness.

Conclusion: RSVLM-QA is a pivotal resource for RS VQA and VLM research, expected to drive advancements in the field.

Abstract: Visual Question Answering (VQA) in remote sensing (RS) is pivotal for interpreting Earth observation data. However, existing RS VQA datasets are constrained by limitations in annotation richness, question diversity, and the assessment of specific reasoning capabilities. This paper introduces RSVLM-QA dataset, a new large-scale, content-rich VQA dataset for the RS domain. RSVLM-QA is constructed by integrating data from several prominent RS segmentation and detection datasets: WHU, LoveDA, INRIA, and iSAID. We employ an innovative dual-track annotation generation pipeline. Firstly, we leverage Large Language Models (LLMs), specifically GPT-4.1, with meticulously designed prompts to automatically generate a suite of detailed annotations including image captions, spatial relations, and semantic tags, alongside complex caption-based VQA pairs. Secondly, to address the challenging task of object counting in RS imagery, we have developed a specialized automated process that extracts object counts directly from the original segmentation data; GPT-4.1 then formulates natural language answers from these counts, which are paired with preset question templates to create counting QA pairs. RSVLM-QA comprises 13,820 images and 162,373 VQA pairs, featuring extensive annotations and diverse question types. We provide a detailed statistical analysis of the dataset and a comparison with existing RS VQA benchmarks, highlighting the superior depth and breadth of RSVLM-QA’s annotations. Furthermore, we conduct benchmark experiments on Six mainstream Vision Language Models (VLMs), demonstrating that RSVLM-QA effectively evaluates and challenges the understanding and reasoning abilities of current VLMs in the RS domain. We believe RSVLM-QA will serve as a pivotal resource for the RS VQA and VLM research communities, poised to catalyze advancements in the field.

[362] TAG: A Simple Yet Effective Temporal-Aware Approach for Zero-Shot Video Temporal Grounding

Jin-Seop Lee, SungJoon Lee, Jaehan Ahn, YunSeok Choi, Jee-Hyong Lee

Main category: cs.CV

TL;DR: The paper introduces TAG, a temporal-aware method for zero-shot video temporal grounding, addressing issues like semantic fragmentation and skewed similarity distributions without relying on LLMs.

Details

Motivation: Existing zero-shot VTG methods suffer from semantic fragmentation, skewed similarity distributions, and reliance on expensive LLMs.

Method: TAG uses temporal pooling, temporal coherence clustering, and similarity adjustment to capture temporal context and correct similarity distributions.

Result: Achieves state-of-the-art performance on Charades-STA and ActivityNet Captions datasets without LLMs.

Conclusion: TAG is an effective, training-free solution for zero-shot VTG, outperforming existing methods.

Abstract: Video Temporal Grounding (VTG) aims to extract relevant video segments based on a given natural language query. Recently, zero-shot VTG methods have gained attention by leveraging pretrained vision-language models (VLMs) to localize target moments without additional training. However, existing approaches suffer from semantic fragmentation, where temporally continuous frames sharing the same semantics are split across multiple segments. When segments are fragmented, it becomes difficult to predict an accurate target moment that aligns with the text query. Also, they rely on skewed similarity distributions for localization, making it difficult to select the optimal segment. Furthermore, they heavily depend on the use of LLMs which require expensive inferences. To address these limitations, we propose a \textit{TAG}, a simple yet effective Temporal-Aware approach for zero-shot video temporal Grounding, which incorporates temporal pooling, temporal coherence clustering, and similarity adjustment. Our proposed method effectively captures the temporal context of videos and addresses distorted similarity distributions without training. Our approach achieves state-of-the-art results on Charades-STA and ActivityNet Captions benchmark datasets without rely on LLMs. Our code is available at https://github.com/Nuetee/TAG

[363] VOIDFace: A Privacy-Preserving Multi-Network Face Recognition With Enhanced Security

Ajnas Muhammed, Iurri Medvedev, Nuno Gonçalves

Main category: cs.CV

TL;DR: VOIDFace introduces a privacy-preserving facial recognition framework using visual secret sharing and patch-based multi-training to eliminate data replication and enhance user control over personal data.

Details

Motivation: Addressing privacy and ethical concerns in facial recognition by reducing data replication and improving user control over personal face data.

Method: Uses visual secret sharing for secure data storage and a patch-based multi-training network for robust, privacy-preserving recognition.

Result: Achieves Right-To-Be-Forgotten, improved data control, security, and privacy while maintaining competitive performance on the VGGFace2 dataset.

Conclusion: VOIDFace successfully enhances privacy, security, and efficiency in facial recognition training while empowering users with data control.

Abstract: Advancement of machine learning techniques, combined with the availability of large-scale datasets, has significantly improved the accuracy and efficiency of facial recognition. Modern facial recognition systems are trained using large face datasets collected from diverse individuals or public repositories. However, for training, these datasets are often replicated and stored in multiple workstations, resulting in data replication, which complicates database management and oversight. Currently, once a user submits their face for dataset preparation, they lose control over how their data is used, raising significant privacy and ethical concerns. This paper introduces VOIDFace, a novel framework for facial recognition systems that addresses two major issues. First, it eliminates the need of data replication and improves data control to securely store training face data by using visual secret sharing. Second, it proposes a patch-based multi-training network that uses this novel training data storage mechanism to develop a robust, privacy-preserving facial recognition system. By integrating these advancements, VOIDFace aims to improve the privacy, security, and efficiency of facial recognition training, while ensuring greater control over sensitive personal face data. VOIDFace also enables users to exercise their Right-To-Be-Forgotten property to control their personal data. Experimental evaluations on the VGGFace2 dataset show that VOIDFace provides Right-To-Be-Forgotten, improved data control, security, and privacy while maintaining competitive facial recognition performance. Code is available at: https://github.com/ajnasmuhammed89/VOIDFace

[364] TrackOR: Towards Personalized Intelligent Operating Rooms Through Robust Tracking

Tony Danjun Wang, Christian Heiliger, Nassir Navab, Lennart Bastian

Main category: cs.CV

TL;DR: TrackOR is a framework for long-term multi-person tracking and re-identification in operating rooms, using 3D geometric signatures to improve tracking accuracy and enable offline trajectory analysis.

Details

Motivation: To enhance surgical team support and patient outcomes by enabling consistent, long-term tracking of staff in operating rooms.

Method: Proposes TrackOR, leveraging 3D geometric signatures for online tracking and offline trajectory recovery.

Result: Achieves +11% Association Accuracy over baselines, enabling persistent identity tracking and staff-centric analyses.

Conclusion: TrackOR’s 3D geometric approach facilitates personalized intelligent systems in operating rooms, with applications like temporal pathway imprints for actionable insights.

Abstract: Providing intelligent support to surgical teams is a key frontier in automated surgical scene understanding, with the long-term goal of improving patient outcomes. Developing personalized intelligence for all staff members requires maintaining a consistent state of who is located where for long surgical procedures, which still poses numerous computational challenges. We propose TrackOR, a framework for tackling long-term multi-person tracking and re-identification in the operating room. TrackOR uses 3D geometric signatures to achieve state-of-the-art online tracking performance (+11% Association Accuracy over the strongest baseline), while also enabling an effective offline recovery process to create analysis-ready trajectories. Our work shows that by leveraging 3D geometric information, persistent identity tracking becomes attainable, enabling a critical shift towards the more granular, staff-centric analyses required for personalized intelligent systems in the operating room. This new capability opens up various applications, including our proposed temporal pathway imprints that translate raw tracking data into actionable insights for improving team efficiency and safety and ultimately providing personalized support.

[365] The Escalator Problem: Identifying Implicit Motion Blindness in AI for Accessibility

Xiantao Zhang

Main category: cs.CV

TL;DR: The paper highlights a critical flaw in Multimodal Large Language Models (MLLMs) called Implicit Motion Blindness, exemplified by the Escalator Problem, where models fail to perceive motion direction. It calls for a shift towards robust physical perception and human-centered benchmarks.

Details

Motivation: To address the trustworthiness of MLLMs for the blind and visually impaired (BVI) community by identifying and analyzing a key limitation in current models.

Method: The paper introduces the Escalator Problem as a case study to illustrate Implicit Motion Blindness, stemming from the frame-sampling paradigm in video understanding.

Result: The analysis reveals that current models struggle with continuous, low-signal motion, undermining their reliability in real-world applications.

Conclusion: The paper advocates for a paradigm shift towards physical perception and the development of human-centered benchmarks to improve safety and user trust.

Abstract: Multimodal Large Language Models (MLLMs) hold immense promise as assistive technologies for the blind and visually impaired (BVI) community. However, we identify a critical failure mode that undermines their trustworthiness in real-world applications. We introduce the Escalator Problem – the inability of state-of-the-art models to perceive an escalator’s direction of travel – as a canonical example of a deeper limitation we term Implicit Motion Blindness. This blindness stems from the dominant frame-sampling paradigm in video understanding, which, by treating videos as discrete sequences of static images, fundamentally struggles to perceive continuous, low-signal motion. As a position paper, our contribution is not a new model but rather to: (I) formally articulate this blind spot, (II) analyze its implications for user trust, and (III) issue a call to action. We advocate for a paradigm shift from purely semantic recognition towards robust physical perception and urge the development of new, human-centered benchmarks that prioritize safety, reliability, and the genuine needs of users in dynamic environments.

Thinesh Thiyakesan Ponbagavathi, Chengzheng Yang, Alina Roitberg

Main category: cs.CV

TL;DR: ProGraD improves Group Activity Detection (GAD) by using learnable group prompts and a lightweight transformer, outperforming state-of-the-art methods, especially in multi-group scenarios.

Details

Motivation: Vision Foundation Models (VFMs) are underexplored for group dynamics, and simple backbone swaps with VFMs yield minimal gains, necessitating structured, group-aware reasoning.

Method: ProGraD employs learnable group prompts to guide VFM attention and a two-layer GroupContext Transformer for actor-group associations and behavior inference.

Result: ProGraD surpasses state-of-the-art on GAD benchmarks, with significant gains in multi-group scenarios (6.5% and 8.2% improvements) using only 10M parameters.

Conclusion: ProGraD effectively bridges the gap in GAD by enhancing VFM performance with interpretable attention maps, offering insights into group reasoning.

Abstract: Group Activity Detection (GAD) involves recognizing social groups and their collective behaviors in videos. Vision Foundation Models (VFMs), like DinoV2, offer excellent features, but are pretrained primarily on object-centric data and remain underexplored for modeling group dynamics. While they are a promising alternative to highly task-specific GAD architectures that require full fine-tuning, our initial investigation reveals that simply swapping CNN backbones used in these methods with VFMs brings little gain, underscoring the need for structured, group-aware reasoning on top. We introduce Prompt-driven Group Activity Detection (ProGraD) – a method that bridges this gap through 1) learnable group prompts to guide the VFM attention toward social configurations, and 2) a lightweight two-layer GroupContext Transformer that infers actor-group associations and collective behavior. We evaluate our approach on two recent GAD benchmarks: Cafe, which features multiple concurrent social groups, and Social-CAD, which focuses on single-group interactions. While we surpass state-of-the-art in both settings, our method is especially effective in complex multi-group scenarios, where we yield a gain of 6.5% (Group mAP@1.0) and 8.2% (Group mAP@0.5) using only 10M trainable parameters. Furthermore, our experiments reveal that ProGraD produces interpretable attention maps, offering insights into actor-group reasoning. Code and models will be released.

[367] Sample-aware RandAugment: Search-free Automatic Data Augmentation for Effective Image Recognition

Anqi Xiao, Weichen Yu, Hongyuan Yu

Main category: cs.CV

TL;DR: SRA is a search-free AutoDA method that dynamically adjusts augmentation policies, achieving high accuracy and compatibility without time-consuming searches.

Details

Motivation: Address the inefficiency and suboptimal performance of mainstream AutoDA methods by proposing a simpler, adaptive solution.

Method: Sample-aware RandAugment (SRA) uses a heuristic scoring module and asymmetric augmentation to tailor policies per sample.

Result: Achieves 78.31% Top-1 accuracy on ImageNet with ResNet-50 and improves downstream task performance.

Conclusion: SRA offers a simpler, effective, and practical AutoDA design with broad applicability.

Abstract: Automatic data augmentation (AutoDA) plays an important role in enhancing the generalization of neural networks. However, mainstream AutoDA methods often encounter two challenges: either the search process is excessively time-consuming, hindering practical application, or the performance is suboptimal due to insufficient policy adaptation during training. To address these issues, we propose Sample-aware RandAugment (SRA), an asymmetric, search-free AutoDA method that dynamically adjusts augmentation policies while maintaining straightforward implementation. SRA incorporates a heuristic scoring module that evaluates the complexity of the original training data, enabling the application of tailored augmentations for each sample. Additionally, an asymmetric augmentation strategy is employed to maximize the potential of this scoring module. In multiple experimental settings, SRA narrows the performance gap between search-based and search-free AutoDA methods, achieving a state-of-the-art Top-1 accuracy of 78.31% on ImageNet with ResNet-50. Notably, SRA demonstrates good compatibility with existing augmentation pipelines and solid generalization across new tasks, without requiring hyperparameter tuning. The pretrained models leveraging SRA also enhance recognition in downstream object detection tasks. SRA represents a promising step towards simpler, more effective, and practical AutoDA designs applicable to a variety of future tasks. Our code is available at \href{https://github.com/ainieli/Sample-awareRandAugment}{https://github.com/ainieli/Sample-awareRandAugment

[368] GenEscape: Hierarchical Multi-Agent Generation of Escape Room Puzzles

Mengyi Shan, Brian Curless, Ira Kemelmacher-Shlizerman, Steve Seitz

Main category: cs.CV

TL;DR: A hierarchical multi-agent framework improves text-to-image models for generating escape room puzzles by ensuring visual appeal, logical solidity, and intellectual stimulation.

Details

Motivation: Addressing the limitations of base image models in handling spatial relationships and affordance reasoning for complex tasks like escape room puzzles.

Method: Proposes a hierarchical multi-agent framework with stages: functional design, symbolic scene graph reasoning, layout synthesis, and local image editing, using iterative feedback for collaboration.

Result: Agent collaboration enhances output quality, improving solvability, avoiding shortcuts, and clarifying affordances while maintaining visual quality.

Conclusion: The framework successfully generates high-quality escape room puzzle images by leveraging specialized agents and structured stages.

Abstract: We challenge text-to-image models with generating escape room puzzle images that are visually appealing, logically solid, and intellectually stimulating. While base image models struggle with spatial relationships and affordance reasoning, we propose a hierarchical multi-agent framework that decomposes this task into structured stages: functional design, symbolic scene graph reasoning, layout synthesis, and local image editing. Specialized agents collaborate through iterative feedback to ensure the scene is visually coherent and functionally solvable. Experiments show that agent collaboration improves output quality in terms of solvability, shortcut avoidance, and affordance clarity, while maintaining visual quality.

[369] Mitigating Biases in Surgical Operating Rooms with Geometry

Tony Danjun Wang, Tobias Czempiel, Nassir Navab, Lennart Bastian

Main category: cs.CV

TL;DR: Deep neural networks in ORs often rely on spurious correlations like footwear or eyewear due to standardized attire. Using 3D point cloud sequences to encode personnel avoids these biases, outperforming RGB models in realistic settings.

Details

Motivation: Standardized attire in ORs introduces biases in deep learning models, hindering accurate recognition of personalized workflow traits like skill level or coordination.

Method: Gradient-based saliency analysis on OR datasets revealed CNN biases. Personnel were encoded as 3D point cloud sequences to separate identity-relevant features from appearance-based confounders.

Result: RGB models dropped 12% accuracy in realistic clinical settings, while geometric methods maintained performance by capturing meaningful biometric features.

Conclusion: Geometric representations (3D point clouds) are more robust for modeling OR personnel, avoiding biases from standardized attire and improving accuracy in real-world scenarios.

Abstract: Deep neural networks are prone to learning spurious correlations, exploiting dataset-specific artifacts rather than meaningful features for prediction. In surgical operating rooms (OR), these manifest through the standardization of smocks and gowns that obscure robust identifying landmarks, introducing model bias for tasks related to modeling OR personnel. Through gradient-based saliency analysis on two public OR datasets, we reveal that CNN models succumb to such shortcuts, fixating on incidental visual cues such as footwear beneath surgical gowns, distinctive eyewear, or other role-specific identifiers. Avoiding such biases is essential for the next generation of intelligent assistance systems in the OR, which should accurately recognize personalized workflow traits, such as surgical skill level or coordination with other staff members. We address this problem by encoding personnel as 3D point cloud sequences, disentangling identity-relevant shape and motion patterns from appearance-based confounders. Our experiments demonstrate that while RGB and geometric methods achieve comparable performance on datasets with apparent simulation artifacts, RGB models suffer a 12% accuracy drop in realistic clinical settings with decreased visual diversity due to standardizations. This performance gap confirms that geometric representations capture more meaningful biometric features, providing an avenue to developing robust methods of modeling humans in the OR.

[370] PrIINeR: Towards Prior-Informed Implicit Neural Representations for Accelerated MRI

Ziad Al-Haj Hemidi, Eytan Kats, Mattias P. Heinrich

Main category: cs.CV

TL;DR: PrIINeR integrates prior knowledge from pre-trained models into Implicit Neural Representations (INRs) for improved MRI reconstruction, outperforming state-of-the-art methods.

Details

Motivation: Accelerated MRI often degrades image quality; INRs struggle at high acceleration due to weak prior constraints.

Method: PrIINeR combines population-level knowledge with instance-based optimization and enforces dual data consistency.

Result: Outperforms INR-based and some learning-based methods, improving structural preservation and reducing aliasing artefacts.

Conclusion: PrIINeR bridges deep learning and INR techniques, offering reliable high-quality MRI reconstruction.

Abstract: Accelerating Magnetic Resonance Imaging (MRI) reduces scan time but often degrades image quality. While Implicit Neural Representations (INRs) show promise for MRI reconstruction, they struggle at high acceleration factors due to weak prior constraints, leading to structural loss and aliasing artefacts. To address this, we propose PrIINeR, an INR-based MRI reconstruction method that integrates prior knowledge from pre-trained deep learning models into the INR framework. By combining population-level knowledge with instance-based optimization and enforcing dual data consistency, PrIINeR aligns both with the acquired k-space data and the prior-informed reconstruction. Evaluated on the NYU fastMRI dataset, our method not only outperforms state-of-the-art INR-based approaches but also improves upon several learning-based state-of-the-art methods, significantly improving structural preservation and fidelity while effectively removing aliasing artefacts.PrIINeR bridges deep learning and INR-based techniques, offering a more reliable solution for high-quality, accelerated MRI reconstruction. The code is publicly available on https://github.com/multimodallearning/PrIINeR.

[371] TRIDE: A Text-assisted Radar-Image weather-aware fusion network for Depth Estimation

Huawei Sun, Zixu Wang, Hao Feng, Julius Ott, Lorenzo Servadei, Robert Wille

Main category: cs.CV

TL;DR: The paper introduces TRIDE, a radar-camera fusion algorithm for depth estimation, incorporating weather-aware fusion and text features, achieving significant performance improvements.

Details

Motivation: Existing depth estimation methods lack weather adaptation and underutilize language descriptions, despite radar's robustness in adverse conditions.

Method: Proposes TRIDE, combining radar-camera fusion with text features and a weather-aware block to adjust radar weighting dynamically.

Result: Achieves 12.87% MAE and 9.08% RMSE improvements on nuScenes dataset.

Conclusion: TRIDE enhances depth estimation accuracy by integrating radar, weather awareness, and language features, outperforming state-of-the-art methods.

Abstract: Depth estimation, essential for autonomous driving, seeks to interpret the 3D environment surrounding vehicles. The development of radar sensors, known for their cost-efficiency and robustness, has spurred interest in radar-camera fusion-based solutions. However, existing algorithms fuse features from these modalities without accounting for weather conditions, despite radars being known to be more robust than cameras under adverse weather. Additionally, while Vision-Language models have seen rapid advancement, utilizing language descriptions alongside other modalities for depth estimation remains an open challenge. This paper first introduces a text-generation strategy along with feature extraction and fusion techniques that can assist monocular depth estimation pipelines, leading to improved accuracy across different algorithms on the KITTI dataset. Building on this, we propose TRIDE, a radar-camera fusion algorithm that enhances text feature extraction by incorporating radar point information. To address the impact of weather on sensor performance, we introduce a weather-aware fusion block that adaptively adjusts radar weighting based on current weather conditions. Our method, benchmarked on the nuScenes dataset, demonstrates performance gains over the state-of-the-art, achieving a 12.87% improvement in MAE and a 9.08% improvement in RMSE. Code: https://github.com/harborsarah/TRIDE

[372] MDD-Net: Multimodal Depression Detection through Mutual Transformer

Md Rezwanul Haque, Md. Milon Islam, S M Taslim Uddin Raju, Hamdi Altaheri, Lobna Nassar, Fakhri Karray

Main category: cs.CV

TL;DR: A multimodal depression detection network (MDD-Net) using acoustic and visual data from social media outperforms state-of-the-art methods by 17.37% in F1-Score.

Details

Motivation: Depression severely impacts well-being, and social media data offers a simple way to study mental health.

Method: MDD-Net uses mutual transformers to extract and fuse acoustic and visual features for depression detection.

Result: MDD-Net achieves a 17.37% higher F1-Score than existing methods on the D-Vlog dataset.

Conclusion: The proposed multimodal approach is highly effective for depression detection.

Abstract: Depression is a major mental health condition that severely impacts the emotional and physical well-being of individuals. The simple nature of data collection from social media platforms has attracted significant interest in properly utilizing this information for mental health research. A Multimodal Depression Detection Network (MDD-Net), utilizing acoustic and visual data obtained from social media networks, is proposed in this work where mutual transformers are exploited to efficiently extract and fuse multimodal features for efficient depression detection. The MDD-Net consists of four core modules: an acoustic feature extraction module for retrieving relevant acoustic attributes, a visual feature extraction module for extracting significant high-level patterns, a mutual transformer for computing the correlations among the generated features and fusing these features from multiple modalities, and a detection layer for detecting depression using the fused feature representations. The extensive experiments are performed using the multimodal D-Vlog dataset, and the findings reveal that the developed multimodal depression detection network surpasses the state-of-the-art by up to 17.37% for F1-Score, demonstrating the greater performance of the proposed system. The source code is accessible at https://github.com/rezwanh001/Multimodal-Depression-Detection.

[373] S^2VG: 3D Stereoscopic and Spatial Video Generation via Denoising Frame Matrix

Peng Dai, Feitong Tan, Qiangeng Xu, Yihua Huang, David Futschik, Ruofei Du, Sean Fanello, Yinda Zhang, Xiaojuan Qi

Main category: cs.CV

TL;DR: A pose-free, training-free method for converting monocular videos into immersive 3D stereoscopic or spatial videos using depth estimation and a novel frame matrix inpainting framework.

Details

Motivation: Addressing the underexplored challenge of generating 3D stereoscopic and spatial videos from monocular video models for immersive applications.

Method: Leverages an off-the-shelf monocular video generation model, warps videos into pre-defined camera viewpoints using depth, and applies a frame matrix inpainting framework for consistency. Includes a dual-update scheme to improve quality.

Result: Significant improvement over previous methods, validated on videos from models like Sora, Lumiere, WALT, and Zeroscope.

Conclusion: The proposed method effectively generates high-quality 3D videos without additional training, advancing immersive video synthesis.

Abstract: While video generation models excel at producing high-quality monocular videos, generating 3D stereoscopic and spatial videos for immersive applications remains an underexplored challenge. We present a pose-free and training-free method that leverages an off-the-shelf monocular video generation model to produce immersive 3D videos. Our approach first warps the generated monocular video into pre-defined camera viewpoints using estimated depth information, then applies a novel \textit{frame matrix} inpainting framework. This framework utilizes the original video generation model to synthesize missing content across different viewpoints and timestamps, ensuring spatial and temporal consistency without requiring additional model fine-tuning. Moreover, we develop a \dualupdate~scheme that further improves the quality of video inpainting by alleviating the negative effects propagated from disoccluded areas in the latent space. The resulting multi-view videos are then adapted into stereoscopic pairs or optimized into 4D Gaussians for spatial video synthesis. We validate the efficacy of our proposed method by conducting experiments on videos from various generative models, such as Sora, Lumiere, WALT, and Zeroscope. The experiments demonstrate that our method has a significant improvement over previous methods. Project page at: https://daipengwa.github.io/S-2VG_ProjectPage/

[374] Deep Neural Networks Can Learn Generalizable Same-Different Visual Relations

Alexa R. Tartaglini, Sheridan Feucht, Michael A. Lepori, Wai Keen Vong, Charles Lovering, Brenden M. Lake, Ellie Pavlick

Main category: cs.CV

TL;DR: The paper investigates whether deep neural networks (DNNs) can learn and generalize same-different relations, finding that certain pretrained transformers excel at this task, especially when fine-tuned on abstract shapes.

Details

Motivation: Prior work shows DNNs struggle with simple abstract relations like same-different, despite excelling in object recognition. This study aims to comprehensively test their ability to learn and generalize such relations.

Method: The study evaluates various DNN architectures, pretraining methods, and fine-tuning datasets, focusing on same-different relation tasks with both within- and out-of-distribution stimuli.

Result: Pretrained transformers achieve near-perfect generalization for same-different relations, particularly when fine-tuned on abstract shapes without texture or color.

Conclusion: With the right approach, DNNs can learn and generalize same-different visual relations effectively.

Abstract: Although deep neural networks can achieve human-level performance on many object recognition benchmarks, prior work suggests that these same models fail to learn simple abstract relations, such as determining whether two objects are the same or different. Much of this prior work focuses on training convolutional neural networks to classify images of two same or two different abstract shapes, testing generalization on within-distribution stimuli. In this article, we comprehensively study whether deep neural networks can acquire and generalize same-different relations both within and out-of-distribution using a variety of architectures, forms of pretraining, and fine-tuning datasets. We find that certain pretrained transformers can learn a same-different relation that generalizes with near perfect accuracy to out-of-distribution stimuli. Furthermore, we find that fine-tuning on abstract shapes that lack texture or color provides the strongest out-of-distribution generalization. Our results suggest that, with the right approach, deep neural networks can learn generalizable same-different visual relations.

[375] Information Bottleneck-based Causal Attention for Multi-label Medical Image Recognition

Xiaoxiao Cui, Yiran Li, Kai He, Shanzhi Jiang, Mengli Xue, Wentao Li, Junhong Leng, Zhi Liu, Lizhen Cui, Shuo Li

Main category: cs.CV

TL;DR: The paper proposes a novel Information Bottleneck-based Causal Attention (IBCA) method for multi-label classification (MLC) of medical images, addressing the challenge of interpreting class-specific features by filtering irrelevant information and enhancing causal attention.

Details

Motivation: Current methods for MLC struggle with interpreting true causes due to attention to irrelevant features, limiting accurate diagnosis and interpretability.

Method: The authors introduce a structural causal model (SCM) and IBCA, which uses Gaussian mixture multi-label spatial attention and contrastive enhancement-based causal intervention to filter irrelevant information and align attention patterns.

Result: IBCA outperforms existing methods, showing significant improvements in metrics like CR, OR, and mAP on datasets Endo and MuReD.

Conclusion: The proposed IBCA effectively addresses the limitations of current MLC methods, improving accuracy and interpretability in medical image classification.

Abstract: Multi-label classification (MLC) of medical images aims to identify multiple diseases and holds significant clinical potential. A critical step is to learn class-specific features for accurate diagnosis and improved interpretability effectively. However, current works focus primarily on causal attention to learn class-specific features, yet they struggle to interpret the true cause due to the inadvertent attention to class-irrelevant features. To address this challenge, we propose a new structural causal model (SCM) that treats class-specific attention as a mixture of causal, spurious, and noisy factors, and a novel Information Bottleneck-based Causal Attention (IBCA) that is capable of learning the discriminative class-specific attention for MLC of medical images. Specifically, we propose learning Gaussian mixture multi-label spatial attention to filter out class-irrelevant information and capture each class-specific attention pattern. Then a contrastive enhancement-based causal intervention is proposed to gradually mitigate the spurious attention and reduce noise information by aligning multi-head attention with the Gaussian mixture multi-label spatial. Quantitative and ablation results on Endo and MuReD show that IBCA outperforms all methods. Compared to the second-best results for each metric, IBCA achieves improvements of 6.35% in CR, 7.72% in OR, and 5.02% in mAP for MuReD, 1.47% in CR, and 1.65% in CF1, and 1.42% in mAP for Endo.

[376] Integrating Task-Specific and Universal Adapters for Pre-Trained Model-based Class-Incremental Learning

Yan Wang, Da-Wei Zhou, Han-Jia Ye

Main category: cs.CV

TL;DR: The paper proposes TUNA, a method integrating Task-Specific and Universal Adapters for Class-Incremental Learning (CIL) to improve performance by leveraging both specialized and shared knowledge.

Details

Motivation: Existing CIL methods using pre-trained models often freeze the network and use lightweight modules, leading to incorrect module selection and overlooking shared knowledge, causing errors in distinguishing similar classes.

Method: TUNA trains task-specific adapters for crucial task features and uses an entropy-based selection mechanism. It also employs adapter fusion to create a universal adapter for shared discriminative features, combining both for inference.

Result: Extensive experiments show TUNA achieves state-of-the-art performance on benchmark datasets.

Conclusion: TUNA effectively addresses challenges in CIL by combining task-specific and universal adapters, improving accuracy and leveraging shared knowledge.

Abstract: Class-Incremental Learning (CIL) requires a learning system to continually learn new classes without forgetting. Existing pre-trained model-based CIL methods often freeze the pre-trained network and adapt to incremental tasks using additional lightweight modules such as adapters. However, incorrect module selection during inference hurts performance, and task-specific modules often overlook shared general knowledge, leading to errors on distinguishing between similar classes across tasks. To address the aforementioned challenges, we propose integrating Task-Specific and Universal Adapters (TUNA) in this paper. Specifically, we train task-specific adapters to capture the most crucial features relevant to their respective tasks and introduce an entropy-based selection mechanism to choose the most suitable adapter. Furthermore, we leverage an adapter fusion strategy to construct a universal adapter, which encodes the most discriminative features shared across tasks. We combine task-specific and universal adapter predictions to harness both specialized and general knowledge during inference. Extensive experiments on various benchmark datasets demonstrate the state-of-the-art performance of our approach. Code is available at: https://github.com/LAMDA-CL/ICCV2025-TUNA

[377] ME-TST+: Micro-expression Analysis via Temporal State Transition with ROI Relationship Awareness

Zizheng Guo, Bochao Zou, Junbao Zhuo, Huimin Ma

Main category: cs.CV

TL;DR: The paper proposes ME-TST and ME-TST+, state space model-based architectures for micro-expression analysis, improving spotting and recognition by replacing window-level classification with video-level regression and leveraging their synergy.

Details

Motivation: Address limitations of fixed window lengths and hard classification in existing deep learning methods, and the separation of ME spotting and recognition tasks.

Method: Introduces ME-TST and ME-TST+ with temporal state transition mechanisms, multi-granularity ROI modeling, and a slowfast Mamba framework for time-series tasks.

Result: Achieves state-of-the-art performance in ME analysis.

Conclusion: The proposed methods effectively model ME dynamics and enhance analysis performance by integrating spotting and recognition.

Abstract: Micro-expressions (MEs) are regarded as important indicators of an individual’s intrinsic emotions, preferences, and tendencies. ME analysis requires spotting of ME intervals within long video sequences and recognition of their corresponding emotional categories. Previous deep learning approaches commonly employ sliding-window classification networks. However, the use of fixed window lengths and hard classification presents notable limitations in practice. Furthermore, these methods typically treat ME spotting and recognition as two separate tasks, overlooking the essential relationship between them. To address these challenges, this paper proposes two state space model-based architectures, namely ME-TST and ME-TST+, which utilize temporal state transition mechanisms to replace conventional window-level classification with video-level regression. This enables a more precise characterization of the temporal dynamics of MEs and supports the modeling of MEs with varying durations. In ME-TST+, we further introduce multi-granularity ROI modeling and the slowfast Mamba framework to alleviate information loss associated with treating ME analysis as a time-series task. Additionally, we propose a synergy strategy for spotting and recognition at both the feature and result levels, leveraging their intrinsic connection to enhance overall analysis performance. Extensive experiments demonstrate that the proposed methods achieve state-of-the-art performance. The codes are available at https://github.com/zizheng-guo/ME-TST.

[378] SARDet-100K: Towards Open-Source Benchmark and ToolKit for Large-Scale SAR Object Detection

Yuxuan Li, Xiang Li, Weijie Li, Qibin Hou, Li Liu, Ming-Ming Cheng, Jian Yang

Main category: cs.CV

TL;DR: A new benchmark dataset (SARDet-100K) and open-source method (MSFA) are introduced to address challenges in SAR object detection, improving performance and generalizability.

Details

Motivation: Limited public datasets and inaccessible source code hinder SAR object detection research.

Method: Created SARDet-100K dataset and proposed Multi-Stage with Filter Augmentation (MSFA) pretraining framework.

Result: MSFA significantly enhances SAR object detection performance and generalizability.

Conclusion: This work advances SAR object detection by providing a high-quality dataset and effective pretraining framework.

Abstract: Synthetic Aperture Radar (SAR) object detection has gained significant attention recently due to its irreplaceable all-weather imaging capabilities. However, this research field suffers from both limited public datasets (mostly comprising <2K images with only mono-category objects) and inaccessible source code. To tackle these challenges, we establish a new benchmark dataset and an open-source method for large-scale SAR object detection. Our dataset, SARDet-100K, is a result of intense surveying, collecting, and standardizing 10 existing SAR detection datasets, providing a large-scale and diverse dataset for research purposes. To the best of our knowledge, SARDet-100K is the first COCO-level large-scale multi-class SAR object detection dataset ever created. With this high-quality dataset, we conducted comprehensive experiments and uncovered a crucial challenge in SAR object detection: the substantial disparities between the pretraining on RGB datasets and finetuning on SAR datasets in terms of both data domain and model structure. To bridge these gaps, we propose a novel Multi-Stage with Filter Augmentation (MSFA) pretraining framework that tackles the problems from the perspective of data input, domain transition, and model migration. The proposed MSFA method significantly enhances the performance of SAR object detection models while demonstrating exceptional generalizability and flexibility across diverse models. This work aims to pave the way for further advancements in SAR object detection. The dataset and code is available at https://github.com/zcablii/SARDet_100K.

[379] Matrix-3D: Omnidirectional Explorable 3D World Generation

Zhongqi Yang, Wenhang Ge, Yuqi Li, Jiaqi Chen, Haoyuan Li, Mengyin An, Fei Kang, Hua Xue, Baixin Xu, Yuyang Yin, Eric Li, Yang Liu, Yikai Wang, Hao-Xiang Guo, Yahui Zhou

Main category: cs.CV

TL;DR: Matrix-3D is a framework for generating explorable 3D worlds from single images or text prompts using panoramic representation, combining video generation and 3D reconstruction.

Details

Motivation: Existing methods for 3D world generation have limited scope in scenes. Matrix-3D aims to overcome this by leveraging panoramic representation for wider coverage.

Method: The framework uses a trajectory-guided panoramic video diffusion model for scene video generation and two reconstruction methods: a feed-forward model for speed and an optimization-based pipeline for accuracy.

Result: Matrix-3D achieves state-of-the-art performance in panoramic video and 3D world generation, validated by extensive experiments.

Conclusion: The proposed framework advances 3D world generation by improving scope and quality, supported by the new Matrix-Pano dataset.

Abstract: Explorable 3D world generation from a single image or text prompt forms a cornerstone of spatial intelligence. Recent works utilize video model to achieve wide-scope and generalizable 3D world generation. However, existing approaches often suffer from a limited scope in the generated scenes. In this work, we propose Matrix-3D, a framework that utilize panoramic representation for wide-coverage omnidirectional explorable 3D world generation that combines conditional video generation and panoramic 3D reconstruction. We first train a trajectory-guided panoramic video diffusion model that employs scene mesh renders as condition, to enable high-quality and geometrically consistent scene video generation. To lift the panorama scene video to 3D world, we propose two separate methods: (1) a feed-forward large panorama reconstruction model for rapid 3D scene reconstruction and (2) an optimization-based pipeline for accurate and detailed 3D scene reconstruction. To facilitate effective training, we also introduce the Matrix-Pano dataset, the first large-scale synthetic collection comprising 116K high-quality static panoramic video sequences with depth and trajectory annotations. Extensive experiments demonstrate that our proposed framework achieves state-of-the-art performance in panoramic video generation and 3D world generation. See more in https://matrix-3d.github.io.

[380] Griffon v2: Advancing Multimodal Perception with High-Resolution Scaling and Visual-Language Co-Referring

Yufei Zhan, Shurong Zheng, Yousong Zhu, Hongyin Zhao, Fan Yang, Ming Tang, Jinqiao Wang

Main category: cs.CV

TL;DR: Griffon v2 is a high-resolution generalist model addressing resolution limitations in vision-language tasks, enabling flexible object referring with visual and textual prompts.

Details

Motivation: Overcome the resolution limitation in large vision-language models to improve performance in complex, dense scenarios and enable nuanced visual-language referring.

Method: Introduces a lightweight down-sampling projector to scale image resolution and a plug-and-play visual tokenizer for visual-language co-referring.

Result: Achieves state-of-the-art performance on REC, phrase grounding, and outperforms expert models in detection, counting, and REG.

Conclusion: Griffon v2 effectively addresses resolution constraints, enhancing multimodal perception and enabling flexible interaction in vision-language tasks.

Abstract: Large Vision Language Models have achieved fine-grained object perception, but the limitation of image resolution remains a significant obstacle to surpassing the performance of task-specific experts in complex and dense scenarios. Such limitation further restricts the model’s potential to achieve nuanced visual and language referring in domains such as GUI Agents, counting, \textit{etc}. To address this issue, we introduce a unified high-resolution generalist model, Griffon v2, enabling flexible object referring with visual and textual prompts. To efficiently scale up image resolution, we design a simple and lightweight down-sampling projector to overcome the input tokens constraint in Large Language Models. This design inherently preserves the complete contexts and fine details and significantly improves multimodal perception ability, especially for small objects. Building upon this, we further equip the model with visual-language co-referring capabilities through a plug-and-play visual tokenizer. It enables user-friendly interaction with flexible target images, free-form texts, and even coordinates. Experiments demonstrate that Griffon v2 can localize objects of interest with visual and textual referring, achieve state-of-the-art performance on REC and phrase grounding, and outperform expert models in object detection, object counting, and REG. Data and codes are released at https://github.com/jefferyZhan/Griffon.

[381] 3D Plant Root Skeleton Detection and Extraction

Jiakai Lin, Jinchang Zhang, Ge Jin, Wenzhan Song, Tianming Liu, Guoyu Lu

Main category: cs.CV

TL;DR: A method for extracting 3D root skeletons from images is introduced, aiding in automated breeding by analyzing root architecture.

Details

Motivation: Roots' complex 3D architecture and lack of texture make 2D studies insufficient; 3D analysis is vital for botany and breeding.

Method: Detects and matches lateral roots, triangulates skeletal structures, and integrates primary and lateral roots from images.

Result: The extracted 3D skeletons closely match ground truth, proving the method’s effectiveness.

Conclusion: This approach enhances automated breeding by improving root trait analysis, boosting efficiency and reducing manual effort.

Abstract: Plant roots typically exhibit a highly complex and dense architecture, incorporating numerous slender lateral roots and branches, which significantly hinders the precise capture and modeling of the entire root system. Additionally, roots often lack sufficient texture and color information, making it difficult to identify and track root traits using visual methods. Previous research on roots has been largely confined to 2D studies; however, exploring the 3D architecture of roots is crucial in botany. Since roots grow in real 3D space, 3D phenotypic information is more critical for studying genetic traits and their impact on root development. We have introduced a 3D root skeleton extraction method that efficiently derives the 3D architecture of plant roots from a few images. This method includes the detection and matching of lateral roots, triangulation to extract the skeletal structure of lateral roots, and the integration of lateral and primary roots. We developed a highly complex root dataset and tested our method on it. The extracted 3D root skeletons showed considerable similarity to the ground truth, validating the effectiveness of the model. This method can play a significant role in automated breeding robots. Through precise 3D root structure analysis, breeding robots can better identify plant phenotypic traits, especially root structure and growth patterns, helping practitioners select seeds with superior root systems. This automated approach not only improves breeding efficiency but also reduces manual intervention, making the breeding process more intelligent and efficient, thus advancing modern agriculture.

[382] TBAC-UniImage: Unified Understanding and Generation by Ladder-Side Diffusion Tuning

Junzhe Xu, Yuyang Yin, Xi Chen

Main category: cs.CV

TL;DR: TBAC-UniImage integrates a Diffusion Model with a Multimodal Large Language Model (MLLM) for unified understanding and generation, using multiple MLLM layers as generative conditions.

Details

Motivation: Overcome limitations of shallow connections in diffusion-based unified models and high computational costs of pretraining from scratch.

Method: Uses representations from diverse MLLM layers as generative conditions for the diffusion model, treating it as a ’ladder’.

Result: Achieves deeper and more fine-grained unification of understanding and generation.

Conclusion: TBAC-UniImage presents a novel, efficient paradigm for multimodal tasks.

Abstract: This paper introduces TBAC-UniImage, a novel unified model for multimodal understanding and generation. We achieve this by deeply integrating a pre-trained Diffusion Model, acting as a generative ladder, with a Multimodal Large Language Model (MLLM). Previous diffusion-based unified models face two primary limitations. One approach uses only the MLLM’s final hidden state as the generative condition. This creates a shallow connection, as the generator is isolated from the rich, hierarchical representations within the MLLM’s intermediate layers. The other approach, pretraining a unified generative architecture from scratch, is computationally expensive and prohibitive for many researchers. To overcome these issues, our work explores a new paradigm. Instead of relying on a single output, we use representations from multiple, diverse layers of the MLLM as generative conditions for the diffusion model. This method treats the pre-trained generator as a ladder, receiving guidance from various depths of the MLLM’s understanding process. Consequently, TBAC-UniImage achieves a much deeper and more fine-grained unification of understanding and generation.

[383] Follow-Your-Shape: Shape-Aware Image Editing via Trajectory-Guided Region Control

Zeqian Long, Mingzhe Zheng, Kunyu Feng, Xinhua Zhang, Hongyu Liu, Harry Yang, Linfeng Zhang, Qifeng Chen, Yue Ma

Main category: cs.CV

TL;DR: Follow-Your-Shape is a training-free, mask-free framework for precise object shape editing, preserving non-target content via Trajectory Divergence Maps and Scheduled KV Injection.

Details

Motivation: Addressing the limitations of flow-based image editing models in large-scale shape transformations, which often degrade background quality or fail to achieve intended edits.

Method: Uses Trajectory Divergence Maps (TDM) to localize editable regions and Scheduled KV Injection for stable editing. Introduces ReShapeBench for evaluation.

Result: Superior editability and visual fidelity, especially in large-scale shape replacement tasks.

Conclusion: The proposed method effectively addresses challenges in shape-aware editing, offering precise control and high-quality results.

Abstract: While recent flow-based image editing models demonstrate general-purpose capabilities across diverse tasks, they often struggle to specialize in challenging scenarios – particularly those involving large-scale shape transformations. When performing such structural edits, these methods either fail to achieve the intended shape change or inadvertently alter non-target regions, resulting in degraded background quality. We propose Follow-Your-Shape, a training-free and mask-free framework that supports precise and controllable editing of object shapes while strictly preserving non-target content. Motivated by the divergence between inversion and editing trajectories, we compute a Trajectory Divergence Map (TDM) by comparing token-wise velocity differences between the inversion and denoising paths. The TDM enables precise localization of editable regions and guides a Scheduled KV Injection mechanism that ensures stable and faithful editing. To facilitate a rigorous evaluation, we introduce ReShapeBench, a new benchmark comprising 120 new images and enriched prompt pairs specifically curated for shape-aware editing. Experiments demonstrate that our method achieves superior editability and visual fidelity, particularly in tasks requiring large-scale shape replacement.

[384] LVBench: An Extreme Long Video Understanding Benchmark

Weihan Wang, Zehai He, Wenyi Hong, Yean Cheng, Xiaohan Zhang, Ji Qi, Xiaotao Gu, Shiyu Huang, Bin Xu, Yuxiao Dong, Ming Ding, Jie Tang

Main category: cs.CV

TL;DR: LVBench is a new benchmark for long video understanding, addressing gaps in current multimodal models for tasks requiring extended comprehension.

Details

Motivation: Existing models excel at short videos but fail in real-world applications needing long video understanding (e.g., movies, sports).

Method: LVBench includes diverse tasks and publicly sourced long videos to test models’ long-term memory and comprehension.

Result: Current models underperform on long video tasks, highlighting the need for improvement.

Conclusion: LVBench aims to drive advancements in long video comprehension models; data and code are publicly available.

Abstract: Recent progress in multimodal large language models has markedly enhanced the understanding of short videos (typically under one minute), and several evaluation datasets have emerged accordingly. However, these advancements fall short of meeting the demands of real-world applications such as embodied intelligence for long-term decision-making, in-depth movie reviews and discussions, and live sports commentary, all of which require comprehension of long videos spanning several hours. To address this gap, we introduce LVBench, a benchmark specifically designed for long video understanding. Our dataset comprises publicly sourced videos and encompasses a diverse set of tasks aimed at long video comprehension and information extraction. LVBench is designed to challenge multimodal models to demonstrate long-term memory and extended comprehension capabilities. Our extensive evaluations reveal that current multimodal models still underperform on these demanding long video understanding tasks. Through LVBench, we aim to spur the development of more advanced models capable of tackling the complexities of long video comprehension. Our data and code are publicly available at: https://lvbench.github.io.

[385] FantasyStyle: Controllable Stylized Distillation for 3D Gaussian Splatting

Yitong Yang, Yinglin Wang, Changshuo Wang, Huajie Wang, Shuting He

Main category: cs.CV

TL;DR: FantasyStyle introduces a 3DGS-based style transfer framework using diffusion model distillation to address multi-view inconsistency and VGG reliance, achieving superior stylization quality.

Details

Motivation: Current 3DGS-based style transfer methods suffer from multi-view inconsistency and reliance on VGG features, leading to style conflicts and content leakage.

Method: FantasyStyle employs Multi-View Frequency Consistency and Controllable Stylized Distillation, leveraging diffusion model distillation and negative guidance to optimize 3D Gaussians.

Result: The method outperforms state-of-the-art approaches, delivering higher stylization quality and visual realism across diverse scenes and styles.

Conclusion: FantasyStyle effectively addresses key challenges in 3D style transfer, offering a robust and controllable framework for high-quality stylization.

Abstract: The success of 3DGS in generative and editing applications has sparked growing interest in 3DGS-based style transfer. However, current methods still face two major challenges: (1) multi-view inconsistency often leads to style conflicts, resulting in appearance smoothing and distortion; and (2) heavy reliance on VGG features, which struggle to disentangle style and content from style images, often causing content leakage and excessive stylization. To tackle these issues, we introduce \textbf{FantasyStyle}, a 3DGS-based style transfer framework, and the first to rely entirely on diffusion model distillation. It comprises two key components: (1) \textbf{Multi-View Frequency Consistency}. We enhance cross-view consistency by applying a 3D filter to multi-view noisy latent, selectively reducing low-frequency components to mitigate stylized prior conflicts. (2) \textbf{Controllable Stylized Distillation}. To suppress content leakage from style images, we introduce negative guidance to exclude undesired content. In addition, we identify the limitations of Score Distillation Sampling and Delta Denoising Score in 3D style transfer and remove the reconstruction term accordingly. Building on these insights, we propose a controllable stylized distillation that leverages negative guidance to more effectively optimize the 3D Gaussians. Extensive experiments demonstrate that our method consistently outperforms state-of-the-art approaches, achieving higher stylization quality and visual realism across various scenes and styles.

[386] ReconDreamer-RL: Enhancing Reinforcement Learning via Diffusion-based Scene Reconstruction

Chaojun Ni, Guosheng Zhao, Xiaofeng Wang, Zheng Zhu, Wenkang Qin, Xinze Chen, Guanghong Jia, Guan Huang, Wenjun Mei

Main category: cs.CV

TL;DR: ReconDreamer-RL integrates video diffusion priors into scene reconstruction for autonomous driving training, narrowing the sim2real gap and improving performance.

Details

Motivation: The sim2real gap in autonomous driving simulations limits realism and generalization, especially for novel or corner-case scenarios.

Method: Proposes ReconSimulator (video diffusion + kinematic modeling), Dynamic Adversary Agent (DAA) for corner cases, and Cousin Trajectory Generator (CTG) to diversify training data.

Result: Outperforms imitation learning with a 5x reduction in Collision Ratio.

Conclusion: ReconDreamer-RL effectively bridges the sim2real gap and enhances training for end-to-end autonomous driving.

Abstract: Reinforcement learning for training end-to-end autonomous driving models in closed-loop simulations is gaining growing attention. However, most simulation environments differ significantly from real-world conditions, creating a substantial simulation-to-reality (sim2real) gap. To bridge this gap, some approaches utilize scene reconstruction techniques to create photorealistic environments as a simulator. While this improves realistic sensor simulation, these methods are inherently constrained by the distribution of the training data, making it difficult to render high-quality sensor data for novel trajectories or corner case scenarios. Therefore, we propose ReconDreamer-RL, a framework designed to integrate video diffusion priors into scene reconstruction to aid reinforcement learning, thereby enhancing end-to-end autonomous driving training. Specifically, in ReconDreamer-RL, we introduce ReconSimulator, which combines the video diffusion prior for appearance modeling and incorporates a kinematic model for physical modeling, thereby reconstructing driving scenarios from real-world data. This narrows the sim2real gap for closed-loop evaluation and reinforcement learning. To cover more corner-case scenarios, we introduce the Dynamic Adversary Agent (DAA), which adjusts the trajectories of surrounding vehicles relative to the ego vehicle, autonomously generating corner-case traffic scenarios (e.g., cut-in). Finally, the Cousin Trajectory Generator (CTG) is proposed to address the issue of training data distribution, which is often biased toward simple straight-line movements. Experiments show that ReconDreamer-RL improves end-to-end autonomous driving training, outperforming imitation learning methods with a 5x reduction in the Collision Ratio.

[387] CD-TVD: Contrastive Diffusion for 3D Super-Resolution with Scarce High-Resolution Time-Varying Data

Chongke Bi, Xin Gao, Jiangkang Deng, Guan

Main category: cs.CV

TL;DR: CD-TVD is a framework combining contrastive learning and diffusion-based super-resolution to achieve accurate 3D super-resolution from limited high-resolution data, reducing reliance on large datasets.

Details

Motivation: Large-scale simulations require costly high-resolution data, and existing super-resolution methods need extensive HR training data, limiting their versatility.

Method: CD-TVD uses contrastive learning and an improved diffusion model with local attention, pre-trained on historical data and fine-tuned with minimal new HR data.

Result: Experiments show CD-TVD provides accurate, resource-efficient 3D super-resolution for fluid and atmospheric simulations.

Conclusion: CD-TVD advances data augmentation for scientific simulations by minimizing HR data dependency while recovering fine details.

Abstract: Large-scale scientific simulations require significant resources to generate high-resolution time-varying data (TVD). While super-resolution is an efficient post-processing strategy to reduce costs, existing methods rely on a large amount of HR training data, limiting their applicability to diverse simulation scenarios. To address this constraint, we proposed CD-TVD, a novel framework that combines contrastive learning and an improved diffusion-based super-resolution model to achieve accurate 3D super-resolution from limited time-step high-resolution data. During pre-training on historical simulation data, the contrastive encoder and diffusion superresolution modules learn degradation patterns and detailed features of high-resolution and low-resolution samples. In the training phase, the improved diffusion model with a local attention mechanism is fine-tuned using only one newly generated high-resolution timestep, leveraging the degradation knowledge learned by the encoder. This design minimizes the reliance on large-scale high-resolution datasets while maintaining the capability to recover fine-grained details. Experimental results on fluid and atmospheric simulation datasets confirm that CD-TVD delivers accurate and resource-efficient 3D super-resolution, marking a significant advancement in data augmentation for large-scale scientific simulations. The code is available at https://github.com/Xin-Gao-private/CD-TVD.

[388] In-Situ Fine-Tuning of Wildlife Models in IoT-Enabled Camera Traps for Efficient Adaptation

Mohammad Mehdi Rastikerdar, Jin Huang, Hui Guan, Deepak Ganesan

Main category: cs.CV

TL;DR: WildFit is an autonomous in-situ adaptation framework for IoT devices that addresses domain shifts in deep learning models by combining background-aware synthesis and drift-aware fine-tuning, achieving higher accuracy and resource efficiency.

Details

Motivation: Resource-constrained IoT devices face accuracy drops due to domain shifts (e.g., lighting, weather changes), and traditional cloud-based retraining is impractical due to limited connectivity and energy constraints.

Method: WildFit uses background-aware synthesis to generate training samples on-device and drift-aware fine-tuning to trigger updates only when necessary, conserving resources.

Result: WildFit outperforms baselines by 7.3%, diffusion models by 3.0%, achieves Pareto optimality with 50% fewer updates, and consumes only 11.2 Wh over 37 days.

Conclusion: WildFit enables battery-powered IoT deployments by efficiently adapting models to changing conditions without relying on cloud retraining.

Abstract: Resource-constrained IoT devices increasingly rely on deep learning models, however, these models experience significant accuracy drops due to domain shifts when encountering variations in lighting, weather, and seasonal conditions. While cloud-based retraining can address this issue, many IoT deployments operate with limited connectivity and energy constraints, making traditional fine-tuning approaches impractical. We explore this challenge through the lens of wildlife ecology, where camera traps must maintain accurate species classification across changing seasons, weather, and habitats without reliable connectivity. We introduce WildFit, an autonomous in-situ adaptation framework that leverages the key insight that background scenes change more frequently than the visual characteristics of monitored species. WildFit combines background-aware synthesis to generate training samples on-device with drift-aware fine-tuning that triggers model updates only when necessary to conserve resources. Our background-aware synthesis surpasses efficient baselines by 7.3% and diffusion models by 3.0% while being orders of magnitude faster, our drift-aware fine-tuning achieves Pareto optimality with 50% fewer updates and 1.5% higher accuracy, and the end-to-end system outperforms domain adaptation approaches by 20–35%% while consuming only 11.2 Wh over 37 days – enabling battery-powered deployment.

[389] 3D Human Mesh Estimation from Single View RGBD

Ozhan Suat, Bedirhan Uguz, Batuhan Karagoz, Muhammed Can Keles, Emre Akbas

Main category: cs.CV

TL;DR: A method named M$^3$ (Masked Mesh Modeling) is introduced for accurate 3D human mesh estimation from a single RGBD view, leveraging MoCap datasets to overcome data scarcity. It outperforms existing methods on benchmark datasets.

Details

Motivation: RGBD cameras are underutilized despite their affordability and potential for 3D human mesh estimation. Existing datasets are limited, so the paper aims to address this gap.

Method: The method uses MoCap datasets to simulate partial single-view meshes, trains a masked autoencoder to complete them, and matches depth data to a template mesh during inference.

Result: M$^3$ achieves 16.8 mm and 22.0 mm PVE on SURREAL and CAPE datasets, outperforming methods using full-body point clouds. It also beats an RGB-based method by 18.4 mm on BEHAVE.

Conclusion: The proposed M$^3$ method effectively utilizes RGBD data and outperforms existing approaches, demonstrating the value of depth data in 3D human mesh estimation.

Abstract: Despite significant progress in 3D human mesh estimation from RGB images; RGBD cameras, offering additional depth data, remain underutilized. In this paper, we present a method for accurate 3D human mesh estimation from a single RGBD view, leveraging the affordability and widespread adoption of RGBD cameras for real-world applications. A fully supervised approach for this problem, requires a dataset with RGBD image and 3D mesh label pairs. However, collecting such a dataset is costly and challenging, hence, existing datasets are small, and limited in pose and shape diversity. To overcome this data scarcity, we leverage existing Motion Capture (MoCap) datasets. We first obtain complete 3D meshes from the body models found in MoCap datasets, and create partial, single-view versions of them by projection to a virtual camera. This simulates the depth data provided by an RGBD camera from a single viewpoint. Then, we train a masked autoencoder to complete the partial, single-view mesh. During inference, our method, which we name as M$^3$ for ``Masked Mesh Modeling’’, matches the depth values coming from the sensor to vertices of a template human mesh, which creates a partial, single-view mesh. We effectively recover parts of the 3D human body mesh model that are not visible, resulting in a full body mesh. M$^3$ achieves 16.8 mm and 22.0 mm per-vertex-error (PVE) on the SURREAL and CAPE datasets, respectively; outperforming existing methods that use full-body point clouds as input. We obtain a competitive 70.9 PVE on the BEHAVE dataset, outperforming a recently published RGB based method by 18.4 mm, highlighting the usefulness of depth data. Code will be released.

[390] THAT: Token-wise High-frequency Augmentation Transformer for Hyperspectral Pansharpening

Hongkun Jin, Hongcheng Jiang, Zejun Zhang, Yuan Zhang, Jia Fu, Tingfeng Li, Kai Luo

Main category: cs.CV

TL;DR: The paper proposes a Transformer-based framework (THAT) for hyperspectral pansharpening, addressing redundancy and multi-scale feature limitations by enhancing high-frequency details and token selection.

Details

Motivation: Existing Transformer methods in hyperspectral pansharpening suffer from redundant tokens and lack of multi-scale feature modeling, failing to leverage spectral-spatial priors effectively.

Method: THAT introduces Pivotal Token Selective Attention (PTSA) for token prioritization and a Multi-level Variance-aware Feed-forward Network (MVFN) for high-frequency detail enhancement.

Result: THAT achieves state-of-the-art performance on benchmarks, improving reconstruction quality and efficiency.

Conclusion: The proposed framework effectively addresses Transformer limitations in hyperspectral pansharpening, offering superior performance and detail preservation.

Abstract: Transformer-based methods have demonstrated strong potential in hyperspectral pansharpening by modeling long-range dependencies. However, their effectiveness is often limited by redundant token representations and a lack of multi-scale feature modeling. Hyperspectral images exhibit intrinsic spectral priors (e.g., abundance sparsity) and spatial priors (e.g., non-local similarity), which are critical for accurate reconstruction. From a spectral-spatial perspective, Vision Transformers (ViTs) face two major limitations: they struggle to preserve high-frequency components–such as material edges and texture transitions–and suffer from attention dispersion across redundant tokens. These issues stem from the global self-attention mechanism, which tends to dilute high-frequency signals and overlook localized details. To address these challenges, we propose the Token-wise High-frequency Augmentation Transformer (THAT), a novel framework designed to enhance hyperspectral pansharpening through improved high-frequency feature representation and token selection. Specifically, THAT introduces: (1) Pivotal Token Selective Attention (PTSA) to prioritize informative tokens and suppress redundancy; (2) a Multi-level Variance-aware Feed-forward Network (MVFN) to enhance high-frequency detail learning. Experiments on standard benchmarks show that THAT achieves state-of-the-art performance with improved reconstruction quality and efficiency. The source code is available at https://github.com/kailuo93/THAT.

[391] KARMA: Efficient Structural Defect Segmentation via Kolmogorov-Arnold Representation Learning

Md Meftahul Ferdaus, Mahdi Abdelguerfi, Elias Ioup, Steven Sloan, Kendall N. Niles, Ken Pathak

Main category: cs.CV

TL;DR: KARMA is a lightweight semantic segmentation framework for structural defects, achieving high accuracy with fewer parameters and real-time performance.

Details

Motivation: Addressing challenges in semantic segmentation of structural defects, such as variable appearances and class imbalance, while reducing computational costs for real-time inspection.

Method: Uses a Tiny Kolmogorov-Arnold Network (TiKAN), an optimized feature pyramid, and a static-dynamic prototype mechanism for efficient multi-scale defect analysis.

Result: Achieves competitive mean IoU with 97% fewer parameters (0.959M vs. 31.04M) and operates at 0.264 GFLOPS for real-time deployment.

Conclusion: KARMA enables practical, automated infrastructure inspection by balancing accuracy and efficiency.

Abstract: Semantic segmentation of structural defects in civil infrastructure remains challenging due to variable defect appearances, harsh imaging conditions, and significant class imbalance. Current deep learning methods, despite their effectiveness, typically require millions of parameters, rendering them impractical for real-time inspection systems. We introduce KARMA (Kolmogorov-Arnold Representation Mapping Architecture), a highly efficient semantic segmentation framework that models complex defect patterns through compositions of one-dimensional functions rather than conventional convolutions. KARMA features three technical innovations: (1) a parameter-efficient Tiny Kolmogorov-Arnold Network (TiKAN) module leveraging low-rank factorization for KAN-based feature transformation; (2) an optimized feature pyramid structure with separable convolutions for multi-scale defect analysis; and (3) a static-dynamic prototype mechanism that enhances feature representation for imbalanced classes. Extensive experiments on benchmark infrastructure inspection datasets demonstrate that KARMA achieves competitive or superior mean IoU performance compared to state-of-the-art approaches, while using significantly fewer parameters (0.959M vs. 31.04M, a 97% reduction). Operating at 0.264 GFLOPS, KARMA maintains inference speeds suitable for real-time deployment, enabling practical automated infrastructure inspection systems without compromising accuracy. The source code can be accessed at the following URL: https://github.com/faeyelab/karma.

[392] Reinforcement Learning in Vision: A Survey

Weijia Wu, Chen Gao, Joya Chen, Kevin Qinghong Lin, Qingwei Meng, Yiming Zhang, Yuke Qiu, Hong Zhou, Mike Zheng Shou

Main category: cs.CV

TL;DR: A survey synthesizing recent advances in visual reinforcement learning (RL), covering problem formalization, policy-optimization evolution, thematic pillars (e.g., multi-modal models), and evaluation challenges.

Details

Motivation: To map the rapidly expanding field of visual RL, providing a coherent overview and highlighting future directions.

Method: Organizes 200+ works into four pillars, analyzing algorithmic design, reward engineering, and benchmarks.

Result: Identifies trends like curriculum-driven training and challenges like sample efficiency and safe deployment.

Conclusion: Offers a resource for researchers, summarizing progress and open challenges in visual RL.

Abstract: Recent advances at the intersection of reinforcement learning (RL) and visual intelligence have enabled agents that not only perceive complex visual scenes but also reason, generate, and act within them. This survey offers a critical and up-to-date synthesis of the field. We first formalize visual RL problems and trace the evolution of policy-optimization strategies from RLHF to verifiable reward paradigms, and from Proximal Policy Optimization to Group Relative Policy Optimization. We then organize more than 200 representative works into four thematic pillars: multi-modal large language models, visual generation, unified model frameworks, and vision-language-action models. For each pillar we examine algorithmic design, reward engineering, benchmark progress, and we distill trends such as curriculum-driven training, preference-aligned diffusion, and unified reward modeling. Finally, we review evaluation protocols spanning set-level fidelity, sample-level preference, and state-level stability, and we identify open challenges that include sample efficiency, generalization, and safe deployment. Our goal is to provide researchers and practitioners with a coherent map of the rapidly expanding landscape of visual RL and to highlight promising directions for future inquiry. Resources are available at: https://github.com/weijiawu/Awesome-Visual-Reinforcement-Learning.

[393] LoRA.rar: Learning to Merge LoRAs via Hypernetworks for Subject-Style Conditioned Image Generation

Donald Shenaj, Ondrej Bohdal, Mete Ozay, Pietro Zanuttigh, Umberto Michieli

Main category: cs.CV

TL;DR: LoRA$.$rar improves image quality and speeds up merging by 4000x, enabling real-time personalization on resource-constrained devices.

Details

Motivation: Prior methods for personalized image creation are slow and computationally intensive, limiting real-time use on devices like smartphones.

Method: Introduces LoRA$.$rar, a hypernetwork trained on diverse content-style LoRA pairs to learn efficient merging, and proposes MLLM-based evaluation.

Result: Achieves 4000x speedup in merging, outperforms state-of-the-art in content and style fidelity, validated by MLLMs and humans.

Conclusion: LoRA$.$rar enables fast, high-quality personalized image generation with improved evaluation metrics.

Abstract: Recent advancements in image generation models have enabled personalized image creation with both user-defined subjects (content) and styles. Prior works achieved personalization by merging corresponding low-rank adapters (LoRAs) through optimization-based methods, which are computationally demanding and unsuitable for real-time use on resource-constrained devices like smartphones. To address this, we introduce LoRA$.$rar, a method that not only improves image quality but also achieves a remarkable speedup of over $4000\times$ in the merging process. We collect a dataset of style and subject LoRAs and pre-train a hypernetwork on a diverse set of content-style LoRA pairs, learning an efficient merging strategy that generalizes to new, unseen content-style pairs, enabling fast, high-quality personalization. Moreover, we identify limitations in existing evaluation metrics for content-style quality and propose a new protocol using multimodal large language models (MLLMs) for more accurate assessment. Our method significantly outperforms the current state of the art in both content and style fidelity, as validated by MLLM assessments and human evaluations.

[394] Spatial-ORMLLM: Improve Spatial Relation Understanding in the Operating Room with Multimodal Large Language Model

Peiqi He, Zhenhao Zhang, Yixiang Zhang, Xiongjun Zhao, Shaoliang Peng

Main category: cs.CV

TL;DR: Spatial-ORMLLM is a novel vision-language model for 3D spatial reasoning in operating rooms using only RGB data, outperforming existing methods by integrating 2D and 3D features without extra sensors.

Details

Motivation: Existing methods for spatial modeling in ORs rely on multimodal datasets, which are hard to obtain, and 2D data, which lacks detail. Spatial-ORMLLM addresses these gaps.

Method: Spatial-ORMLLM uses RGB data and a Spatial-Enhanced Feature Fusion Block to combine 2D inputs with 3D spatial knowledge, enabling 3D reasoning without additional sensors.

Result: The model achieves state-of-the-art performance on clinical datasets and generalizes well to unseen surgical scenarios.

Conclusion: Spatial-ORMLLM provides a robust, sensor-free solution for 3D spatial reasoning in ORs, enhancing clinical tasks with detailed spatial context.

Abstract: Precise spatial modeling in the operating room (OR) is foundational to many clinical tasks, supporting intraoperative awareness, hazard avoidance, and surgical decision-making. While existing approaches leverage large-scale multimodal datasets for latent-space alignment to implicitly learn spatial relationships, they overlook the 3D capabilities of MLLMs. However, this approach raises two issues: (1) Operating rooms typically lack multiple video and audio sensors, making multimodal 3D data difficult to obtain; (2) Training solely on readily available 2D data fails to capture fine-grained details in complex scenes. To address this gap, we introduce Spatial-ORMLLM, the first large vision-language model for 3D spatial reasoning in operating rooms using only RGB modality to infer volumetric and semantic cues, enabling downstream medical tasks with detailed and holistic spatial context. Spatial-ORMLLM incorporates a Spatial-Enhanced Feature Fusion Block, which integrates 2D modality inputs with rich 3D spatial knowledge extracted by the estimation algorithm and then feeds the combined features into the visual tower. By employing a unified end-to-end MLLM framework, it combines powerful spatial features with textual features to deliver robust 3D scene reasoning without any additional expert annotations or sensor inputs. Experiments on multiple benchmark clinical datasets demonstrate that Spatial-ORMLLM achieves state-of-the-art performance and generalizes robustly to previously unseen surgical scenarios and downstream tasks.

[395] B-VLLM: A Vision Large Language Model with Balanced Spatio-Temporal Tokens

Zhuqiang Lu, Zhenfei Yin, Mengwei He, Zhihui Wang, Zicheng Liu, Zhiyong Wang, Kun Hu

Main category: cs.CV

TL;DR: B-VLLM is a novel Vision Large Language Model framework that balances spatio-temporal cues in videos by adaptively selecting and merging frames to control visual token count, improving video understanding performance.

Details

Motivation: Existing VLLMs struggle with long videos due to excessive visual tokens, either oversimplifying temporal cues or spatial details. B-VLLM addresses this by selectively leveraging relevant frames and tokens.

Method: B-VLLM uses a text-conditioned adaptive frame selection module, temporal frame token merging, and spatial token sampling/merging to manage token count while preserving key visual details.

Result: B-VLLM outperforms existing methods on video understanding benchmarks by effectively balancing frame and token numbers.

Conclusion: B-VLLM offers a robust solution for video understanding in VLLMs, optimizing spatio-temporal cues without exceeding token limits.

Abstract: Recently, Vision Large Language Models (VLLMs) integrated with vision encoders have shown promising performance in vision understanding. The key of VLLMs is to encode visual content into sequences of visual tokens, enabling VLLMs to simultaneously process both visual and textual content. However, understanding videos, especially long videos, remain a challenge to VLLMs as the number of visual tokens grows rapidly when encoding videos, resulting in the risk of exceeding the context window of VLLMs and introducing heavy computation burden. To restrict the number of visual tokens, existing VLLMs either: (1) uniformly downsample videos into a fixed number of frames or (2) reducing the number of visual tokens encoded from each frame. We argue the former solution neglects the rich temporal cue in videos and the later overlooks the spatial details in each frame. In this work, we present Balanced-VLLM (B-VLLM): a novel VLLM framework that aims to effectively leverage task relevant spatio-temporal cues while restricting the number of visual tokens under the VLLM context window length. At the core of our method, we devise a text-conditioned adaptive frame selection module to identify frames relevant to the visual understanding task. The selected frames are then de-duplicated using a temporal frame token merging technique. The visual tokens of the selected frames are processed through a spatial token sampling module and an optional spatial token merging strategy to achieve precise control over the token count. Experimental results show that B-VLLM is effective in balancing the number of frames and visual tokens in video understanding, yielding superior performance on various video understanding benchmarks. Our code is available at https://github.com/zhuqiangLu/B-VLLM.

[396] SAGOnline: Segment Any Gaussians Online

Wentao Sun, Quanyun Wu, Hanqing Xu, Kyle Gao, Zhengsen Xu, Yiping Chen, Dedong Zhang, Lingfei Ma, John S. Zelek, Jonathan Li

Main category: cs.CV

TL;DR: SAGOnline is a lightweight, zero-shot framework for real-time 3D segmentation in Gaussian scenes, addressing computational inefficiency and multi-object tracking challenges with innovative 2D mask propagation and GPU-accelerated 3D labeling.

Details

Motivation: Current 3D segmentation methods in Gaussian scenes are computationally expensive, lack spatial reasoning, and fail to track multiple objects simultaneously.

Method: SAGOnline integrates video foundation models for 2D mask propagation and uses a GPU-accelerated algorithm for 3D mask generation and Gaussian-level instance labeling.

Result: Achieves state-of-the-art performance (92.7% mIoU on NVOS, 95.2% on Spin-NeRF) and 15–1500x faster inference (27 ms/frame).

Conclusion: SAGOnline enables real-time 3D segmentation and tracking, advancing AR/VR and robotic applications.

Abstract: 3D Gaussian Splatting (3DGS) has emerged as a powerful paradigm for explicit 3D scene representation, yet achieving efficient and consistent 3D segmentation remains challenging. Current methods suffer from prohibitive computational costs, limited 3D spatial reasoning, and an inability to track multiple objects simultaneously. We present Segment Any Gaussians Online (SAGOnline), a lightweight and zero-shot framework for real-time 3D segmentation in Gaussian scenes that addresses these limitations through two key innovations: (1) a decoupled strategy that integrates video foundation models (e.g., SAM2) for view-consistent 2D mask propagation across synthesized views; and (2) a GPU-accelerated 3D mask generation and Gaussian-level instance labeling algorithm that assigns unique identifiers to 3D primitives, enabling lossless multi-object tracking and segmentation across views. SAGOnline achieves state-of-the-art performance on NVOS (92.7% mIoU) and Spin-NeRF (95.2% mIoU) benchmarks, outperforming Feature3DGS, OmniSeg3D-gs, and SA3D by 15–1500 times in inference speed (27 ms/frame). Qualitative results demonstrate robust multi-object segmentation and tracking in complex scenes. Our contributions include: (i) a lightweight and zero-shot framework for 3D segmentation in Gaussian scenes, (ii) explicit labeling of Gaussian primitives enabling simultaneous segmentation and tracking, and (iii) the effective adaptation of 2D video foundation models to the 3D domain. This work allows real-time rendering and 3D scene understanding, paving the way for practical AR/VR and robotic applications.

[397] Learning User Preferences for Image Generation Model

Wenyi Mo, Ying Ba, Tianyu Zhang, Yalong Bai, Biye Li

Main category: cs.CV

TL;DR: The paper proposes a Multimodal Large Language Model-based approach with contrastive preference loss and preference tokens to predict user preferences more accurately by capturing individual variability and dynamic tastes.

Details

Motivation: Existing methods for user preference prediction often overlook individual variability and the dynamic nature of personal taste, relying instead on general preferences or static profiles.

Method: The approach uses contrastive preference loss to distinguish likes/dislikes and learnable preference tokens to capture shared interests among users, enabling group-specific preference activation.

Result: The model outperforms others in accuracy, identifies users with similar tastes, and improves image generation alignment with individual preferences.

Conclusion: The proposed method effectively addresses limitations of existing approaches, offering more precise and personalized user preference prediction.

Abstract: User preference prediction requires a comprehensive and accurate understanding of individual tastes. This includes both surface-level attributes, such as color and style, and deeper content-related aspects, such as themes and composition. However, existing methods typically rely on general human preferences or assume static user profiles, often neglecting individual variability and the dynamic, multifaceted nature of personal taste. To address these limitations, we propose an approach built upon Multimodal Large Language Models, introducing contrastive preference loss and preference tokens to learn personalized user preferences from historical interactions. The contrastive preference loss is designed to effectively distinguish between user ‘’likes’’ and ‘‘dislikes’’, while the learnable preference tokens capture shared interest representations among existing users, enabling the model to activate group-specific preferences and enhance consistency across similar users. Extensive experiments demonstrate our model outperforms other methods in preference prediction accuracy, effectively identifying users with similar aesthetic inclinations and providing more precise guidance for generating images that align with individual tastes. The project page is \texttt{https://learn-user-pref.github.io/}.

[398] MomentMix Augmentation with Length-Aware DETR for Temporally Robust Moment Retrieval

Seojeong Park, Jiho Choi, Kyungjune Baek, Hyunjung Shim

Main category: cs.CV

TL;DR: MomentMix enhances video moment retrieval by addressing feature diversity in short moments using ForegroundMix and BackgroundMix, and improves localization with a Length-Aware Decoder, outperforming DETR-based methods.

Details

Motivation: The demand for accurate video moment retrieval is growing, but existing DETR-based models struggle with short moments due to limited feature diversity and prediction bias.

Method: Proposes MomentMix with ForegroundMix and BackgroundMix for feature augmentation and a Length-Aware Decoder to address center position prediction bias in short moments.

Result: Achieves state-of-the-art performance on QVHighlights, TACoS, and Charades-STA, with notable gains in R1 and mAP metrics.

Conclusion: The method effectively improves short-moment localization and overall performance, setting new benchmarks for video moment retrieval.

Abstract: Video Moment Retrieval (MR) aims to localize moments within a video based on a given natural language query. Given the prevalent use of platforms like YouTube for information retrieval, the demand for MR techniques is significantly growing. Recent DETR-based models have made notable advances in performance but still struggle with accurately localizing short moments. Through data analysis, we identified limited feature diversity in short moments, which motivated the development of MomentMix. MomentMix employs two augmentation strategies: ForegroundMix and BackgroundMix, each enhancing the feature representations of the foreground and background, respectively. Additionally, our analysis of prediction bias revealed that short moments particularly struggle with accurately predicting their center positions of moments. To address this, we propose a Length-Aware Decoder, which conditions length through a novel bipartite matching process. Our extensive studies demonstrate the efficacy of our length-aware approach, especially in localizing short moments, leading to improved overall performance. Our method surpasses state-of-the-art DETR-based methods on benchmark datasets, achieving the highest R1 and mAP on QVHighlights and the highest R1@0.7 on TACoS and Charades-STA (such as a 2.46% gain in R1@0.7 and a 2.57% gain in mAP average for QVHighlights). The code is available at https://github.com/sjpark5800/LA-DETR.

[399] StableAvatar: Infinite-Length Audio-Driven Avatar Video Generation

Shuyuan Tu, Yueming Pan, Yinming Huang, Xintong Han, Zhen Xing, Qi Dai, Chong Luo, Zuxuan Wu, Yu-Gang Jiang

Main category: cs.CV

TL;DR: StableAvatar is an end-to-end video diffusion transformer for infinite-length high-quality avatar video generation, addressing audio synchronization and identity consistency issues with novel modules like Time-step-aware Audio Adapter and Audio Native Guidance Mechanism.

Details

Motivation: Existing diffusion models struggle with long videos due to audio modeling limitations, causing latent distribution errors and poor synchronization.

Method: StableAvatar integrates tailored training and inference modules, including Time-step-aware Audio Adapter and Audio Native Guidance Mechanism, along with a Dynamic Weighted Sliding-window Strategy for smooth video fusion.

Result: Experiments show StableAvatar outperforms benchmarks in generating high-quality, synchronized, and identity-consistent infinite-length videos.

Conclusion: StableAvatar effectively addresses long-video generation challenges with innovative audio modeling and inference techniques.

Abstract: Current diffusion models for audio-driven avatar video generation struggle to synthesize long videos with natural audio synchronization and identity consistency. This paper presents StableAvatar, the first end-to-end video diffusion transformer that synthesizes infinite-length high-quality videos without post-processing. Conditioned on a reference image and audio, StableAvatar integrates tailored training and inference modules to enable infinite-length video generation. We observe that the main reason preventing existing models from generating long videos lies in their audio modeling. They typically rely on third-party off-the-shelf extractors to obtain audio embeddings, which are then directly injected into the diffusion model via cross-attention. Since current diffusion backbones lack any audio-related priors, this approach causes severe latent distribution error accumulation across video clips, leading the latent distribution of subsequent segments to drift away from the optimal distribution gradually. To address this, StableAvatar introduces a novel Time-step-aware Audio Adapter that prevents error accumulation via time-step-aware modulation. During inference, we propose a novel Audio Native Guidance Mechanism to further enhance the audio synchronization by leveraging the diffusion’s own evolving joint audio-latent prediction as a dynamic guidance signal. To enhance the smoothness of the infinite-length videos, we introduce a Dynamic Weighted Sliding-window Strategy that fuses latent over time. Experiments on benchmarks show the effectiveness of StableAvatar both qualitatively and quantitatively.

[400] ReferSplat: Referring Segmentation in 3D Gaussian Splatting

Shuting He, Guangquan Jie, Changshuo Wang, Yun Zhou, Shuming Hu, Guanbin Li, Henghui Ding

Main category: cs.CV

TL;DR: R3DGS introduces a task for segmenting 3D objects using natural language descriptions, addressing challenges like occlusion and spatial relationships. The paper proposes ReferSplat, a framework achieving top performance, and releases the Ref-LERF dataset.

Details

Motivation: Advancing embodied AI by enabling 3D multi-modal understanding and segmentation of objects based on natural language descriptions, including spatial relationships and attributes.

Method: Proposes ReferSplat, a framework that models 3D Gaussian points with natural language expressions in a spatially aware paradigm.

Result: ReferSplat achieves state-of-the-art performance on R3DGS and 3D open-vocabulary segmentation benchmarks.

Conclusion: The work highlights the importance of 3D multi-modal understanding and spatial relationship modeling, with ReferSplat proving effective for the R3DGS task.

Abstract: We introduce Referring 3D Gaussian Splatting Segmentation (R3DGS), a new task that aims to segment target objects in a 3D Gaussian scene based on natural language descriptions, which often contain spatial relationships or object attributes. This task requires the model to identify newly described objects that may be occluded or not directly visible in a novel view, posing a significant challenge for 3D multi-modal understanding. Developing this capability is crucial for advancing embodied AI. To support research in this area, we construct the first R3DGS dataset, Ref-LERF. Our analysis reveals that 3D multi-modal understanding and spatial relationship modeling are key challenges for R3DGS. To address these challenges, we propose ReferSplat, a framework that explicitly models 3D Gaussian points with natural language expressions in a spatially aware paradigm. ReferSplat achieves state-of-the-art performance on both the newly proposed R3DGS task and 3D open-vocabulary segmentation benchmarks. Dataset and code are available at https://github.com/heshuting555/ReferSplat.

[401] Generative AI for Cel-Animation: A Survey

Yunlong Tang, Junjia Guo, Pinxin Liu, Zhiyuan Wang, Hang Hua, Jia-Xing Zhong, Yunzhong Xiao, Chao Huang, Luchuan Song, Susan Liang, Yizhi Song, Liu He, Jing Bi, Mingqian Feng, Xinyang Li, Zeliang Zhang, Chenliang Xu

Main category: cs.CV

TL;DR: GenAI is transforming traditional Cel-Animation by automating tasks like inbetweening and colorization, reducing manual effort, and enhancing accessibility. Challenges remain in consistency and ethics.

Details

Motivation: To address inefficiencies and scalability issues in traditional Cel-Animation by leveraging GenAI for automation and creative support.

Method: Survey of GenAI tools (e.g., AniDoc, ToonCrafter) and their integration into animation workflows.

Result: GenAI lowers technical barriers, broadens accessibility, and allows artists to focus on creativity, though challenges like consistency persist.

Conclusion: GenAI revolutionizes animation workflows but requires addressing visual consistency and ethical concerns for broader adoption.

Abstract: Traditional Celluloid (Cel) Animation production pipeline encompasses multiple essential steps, including storyboarding, layout design, keyframe animation, inbetweening, and colorization, which demand substantial manual effort, technical expertise, and significant time investment. These challenges have historically impeded the efficiency and scalability of Cel-Animation production. The rise of generative artificial intelligence (GenAI), encompassing large language models, multimodal models, and diffusion models, offers innovative solutions by automating tasks such as inbetween frame generation, colorization, and storyboard creation. This survey explores how GenAI integration is revolutionizing traditional animation workflows by lowering technical barriers, broadening accessibility for a wider range of creators through tools like AniDoc, ToonCrafter, and AniSora, and enabling artists to focus more on creative expression and artistic innovation. Despite its potential, challenges like visual consistency, stylistic coherence, and ethical considerations persist. Additionally, this paper explores future directions and advancements in AI-assisted animation. For further exploration and resources, please visit our GitHub repository: https://github.com/yunlong10/Awesome-AI4Animation

[402] Learning an Implicit Physics Model for Image-based Fluid Simulation

Emily Yue-Ting Jia, Jiageng Mao, Zhiyuan Gao, Yajie Zhao, Yue Wang

Main category: cs.CV

TL;DR: A novel method for generating 4D scenes (motion + 3D geometry) from a single image, using physics-informed neural networks to ensure realistic animations.

Details

Motivation: Replicating human ability to imagine 4D scenes from single images, addressing limitations of existing methods that produce unrealistic animations.

Method: Physics-informed neural network predicts motion guided by physical principles (e.g., Navier-Stokes equations), combined with feature-based 3D Gaussians for appearance.

Result: Produces physically plausible animations, outperforming existing methods.

Conclusion: The approach effectively bridges the gap between single-image input and realistic 4D scene generation.

Abstract: Humans possess an exceptional ability to imagine 4D scenes, encompassing both motion and 3D geometry, from a single still image. This ability is rooted in our accumulated observations of similar scenes and an intuitive understanding of physics. In this paper, we aim to replicate this capacity in neural networks, specifically focusing on natural fluid imagery. Existing methods for this task typically employ simplistic 2D motion estimators to animate the image, leading to motion predictions that often defy physical principles, resulting in unrealistic animations. Our approach introduces a novel method for generating 4D scenes with physics-consistent animation from a single image. We propose the use of a physics-informed neural network that predicts motion for each surface point, guided by a loss term derived from fundamental physical principles, including the Navier-Stokes equations. To capture appearance, we predict feature-based 3D Gaussians from the input image and its estimated depth, which are then animated using the predicted motions and rendered from any desired camera perspective. Experimental results highlight the effectiveness of our method in producing physically plausible animations, showcasing significant performance improvements over existing methods. Our project page is https://physfluid.github.io/ .

[403] MQuant: Unleashing the Inference Potential of Multimodal Large Language Models via Full Static Quantization

JiangYong Yu, Sifan Zhou, Dawei Yang, Shuo Wang, Shuoyu Li, Xing Hu, Chen Xu, Zukang Xu, Changyong Shu, Zhihang Yuan

Main category: cs.CV

TL;DR: MQuant is a post-training quantization framework for multimodal large language models (MLLMs) that addresses challenges like high latency and distribution disparities, achieving near-floating-point accuracy with reduced inference latency.

Details

Motivation: The large size and computational demands of MLLMs hinder their practical deployment, and existing quantization methods struggle with multimodal challenges.

Method: MQuant introduces Modality-Specific Static Quantization (MSQ), Attention-Invariant Flexible Switching (AIFS), and Rotation Magnitude Suppression (RMS) to address MLLM-specific issues.

Result: MQuant achieves <1% accuracy degradation and reduces inference latency by up to 30% on five MLLMs.

Conclusion: MQuant bridges the gap for efficient and accurate MLLM inference on resource-constrained devices.

Abstract: Multimodal large language models (MLLMs) have garnered widespread attention due to their ability to understand multimodal input. However, their large parameter sizes and substantial computational demands severely hinder their practical deployment and application.While quantization is an effective way to reduce model size and inference latency, its application to MLLMs remains underexplored. In this paper, we propose MQuant, a post-training quantization (PTQ) framework designed to tackle the unique challenges of multimodal large language models (MLLMs). Conventional quantization often struggles with MLLMs because of (a) high inference latency from large visual token counts, (b) distributional disparities between visual and textual tokens, and (c) extreme outliers introduced by Hadamard-based transformations. To address these issues, MQuant introduces: Modality-Specific Static Quantization (MSQ), assigning distinct static scales for visual vs. textual tokens; Attention-Invariant Flexible Switching (AIFS), reordering tokens to preserve casual attention while eliminating expensive token-wise scale computations; Rotation Magnitude Suppression (RMS), mitigating weight outliers arising from online Hadamard rotations. On five mainstream MLLMs (including Qwen-VL, MiniCPM-V, CogVLM2), MQuant under W4A8 achieves near-floating-point accuracy (<1% degradation) while reducing inference latency by up to 30%, significantly outperforming existing PTQ baselines. Our MQuant effectively bridges the gap for efficient and accurate MLLMs inference in resource-constrained devices. Code has been released in https://github.com/StiphyJay/MQuant.

[404] Advancing AI-Powered Medical Image Synthesis: Insights from MedVQA-GI Challenge Using CLIP, Fine-Tuned Stable Diffusion, and Dream-Booth + LoRA

Ojonugwa Oluwafemi Ejiga Peter, Md Mahmudur Rahman, Fahmi Khalifa

Main category: cs.CV

TL;DR: The MEDVQA-GI challenge introduces AI-driven text-to-image generative models for medical diagnostics, outperforming existing methods with Stable Diffusion, achieving high-quality and diverse synthetic images.

Details

Motivation: To enhance medical diagnostics by addressing the lack of dynamic medical image generation from textual descriptions, overcoming limitations of static methods and constrained datasets.

Method: Uses fine-tuned Stable Diffusion, DreamBooth, and LORA for dynamic, scalable medical image synthesis from text, focusing on image synthesis (IS) and optimal prompt production (OPG).

Result: Stable Diffusion outperforms CLIP and DreamBooth + LORA, with the lowest FID scores (0.099, 0.064, 0.067) and highest Inception Score (2.327), indicating superior quality and diversity.

Conclusion: The study advances AI-powered medical diagnosis, with future work on model refining, dataset augmentation, and ethical implementation in clinical practice.

Abstract: The MEDVQA-GI challenge addresses the integration of AI-driven text-to-image generative models in medical diagnostics, aiming to enhance diagnostic capabilities through synthetic image generation. Existing methods primarily focus on static image analysis and lack the dynamic generation of medical imagery from textual descriptions. This study intends to partially close this gap by introducing a novel approach based on fine-tuned generative models to generate dynamic, scalable, and precise images from textual descriptions. Particularly, our system integrates fine-tuned Stable Diffusion and DreamBooth models, as well as Low-Rank Adaptation (LORA), to generate high-fidelity medical images. The problem is around two sub-tasks namely: image synthesis (IS) and optimal prompt production (OPG). The former creates medical images via verbal prompts, whereas the latter provides prompts that produce high-quality images in specified categories. The study emphasizes the limitations of traditional medical image generation methods, such as hand sketching, constrained datasets, static procedures, and generic models. Our evaluation measures showed that Stable Diffusion surpasses CLIP and DreamBooth + LORA in terms of producing high-quality, diversified images. Specifically, Stable Diffusion had the lowest Fr'echet Inception Distance (FID) scores (0.099 for single center, 0.064 for multi-center, and 0.067 for combined), indicating higher image quality. Furthermore, it had the highest average Inception Score (2.327 across all datasets), indicating exceptional diversity and quality. This advances the field of AI-powered medical diagnosis. Future research will concentrate on model refining, dataset augmentation, and ethical considerations for efficiently implementing these advances into clinical practice

[405] From Reusing to Forecasting: Accelerating Diffusion Models with TaylorSeers

Jiacheng Liu, Chang Zou, Yuanhuiyi Lyu, Junjie Chen, Linfeng Zhang

Main category: cs.CV

TL;DR: TaylorSeer accelerates Diffusion Transformers (DiT) by predicting future timestep features using Taylor series expansion, reducing errors from feature caching and improving generation quality.

Details

Motivation: Current feature caching methods for DiT introduce errors at significant timestep intervals, degrading generation quality.

Method: TaylorSeer predicts future features using Taylor series expansion by approximating higher-order derivatives of features.

Result: Achieves near-lossless acceleration (e.g., 4.99× on FLUX, 5.00× on HunyuanVideo) and lower FID (3.41) on DiT at 4.53× acceleration.

Conclusion: TaylorSeer effectively addresses feature caching errors in DiT, enabling high-quality, real-time image and video synthesis.

Abstract: Diffusion Transformers (DiT) have revolutionized high-fidelity image and video synthesis, yet their computational demands remain prohibitive for real-time applications. To solve this problem, feature caching has been proposed to accelerate diffusion models by caching the features in the previous timesteps and then reusing them in the following timesteps. However, at timesteps with significant intervals, the feature similarity in diffusion models decreases substantially, leading to a pronounced increase in errors introduced by feature caching, significantly harming the generation quality. To solve this problem, we propose TaylorSeer, which firstly shows that features of diffusion models at future timesteps can be predicted based on their values at previous timesteps. Based on the fact that features change slowly and continuously across timesteps, TaylorSeer employs a differential method to approximate the higher-order derivatives of features and predict features in future timesteps with Taylor series expansion. Extensive experiments demonstrate its significant effectiveness in both image and video synthesis, especially in high acceleration ratios. For instance, it achieves an almost lossless acceleration of 4.99$\times$ on FLUX and 5.00$\times$ on HunyuanVideo without additional training. On DiT, it achieves $3.41$ lower FID compared with previous SOTA at $4.53$$\times$ acceleration. %Our code is provided in the supplementary materials and will be made publicly available on GitHub. Our codes have been released in Github:https://github.com/Shenyi-Z/TaylorSeer

[406] FunGraph: Functionality Aware 3D Scene Graphs for Language-Prompted Scene Interaction

Dennis Rotondi, Fabio Scaparro, Hermann Blum, Kai O. Arras

Main category: cs.CV

TL;DR: The paper introduces a finer-resolution 3D scene graph representation for robots, focusing on affordance-relevant parts to enable direct interaction with the environment.

Details

Motivation: Current 3D scene graph approaches are coarse and object-level, limiting robots' ability to interact functionally with their environment.

Method: The authors generate 2D data from available 3D resources to train a detector, augmenting the standard 3D scene graph pipeline.

Result: Their approach achieves functional element segmentation comparable to state-of-the-art 3D models and improves affordance grounding accuracy.

Conclusion: The proposed method enhances robotic interaction capabilities by refining 3D scene graphs with affordance-focused details.

Abstract: The concept of 3D scene graphs is increasingly recognized as a powerful semantic and hierarchical representation of the environment. Current approaches often address this at a coarse, object-level resolution. In contrast, our goal is to develop a representation that enables robots to directly interact with their environment by identifying both the location of functional interactive elements and how these can be used. To achieve this, we focus on detecting and storing objects at a finer resolution, focusing on affordance-relevant parts. The primary challenge lies in the scarcity of data that extends beyond instance-level detection and the inherent difficulty of capturing detailed object features using robotic sensors. We leverage currently available 3D resources to generate 2D data and train a detector, which is then used to augment the standard 3D scene graph generation pipeline. Through our experiments, we demonstrate that our approach achieves functional element segmentation comparable to state-of-the-art 3D models and that our augmentation enables task-driven affordance grounding with higher accuracy than the current solutions. See our project page at https://fungraph.github.io.

[407] On Representation Learning with Feedback

Hao Li

Main category: cs.CV

TL;DR: This note explains the theoretical mechanism of representation learning with feedback, complementing a recent paper on single image deraining.

Details

Motivation: To provide heuristic theoretical explanations for the representation learning with feedback mechanism introduced in the author's recent work.

Method: Theoretical analysis and heuristic explanations are used to clarify the mechanism.

Result: Enhanced understanding of the key points in representation learning with feedback.

Conclusion: This note successfully clarifies the theoretical underpinnings of the feedback mechanism in representation learning, aiding comprehension of the recent work.

Abstract: This note complements the author’s recent paper “Robust representation learning with feedback for single image deraining” by providing heuristically theoretical explanations on the mechanism of representation learning with feedback, namely an essential merit of the works presented in this recent article. This note facilitates understanding of key points in the mechanism of representation learning with feedback.

[408] EA-KD: Entropy-based Adaptive Knowledge Distillation

Chi-Ping Su, Ching-Hsun Tseng, Bin Pu, Lei Zhao, Jiewen Yang, Zhuangzhuang Chen, Shin-Jye Lee

Main category: cs.CV

TL;DR: EA-KD is a plug-and-play KD method that prioritizes high-entropy samples by dynamically reweighting distillation loss, improving performance across tasks with minimal cost.

Details

Motivation: Current KD methods treat all samples uniformly, ignoring varying learning value, limiting effectiveness.

Method: EA-KD quantifies sample learning value using teacher and student output entropy, dynamically reweights distillation loss.

Result: EA-KD enhances performance across diverse tasks, achieving state-of-the-art results with negligible computational cost.

Conclusion: EA-KD is a simple yet effective KD method that outperforms uniform sample treatment approaches.

Abstract: Knowledge distillation (KD) enables a smaller “student” model to mimic a larger “teacher” model by transferring knowledge from the teacher’s output or features. However, most KD methods treat all samples uniformly, overlooking the varying learning value of each sample and thereby limiting their effectiveness. In this paper, we propose Entropy-based Adaptive Knowledge Distillation (EA-KD), a simple yet effective plug-and-play KD method that prioritizes learning from valuable samples. EA-KD quantifies each sample’s learning value by strategically combining the entropy of the teacher and student output, then dynamically reweights the distillation loss to place greater emphasis on high-entropy samples. Extensive experiments across diverse KD frameworks and tasks – including image classification, object detection, and large language model (LLM) distillation – demonstrate that EA-KD consistently enhances performance, achieving state-of-the-art results with negligible computational cost. Code is available at: https://github.com/cpsu00/EA-KD

[409] Scene Summarization: Clustering Scene Videos into Spatially Diverse Frames

Chao Chen, Mingzhi Zhu, Ankush Pratap Singh, Yu Yan, Felix Juefei Xu, Chen Feng

Main category: cs.CV

TL;DR: SceneSum is a self-supervised pipeline for condensing scene videos into spatially diverse keyframes, mimicking human spatial reasoning.

Details

Motivation: Humans efficiently understand spatial layouts from few views; the paper aims to replicate this for scene summarization.

Method: A two-stage pipeline: clustering frames for spatial diversity, then selecting keyframes under constraints, optionally refined with supervised loss.

Result: SceneSum outperforms baselines, producing more spatially informative summaries.

Conclusion: The method effectively mimics human spatial abstraction, improving video summarization for global spatial reasoning.

Abstract: Humans are remarkably efficient at forming spatial understanding from just a few visual observations. When browsing real estate or navigating unfamiliar spaces, they intuitively select a small set of views that summarize the spatial layout. Inspired by this ability, we introduce scene summarization, the task of condensing long, continuous scene videos into a compact set of spatially diverse keyframes that facilitate global spatial reasoning. Unlike conventional video summarization-which focuses on user-edited, fragmented clips and often ignores spatial continuity-our goal is to mimic how humans abstract spatial layout from sparse views. We propose SceneSum, a two-stage self-supervised pipeline that first clusters video frames using visual place recognition to promote spatial diversity, then selects representative keyframes from each cluster under resource constraints. When camera trajectories are available, a lightweight supervised loss further refines clustering and selection. Experiments on real and simulated indoor datasets show that SceneSum produces more spatially informative summaries and outperforms existing video summarization baselines.

[410] BonnBeetClouds3D: A Dataset Towards Point Cloud-based Organ-level Phenotyping of Sugar Beet Plants under Field Conditions

Elias Marks, Jonas Bömer, Federico Magistri, Anurag Sah, Jens Behley, Cyrill Stachniss

Main category: cs.CV

TL;DR: The paper introduces a novel UAV-captured dataset for automatic fine-grained organ-level geometric analysis in precision phenotyping, addressing challenges in agricultural sustainability.

Details

Motivation: To reduce manual labor in plant phenotyping and support sustainable agriculture by enabling autonomous, precise analysis of plant traits.

Method: UAVs capture high-resolution images of 48 plant varieties, creating photogrammetric dense point clouds with detailed labels for plants, leaves, and key points.

Result: A dataset with labeled point clouds and expert-measured phenotypic traits, facilitating evaluation of segmentation, keypoint detection, and downstream tasks.

Conclusion: The dataset advances automatic phenotyping and supports research in surface reconstruction, point cloud completion, and semantic interpretation.

Abstract: Agricultural production is facing severe challenges in the next decades induced by climate change and the need for sustainability, reducing its impact on the environment. Advancements in field management through non-chemical weeding by robots in combination with monitoring of crops by autonomous unmanned aerial vehicles (UAVs) and breeding of novel and more resilient crop varieties are helpful to address these challenges. The analysis of plant traits, called phenotyping, is an essential activity in plant breeding, it however involves a great amount of manual labor. With this paper, we address the problem of automatic fine-grained organ-level geometric analysis needed for precision phenotyping. As the availability of real-world data in this domain is relatively scarce, we propose a novel dataset that was acquired using UAVs capturing high-resolution images of a real breeding trial containing 48 plant varieties and therefore covering great morphological and appearance diversity. This enables the development of approaches for autonomous phenotyping that generalize well to different varieties. Based on overlapping high-resolution images from multiple viewing angles, we compute photogrammetric dense point clouds and provide detailed and accurate point-wise labels for plants, leaves, and salient points as the tip and the base. Additionally, we include measurements of phenotypic traits performed by experts from the German Federal Plant Variety Office on the real plants, allowing the evaluation of new approaches not only on segmentation and keypoint detection but also directly on the downstream tasks. The provided labeled point clouds enable fine-grained plant analysis and support further progress in the development of automatic phenotyping approaches, but also enable further research in surface reconstruction, point cloud completion, and semantic interpretation of point clouds.

[411] Sparsity Outperforms Low-Rank Projections in Few-Shot Adaptation

Nairouz Mrabah, Nicolas Richet, Ismail Ben Ayed, Éric Granger

Main category: cs.CV

TL;DR: A novel Sparse Optimization (SO) framework is proposed to adapt Vision-Language Models (VLMs) to new domains with few labeled samples, addressing overfitting and computational constraints by dynamically adjusting few parameters with high sparsity.

Details

Motivation: The challenge of adapting VLMs to new domains with limited labeled data, avoiding overfitting and computational inefficiencies, motivates the development of a more effective solution.

Method: The SO framework introduces two paradigms: (1) local sparsity and global density, updating minimal parameters per iteration while maintaining expressiveness, and (2) local randomness and global importance, sparsifying gradients randomly while pruning based on importance.

Result: Extensive experiments on 11 datasets demonstrate SO’s state-of-the-art few-shot adaptation performance and reduced memory overhead.

Conclusion: The SO framework effectively mitigates overfitting and ensures stable adaptation in low-data regimes, outperforming existing methods.

Abstract: Adapting Vision-Language Models (VLMs) to new domains with few labeled samples remains a significant challenge due to severe overfitting and computational constraints. State-of-the-art solutions, such as low-rank reparameterization, mitigate these issues but often struggle with generalization and require extensive hyperparameter tuning. In this paper, a novel Sparse Optimization (SO) framework is proposed. Unlike low-rank approaches that typically constrain updates to a fixed subspace, our SO method leverages high sparsity to dynamically adjust very few parameters. We introduce two key paradigms. First, we advocate for \textit{local sparsity and global density}, which updates a minimal subset of parameters per iteration while maintaining overall model expressiveness. As a second paradigm, we advocate for \textit{local randomness and global importance}, which sparsifies the gradient using random selection while pruning the first moment based on importance. This combination significantly mitigates overfitting and ensures stable adaptation in low-data regimes. Extensive experiments on 11 diverse datasets show that SO achieves state-of-the-art few-shot adaptation performance while reducing memory overhead.

[412] Towards Customized Knowledge Distillation for Chip-Level Dense Image Predictions

Dong Zhang, Pingcheng Dong, Long Chen, Kwang-Ting Cheng

Main category: cs.CV

TL;DR: The paper proposes a boundary and context knowledge distillation (BCKD) method to address challenges in dense image prediction models, improving boundary completeness and target region connectivity.

Details

Motivation: Efficient dense image prediction models face issues with boundary region completeness and target region connectivity despite their real-time capabilities.

Method: BCKD uses boundary distillation for explicit object-level boundaries and context distillation for implicit pixel-level contexts, transferring knowledge from teacher to student models.

Result: Experiments on five datasets show BCKD improves mask quality and connectivity, achieving well-defined boundaries and smooth regions.

Conclusion: BCKD is effective for EDIP tasks, offering simplicity and efficiency in enhancing model performance.

Abstract: It has been revealed that efficient dense image prediction (EDIP) models designed for AI chips, trained using the knowledge distillation (KD) framework, encounter two key challenges, including \emph{maintaining boundary region completeness} and \emph{ensuring target region connectivity}, despite their favorable real-time capacity to recognize the main object regions. In this work, we propose a customized boundary and context knowledge distillation (BCKD) method for EDIPs, which facilitates the targeted KD from large accurate teacher models to compact small student models. Specifically, the \emph{boundary distillation} focuses on extracting explicit object-level boundaries from the hierarchical feature maps to enhance the student model’s mask quality in boundary regions. Meanwhile, the \emph{context distillation} leverages self-relations as a bridge to transfer implicit pixel-level contexts from the teacher model to the student model, ensuring strong connectivity in target regions. Our proposed method is specifically designed for the EDIP tasks and is characterized by its simplicity and efficiency. Theoretical analysis and extensive experimental results across semantic segmentation, object detection, and instance segmentation on five representative datasets demonstrate the effectiveness of BCKD, resulting in well-defined object boundaries and smooth connecting regions.

[413] Compact and De-biased Negative Instance Embedding for Multi-Instance Learning on Whole-Slide Image Classification

Joohyung Lee, Heejeong Nam, Kwanhyung Lee, Sangchul Hahn

Main category: cs.CV

TL;DR: The paper introduces a semi-supervision signal to improve WSI classification by leveraging free annotations from normal slides, enhancing existing MIL methods.

Details

Motivation: Address the challenges of WSI classification, such as lack of patch-level annotations and inter-slide variability, by utilizing free annotations from normal slides.

Method: Proposes a semi-supervision signal to de-bias inter-slide variability and capture common factors in normal patches, evaluated on top of existing MIL algorithms.

Result: Significantly improves predictive performance on Camelyon-16 and TCGA lung cancer datasets, outperforming other semi-supervised approaches.

Conclusion: The method effectively enhances MIL algorithms for WSI classification and is released as open-source.

Abstract: Whole-slide image (WSI) classification is a challenging task because 1) patches from WSI lack annotation, and 2) WSI possesses unnecessary variability, e.g., stain protocol. Recently, Multiple-Instance Learning (MIL) has made significant progress, allowing for classification based on slide-level, rather than patch-level, annotations. However, existing MIL methods ignore that all patches from normal slides are normal. Using this free annotation, we introduce a semi-supervision signal to de-bias the inter-slide variability and to capture the common factors of variation within normal patches. Because our method is orthogonal to the MIL algorithm, we evaluate our method on top of the recently proposed MIL algorithms and also compare the performance with other semi-supervised approaches. We evaluate our method on two public WSI datasets including Camelyon-16 and TCGA lung cancer and demonstrate that our approach significantly improves the predictive performance of existing MIL algorithms and outperforms other semi-supervised algorithms. We release our code at https://github.com/AITRICS/pathology_mil.

[414] DreamFrame: Enhancing Video Understanding via Automatically Generated QA and Style-Consistent Keyframes

Zhende Song, Chenchen Wang, Jiamu Sheng, Chi Zhang, Shengji Tang, Jiayuan Fan, Tao Chen

Main category: cs.CV

TL;DR: DreamFrame is a three-stage framework for generating style-consistent keyframes and QA pairs to support LVLM instruction tuning, reducing the need for manual video annotation.

Details

Motivation: Existing LVLMs rely on labor-intensive, annotated datasets, making adaptation to specific tasks challenging. DreamFrame automates dataset generation to address this.

Method: DreamFrame uses an LLM to create structured movie plots, ensures visual consistency with a Style Immobilization Process, and integrates descriptions and embeddings to produce keyframes.

Result: The framework generated 1k stylized videos and 100k QA pairs, and the fine-tuned DreamFrame-7B outperformed similar-sized LVLMs on benchmarks.

Conclusion: DreamFrame effectively automates dataset generation for LVLM tuning, demonstrating superior performance and scalability.

Abstract: Recent large vision-language models (LVLMs) for video understanding are primarily fine-tuned with various videos scraped from online platforms. Existing datasets, such as ActivityNet, require considerable human labor for structuring and annotation before effectively utilized for tuning LVLMs. While current LVLMs are primarily trained on existing datasets in broad, general-purpose settings, adapting them to specific downstream scenarios remains challenging, as collecting and annotating task-specific videos is highly labor-intensive and time-consuming. To address this issue, we propose a three-stage framework named DreamFrame for automatically generating style-consistent keyframes and corresponding question-answer (QA) pairs to support LVLM instruction tuning. DreamFrame generates datasets in a movie-like manner. First, we utilize an LLM to generate structured movie plots including movie prior information (like overview and style), frame descriptions and plot-related QA pairs, with a story expansion strategy to mitigate context length limitations.Then, to ensure visual consistency across generated frames, we design a Style Immobilization Process which maintains consistent style through an embedding learning strategy. Finally, frame descriptions and style embeddings are integrated to produce coherent keyframes. Using DreamFrame, we construct a dataset comprising approximately 1k stylized keyframe-like videos and 100k diverse QA pairs. Extensive fine-tuned experiments on various LVLM architectures demonstrate the effectiveness of the proposed dataset. Furthermore, based on the proposed dataset, we fine-tune a new LVLM named DreamFrame-7B, which significantly surpasses the previous similar-sized LVLMs across different benchmarks.

[415] Spotter+GPT: Turning Sign Spottings into Sentences with LLMs

Ozge Mercanoglu Sincan, Richard Bowden

Main category: cs.CV

TL;DR: Spotter+GPT is a lightweight, modular framework for Sign Language Translation (SLT) that uses a sign spotter and LLM to avoid heavy training.

Details

Motivation: SLT is challenging; existing methods require heavy end-to-end training. Spotter+GPT aims to simplify this by modularizing the task.

Method: Two-stage approach: sign spotter identifies signs, then an LLM translates them into spoken language. No SLT-specific training needed.

Result: Reduces computational costs and time by eliminating heavy training.

Conclusion: Spotter+GPT offers an efficient, modular solution for SLT without extensive training.

Abstract: Sign Language Translation (SLT) is a challenging task that aims to generate spoken language sentences from sign language videos. In this paper, we introduce a lightweight, modular SLT framework, Spotter+GPT, that leverages the power of Large Language Models (LLMs) and avoids heavy end-to-end training. Spotter+GPT breaks down the SLT task into two distinct stages. First, a sign spotter identifies individual signs within the input video. The spotted signs are then passed to an LLM, which transforms them into meaningful spoken language sentences. Spotter+GPT eliminates the requirement for SLT-specific training. This significantly reduces computational costs and time requirements. The source code and pretrained weights of the Spotter are available at https://gitlab.surrey.ac.uk/cogvispublic/sign-spotter.

[416] Goldilocks Test Sets for Face Verification

Haiyu Wu, Sicong Tian, Aman Bhatta, Jacob Gutierrez, Grace Bezold, Genesis Argueta, Karl Ricanek Jr., Michael C. King, Kevin W. Bowyer

Main category: cs.CV

TL;DR: The paper introduces three challenging test sets (Hadrian, Eclipse, twins-IND) to expose weaknesses in face recognition models, focusing on facial attribute variations and similar-looking identities, without artificially degrading image quality.

Details

Motivation: Current face verification accuracy has plateaued on existing test sets, which often artificially reduce image quality. The authors argue that real-world challenges like attribute differences and similar-looking identities are overlooked.

Method: Proposed three test sets: Hadrian (facial hair differences), Eclipse (face exposure differences), and twins-IND (similar-looking identities). Used LFW test protocol and additional rules for balanced evaluation.

Result: The test sets are quantitatively as or more challenging than existing ones, without image manipulation.

Conclusion: The proposed test sets effectively highlight ignored weaknesses in face recognition models, offering a more realistic evaluation framework.

Abstract: Reported face verification accuracy has reached a plateau on current well-known test sets. As a result, some difficult test sets have been assembled by reducing the image quality or adding artifacts to the image. However, we argue that test sets can be challenging without artificially reducing the image quality because the face recognition (FR) models suffer from correctly recognizing 1) the pairs from the same identity (i.e., genuine pairs) with a large face attribute difference, 2) the pairs from different identities (i.e., impostor pairs) with a small face attribute difference, and 3) the pairs of similar-looking identities (e.g., twins and relatives). We propose three challenging test sets to reveal important but ignored weaknesses of the existing FR algorithms. To challenge models on variation of facial attributes, we propose Hadrian and Eclipse to address facial hair differences and face exposure differences. The images in both test sets are high-quality and collected in a controlled environment. To challenge FR models on similar-looking persons, we propose twins-IND, which contains images from a dedicated twins dataset. The LFW test protocol is used to structure the proposed test sets. Moreover, we introduce additional rules to assemble “Goldilocks1” level test sets, including 1) restricted number of occurrence of hard samples, 2) equal chance evaluation across demographic groups, and 3) constrained identity overlap across validation folds. Quantitatively, without further processing the images, the proposed test sets have on-par or higher difficulties than the existing test sets. The datasets are available at: https: //github.com/HaiyuWu/SOTA-Face-Recognition-Train-and-Test.

Kirolos Ataallah, Eslam Abdelrahman, Mahmoud Ahmed, Chenhui Gou, Khushbu Pahwa, Jian Ding, Mohamed Elhoseiny

Main category: cs.CV

TL;DR: InfiniBench is a new benchmark for evaluating long video understanding, featuring 1,000+ hours of content and 91K Q&A pairs. It tests diverse skills, revealing models’ struggles and reliance on pre-trained knowledge.

Details

Motivation: Existing benchmarks lack the complexity to evaluate long-form video understanding, prompting the creation of InfiniBench.

Method: InfiniBench includes extensive video content, diverse question types, and evaluates models like GPT-4o and Gemini 2.0 Flash.

Result: Models perform poorly, with GPT-4o scoring 47.1% on grounding tasks. Multimodal inputs improve performance.

Conclusion: InfiniBench highlights the challenges in long video understanding and the need for better multimodal models.

Abstract: Understanding long-form videos, such as movies and TV episodes ranging from tens of minutes to two hours, remains a significant challenge for multi-modal models. Existing benchmarks often fail to test the full range of cognitive skills needed to process these temporally rich and narratively complex inputs. Therefore, we introduce InfiniBench, a comprehensive benchmark designed to evaluate the capabilities of models in long video understanding rigorously. InfiniBench offers:(1) Over 1,000 hours of video content, with an average video length of 53 minutes. (2) The largest set of question-answer pairs for long video comprehension, totaling around 91 K. (3) Eight diverse skills that span both grounding-based (e.g., scene transitions, character actions) and reasoning-based (e.g., deep context understanding, multi-event linking). (4) Rich annotation formats, including both multiple-choice and open-ended questions. We conducted an in-depth evaluation across both commercial (GPT-4o, Gemini 2.0 Flash) and most recent open-source vision-language models such as Qwen2.5-VL, InternVL3.0). Results reveal that:(1) Models struggle across the board: Even the best model, GPT-4o, achieves only 47.1 % on grounding-based skills, with most models performing near or just above random chance. (2) Strong reliance on world knowledge: Models achieve surprisingly high scores using only metadata (e.g., video titles), highlighting a tendency to rely on pre-trained knowledge rather than actual visual or temporal understanding. (3) Multi-Modal Importance: When provided with full video and subtitle context, however, models show substantial improvements, confirming the critical role of multimodal input in video understanding. InfiniBench is publicly available at https://vision-cair.github.io/Infinibench

[418] Learning Multi-view Anomaly Detection with Efficient Adaptive Selection

Haoyang He, Jiangning Zhang, Guanzhong Tian, Chengjie Wang, Lei Xie

Main category: cs.CV

TL;DR: The paper introduces a Multi-View Anomaly Detection (MVAD) approach with a Multi-View Adaptive Selection (MVAS) algorithm to address blind spots in single-view tasks, achieving state-of-the-art performance with improved efficiency.

Details

Motivation: Single-view anomaly detection has blind spots from other perspectives, leading to inaccuracies. Multi-view approaches can mitigate this by integrating features from multiple views.

Method: Proposes MVAS for feature learning and fusion across views using neighbourhood attention windows and a semantic correlation matrix. Adjusts window sizes and top-k for efficiency.

Result: Achieves +2.5 average improvement across 10 metrics on the Real-IAD dataset with 18M parameters, fewer FLOPs, and less training time.

Conclusion: MVAD with MVAS is effective for multi-view anomaly detection, offering superior performance and efficiency.

Abstract: This study explores the recently proposed and challenging multi-view Anomaly Detection (AD) task. Single-view tasks will encounter blind spots from other perspectives, resulting in inaccuracies in sample-level prediction. Therefore, we introduce the Multi-View Anomaly Detection (MVAD) approach, which learns and integrates features from multi-views. Specifically, we propose a Multi-View Adaptive Selection (MVAS) algorithm for feature learning and fusion across multiple views. The feature maps are divided into neighbourhood attention windows to calculate a semantic correlation matrix between single-view windows and all other views, which is an attention mechanism conducted for each single-view window and the top-k most correlated multi-view windows. Adjusting the window sizes and top-k can minimise the complexity to O((hw)^4/3). Extensive experiments on the Real-IAD dataset under the multi-class setting validate the effectiveness of our approach, achieving state-of-the-art performance with an average improvement of +2.5 across 10 metrics at the sample/image/pixel levels, using only 18M parameters and requiring fewer FLOPs and training time. The codes are available at https://github.com/lewandofskee/MVAD.

[419] FancyVideo: Towards Dynamic and Consistent Video Generation via Cross-frame Textual Guidance

Jiasong Feng, Ao Ma, Jing Wang, Ke Cao, Zhanjie Zhang

Main category: cs.CV

TL;DR: FancyVideo introduces a Cross-frame Textual Guidance Module (CTGM) to enhance text-to-video generation by providing frame-specific textual guidance, achieving state-of-the-art results.

Details

Motivation: Existing text-to-video models lack frame-specific textual guidance, limiting their ability to generate coherent motion and temporal consistency.

Method: FancyVideo employs CTGM with Temporal Information Injector (TII) and Temporal Affinity Refiner (TAR) to inject and refine frame-specific textual conditions.

Result: FancyVideo achieves state-of-the-art performance on the EvalCrafter benchmark and supports both text-to-video and image-to-video tasks effectively.

Conclusion: FancyVideo advances video synthesis by improving temporal coherence and motion richness, demonstrating superior performance in T2V and I2V tasks.

Abstract: Synthesizing motion-rich and temporally consistent videos remains a challenge in artificial intelligence, especially when dealing with extended durations. Existing text-to-video (T2V) models commonly employ spatial cross-attention for text control, equivalently guiding different frame generations without frame-specific textual guidance. Thus, the model’s capacity to comprehend the temporal logic conveyed in prompts and generate videos with coherent motion is restricted. To tackle this limitation, we introduce FancyVideo, an innovative video generator that improves the existing text-control mechanism with the well-designed Cross-frame Textual Guidance Module (CTGM). Specifically, CTGM incorporates the Temporal Information Injector (TII) and Temporal Affinity Refiner (TAR) at the beginning and end of cross-attention, respectively, to achieve frame-specific textual guidance. Firstly, TII injects frame-specific information from latent features into text conditions, thereby obtaining cross-frame textual conditions. Then, TAR refines the correlation matrix between cross-frame textual conditions and latent features along the time dimension. Extensive experiments comprising both quantitative and qualitative evaluations demonstrate the effectiveness of FancyVideo. Our approach achieves state-of-the-art T2V generation results on the EvalCrafter benchmark and facilitates the synthesis of dynamic and consistent videos. Note that the T2V process of FancyVideo essentially involves a text-to-image step followed by T+I2V. This means it also supports the generation of videos from user images, i.e., the image-to-video (I2V) task. A significant number of experiments have shown that its performance is also outstanding.

[420] Alignment-free Raw Video Demoireing

Shuning Xu, Xina Liu, Binbin Song, Xiangyu Chen, Qiubo Chen, Jiantao Zhou

Main category: cs.CV

TL;DR: The paper introduces DemMamba, an alignment-free raw video demoireing network using frequency-assisted spatio-temporal Mamba, outperforming existing methods by 1.6 dB in PSNR.

Details

Motivation: Existing video demoireing methods rely on complex alignment modules, which are computationally demanding. Raw data input has shown better performance, motivating an alignment-free approach.

Method: DemMamba uses Spatial Mamba Blocks (SMB) and Temporal Mamba Blocks (TMB) to model inter- and intra-relationships in raw video. SMB employs multi-directional scanning and frequency compression, while TMB enhances temporal consistency with bidirectional scanning and channel attention.

Result: DemMamba achieves a 1.6 dB PSNR improvement over state-of-the-art methods and provides better visual quality.

Conclusion: The proposed DemMamba effectively addresses video demoireing without alignment modules, offering superior performance and efficiency.

Abstract: Video demoireing aims to remove undesirable interference patterns that arise during the capture of screen content, restoring artifact-free frames while maintaining temporal consistency. Existing video demoireing methods typically utilize carefully designed alignment modules to estimate inter-frame motion for leveraging temporal information; however, these modules are often complex and computationally demanding. Meanwhile, recent works indicate that using raw data as input significantly enhances demoireing performance. Building on this insight, this paper introduces a novel alignment-free raw video demoireing network with frequency-assisted spatio-temporal Mamba (DemMamba). It incorporates sequentially arranged Spatial Mamba Blocks (SMB) and Temporal Mamba Blocks (TMB) to effectively model the inter- and intra-relationships in raw video demoireing. The SMB employs a multi-directional scanning mechanism coupled with a learnable frequency compressor to effectively differentiate interference patterns across various orientations and frequencies, resulting in reduced artifacts, sharper edges, and faithful texture reconstruction. Concurrently, the TMB enhances temporal consistency by performing bidirectional scanning across the temporal sequences and integrating channel attention techniques, facilitating improved temporal information fusion. Extensive experiments demonstrate that DemMamba surpasses state-of-the-art methods by 1.6 dB in PSNR, and also delivers a satisfactory visual experience.

[421] Prompt-Softbox-Prompt: A Free-Text Embedding Control for Image Editing

Yitong Yang, Yinglin Wang, Tian Zhang, Jing Wang, Shuting He

Main category: cs.CV

TL;DR: The paper analyzes text embeddings in Stable Diffusion XL, identifies key insights about their roles, and introduces PSP, a training-free method for precise image editing using modified text embeddings.

Details

Motivation: To address the ambiguity and entanglement of text embeddings in diffusion models, which hinder precise image editing.

Method: Analyzes text embeddings in Stable Diffusion XL, identifies their roles, and proposes PSP for editing by modifying embeddings in cross-attention layers and using Softbox for targeted semantic injection.

Result: PSP enables precise object addition, replacement, and style transfer without affecting other image areas, validated by extensive experiments.

Conclusion: PSP is an effective, training-free solution for precise image editing using text embeddings, with demonstrated success in various tasks.

Abstract: While text-driven diffusion models demonstrate remarkable performance in image editing, the critical components of their text embeddings remain underexplored. The ambiguity and entanglement of these embeddings pose challenges for precise editing. In this paper, we provide a comprehensive analysis of text embeddings in Stable Diffusion XL, offering three key insights: (1) \textit{aug embedding}~\footnote{\textit{aug embedding} is obtained by combining the pooled output of the final text encoder with the timestep embeddings. https://github.com/huggingface/diffusers} retains complete textual semantics but contributes minimally to image generation as it is only fused via the ResBlocks. More text information weakens its local semantics while preserving most global semantics. (2) \textit{BOS} and \textit{padding embedding} do not contain any semantic information. (3) \textit{EOS} holds the semantic information of all words and stylistic information. Each word embedding is important and does not interfere with the semantic injection of other embeddings. Based on these insights, we propose PSP (\textbf{P}rompt-\textbf{S}oftbox-\textbf{P}rompt), a training-free image editing method that leverages free-text embedding. PSP enables precise image editing by modifying text embeddings within the cross-attention layers and using Softbox to control the specific area for semantic injection. This technique enables the addition and replacement of objects without affecting other areas of the image. Additionally, PSP can achieve style transfer by simply replacing text embeddings. Extensive experiments show that PSP performs remarkably well in tasks such as object replacement, object addition, and style transfer. Our code is available at https://github.com/yangyt46/PSP.

[422] Ethical Challenges in Computer Vision: Ensuring Privacy and Mitigating Bias in Publicly Available Datasets

Ghalib Ahmed Tahir

Main category: cs.CV

TL;DR: The paper discusses ethical issues in computer vision, focusing on privacy and bias in publicly available datasets, and proposes an ethical framework for responsible AI development.

Details

Motivation: The rapid growth of AI and computer vision raises ethical concerns, especially regarding privacy and bias in datasets collected without consent.

Method: Analysis of popular datasets (e.g., COCO, ImageNet) and development of an ethical framework addressing rights, bias, transparency, and accountability.

Result: A comprehensive ethical framework is proposed to guide responsible AI development, ensuring societal values and ethical standards are upheld.

Conclusion: The paper advocates for ethical AI practices to prevent harm and align technology with societal values.

Abstract: This paper aims to shed light on the ethical problems of creating and deploying computer vision tech, particularly in using publicly available datasets. Due to the rapid growth of machine learning and artificial intelligence, computer vision has become a vital tool in many industries, including medical care, security systems, and trade. However, extensive use of visual data that is often collected without consent due to an informed discussion of its ramifications raises significant concerns about privacy and bias. The paper also examines these issues by analyzing popular datasets such as COCO, LFW, ImageNet, CelebA, PASCAL VOC, etc., that are usually used for training computer vision models. We offer a comprehensive ethical framework that addresses these challenges regarding the protection of individual rights, minimization of bias as well as openness and responsibility. We aim to encourage AI development that will take into account societal values as well as ethical standards to avoid any public harm.

[423] PainDiffusion: Learning to Express Pain

Quang Tien Dam, Tri Tung Nguyen Nguyen, Yuki Endo, Dinh Tuan Tran, Joo-Ho Lee

Main category: cs.CV

TL;DR: PainDiffusion is a generative model for synthesizing realistic facial pain expressions, outperforming traditional methods and showing promise for clinical training.

Details

Motivation: Current Robotic Patient Simulators lack realistic pain expressions, limiting their effectiveness in medical training.

Method: PainDiffusion uses a continuous latent space and diffusion forcing for smooth, natural facial motion, incorporating pain expressiveness and emotion for personalized synthesis.

Result: PainDiffusion achieves a 31.2% preference rate against ground-truth recordings and integrates successfully into robotic systems for real-time rehabilitation.

Conclusion: PainDiffusion bridges the gap between synthetic and naturalistic pain expressions, offering a viable alternative for clinical training.

Abstract: Accurate pain expression synthesis is essential for improving clinical training and human-robot interaction. Current Robotic Patient Simulators (RPSs) lack realistic pain facial expressions, limiting their effectiveness in medical training. In this work, we introduce PainDiffusion, a generative model that synthesizes naturalistic facial pain expressions. Unlike traditional heuristic or autoregressive methods, PainDiffusion operates in a continuous latent space, ensuring smoother and more natural facial motion while supporting indefinite-length generation via diffusion forcing. Our approach incorporates intrinsic characteristics such as pain expressiveness and emotion, allowing for personalized and controllable pain expression synthesis. We train and evaluate our model using the BioVid HeatPain Database. Additionally, we integrate PainDiffusion into a robotic system to assess its applicability in real-time rehabilitation exercises. Qualitative studies with clinicians reveal that PainDiffusion produces realistic pain expressions, with a 31.2% (std 4.8%) preference rate against ground-truth recordings. Our results suggest that PainDiffusion can serve as a viable alternative to real patients in clinical training and simulation, bridging the gap between synthetic and naturalistic pain expression. Code and videos are available at: https://damtien444.github.io/paindf/

[424] SPRMamba: Surgical Phase Recognition for Endoscopic Submucosal Dissection with Mamba

Xiangning Zhang, Qingwei Zhang, Jinnan Chen, Chengfeng Zhou, Yaqi Wang, Zhengjie Zhang, Xiaobo Li, Dahong Qian

Main category: cs.CV

TL;DR: SPRMamba is a novel framework for real-time surgical phase recognition in ESD, combining Mamba-based architecture with a Scaled Residual TranMamba block for improved accuracy and efficiency.

Details

Motivation: Current video-based phase recognition algorithms struggle with fine-grained transitions and long-range dependencies due to inefficient temporal modeling, limiting their clinical utility.

Method: SPRMamba integrates a Mamba-based architecture with a Scaled Residual TranMamba block and a Hierarchical Sampling Strategy for efficient long-term temporal modeling and localized detail extraction.

Result: Achieves 87.64% accuracy on ESD385 (+1.0% over prior methods) and robust performance on Cholec80, demonstrating generalizability.

Conclusion: SPRMamba bridges computational efficiency and temporal sensitivity, offering a transformative tool for ESD intraoperative guidance and skill assessment.

Abstract: Endoscopic Submucosal Dissection (ESD) is a minimally invasive procedure initially developed for early gastric cancer treatment and has expanded to address diverse gastrointestinal lesions. While computer-assisted surgery (CAS) systems enhance ESD precision and safety, their efficacy hinges on accurate real-time surgical phase recognition, a task complicated by ESD’s inherent complexity, including heterogeneous lesion characteristics and dynamic tissue interactions. Existing video-based phase recognition algorithms, constrained by inefficient temporal context modeling, exhibit limited performance in capturing fine-grained phase transitions and long-range dependencies. To overcome these limitations, we propose SPRMamba, a novel framework integrating a Mamba-based architecture with a Scaled Residual TranMamba (SRTM) block to synergize long-term temporal modeling and localized detail extraction. SPRMamba further introduces the Hierarchical Sampling Strategy to optimize computational efficiency, enabling real-time processing critical for clinical deployment. Evaluated on the ESD385 dataset and the cholecystectomy benchmark Cholec80, SPRMamba achieves state-of-the-art performance (87.64% accuracy on ESD385, +1.0% over prior methods), demonstrating robust generalizability across surgical workflows. This advancement bridges the gap between computational efficiency and temporal sensitivity, offering a transformative tool for intraoperative guidance and skill assessment in ESD surgery. The code is accessible at https://github.com/Zxnyyyyy/SPRMamba.

[425] DRAMA-X: A Fine-grained Intent Prediction and Risk Reasoning Benchmark For Driving

Mihir Godbole, Xiangbo Gao, Zhengzhong Tu

Main category: cs.CV

TL;DR: DRAMA-X is a benchmark for evaluating multi-class intent prediction in safety-critical scenarios for autonomous driving, featuring annotations for object detection, intent prediction, risk assessment, and action suggestion. SGG-Intent, a lightweight framework, is proposed as a baseline, showing improved performance with scene-graph-based reasoning.

Details

Motivation: The need for fine-grained intent reasoning in autonomous driving, especially for vulnerable road users (VRUs), is unmet by existing benchmarks.

Method: DRAMA-X is introduced as a benchmark with automated annotations. SGG-Intent, a training-free framework, uses scene-graph-based reasoning with VLMs and LLMs for intent prediction and risk assessment.

Result: Scene-graph-based reasoning improves intent prediction and risk assessment, particularly with explicit contextual modeling.

Conclusion: DRAMA-X fills a critical gap in evaluating intent reasoning for autonomous driving, and SGG-Intent demonstrates the effectiveness of structured reasoning.

Abstract: Understanding the short-term motion of vulnerable road users (VRUs) like pedestrians and cyclists is critical for safe autonomous driving, especially in urban scenarios with ambiguous or high-risk behaviors. While vision-language models (VLMs) have enabled open-vocabulary perception, their utility for fine-grained intent reasoning remains underexplored. Notably, no existing benchmark evaluates multi-class intent prediction in safety-critical situations, To address this gap, we introduce DRAMA-X, a fine-grained benchmark constructed from the DRAMA dataset via an automated annotation pipeline. DRAMA-X contains 5,686 accident-prone frames labeled with object bounding boxes, a nine-class directional intent taxonomy, binary risk scores, expert-generated action suggestions for the ego vehicle, and descriptive motion summaries. These annotations enable a structured evaluation of four interrelated tasks central to autonomous decision-making: object detection, intent prediction, risk assessment, and action suggestion. As a reference baseline, we propose SGG-Intent, a lightweight, training-free framework that mirrors the ego vehicle’s reasoning pipeline. It sequentially generates a scene graph from visual input using VLM-backed detectors, infers intent, assesses risk, and recommends an action using a compositional reasoning stage powered by a large language model. We evaluate a range of recent VLMs, comparing performance across all four DRAMA-X tasks. Our experiments demonstrate that scene-graph-based reasoning enhances intent prediction and risk assessment, especially when contextual cues are explicitly modeled.

[426] Multimodal Deception in Explainable AI: Concept-Level Backdoor Attacks on Concept Bottleneck Models

Songning Lai, Jiayu Yang, Yu Huang, Lijie Hu, Tianlang Xue, Zhangyi Hu, Jiaxu Li, Haicheng Liao, Yutao Yue

Main category: cs.CV

TL;DR: The paper introduces CAT and CAT+, methods for concept-level backdoor attacks in multimodal XAI systems, highlighting security risks in interpretable AI.

Details

Motivation: Address the unexplored risk of concept-level backdoor attacks in multimodal XAI systems, particularly in Concept Bottleneck Models (CBMs).

Method: Propose CAT and CAT+, which inject triggers into conceptual representations during training to manipulate predictions without affecting clean-data performance. CAT+ optimizes trigger-concept associations for better stealth and effectiveness.

Result: CAT and CAT+ achieve high performance on clean data while successfully manipulating predictions on backdoored datasets.

Conclusion: The work underscores security risks in interpretable AI and provides a robust framework for assessing CBM security.

Abstract: Deep learning has demonstrated transformative potential across domains, yet its inherent opacity has driven the development of Explainable Artificial Intelligence (XAI). Concept Bottleneck Models (CBMs), which enforce interpretability through human-understandable concepts, represent a prominent advancement in XAI. However, despite their semantic transparency, CBMs remain vulnerable to security threats such as backdoor attacks malicious manipulations that induce controlled misbehaviors during inference. While CBMs leverage multimodal representations (visual inputs and textual concepts) to enhance interpretability, heir dual modality structure introduces new attack surfaces. To address the unexplored risk of concept-level backdoor attacks in multimodal XAI systems, we propose CAT (Concept-level Backdoor ATtacks), a methodology that injects triggers into conceptual representations during training, enabling precise prediction manipulation without compromising clean-data performance. An enhanced variant, CAT+, incorporates a concept correlation function to systematically optimize trigger-concept associations, thereby improving attack effectiveness and stealthiness. Through a comprehensive evaluation framework assessing attack success rate, stealth metrics, and model utility preservation, we demonstrate that CAT and CAT+ maintain high performance on clean data while achieving significant targeted effects on backdoored datasets. This work highlights critical security risks in interpretable AI systems and provides a robust methodology for future security assessments of CBMs.

[427] SparseTem: Boosting the Efficiency of CNN-Based Video Encoders by Exploiting Temporal Continuity

Kunyun Wang, Shuo Yang, Jieru Zhao, Wenchao Ding, Quan Chen, Jingwen Leng, Minyi Guo

Main category: cs.CV

TL;DR: SparseTem framework accelerates CNN-based video encoders by leveraging temporal continuity, achieving significant speedups with minimal accuracy drop and no extra memory overhead.

Details

Motivation: The computational demands of CNNs in video processing hinder broader adoption, despite their efficiency. Temporal continuity in video frames allows skipping redundant computations, but challenges like memory consumption and accuracy degradation arise.

Method: Proposes Diff Computation to skip redundant computations, a memory-efficient scheduling method to reduce overhead, and an online adjustment mechanism to mitigate accuracy loss. Integrated into SparseTem framework.

Result: Achieves speedups of 1.79x for EfficientDet and 4.72x for CRNN with minimal accuracy drop and no additional memory overhead.

Conclusion: SparseTem sets a new state-of-the-art by efficiently utilizing temporal continuity to accelerate CNN-based video encoders.

Abstract: Deep learning models have become pivotal in the field of video processing and is increasingly critical in practical applications such as autonomous driving and object detection. Although Vision Transformers (ViTs) have demonstrated their power, Convolutional Neural Networks (CNNs) remain a highly efficient and high-performance choice for feature extraction and encoding. However, the intensive computational demands of convolution operations hinder its broader adoption as a video encoder. Given the inherent temporal continuity in video frames, changes between consecutive frames are minimal, allowing for the skipping of redundant computations. This technique, which we term as Diff Computation, presents two primary challenges. First, Diff Computation requires to cache intermediate feature maps to ensure the correctness of non-linear computations, leading to significant memory consumption. Second, the imbalance of sparsity among layers, introduced by Diff Computation, incurs accuracy degradation. To address these issues, we propose a memory-efficient scheduling method to eliminate memory overhead and an online adjustment mechanism to minimize accuracy degradation. We integrate these techniques into our framework, SparseTem, to seamlessly support various CNN-based video encoders. SparseTem achieves speedup of 1.79x for EfficientDet and 4.72x for CRNN, with minimal accuracy drop and no additional memory overhead. Extensive experimental results demonstrate that SparseTem sets a new state-of-the-art by effectively utilizing temporal continuity to accelerate CNN-based video encoders.

[428] Flow Matching Posterior Sampling: A Training-free Conditional Generation for Flow Matching

Kaiyu Song, Hanjiang Lai, Yan Pan, Kun Yue, Jian yin

Main category: cs.CV

TL;DR: FMPS enables training-free conditional generation in flow matching models by introducing a correction term and surrogate score function, outperforming existing methods.

Details

Motivation: Existing training-free conditional generation methods for flow matching models are limited due to the lack of an explicit score function, restricting their applicability.

Method: Proposes Flow Matching-based Posterior Sampling (FMPS) with a correction term to steer the velocity field, incorporating a surrogate score function. Two implementations are introduced for quality and efficiency.

Result: FMPS achieves superior generation quality in diverse tasks compared to state-of-the-art methods.

Conclusion: FMPS effectively bridges the gap between flow matching and score-based posterior sampling, demonstrating generality and effectiveness.

Abstract: Training-free conditional generation based on flow matching aims to leverage pre-trained unconditional flow matching models to perform conditional generation without retraining. Recently, a successful training-free conditional generation approach incorporates conditions via posterior sampling, which relies on the availability of a score function in the unconditional diffusion model. However, flow matching models do not possess an explicit score function, rendering such a strategy inapplicable. Approximate posterior sampling for flow matching has been explored, but it is limited to linear inverse problems. In this paper, we propose Flow Matching-based Posterior Sampling (FMPS) to expand its application scope. We introduce a correction term by steering the velocity field. This correction term can be reformulated to incorporate a surrogate score function, thereby bridging the gap between flow matching models and score-based posterior sampling. Hence, FMPS enables the posterior sampling to be adjusted within the flow matching framework. Further, we propose two practical implementations of the correction mechanism: one aimed at improving generation quality, and the other focused on computational efficiency. Experimental results on diverse conditional generation tasks demonstrate that our method achieves superior generation quality compared to existing state-of-the-art approaches, validating the effectiveness and generality of FMPS.

[429] Foundation versus Domain-specific Models: Performance Comparison, Fusion, and Explainability in Face Recognition

Redwan Sony, Parisa Farmanifard, Arun Ross, Anil K. Jain

Main category: cs.CV

TL;DR: Domain-specific face recognition models outperform zero-shot foundation models, but combining them improves accuracy and explainability.

Details

Motivation: To compare generic foundation models (e.g., CLIP, GPT-4o) with domain-specific face recognition models (e.g., AdaFace) on face recognition tasks.

Method: Conducted experiments using benchmark datasets, evaluated performance, and tested score-level fusion of models.

Result: Domain-specific models outperformed foundation models; fusion improved accuracy, and foundation models added explainability.

Conclusion: Combining domain-specific and foundation models judiciously enhances face recognition performance and explainability.

Abstract: In this paper, we address the following question: How do generic foundation models (e.g., CLIP, BLIP, GPT-4o, Grok-4) compare against a domain-specific face recognition model (viz., AdaFace or ArcFace) on the face recognition task? Through a series of experiments involving several foundation models and benchmark datasets, we report the following findings: (a) In all face benchmark datasets considered, domain-specific models outperformed zero-shot foundation models. (b) The performance of zero-shot generic foundation models improved on over-segmented face images compared to tightly cropped faces, thereby suggesting the importance of contextual clues. (c) A simple score-level fusion of a foundation model with a domain-specific face recognition model improved the accuracy at low false match rates. (d) Foundation models, such as GPT-4o and Grok-4, are able to provide explainability to the face recognition pipeline. In some instances, foundation models are even able to resolve low-confidence decisions made by AdaFace, thereby reiterating the importance of combining domain-specific face recognition models with generic foundation models in a judicious manner.

[430] Solving Zero-Shot 3D Visual Grounding as Constraint Satisfaction Problems

Qihao Yuan, Kailai Li, Jiaming Zhang

Main category: cs.CV

TL;DR: A zero-shot 3D visual grounding method reformulates the task as a Constraint Satisfaction Problem (CSP) for improved accuracy and flexibility.

Details

Motivation: Supervised methods lack open vocabulary and language understanding; zero-shot methods using LLMs are explored.

Method: Proposes Constraint Satisfaction Visual Grounding (CSVG), treating 3DVG as a CSP for symbolic reasoning.

Result: Achieves +7.0% and +11.2% improvements on ScanRefer and Nr3D datasets, respectively.

Conclusion: CSVG is effective, flexible, and outperforms state-of-the-art zero-shot methods.

Abstract: 3D visual grounding (3DVG) aims to locate objects in a 3D scene with natural language descriptions. Supervised methods have achieved decent accuracy, but have a closed vocabulary and limited language understanding ability. Zero-shot methods utilize large language models (LLMs) to handle natural language descriptions, where the LLM either produces grounding results directly or generates programs that compute results (symbolically). In this work, we propose a zero-shot method that reformulates the 3DVG task as a Constraint Satisfaction Problem (CSP), where the variables and constraints represent objects and their spatial relations, respectively. This allows a global symbolic reasoning of all relevant objects, producing grounding results of both the target and anchor objects. Moreover, we demonstrate the flexibility of our framework by handling negation- and counting-based queries with only minor extra coding efforts. Our system, Constraint Satisfaction Visual Grounding (CSVG), has been extensively evaluated on the public datasets ScanRefer and Nr3D datasets using only open-source LLMs. Results show the effectiveness of CSVG and superior grounding accuracy over current state-of-the-art zero-shot 3DVG methods with improvements of $+7.0%$ (Acc@0.5 score) and $+11.2%$ on the ScanRefer and Nr3D datasets, respectively. The code of our system is available at https://asig-x.github.io/csvg_web.

[431] Quadratic Gaussian Splatting: High Quality Surface Reconstruction with Second-order Geometric Primitives

Ziyu Zhang, Binbin Huang, Hanqing Jiang, Liyang Zhou, Xiaojun Xiang, Shunhan Shen

Main category: cs.CV

TL;DR: Quadratic Gaussian Splatting (QGS) introduces deformable quadric surfaces and geodesic distance-based density distributions for improved geometric accuracy and memory efficiency in surface reconstruction.

Details

Motivation: Prior methods used Euclidean distance, which misaligns with surface geometry under deformation. QGS aims to address this by adapting density weights to primitive curvature.

Method: QGS replaces static primitives with deformable quadric surfaces and uses geodesic distance for density modeling, enabling surface-aware splatting and efficient rendering.

Result: QGS reduces geometric error by 33% over 2DGS and 27% over GOF on the DTU dataset while maintaining competitive appearance quality.

Conclusion: QGS bridges the gap between geometric precision and visual fidelity, making it suitable for applications like robotics and immersive reality.

Abstract: We propose Quadratic Gaussian Splatting (QGS), a novel representation that replaces static primitives with deformable quadric surfaces (e.g., ellipse, paraboloids) to capture intricate geometry. Unlike prior works that rely on Euclidean distance for primitive density modeling–a metric misaligned with surface geometry under deformation–QGS introduces geodesic distance-based density distributions. This innovation ensures that density weights adapt intrinsically to the primitive curvature, preserving consistency during shape changes (e.g., from planar disks to curved paraboloids). By solving geodesic distances in closed form on quadric surfaces, QGS enables surface-aware splatting, where a single primitive can represent complex curvature that previously required dozens of planar surfels, potentially reducing memory usage while maintaining efficient rendering via fast ray-quadric intersection. Experiments on DTU, Tanks and Temples, and MipNeRF360 datasets demonstrate state-of-the-art surface reconstruction, with QGS reducing geometric error (chamfer distance) by 33% over 2DGS and 27% over GOF on the DTU dataset. Crucially, QGS retains competitive appearance quality, bridging the gap between geometric precision and visual fidelity for applications like robotics and immersive reality.

Zhang Li, Biao Yang, Qiang Liu, Shuo Zhang, Zhiyin Ma, Liang Yin, Linger Deng, Yabo Sun, Yuliang Liu, Xiang Bai

Main category: cs.CV

TL;DR: LIRA addresses LMMs’ segmentation and comprehension issues by combining semantic and pixel-level features (SEFE) and fine-grained supervision (ILVC), achieving state-of-the-art results.

Details

Motivation: Large multi-modal models (LMMs) face inaccurate segmentation and hallucinated comprehension due to weak visual comprehension and lack of fine-grained perception.

Method: LIRA uses Semantic-Enhanced Feature Extractor (SEFE) for better segmentation and Interleaved Local Visual Coupling (ILVC) for fine-grained supervision. The Attributes Evaluation (AttrEval) dataset quantifies semantic inference.

Result: LIRA achieves state-of-the-art performance in segmentation and comprehension tasks.

Conclusion: LIRA effectively mitigates LMMs’ limitations by leveraging semantic and fine-grained supervision, validated by the AttrEval dataset.

Abstract: While large multi-modal models (LMMs) demonstrate promising capabilities in segmentation and comprehension, they still struggle with two limitations: inaccurate segmentation and hallucinated comprehension. These challenges stem primarily from constraints in weak visual comprehension and a lack of fine-grained perception. To alleviate these limitations, we propose LIRA, a framework that capitalizes on the complementary relationship between visual comprehension and segmentation via two key components: (1) Semantic-Enhanced Feature Extractor (SEFE) improves object attribute inference by fusing semantic and pixel-level features, leading to more accurate segmentation; (2) Interleaved Local Visual Coupling (ILVC) autoregressively generates local descriptions after extracting local features based on segmentation masks, offering fine-grained supervision to mitigate hallucinations. Furthermore, we find that the precision of object segmentation is positively correlated with the latent related semantics of the token. To quantify this relationship and the model’s potential semantic inferring ability, we introduce the Attributes Evaluation (AttrEval) dataset. Our experiments show that LIRA achieves state-of-the-art performance in both segmentation and comprehension tasks. Code will be available at https://github.com/echo840/LIRA.

[433] VideoSAVi: Self-Aligned Video Language Models without Human Supervision

Yogesh Kulkarni, Pooyan Fazli

Main category: cs.CV

TL;DR: VideoSAVi introduces a self-training pipeline for Video-LLMs, eliminating the need for costly external supervision by generating preference data internally and using DPO for iterative training, achieving notable benchmark improvements.

Details

Motivation: Current methods for video-language alignment rely on expensive proprietary APIs or human annotations, making the process costly and labor-intensive. VideoSAVi aims to overcome this by enabling self-supervised learning.

Method: VideoSAVi uses a self-critiquing mechanism to identify and correct reasoning errors in model outputs, creating preference pairs from video content. It then applies Direct Preference Optimization (DPO) for iterative training.

Result: VideoSAVi improves performance by +4.2 points on MVBench, +3.9 on PerceptionTest, and +6.8 on EgoSchema compared to baselines, while being computationally efficient (32 frames).

Conclusion: VideoSAVi offers a model-agnostic, efficient approach for self-aligned video understanding, reducing reliance on external supervision and achieving significant benchmark gains.

Abstract: Recent advances in video-large language models (Video-LLMs) have led to significant progress in video understanding. Current preference optimization methods often rely on proprietary APIs or human-annotated captions to generate preference data (i.e., pairs of model outputs ranked by quality or alignment with human judgment), which is then used to train models for video-language alignment. This approach is both costly and labor-intensive. To address this limitation, we introduce VideoSAVi (Self-Aligned Video Language Model), a self-training pipeline that enables Video-LLMs to learn from video content without external supervision. Our approach includes a self-critiquing mechanism that identifies reasoning errors in the model’s initial responses and generates improved alternatives, creating preference pairs directly from video content. VideoSAVi then applies Direct Preference Optimization (DPO) to iteratively train the model using the preference data, thus enhancing its temporal and spatial reasoning for video understanding. Experiments show that VideoSAVi delivers significant improvements across multiple benchmarks, including a +4.2 percentage point gain on MVBench, +3.9 on PerceptionTest, and +6.8 on the challenging EgoSchema dataset compared to baseline models. Our model-agnostic approach is computationally efficient, requiring only 32 frames, offering a promising direction for self-aligned video understanding without reliance on external models or annotations.

[434] DuoCast: Duo-Probabilistic Diffusion for Precipitation Nowcasting

Penghui Wen, Mengwei He, Patrick Filippi, Na Zhao, Feng Zhang, Thomas Francis Bishop, Zhiyong Wang, Kun Hu

Main category: cs.CV

TL;DR: DuoCast, a dual-diffusion framework, improves short-term precipitation forecasting by decomposing it into low- and high-frequency components, outperforming existing methods.

Details

Motivation: Accurate short-term precipitation forecasting is crucial for weather-sensitive sectors like agriculture and disaster response, but current deep learning methods struggle with balancing global and local details.

Method: DuoCast uses a dual-diffusion framework, decomposing forecasting into low- and high-frequency components in orthogonal latent subspaces. The low-frequency model captures large-scale trends, while the high-frequency model refines fine-scale details.

Result: Experiments on four radar datasets show DuoCast outperforms state-of-the-art baselines in accuracy for spatial detail and temporal evolution.

Conclusion: DuoCast’s frequency decomposition reduces prediction error and enhances forecasting accuracy, making it a superior approach for short-term precipitation prediction.

Abstract: Accurate short-term precipitation forecasting is critical for weather-sensitive decision-making in agriculture, transportation, and disaster response. Existing deep learning approaches often struggle to balance global structural consistency with local detail preservation, especially under complex meteorological conditions. We propose DuoCast, a dual-diffusion framework that decomposes precipitation forecasting into low- and high-frequency components modeled in orthogonal latent subspaces. We theoretically prove that this frequency decomposition reduces prediction error compared to conventional single branch U-Net diffusion models. In DuoCast, the low-frequency model captures large-scale trends via convolutional encoders conditioned on weather front dynamics, while the high-frequency model refines fine-scale variability using a self-attention-based architecture. Experiments on four benchmark radar datasets show that DuoCast consistently outperforms state-of-the-art baselines, achieving superior accuracy in both spatial detail and temporal evolution.

[435] BadPatch: Diffusion-Based Generation of Physical Adversarial Patches

Zhixiang Wang, Xingjun Ma, Yu-Gang Jiang

Main category: cs.CV

TL;DR: BadPatch is a diffusion-based framework for generating customizable, natural-looking adversarial patches that balance stealthiness and attack effectiveness.

Details

Motivation: Existing adversarial patch methods prioritize attack effectiveness over stealthiness, lack customization, and fail to balance aesthetics with functionality.

Method: BadPatch uses Null-text inversion and Incomplete Diffusion Optimization (IDO) to generate patches from reference images, allowing varied shapes and preserving original semantics.

Result: Achieves attack performance comparable to non-naturalistic patches while maintaining a natural appearance, and introduces AdvT-shirt-1K, a dataset of adversarial T-shirt images.

Conclusion: BadPatch addresses limitations of prior methods, offering customizable, stealthy adversarial patches, and provides a dataset for future defense research.

Abstract: Physical adversarial patches printed on clothing can enable individuals to evade person detectors, but most existing methods prioritize attack effectiveness over stealthiness, resulting in aesthetically unpleasing patches. While generative adversarial networks and diffusion models can produce more natural-looking patches, they often fail to balance stealthiness with attack effectiveness and lack flexibility for user customization. To address these limitations, we propose BadPatch, a novel diffusion-based framework for generating customizable and naturalistic adversarial patches. Our approach allows users to start from a reference image (rather than random noise) and incorporates masks to create patches of various shapes, not limited to squares. To preserve the original semantics during the diffusion process, we employ Null-text inversion to map random noise samples to a single input image and generate patches through Incomplete Diffusion Optimization (IDO). Our method achieves attack performance comparable to state-of-the-art non-naturalistic patches while maintaining a natural appearance. Using BadPatch, we construct AdvT-shirt-1K, the first physical adversarial T-shirt dataset comprising over a thousand images captured in diverse scenarios. AdvT-shirt-1K can serve as a useful dataset for training or testing future defense methods.

[436] Learning Phonetic Context-Dependent Viseme for Enhancing Speech-Driven 3D Facial Animation

Hyung Kyu Kim, Hak Gu Kim

Main category: cs.CV

TL;DR: The paper proposes a phonetic context-aware loss for speech-driven 3D facial animation to improve motion continuity and reduce jitter, outperforming traditional frame-wise methods.

Details

Motivation: Traditional methods focus on frame-wise alignment with ground-truth, often resulting in unnatural animations due to ignored coarticulation effects.

Method: Introduces a phonetic context-aware loss with viseme coarticulation weights to adaptively prioritize facial movements based on dynamic changes over time.

Result: Experiments show improved quantitative metrics and visual quality compared to conventional reconstruction loss methods.

Conclusion: Explicitly modeling phonetic context-dependent visemes is crucial for natural speech-driven 3D facial animation.

Abstract: Speech-driven 3D facial animation aims to generate realistic facial movements synchronized with audio. Traditional methods primarily minimize reconstruction loss by aligning each frame with ground-truth. However, this frame-wise approach often fails to capture the continuity of facial motion, leading to jittery and unnatural outputs due to coarticulation. To address this, we propose a novel phonetic context-aware loss, which explicitly models the influence of phonetic context on viseme transitions. By incorporating a viseme coarticulation weight, we assign adaptive importance to facial movements based on their dynamic changes over time, ensuring smoother and perceptually consistent animations. Extensive experiments demonstrate that replacing the conventional reconstruction loss with ours improves both quantitative metrics and visual quality. It highlights the importance of explicitly modeling phonetic context-dependent visemes in synthesizing natural speech-driven 3D facial animation. Project page: https://cau-irislab.github.io/interspeech25/

[437] Street Gaussians without 3D Object Tracker

Ruida Zhang, Chengxi Li, Chenyangguang Zhang, Xingyu Liu, Haili Yuan, Yanyan Li, Xiangyang Ji, Gim Hee Lee

Main category: cs.CV

TL;DR: A method leveraging 2D deep trackers and motion learning in implicit feature space improves dynamic object reconstruction in driving scenes, outperforming existing approaches.

Details

Motivation: Challenges in realistic scene reconstruction due to fast-moving objects and reliance on manual labeling or limited 3D trackers.

Method: Uses 2D deep trackers within a 3D fusion strategy and introduces motion learning in implicit feature space to correct errors.

Result: Outperforms existing methods on Waymo-NOTR and KITTI datasets.

Conclusion: Proposed method enhances robustness and eliminates reliance on 3D trackers, with code to be publicly available.

Abstract: Realistic scene reconstruction in driving scenarios poses significant challenges due to fast-moving objects. Most existing methods rely on labor-intensive manual labeling of object poses to reconstruct dynamic objects in canonical space and move them based on these poses during rendering. While some approaches attempt to use 3D object trackers to replace manual annotations, the limited generalization of 3D trackers – caused by the scarcity of large-scale 3D datasets – results in inferior reconstructions in real-world settings. In contrast, 2D foundation models demonstrate strong generalization capabilities. To eliminate the reliance on 3D trackers and enhance robustness across diverse environments, we propose a stable object tracking module by leveraging associations from 2D deep trackers within a 3D object fusion strategy. We address inevitable tracking errors by further introducing a motion learning strategy in an implicit feature space that autonomously corrects trajectory errors and recovers missed detections. Experimental results on Waymo-NOTR and KITTI show that our method outperforms existing approaches. Our code will be made publicly available.

[438] FOF-X: Towards Real-time Detailed Human Reconstruction from a Single Image

Qiao Feng, Yuanwang Yang, Yebin Liu, Yu-Kun Lai, Jingyu Yang, Kun Li

Main category: cs.CV

TL;DR: FOF-X introduces an efficient 3D representation (Fourier Occupancy Field) for real-time human geometry reconstruction from a single image, balancing speed and quality.

Details

Motivation: The challenge lies in balancing real-time speed with high-quality 3D reconstruction due to computational demands of existing methods.

Method: Proposes Fourier Occupancy Field (FOF), a 3D representation factorized into a 2D vector field, enabling compatibility with 2D CNNs. FOF-X integrates human parametric models and enhances robustness with Laplacian constraints and discontinuity matchers.

Result: FOF-X achieves state-of-the-art results on datasets and real-captured data, handling domain gaps and improving reconstruction quality.

Conclusion: FOF-X successfully bridges 3D and 2D domains, offering a robust, real-time solution for detailed human geometry reconstruction.

Abstract: We introduce FOF-X for real-time reconstruction of detailed human geometry from a single image. Balancing real-time speed against high-quality results is a persistent challenge, mainly due to the high computational demands of existing 3D representations. To address this, we propose Fourier Occupancy Field (FOF), an efficient 3D representation by learning the Fourier series. The core of FOF is to factorize a 3D occupancy field into a 2D vector field, retaining topology and spatial relationships within the 3D domain while facilitating compatibility with 2D convolutional neural networks. Such a representation bridges the gap between 3D and 2D domains, enabling the integration of human parametric models as priors and enhancing the reconstruction robustness. Based on FOF, we design a new reconstruction framework, FOF-X, to avoid the performance degradation caused by texture and lighting. This enables our real-time reconstruction system to better handle the domain gap between training images and real images. Additionally, in FOF-X, we enhance the inter-conversion algorithms between FOF and mesh representations with a Laplacian constraint and an automaton-based discontinuity matcher, improving both quality and robustness. We validate the strengths of our approach on different datasets and real-captured data, where FOF-X achieves new state-of-the-art results. The code has already been released for research purposes at https://cic.tju.edu.cn/faculty/likun/projects/FOFX/index.html.

[439] Dynamic Robot-Assisted Surgery with Hierarchical Class-Incremental Semantic Segmentation

Julia Hindel, Ema Mekic, Enamundram Naga Karthik, Rohit Mohan, Daniele Cattaneo, Maria Kalweit, Abhinav Valada

Main category: cs.CV

TL;DR: The paper introduces TOPICS+, an enhanced variant of the TOPICS approach for class-incremental semantic segmentation in robotic surgery, addressing class imbalances and dynamic environments. It also proposes new benchmarks and a refined dataset.

Details

Motivation: To improve scene understanding in dynamic surgical environments, overcoming limitations of static datasets and catastrophic forgetting in incremental learning.

Method: Enhances TOPICS with Dice loss for class imbalance, hierarchical pseudo-labeling, and tailored taxonomies. Introduces new CISS benchmarks and a refined dataset.

Result: TOPICS+ improves segmentation robustness in surgical scenes, with new benchmarks and a dataset for evaluation.

Conclusion: TOPICS+ and the proposed benchmarks advance incremental segmentation for robotic surgery, with publicly available resources.

Abstract: Robot-assisted surgeries rely on accurate and real-time scene understanding to safely guide surgical instruments. However, segmentation models trained on static datasets face key limitations when deployed in these dynamic and evolving surgical environments. Class-incremental semantic segmentation (CISS) allows models to continually adapt to new classes while avoiding catastrophic forgetting of prior knowledge, without training on previous data. In this work, we build upon the recently introduced Taxonomy-Oriented Poincar'e-regularized Incremental Class Segmentation (TOPICS) approach and propose an enhanced variant, termed TOPICS+, specifically tailored for robust segmentation of surgical scenes. Concretely, we incorporate the Dice loss into the hierarchical loss formulation to handle strong class imbalances, introduce hierarchical pseudo-labeling, and design tailored label taxonomies for robotic surgery environments. We also propose six novel CISS benchmarks designed for robotic surgery environments including multiple incremental steps and several semantic categories to emulate realistic class-incremental settings in surgical environments. In addition, we introduce a refined set of labels with more than 144 classes on the Syn-Mediverse synthetic dataset, hosted online as an evaluation benchmark. We make the code and trained models publicly available at http://topics.cs.uni-freiburg.de.

[440] Unbiased Region-Language Alignment for Open-Vocabulary Dense Prediction

Yunheng Li, Yuxuan Li, Quansheng Zeng, Wenhai Wang, Qibin Hou, Ming-Ming Cheng

Main category: cs.CV

TL;DR: DenseVLM improves zero-shot dense prediction tasks by addressing foreground bias in pre-trained VLMs through unbiased region-language alignment.

Details

Motivation: Pre-trained VLMs like CLIP underperform in dense prediction tasks due to foreground bias, where background regions are misidentified as foreground objects.

Method: DenseVLM leverages pre-trained VLMs to retrieve categories for unlabeled regions and decouples foreground-background feature interference.

Result: DenseVLM enhances performance in open-vocabulary object detection and segmentation, showing scalability with diverse datasets.

Conclusion: DenseVLM effectively mitigates foreground bias and improves dense prediction tasks without extensive annotations.

Abstract: Pre-trained vision-language models (VLMs), such as CLIP, have demonstrated impressive zero-shot recognition capability, but still underperform in dense prediction tasks. Self-distillation recently is emerging as a promising approach for fine-tuning VLMs to better adapt to local regions without requiring extensive annotations. However, previous state-of-the-art approaches often suffer from significant `foreground bias’, where models tend to wrongly identify background regions as foreground objects. To alleviate this issue, we propose DenseVLM, a framework designed to learn unbiased region-language alignment from powerful pre-trained VLM representations. To alleviate this issue, we propose DenseVLM, a framework designed to learn unbiased region-language alignment from powerful pre-trained VLM representations. DenseVLM leverages the pre-trained VLM to retrieve categories for unlabeled regions and then decouples the interference between foreground and background features. We show that DenseVLM can directly replace the original VLM in open-vocabulary object detection and image segmentation methods, leading to notable performance improvements. Furthermore, it exhibits promising zero-shot scalability when training on more extensive and diverse datasets. Our code is available at https://github.com/HVision-NKU/DenseVLM.

[441] TAR3D: Creating High-Quality 3D Assets via Next-Part Prediction

Xuying Zhang, Yutong Liu, Yangguang Li, Renrui Zhang, Yufei Liu, Kai Wang, Wanli Ouyang, Zhiwei Xiong, Peng Gao, Qibin Hou, Ming-Ming Cheng

Main category: cs.CV

TL;DR: TAR3D is a framework combining a 3D-aware VQ-VAE and a GPT for high-quality 3D asset generation, outperforming existing methods in text-to-3D and image-to-3D tasks.

Details

Motivation: To leverage the multimodal unification and learning capabilities of next-token prediction for conditional 3D object generation.

Method: Uses a 3D VQ-VAE to encode shapes into a triplane latent space and a 3D GPT with TriPE to predict codebook indices autoregressively.

Result: Superior generation quality demonstrated on ShapeNet and Objaverse datasets.

Conclusion: TAR3D effectively models 3D geometries part by part, achieving state-of-the-art performance.

Abstract: We present TAR3D, a novel framework that consists of a 3D-aware Vector Quantized-Variational AutoEncoder (VQ-VAE) and a Generative Pre-trained Transformer (GPT) to generate high-quality 3D assets. The core insight of this work is to migrate the multimodal unification and promising learning capabilities of the next-token prediction paradigm to conditional 3D object generation. To achieve this, the 3D VQ-VAE first encodes a wide range of 3D shapes into a compact triplane latent space and utilizes a set of discrete representations from a trainable codebook to reconstruct fine-grained geometries under the supervision of query point occupancy. Then, the 3D GPT, equipped with a custom triplane position embedding called TriPE, predicts the codebook index sequence with prefilling prompt tokens in an autoregressive manner so that the composition of 3D geometries can be modeled part by part. Extensive experiments on ShapeNet and Objaverse demonstrate that TAR3D can achieve superior generation quality over existing methods in text-to-3D and image-to-3D tasks

[442] When Cars Have Stereotypes: Auditing Demographic Bias in Objects from Text-to-Image Models

Dasol Choi, Jihwan Lee, Minjae Lee, Minsuk Kahng

Main category: cs.CV

TL;DR: The paper introduces SODA, a framework to measure demographic biases in text-to-image generation, revealing subtle stereotypes in object depictions.

Details

Motivation: To address the overlooked issue of demographic biases in generated objects, beyond human depictions, and provide a systematic auditing method.

Method: SODA compares visual attributes of objects generated with demographic cues versus neutral prompts across 2,700 images from three models (GPT Image-1, Imagen 4, Stable Diffusion) in five categories.

Result: Strong associations between demographic groups and visual attributes were found, revealing stereotypes and subtle biases. Some models produced less diverse outputs, amplifying disparities.

Conclusion: SODA offers a practical tool for auditing biases in generative models, highlighting the need for systematic and responsible AI development.

Abstract: While prior research on text-to-image generation has predominantly focused on biases in human depictions, we investigate a more subtle yet pervasive phenomenon: demographic bias in generated objects (e.g., cars). We introduce SODA (Stereotyped Object Diagnostic Audit), a novel framework for systematically measuring such biases. Our approach compares visual attributes of objects generated with demographic cues (e.g., “for young people’’) to those from neutral prompts, across 2,700 images produced by three state-of-the-art models (GPT Image-1, Imagen 4, and Stable Diffusion) in five object categories. Through a comprehensive analysis, we uncover strong associations between specific demographic groups and visual attributes, such as recurring color patterns prompted by gender or ethnicity cues. These patterns reflect and reinforce not only well-known stereotypes but also more subtle and unintuitive biases. We also observe that some models generate less diverse outputs, which in turn amplifies the visual disparities compared to neutral prompts. Our proposed auditing framework offers a practical approach for testing, revealing how stereotypes still remain embedded in today’s generative models. We see this as an essential step toward more systematic and responsible AI development.

[443] Global Compression Commander: Plug-and-Play Inference Acceleration for High-Resolution Large Vision-Language Models

Xuyang Liu, Ziming Wang, Junjie Chen, Yuhang Han, Yingyao Wang, Jiale Yuan, Jun Song, Linfeng Zhang, Siteng Huang, Honggang Chen

Main category: cs.CV

TL;DR: GlobalCom$^2$ is a token compression framework for high-resolution LVLMs, using thumbnails to guide compression of local crops, achieving 90% performance retention with 90% token reduction.

Details

Motivation: Address efficiency challenges in LVLMs due to quadratic complexity in multi-modal contexts, leveraging multi-view characteristics for better compression.

Method: Proposes GlobalCom$^2$, a plug-and-play framework where thumbnails guide adaptive compression of local crops, preserving informative details.

Result: Achieves 90% performance retention, reduces FLOPs to 9.1%, and peak memory to 60% while compressing 90% visual tokens.

Conclusion: GlobalCom$^2$ effectively balances efficiency and performance in HR-LVLMs by leveraging thumbnail-guided compression.

Abstract: Large vision-language models (LVLMs) excel at visual understanding, but face efficiency challenges due to quadratic complexity in processing long multi-modal contexts. While token compression can reduce computational costs, existing approaches are designed for single-view LVLMs and fail to consider the unique multi-view characteristics of high-resolution LVLMs with dynamic cropping. Existing methods treat all tokens uniformly, but our analysis reveals that global thumbnails can naturally guide the compression of local crops by providing holistic context for informativeness evaluation. In this paper, we first analyze dynamic cropping strategy, revealing both the complementary nature between thumbnails and crops, and the distinctive characteristics across different crops. Based on our observations, we propose “Global Compression Commander” (GlobalCom$^2$), a novel plug-and-play token compression framework for HR-LVLMs. GlobalCom$^2$ leverages thumbnail as the “commander” to guide the compression of local crops, adaptively preserving informative details while eliminating redundancy. Extensive experiments show that GlobalCom$^2$ maintains over 90% performance while compressing 90% visual tokens, reducing FLOPs and peak memory to 9.1% and 60%. Our code is available at https://github.com/xuyang-liu16/GlobalCom2.

[444] DAViD: Modeling Dynamic Affordance of 3D Objects Using Pre-trained Video Diffusion Models

Hyeonwoo Kim, Sangwon Baik, Hanbyul Joo

Main category: cs.CV

TL;DR: The paper introduces DAViD, a framework for learning dynamic human-object interaction (HOI) patterns using synthetic 4D HOI samples, outperforming baselines in motion synthesis.

Details

Motivation: Existing studies focus on static HOI patterns, leaving dynamic HOI underexplored. The paper aims to model dynamic affordance to better assist or mimic human behaviors.

Method: A pipeline generates 2D HOI videos from 3D objects using a video diffusion model, lifts them to 3D for 4D samples, and trains DAViD—a generative 4D HOI model with human and object motion diffusion components.

Result: DAViD synthesizes novel HOI motions by integrating learned concepts with pre-trained motions, outperforming baselines in HOI motion synthesis.

Conclusion: The framework effectively addresses the scarcity of 4D HOI data and advances dynamic HOI modeling, demonstrating superior performance in motion synthesis.

Abstract: Modeling how humans interact with objects is crucial for AI to effectively assist or mimic human behaviors. Existing studies for learning such ability primarily focus on static human-object interaction (HOI) patterns, such as contact and spatial relationships, while dynamic HOI patterns, capturing the movement of humans and objects over time, remain relatively underexplored. In this paper, we present a novel framework for learning Dynamic Affordance across various target object categories. To address the scarcity of 4D HOI datasets, our method learns the 3D dynamic affordance from synthetically generated 4D HOI samples. Specifically, we propose a pipeline that first generates 2D HOI videos from a given 3D target object using a pre-trained video diffusion model, then lifts them into 3D to generate 4D HOI samples. Leveraging these synthesized 4D HOI samples, we train DAViD, our generative 4D human-object interaction model, which is composed of two key components: (1) a human motion diffusion model (MDM) with Low-Rank Adaptation (LoRA) module to fine-tune a pre-trained MDM to learn the HOI motion concepts from limited HOI motion samples, (2) a motion diffusion model for 4D object poses conditioned by produced human interaction motions. Interestingly, DAViD can integrate newly learned HOI motion concepts with pre-trained human motions to create novel HOI motions, even for multiple HOI motion concepts, demonstrating the advantage of our pipeline with LoRA in integrating dynamic HOI concepts. Through extensive experiments, we demonstrate that DAViD outperforms baselines in synthesizing HOI motion.

[445] DS$^2$Net: Detail-Semantic Deep Supervision Network for Medical Image Segmentation

Zhaohong Huang, Yuxin Zhang, Taojian Zhou, Guorong Cai, Rongrong Ji

Main category: cs.CV

TL;DR: DS²Net introduces multi-view deep supervision for medical image segmentation, combining detail and semantic feature supervision with adaptive uncertainty-based loss, outperforming state-of-the-art methods.

Details

Motivation: Existing methods supervise coarse or fine features in isolation, ignoring their vital relationships in medical image analysis.

Method: Proposes DS²Net with Detail Enhance Module (DEM) and Semantic Enhance Module (SEM) for multi-view supervision, plus an adaptive uncertainty-based loss.

Result: Outperforms state-of-the-art methods on six benchmarks across colonoscopy, ultrasound, and microscope imaging.

Conclusion: DS²Net effectively integrates detail and semantic supervision, offering superior performance in medical image segmentation.

Abstract: Deep Supervision Networks exhibit significant efficacy for the medical imaging community. Nevertheless, existing work merely supervises either the coarse-grained semantic features or fine-grained detailed features in isolation, which compromises the fact that these two types of features hold vital relationships in medical image analysis. We advocate the powers of complementary feature supervision for medical image segmentation, by proposing a Detail-Semantic Deep Supervision Network (DS$^2$Net). DS$^2$Net navigates both low-level detailed and high-level semantic feature supervision through Detail Enhance Module (DEM) and Semantic Enhance Module (SEM). DEM and SEM respectively harness low-level and high-level feature maps to create detail and semantic masks for enhancing feature supervision. This is a novel shift from single-view deep supervision to multi-view deep supervision. DS$^2$Net is also equipped with a novel uncertainty-based supervision loss that adaptively assigns the supervision strength of features within distinct scales based on their uncertainty, thus circumventing the sub-optimal heuristic design that typifies previous works. Through extensive experiments on six benchmarks captured under either colonoscopy, ultrasound and microscope, we demonstrate that DS$^2$Net consistently outperforms state-of-the-art methods for medical image analysis.

[446] DWTNeRF: Boosting Few-shot Neural Radiance Fields via Discrete Wavelet Transform

Hung Nguyen, Blark Runfa Li, Truong Nguyen

Main category: cs.CV

TL;DR: DWTNeRF improves few-shot NeRF performance by introducing a Discrete Wavelet loss and a model-based approach, achieving significant gains over Vanilla INGP.

Details

Motivation: NeRF's slow convergence and reliance on dense training views limit its practicality. DWTNeRF addresses these issues for few-shot scenarios.

Method: DWTNeRF uses Instant-NGP’s hash encoding, a Discrete Wavelet loss for low-frequency prioritization, and a multi-head attention model-based approach.

Result: DWTNeRF outperforms Vanilla INGP by 15.07% in PSNR, 24.45% in SSIM, and 36.30% in LPIPS on the 3-shot LLFF benchmark.

Conclusion: DWTNeRF rethinks few-shot approaches for fast-converging implicit representations like INGP or 3DGS.

Abstract: Neural Radiance Fields (NeRF) has achieved superior performance in novel view synthesis and 3D scene representation, but its practical applications are hindered by slow convergence and reliance on dense training views. To this end, we present DWTNeRF, a unified framework based on Instant-NGP’s fast-training hash encoding. It is coupled with regularization terms designed for few-shot NeRF, which operates on sparse training views. Our DWTNeRF additionally includes a novel Discrete Wavelet loss that allows explicit prioritization of low frequencies directly in the training objective, reducing few-shot NeRF’s overfitting on high frequencies in earlier training stages. We also introduce a model-based approach, based on multi-head attention, that is compatible with INGP, which are sensitive to architectural changes. On the 3-shot LLFF benchmark, DWTNeRF outperforms Vanilla INGP by 15.07% in PSNR, 24.45% in SSIM and 36.30% in LPIPS. Our approach encourages a re-thinking of current few-shot approaches for fast-converging implicit representations like INGP or 3DGS.

[447] MatCLIP: Light- and Shape-Insensitive Assignment of PBR Material Models

Michael Birsak, John Femiani, Biao Zhang, Peter Wonka

Main category: cs.CV

TL;DR: MatCLIP proposes a method to assign PBR materials to 3D models using shape- and lighting-insensitive descriptors, outperforming existing methods by 15% in accuracy.

Details

Motivation: Assigning realistic materials to 3D models is challenging due to the dynamic nature of PBR representations under varying conditions.

Method: Extends Alpha-CLIP to generate descriptors from material renderings across diverse shapes and lighting, bridging PBR representations with images.

Result: Achieves 76.6% top-1 classification accuracy, surpassing PhotoShape and MatAtlas by over 15%.

Conclusion: MatCLIP enables consistent material assignments without explicit knowledge of relationships, applicable to datasets like ShapeNet and Objaverse.

Abstract: Assigning realistic materials to 3D models remains a significant challenge in computer graphics. We propose MatCLIP, a novel method that extracts shape- and lighting-insensitive descriptors of Physically Based Rendering (PBR) materials to assign plausible textures to 3D objects based on images, such as the output of Latent Diffusion Models (LDMs) or photographs. Matching PBR materials to static images is challenging because the PBR representation captures the dynamic appearance of materials under varying viewing angles, shapes, and lighting conditions. By extending an Alpha-CLIP-based model on material renderings across diverse shapes and lighting, and encoding multiple viewing conditions for PBR materials, our approach generates descriptors that bridge the domains of PBR representations with photographs or renderings, including LDM outputs. This enables consistent material assignments without requiring explicit knowledge of material relationships between different parts of an object. MatCLIP achieves a top-1 classification accuracy of 76.6%, outperforming state-of-the-art methods such as PhotoShape and MatAtlas by over 15 percentage points on publicly available datasets. Our method can be used to construct material assignments for 3D shape datasets such as ShapeNet, 3DCoMPaT++, and Objaverse. All code and data will be released.

[448] Fourier-VLM: Compressing Vision Tokens in the Frequency Domain for Large Vision-Language Models

Huanyu Wang, Jushi Kai, Haoli Bai, Lu Hou, Bo Jiang, Ziwei He, Zhouhan Lin

Main category: cs.CV

TL;DR: Fourier-VLM compresses visual representations in the frequency domain to reduce computational overhead in Vision-Language Models (VLMs) without compromising performance.

Details

Motivation: The large number of vision tokens in VLMs increases context length, leading to high computational overhead and latency. Existing methods either compromise performance or add extra costs.

Method: Proposes Fourier-VLM, which compresses visual features using a low-pass filter in the frequency domain via Discrete Cosine Transform (DCT), computed efficiently with Fast Fourier Transform (FFT).

Result: Achieves competitive performance across benchmarks, reduces inference FLOPs by 83.8%, and boosts generation speed by 31.2% compared to LLaVA-v1.5.

Conclusion: Fourier-VLM offers superior efficiency and practicality for VLMs while maintaining performance.

Abstract: Vision-Language Models (VLMs) typically replace the predefined image placeholder token () in textual instructions with visual features from an image encoder, forming the input to a backbone Large Language Model (LLM). However, the large number of vision tokens significantly increases the context length, leading to high computational overhead and inference latency. While previous efforts mitigate this by selecting only important visual features or leveraging learnable queries to reduce token count, they often compromise performance or introduce substantial extra costs. In response, we propose Fourier-VLM, a simple yet efficient method that compresses visual representations in the frequency domain. Our approach is motivated by the observation that vision features output from the vision encoder exhibit concentrated energy in low-frequency components. Leveraging this, we apply a low-pass filter to the vision features using a two-dimensional Discrete Cosine Transform (DCT). Notably, the DCT is efficiently computed via the Fast Fourier Transform (FFT) operator with a time complexity of $\mathcal{O}(n\log n)$, minimizing the extra computational cost while introducing no additional parameters. Extensive experiments across various image-based benchmarks demonstrate that Fourier-VLM achieves competitive performance with strong generalizability across both LLaVA and Qwen-VL architectures. Crucially, it reduce inference FLOPs by up to 83.8% and boots generation speed by 31.2% compared to LLaVA-v1.5, highlighting the superior efficiency and practicality.

[449] LayerTracer: Cognitive-Aligned Layered SVG Synthesis via Diffusion Transformer

Yiren Song, Danze Chen, Mike Zheng Shou

Main category: cs.CV

TL;DR: LayerTracer, a diffusion transformer framework, generates layered SVGs by simulating human design workflows, outperforming existing methods in quality and editability.

Details

Motivation: Existing methods for generating layered SVGs either oversimplify or introduce redundancies, failing to align with professional design cognition.

Method: LayerTracer uses a two-phase approach: a text-conditioned DiT generates rasterized blueprints, followed by layer-wise vectorization with path deduplication. A conditional diffusion mechanism guides hierarchical reconstruction.

Result: LayerTracer outperforms optimization-based and neural baselines in generation quality and editability.

Conclusion: LayerTracer effectively bridges the gap in cognitive-aligned layered SVG generation, aligning AI outputs with professional design workflows.

Abstract: Generating cognitive-aligned layered SVGs remains challenging due to existing methods’ tendencies toward either oversimplified single-layer outputs or optimization-induced shape redundancies. We propose LayerTracer, a diffusion transformer based framework that bridges this gap by learning designers’ layered SVG creation processes from a novel dataset of sequential design operations. Our approach operates in two phases: First, a text-conditioned DiT generates multi-phase rasterized construction blueprints that simulate human design workflows. Second, layer-wise vectorization with path deduplication produces clean, editable SVGs. For image vectorization, we introduce a conditional diffusion mechanism that encodes reference images into latent tokens, guiding hierarchical reconstruction while preserving structural integrity. Extensive experiments demonstrate LayerTracer’s superior performance against optimization-based and neural baselines in both generation quality and editability, effectively aligning AI-generated vectors with professional design cognition.

[450] Confidence-Based Annotation Of Brain Tumours In Ultrasound

Alistair Weld, Luke Dixon, Alfie Roddan, Giulio Anichini, Sophie Camp, Stamatia Giannarou

Main category: cs.CV

TL;DR: The paper proposes a sparse confidence-based annotation method for brain tumour segmentation in ultrasound, addressing aleatoric uncertainty in tumour margins and reducing interobserver variance.

Details

Motivation: The challenge of annotating brain tumours in ultrasound, especially diffuse tumours, involves aleatoric uncertainty along tumour margins and high interobserver variance due to subjectivity.

Method: A sparse confidence annotation protocol, combining computer vision and radiology theory, is introduced to incorporate margin-related uncertainty and minimize subjectivity.

Result: The method showed a Pearson correlation of 0.8 for interobserver variance and outperformed discrete annotations in downstream applications, with superior Brier scores for soft-label training.

Conclusion: Discrete annotation is deemed infeasible for brain tumours in ultrasound, and the proposed confidence-based method is validated as a superior alternative.

Abstract: Purpose: An investigation of the challenge of annotating discrete segmentations of brain tumours in ultrasound, with a focus on the issue of aleatoric uncertainty along the tumour margin, particularly for diffuse tumours. A segmentation protocol and method is proposed that incorporates this margin-related uncertainty while minimising the interobserver variance through reduced subjectivity, thereby diminishing annotator epistemic uncertainty. Approach: A sparse confidence method for annotation is proposed, based on a protocol designed using computer vision and radiology theory. Results: Output annotations using the proposed method are compared with the corresponding professional discrete annotation variance between the observers. A linear relationship was measured within the tumour margin region, with a Pearson correlation of 0.8. The downstream application was explored, comparing training using confidence annotations as soft labels with using the best discrete annotations as hard labels. In all evaluation folds, the Brier score was superior for the soft-label trained network. Conclusion: A formal framework was constructed to demonstrate the infeasibility of discrete annotation of brain tumours in B-mode ultrasound. Subsequently, a method for sparse confidence-based annotation is proposed and evaluated. Keywords: Brain tumours, ultrasound, confidence, annotation.

[451] GaussianFlowOcc: Sparse and Weakly Supervised Occupancy Estimation using Gaussian Splatting and Temporal Flow

Simon Boeder, Fabian Gigengack, Benjamin Risse

Main category: cs.CV

TL;DR: GaussianFlowOcc introduces a sparse 3D Gaussian representation for occupancy estimation, reducing computational costs and outperforming existing methods in speed and performance.

Details

Motivation: Traditional voxel-based methods for occupancy estimation are computationally expensive and inefficient, especially for representing empty 3D spaces. GaussianFlowOcc addresses this by leveraging sparse Gaussian representations.

Method: The approach replaces dense voxel grids with sparse 3D Gaussians, uses a Gaussian Transformer for efficiency, and estimates temporal flow for each Gaussian during training. It employs weak supervision, avoiding costly dense annotations.

Result: GaussianFlowOcc outperforms previous methods on the nuScenes dataset and achieves 50x faster inference speed than state-of-the-art approaches.

Conclusion: GaussianFlowOcc offers a scalable, efficient, and high-performance solution for occupancy estimation, particularly in autonomous driving scenarios.

Abstract: Occupancy estimation has become a prominent task in 3D computer vision, particularly within the autonomous driving community. In this paper, we present a novel approach to occupancy estimation, termed GaussianFlowOcc, which is inspired by Gaussian Splatting and replaces traditional dense voxel grids with a sparse 3D Gaussian representation. Our efficient model architecture based on a Gaussian Transformer significantly reduces computational and memory requirements by eliminating the need for expensive 3D convolutions used with inefficient voxel-based representations that predominantly represent empty 3D spaces. GaussianFlowOcc effectively captures scene dynamics by estimating temporal flow for each Gaussian during the overall network training process, offering a straightforward solution to a complex problem that is often neglected by existing methods. Moreover, GaussianFlowOcc is designed for scalability, as it employs weak supervision and does not require costly dense 3D voxel annotations based on additional data (e.g., LiDAR). Through extensive experimentation, we demonstrate that GaussianFlowOcc significantly outperforms all previous methods for weakly supervised occupancy estimation on the nuScenes dataset while featuring an inference speed that is 50 times faster than current SOTA.

[452] X2I: Seamless Integration of Multimodal Understanding into Diffusion Transformer via Attention Distillation

Jian Ma, Qirong Peng, Xu Guo, Chen Chen, Haonan Lu, Zhenyu Yang

Main category: cs.CV

TL;DR: The paper introduces X2I, a framework enabling Diffusion Transformer (DiT) models to understand multimodal inputs by leveraging MLLMs’ comprehension abilities. It achieves this with minimal training (100K English corpus, 160 GPU hours) and maintains performance while adding diverse multimodal capabilities.

Details

Motivation: To bridge the gap between MLLMs' multimodal understanding and T2I models' image generation, enabling T2I models to process inputs like multilingual text, images, videos, and audio.

Method: Proposes X2I, using a distillation method to transfer MLLMs’ inference capabilities to DiT models, aided by a lightweight AlignNet. Includes LightControl for image editing fidelity.

Result: X2I shows <1% performance degradation while gaining multimodal abilities (e.g., multilingual-to-image, video-to-image). It supports LoRA training and enhances image editing.

Conclusion: X2I is effective, efficient, and versatile, filling an industry gap. Open-source code and checkpoints are provided for broader adoption.

Abstract: Text-to-image (T2I) models are well known for their ability to produce highly realistic images, while multimodal large language models (MLLMs) are renowned for their proficiency in understanding and integrating multiple modalities. However, currently there is no straightforward and efficient framework to transfer the multimodal comprehension abilities of MLLMs to T2I models to enable them to understand multimodal inputs. In this paper, we propose the X2I framework, which endows Diffusion Transformer (DiT) models with the capability to comprehend various modalities, including multilingual text, screenshot documents, images, videos, and audio. X2I is trained using merely 100K English corpus with 160 GPU hours. Building on the DiT teacher model, we adopt an innovative distillation method to extract the inference capabilities of the teacher model and design a lightweight AlignNet structure to serve as an intermediate bridge. Compared to the teacher model, X2I shows a decrease in performance degradation of less than 1% while gaining various multimodal understanding abilities, including multilingual to image, image to image, image-text to image, video to image, audio to image, and utilizing creative fusion to enhance imagery. Furthermore, it is applicable for LoRA training in the context of image-text to image generation, filling a void in the industry in this area. We further design a simple LightControl to enhance the fidelity of instructional image editing. Finally, extensive experiments demonstrate the effectiveness, efficiency, multifunctionality, and transferability of our X2I. The open-source code and checkpoints for X2I can be found at the following link: https://github.com/OPPO-Mente-Lab/X2I.

[453] TextInPlace: Indoor Visual Place Recognition in Repetitive Structures with Scene Text Spotting and Verification

Huaqi Tao, Bingxi Liu, Calvin Chen, Tingjun Huang, He Li, Jinqiang Cui, Hong Zhang

Main category: cs.CV

TL;DR: TextInPlace integrates Scene Text Spotting (STS) with Visual Place Recognition (VPR) to improve accuracy in repetitive indoor environments by leveraging scene texts.

Details

Motivation: Existing VPR methods struggle in indoor settings due to repetitive structures. Scene texts can help distinguish similar places.

Method: Uses a dual-branch architecture: VPR branch for global descriptors and STS branch for text detection/recognition. Text similarity re-ranks top-K images.

Result: Outperforms appearance-only methods on custom (Maze-with-Text) and public datasets.

Conclusion: TextInPlace effectively mitigates ambiguity in repetitive indoor environments by combining visual and text cues.

Abstract: Visual Place Recognition (VPR) is a crucial capability for long-term autonomous robots, enabling them to identify previously visited locations using visual information. However, existing methods remain limited in indoor settings due to the highly repetitive structures inherent in such environments. We observe that scene texts frequently appear in indoor spaces and can help distinguish visually similar but different places. This inspires us to propose TextInPlace, a simple yet effective VPR framework that integrates Scene Text Spotting (STS) to mitigate visual perceptual ambiguity in repetitive indoor environments. Specifically, TextInPlace adopts a dual-branch architecture within a local parameter sharing network. The VPR branch employs attention-based aggregation to extract global descriptors for coarse-grained retrieval, while the STS branch utilizes a bridging text spotter to detect and recognize scene texts. Finally, the discriminative texts are filtered to compute text similarity and re-rank the top-K retrieved images. To bridge the gap between current text-based repetitive indoor scene datasets and the typical scenarios encountered in robot navigation, we establish an indoor VPR benchmark dataset, called Maze-with-Text. Extensive experiments on both custom and public datasets demonstrate that TextInPlace achieves superior performance over existing methods that rely solely on appearance information. The dataset, code, and trained models are publicly available at https://github.com/HqiTao/TextInPlace.

[454] MambaFlow: A Mamba-Centric Architecture for End-to-End Optical Flow Estimation

Juntian Du, Yuan Sun, Zhihu Zhou, Pinyi Chen, Runzhe Zhang, Keji Mao

Main category: cs.CV

TL;DR: MambaFlow is a novel framework using the Mamba architecture for optical flow estimation, featuring PolyMamba and PulseMamba for enhanced feature representation and efficient flow decoding, achieving competitive results.

Details

Motivation: The Mamba architecture has shown success in computer vision tasks, but its application to optical flow estimation is unexplored. MambaFlow aims to fill this gap by leveraging Mamba's efficiency and accuracy.

Method: MambaFlow includes PolyMamba (dual-Mamba for intra-token and inter-modality modeling) and PulseMamba (AGA for adaptive feature integration and autoregressive flow decoding).

Result: MambaFlow achieves competitive accuracy on benchmark datasets, outperforming SEA-RAFT on Sintel, and shows potential for real-world deployment on resource-constrained devices.

Conclusion: MambaFlow is the first Mamba-based architecture for optical flow estimation, demonstrating high accuracy and efficiency, with plans to release the source code.

Abstract: Recently, the Mamba architecture has demonstrated significant successes in various computer vision tasks, such as classification and segmentation. However, its application to optical flow estimation remains unexplored. In this paper, we introduce MambaFlow, a novel framework designed to leverage the high accuracy and efficiency of the Mamba architecture for capturing locally correlated features while preserving global information in end-to-end optical flow estimation. To our knowledge, MambaFlow is the first architecture centered around the Mamba design tailored specifically for optical flow estimation. It comprises two key components: (1) PolyMamba, which enhances feature representation through a dual-Mamba architecture, incorporating a Self-Mamba module for intra-token modeling and a Cross-Mamba module for inter-modality interaction, enabling both deep contextualization and effective feature fusion; and (2) PulseMamba, which leverages an Attention Guidance Aggregator (AGA) to adaptively integrate features with dynamically learned weights in contrast to naive concatenation, and then employs the intrinsic recurrent mechanism of Mamba to perform autoregressive flow decoding, facilitating efficient flow information dissemination. Extensive experiments demonstrate that MambaFlow achieves remarkable results comparable to mainstream methods on benchmark datasets. Compared to SEA-RAFT, MambaFlow attains higher accuracy on the Sintel benchmark, demonstrating stronger potential for real-world deployment on resource-constrained devices. The source code will be made publicly available upon acceptance of the paper.

[455] Unleashing the Potential of Large Language Models for Text-to-Image Generation through Autoregressive Representation Alignment

Xing Xie, Jiawei Liu, Ziyue Lin, Huijie Fan, Zhi Han, Yandong Tang, Liangqiong Qu

Main category: cs.CV

TL;DR: ARRA is a training framework for autoregressive LLMs that aligns hidden states with visual representations, improving global-coherent text-to-image generation without architectural changes.

Details

Motivation: To enable autoregressive LLMs to generate globally coherent images without complex architectural redesigns.

Method: ARRA uses a global visual alignment loss and a hybrid token to align LLM hidden states with visual representations, enforcing local and global constraints.

Result: ARRA reduces FID by up to 25.5% across datasets and LLMs, demonstrating improved coherence and performance.

Conclusion: ARRA shows that training objective redesign can resolve cross-modal coherence challenges, offering a plug-and-play solution for autoregressive models.

Abstract: We present Autoregressive Representation Alignment (ARRA), a new training framework that unlocks global-coherent text-to-image generation in autoregressive LLMs without architectural modifications. Different from prior works that require complex architectural redesigns, ARRA aligns LLM’s hidden states with visual representations from external visual foundational models via a global visual alignment loss and a hybrid token, [object Object]. This token enforces dual constraints: local next-token prediction and global semantic distillation, enabling LLMs to implicitly learn spatial and contextual coherence while retaining their original autoregressive paradigm. Extensive experiments validate ARRA’s plug-and-play versatility. When training T2I LLMs from scratch, ARRA reduces FID by 16.6% (ImageNet), 12.0% (LAION-COCO) for autoregressive LLMs like LlamaGen, without modifying original architecture and inference mechanism. For training from text-generation-only LLMs, ARRA reduces FID by 25.5% (MIMIC-CXR), 8.8% (DeepEyeNet) for advanced LLMs like Chameleon. For domain adaptation, ARRA aligns general-purpose LLMs with specialized models (e.g., BioMedCLIP), achieving an 18.6% FID reduction over direct fine-tuning on medical imaging (MIMIC-CXR). These results demonstrate that training objective redesign, rather than architectural modifications, can resolve cross-modal global coherence challenges. ARRA offers a complementary paradigm for advancing autoregressive models. The code is available at https://github.com/xiexing0916/ARRA.

[456] From Limited Labels to Open Domains:An Efficient Learning Method for Drone-view Geo-Localization

Zhongwei Chen, Zhao-Xu Yang, Hai-Jun Rong, Jiawei Lang, Guoqi Li

Main category: cs.CV

TL;DR: The paper introduces CDIKTNet, a novel network for drone-view geo-localization, addressing limitations in supervised and unsupervised methods by combining cross-domain invariance learning and knowledge transfer with limited supervision.

Details

Motivation: Traditional DVGL methods struggle with cross-view correlations and require retraining for new domains. Unsupervised methods face feature confusion due to geographical similarities. CDIKTNet aims to overcome these challenges.

Method: CDIKTNet consists of two sub-networks: CDIS for learning cross-view invariance from limited paired data and CDTS for dual-path contrastive learning to optimize subspaces while maintaining feature consistency.

Result: CDIKTNet outperforms supervised methods under full supervision and surpasses unsupervised methods in few-shot and cross-domain scenarios.

Conclusion: CDIKTNet effectively addresses feature confusion and domain adaptation issues in DVGL, achieving state-of-the-art performance with limited supervision.

Abstract: Traditional supervised drone-view geo-localization (DVGL) methods heavily depend on paired training data and encounter difficulties in learning cross-view correlations from unpaired data. Moreover, when deployed in a new domain, these methods require obtaining the new paired data and subsequent retraining for model adaptation, which significantly increases computational overhead. Existing unsupervised methods have enabled to generate pseudo-labels based on cross-view similarity to infer the pairing relationships. However, geographical similarity and spatial continuity often cause visually analogous features at different geographical locations. The feature confusion compromises the reliability of pseudo-label generation, where incorrect pseudo-labels drive negative optimization. Given these challenges inherent in both supervised and unsupervised DVGL methods, we propose a novel cross-domain invariant knowledge transfer network (CDIKTNet) with limited supervision, whose architecture consists of a cross-domain invariance sub-network (CDIS) and a cross-domain transfer sub-network (CDTS). This architecture facilitates a closed-loop framework for invariance feature learning and knowledge transfer. The CDIS is designed to learn cross-view structural and spatial invariance from a small amount of paired data that serves as prior knowledge. It endows the shared feature space of unpaired data with similar implicit cross-view correlations at initialization, which alleviates feature confusion. Based on this, the CDTS employs dual-path contrastive learning to further optimize each subspace while preserving consistency in a shared feature space. Extensive experiments demonstrate that CDIKTNet achieves state-of-the-art performance under full supervision compared with those supervised methods, and further surpasses existing unsupervised methods in both few-shot and cross-domain initialization.

[457] ROODI: Reconstructing Occluded Objects with Denoising Inpainters

Yeonjin Chang, Erqun Dong, Seunghyeon Seo, Nojun Kwak, Kwang Moo Yi

Main category: cs.CV

TL;DR: A novel method for extracting objects from 3D Gaussian Splatting scenes by pruning irrelevant primitives and using generative inpainting for occlusions, outperforming state-of-the-art techniques.

Details

Motivation: The challenge of isolating specific objects and handling occlusions in 3D Gaussian Splatting scenes remains unsolved, necessitating a robust extraction method.

Method: The approach involves object-centric reconstruction by pruning irrelevant Gaussians using Wasserstein distances and employing diffusion-based inpainting for occlusions.

Result: The method outperforms state-of-the-art techniques, validated on real-world and synthetic datasets.

Conclusion: The synergy of pruning and inpainting significantly enhances object extraction performance in complex scenes.

Abstract: While the quality of novel-view images has improved dramatically with 3D Gaussian Splatting, extracting specific objects from scenes remains challenging. Isolating individual 3D Gaussian primitives for each object and handling occlusions in scenes remains far from being solved. We propose a novel object extraction method based on two key principles: (1) object-centric reconstruction through removal of irrelevant primitives; and (2) leveraging generative inpainting to compensate for missing observations caused by occlusions. For pruning, we propose to remove irrelevant Gaussians by looking into how close they are to its K-nearest neighbors and removing those that are statistical outliers. Importantly, these distances must take into account the actual spatial extent they cover – we thus propose to use Wasserstein distances. For inpainting, we employ an off-the-shelf diffusion-based inpainter combined with occlusion reasoning, utilizing the 3D representation of the entire scene. Our findings highlight the crucial synergy between proper pruning and inpainting, both of which significantly enhance extraction performance. We evaluate our method on a standard real-world dataset and introduce a synthetic dataset for quantitative analysis. Our approach outperforms the state-of-the-art, demonstrating its effectiveness in object extraction from complex scenes.

[458] AR-1-to-3: Single Image to Consistent 3D Object Generation via Next-View Prediction

Xuying Zhang, Yupeng Zhou, Kai Wang, Yikai Wang, Zhen Li, Shaohui Jiao, Daquan Zhou, Qibin Hou, Ming-Ming Cheng

Main category: cs.CV

TL;DR: AR-1-to-3 improves novel view synthesis by prioritizing closer views and using diffusion models for progressive synthesis, enhancing consistency and fidelity.

Details

Motivation: Existing NVS methods struggle with consistency when camera poses differ significantly, leading to poor 3D quality.

Method: Proposes AR-1-to-3, a diffusion model-based approach, with Stacked-LE and LSTM-GE for feature encoding.

Result: Significantly improves view consistency and produces high-fidelity 3D assets.

Conclusion: AR-1-to-3 effectively addresses consistency issues in NVS, outperforming prior methods.

Abstract: Novel view synthesis (NVS) is a cornerstone for image-to-3d creation. However, existing works still struggle to maintain consistency between the generated views and the input views, especially when there is a significant camera pose difference, leading to poor-quality 3D geometries and textures. We attribute this issue to their treatment of all target views with equal priority according to our empirical observation that the target views closer to the input views exhibit higher fidelity. With this inspiration, we propose AR-1-to-3, a novel next-view prediction paradigm based on diffusion models that first generates views close to the input views, which are then utilized as contextual information to progressively synthesize farther views. To encode the generated view subsequences as local and global conditions for the next-view prediction, we accordingly develop a stacked local feature encoding strategy (Stacked-LE) and an LSTM-based global feature encoding strategy (LSTM-GE). Extensive experiments demonstrate that our method significantly improves the consistency between the generated views and the input views, producing high-fidelity 3D assets.

[459] EDiT: Efficient Diffusion Transformers with Linear Compressed Attention

Philipp Becker, Abhinav Mehrotra, Ruchika Chavhan, Malcolm Chadwick, Luca Morreale, Mehdi Noroozi, Alberto Gil Ramos, Sourav Bhattacharya

Main category: cs.CV

TL;DR: EDiT and MM-EDiT improve efficiency in diffusion transformers for text-to-image synthesis by introducing linear compressed attention and hybrid attention, achieving faster generation without quality loss.

Details

Motivation: Quadratic scaling in DiTs limits high-resolution image generation and resource efficiency. EDiT addresses this bottleneck.

Method: Proposes linear compressed attention for local modulation and hybrid attention for multimodal inputs, combining linear and standard attention.

Result: Achieves up to 2.2x speedup in PixArt-Sigma and Stable Diffusion 3.5-Medium with comparable image quality.

Conclusion: EDiT and MM-EDiT offer efficient, scalable alternatives to conventional DiTs, enhancing performance in text-to-image synthesis.

Abstract: Diffusion Transformers (DiTs) have emerged as a leading architecture for text-to-image synthesis, producing high-quality and photorealistic images. However, the quadratic scaling properties of the attention in DiTs hinder image generation with higher resolution or on devices with limited resources. This work introduces an efficient diffusion transformer (EDiT) to alleviate these efficiency bottlenecks in conventional DiTs and Multimodal DiTs (MM-DiTs). First, we present a novel linear compressed attention method that uses a multi-layer convolutional network to modulate queries with local information while keys and values are aggregated spatially. Second, we formulate a hybrid attention scheme for multimodal inputs that combines linear attention for image-to-image interactions and standard scaled dot-product attention for interactions involving prompts. Merging these two approaches leads to an expressive, linear-time Multimodal Efficient Diffusion Transformer (MM-EDiT). We demonstrate the effectiveness of the EDiT and MM-EDiT architectures by integrating them into PixArt-Sigma (conventional DiT) and Stable Diffusion 3.5-Medium (MM-DiT), achieving up to 2.2x speedup with comparable image quality after distillation.

[460] Generating, Fast and Slow: Scalable Parallel Video Generation with Video Interface Networks

Bhishma Dedhia, David Bourgin, Krishna Kumar Singh, Yuheng Li, Yan Kang, Zhan Xu, Niraj K. Jha, Yuchen Liu

Main category: cs.CV

TL;DR: VINs enhance DiTs for parallel video chunk generation, improving efficiency and consistency in long videos.

Details

Motivation: Overcome computational challenges of full-attention DiTs for long videos by enabling parallel inference.

Method: Augment DiTs with an abstraction module (VINs) for parallel denoising of chunks using global semantics.

Result: VINs outperform chunk-based methods in consistency and motion smoothness, with reduced FLOPs.

Conclusion: VINs offer a scalable, efficient solution for high-quality long video generation.

Abstract: Diffusion Transformers (DiTs) can generate short photorealistic videos, yet directly training and sampling longer videos with full attention across the video remains computationally challenging. Alternative methods break long videos down into sequential generation of short video segments, requiring multiple sampling chain iterations and specialized consistency modules. To overcome these challenges, we introduce a new paradigm called Video Interface Networks (VINs), which augment DiTs with an abstraction module to enable parallel inference of video chunks. At each diffusion step, VINs encode global semantics from the noisy input of local chunks and the encoded representations, in turn, guide DiTs in denoising chunks in parallel. The coupling of VIN and DiT is learned end-to-end on the denoising objective. Further, the VIN architecture maintains fixed-size encoding tokens that encode the input via a single cross-attention step. Disentangling the encoding tokens from the input thus enables VIN to scale to long videos and learn essential semantics. Experiments on VBench demonstrate that VINs surpass existing chunk-based methods in preserving background consistency and subject coherence. We then show via an optical flow analysis that our approach attains state-of-the-art motion smoothness while using 25-40% fewer FLOPs than full generation. Finally, human raters favorably assessed the overall video quality and temporal consistency of our method in a user study.

[461] Learning 3D Object Spatial Relationships from Pre-trained 2D Diffusion Models

Sangwon Baik, Hyeonwoo Kim, Hanbyul Joo

Main category: cs.CV

TL;DR: A method for learning 3D spatial relationships (OOR) between objects using synthetic 3D samples from 2D diffusion models, extended to multi-object scenarios and validated in 3D scene arrangement and human motion synthesis.

Details

Motivation: To efficiently learn realistic object-object spatial relationships (OOR) for unbounded object categories by leveraging synthetic data from 2D diffusion models.

Method: Synthesize diverse 2D images with plausible OOR cues, uplift them into 3D samples, and train a score-based OOR diffusion model. Extend to multi-object OOR with consistency constraints.

Result: Demonstrates robustness across various OOR scenarios and applicability in 3D scene arrangement and human motion synthesis.

Conclusion: The method effectively learns OOR from synthetic data, enabling scalable and diverse applications in 3D tasks.

Abstract: We present a method for learning 3D spatial relationships between object pairs, referred to as object-object spatial relationships (OOR), by leveraging synthetically generated 3D samples from pre-trained 2D diffusion models. We hypothesize that images synthesized by 2D diffusion models inherently capture realistic OOR cues, enabling efficient collection of a 3D dataset to learn OOR for various unbounded object categories. Our approach synthesizes diverse images that capture plausible OOR cues, which we then uplift into 3D samples. Leveraging our diverse collection of 3D samples for the object pairs, we train a score-based OOR diffusion model to learn the distribution of their relative spatial relationships. Additionally, we extend our pairwise OOR to multi-object OOR by enforcing consistency across pairwise relations and preventing object collisions. Extensive experiments demonstrate the robustness of our method across various object-object spatial relationships, along with its applicability to 3D scene arrangement tasks and human motion synthesis using our OOR diffusion model.

[462] GAITGen: Disentangled Motion-Pathology Impaired Gait Generative Model – Bringing Motion Generation to the Clinical Domain

Vida Adeli, Soroush Mehraban, Majid Mirmehdi, Alan Whone, Benjamin Filtjens, Amirhossein Dadashzadeh, Alfonso Fasano, Andrea Iaboni, Babak Taati

Main category: cs.CV

TL;DR: GAITGen is a framework for generating realistic gait sequences to address data scarcity in Parkinson’s Disease gait analysis, improving model training and accuracy.

Details

Motivation: Limited clinical datasets and challenges in collecting labeled data hinder the effectiveness of computer vision models for parkinsonian gait analysis.

Method: GAITGen uses a Conditional Residual Vector Quantized Variational Autoencoder and Mask/Residual Transformers to generate gait sequences conditioned on pathology severity.

Result: GAITGen outperforms state-of-the-art models in reconstruction and generation quality, validated by a clinical user study.

Conclusion: GAITGen enhances clinical gait analysis by improving dataset quality and downstream task performance.

Abstract: Gait analysis is crucial for the diagnosis and monitoring of movement disorders like Parkinson’s Disease. While computer vision models have shown potential for objectively evaluating parkinsonian gait, their effectiveness is limited by scarce clinical datasets and the challenge of collecting large and well-labelled data, impacting model accuracy and risk of bias. To address these gaps, we propose GAITGen, a novel framework that generates realistic gait sequences conditioned on specified pathology severity levels. GAITGen employs a Conditional Residual Vector Quantized Variational Autoencoder to learn disentangled representations of motion dynamics and pathology-specific factors, coupled with Mask and Residual Transformers for conditioned sequence generation. GAITGen generates realistic, diverse gait sequences across severity levels, enriching datasets and enabling large-scale model training in parkinsonian gait analysis. Experiments on our new PD-GaM (real) dataset demonstrate that GAITGen outperforms adapted state-of-the-art models in both reconstruction fidelity and generation quality, accurately capturing critical pathology-specific gait features. A clinical user study confirms the realism and clinical relevance of our generated sequences. Moreover, incorporating GAITGen-generated data into downstream tasks improves parkinsonian gait severity estimation, highlighting its potential for advancing clinical gait analysis.

[463] Leveraging Sparse Annotations for Leukemia Diagnosis on the Large Leukemia Dataset

Abdul Rehman, Talha Meraj, Aiman Mahmood Minhas, Ayisha Imran, Mohsen Ali, Waqas Sultani, Mubarak Shah

Main category: cs.CV

TL;DR: The paper introduces a large-scale Leukemia dataset (LLD) and novel methods for WBC detection and attribute analysis, addressing dataset limitations and improving diagnostic accuracy.

Details

Motivation: Leukemia analysis lacks large, diverse datasets, limiting real-world applicability. The paper aims to overcome this by providing a comprehensive dataset and interpretable methods.

Method: The authors present the LLD dataset collected from 48 patients using multi-source microscopy. They propose a multi-task model for WBC detection and attribute prediction, along with a sparse annotation method to reduce expert workload.

Result: The dataset and methods enhance diagnostic accuracy and explainability, leveraging sparse annotations to improve learning efficiency.

Conclusion: The LLD dataset and proposed methods address key challenges in leukemia analysis, offering a scalable and interpretable solution for microscopic image analysis.

Abstract: Leukemia is the 10th most frequently diagnosed cancer and one of the leading causes of cancer-related deaths worldwide. Realistic analysis of leukemia requires white blood cell (WBC) localization, classification, and morphological assessment. Despite deep learning advances in medical imaging, leukemia analysis lacks a large, diverse multi-task dataset, while existing small datasets lack domain diversity, limiting real-world applicability. To overcome dataset challenges, we present a large-scale WBC dataset named Large Leukemia Dataset (LLD) and novel methods for detecting WBC with their attributes. Our contribution here is threefold. First, we present a large-scale Leukemia dataset collected through Peripheral Blood Films (PBF) from 48 patients, through multiple microscopes, multi-cameras, and multi-magnification. To enhance diagnosis explainability and medical expert acceptance, each leukemia cell is annotated at 100x with 7 morphological attributes, ranging from Cell Size to Nuclear Shape. Secondly, we propose a multi-task model that not only detects WBCs but also predicts their attributes, providing an interpretable and clinically meaningful solution. Third, we propose a method for WBC detection with attribute analysis using sparse annotations. This approach reduces the annotation burden on hematologists, requiring them to mark only a small area within the field of view. Our method enables the model to leverage the entire field of view rather than just the annotated regions, enhancing learning efficiency and diagnostic accuracy. From diagnosis explainability to overcoming domain-shift challenges, the presented datasets can be used for many challenging aspects of microscopic image analysis. The datasets, code, and demo are available at: https://im.itu.edu.pk/sparse-leukemiaattri/

[464] Scaling Laws for Native Multimodal Models

Mustafa Shukor, Enrico Fini, Victor Guilherme Turrisi da Costa, Matthieu Cord, Joshua Susskind, Alaaeldin El-Nouby

Main category: cs.CV

TL;DR: The paper challenges the superiority of late-fusion architectures in multimodal models, showing early-fusion performs better at lower parameter counts and is more efficient. Mixture of Experts (MoEs) further enhances performance.

Details

Motivation: To determine if late-fusion architectures are inherently superior in multimodal models and explore the potential of early-fusion designs.

Method: An extensive scaling study involving 457 models with varied architectures and training mixtures, comparing late-fusion and early-fusion approaches.

Result: Early-fusion outperforms late-fusion at lower parameter counts, is more efficient, and easier to deploy. MoEs improve performance by learning modality-specific weights.

Conclusion: Early-fusion architectures are more effective than late-fusion for multimodal models, especially when enhanced with MoEs.

Abstract: Building general-purpose models that can effectively perceive the world through multimodal signals has been a long-standing goal. Current approaches involve integrating separately pre-trained components, such as connecting vision encoders to LLMs and continuing multimodal training. While such approaches exhibit remarkable sample efficiency, it remains an open question whether such late-fusion architectures are inherently superior. In this work, we revisit the architectural design of native multimodal models (NMMs)-those trained from the ground up on all modalities-and conduct an extensive scaling laws study, spanning 457 trained models with different architectures and training mixtures. Our investigation reveals no inherent advantage to late-fusion architectures over early-fusion ones, which do not rely on image encoders or tokenizers. On the contrary, early-fusion exhibits stronger performance at lower parameter counts, is more efficient to train, and is easier to deploy. Motivated by the strong performance of the early-fusion architectures, we show that incorporating Mixture of Experts (MoEs) allows models to learn modality-specific weights, significantly benefiting performance.

[465] Hierarchical Relation-augmented Representation Generalization for Few-shot Action Recognition

Hongyu Qu, Ling Xing, Jiachao Zhang, Rui Yan, Yazhou Yao, Xiangbo Shu

Main category: cs.CV

TL;DR: HR2G-shot is a framework for few-shot action recognition (FSAR) that integrates inter-frame, inter-video, and inter-task relation modeling to improve temporal pattern learning and knowledge reuse.

Details

Motivation: Existing FSAR methods lack fine-grained temporal relation modeling between videos and fail to reuse temporal knowledge from historical tasks, limiting their performance.

Method: HR2G-shot includes Inter-video Semantic Correlation (ISC) for cross-video frame-level interactions and Inter-task Knowledge Transfer (IKT) to aggregate temporal knowledge from historical tasks.

Result: HR2G-shot outperforms leading FSAR methods on five benchmarks.

Conclusion: The proposed hierarchical relation modeling framework effectively captures shared temporal patterns and enhances FSAR performance.

Abstract: Few-shot action recognition (FSAR) aims to recognize novel action categories with few exemplars. Existing methods typically learn frame-level representations for each video by designing inter-frame temporal modeling strategies or inter-video interaction at the coarse video-level granularity. However, they treat each episode task in isolation and neglect fine-grained temporal relation modeling between videos, thus failing to capture shared fine-grained temporal patterns across videos and reuse temporal knowledge from historical tasks. In light of this, we propose HR2G-shot, a Hierarchical Relation-augmented Representation Generalization framework for FSAR, which unifies three types of relation modeling (inter-frame, inter-video, and inter-task) to learn task-specific temporal patterns from a holistic view. Going beyond conducting inter-frame temporal interactions, we further devise two components to respectively explore inter-video and inter-task relationships: i) Inter-video Semantic Correlation (ISC) performs cross-video frame-level interactions in a fine-grained manner, thereby capturing task-specific query features and enhancing both intra-class consistency and inter-class separability; ii) Inter-task Knowledge Transfer (IKT) retrieves and aggregates relevant temporal knowledge from the bank, which stores diverse temporal patterns from historical episode tasks. Extensive experiments on five benchmarks show that HR2G-shot outperforms current top-leading FSAR methods.

[466] Anchor Token Matching: Implicit Structure Locking for Training-free AR Image Editing

Taihang Hu, Linxuan Li, Kai Wang, Yaxing Wang, Jian Yang, Ming-Ming Cheng

Main category: cs.CV

TL;DR: ISLock introduces a training-free editing strategy for autoregressive (AR) models, addressing structural control issues by aligning self-attention patterns with reference images.

Details

Motivation: Existing editing techniques for diffusion models don't work for AR models due to spatial poverty of attention maps and structural error accumulation.

Method: ISLock uses Anchor Token Matching (ATM) to dynamically align self-attention patterns with reference images, preserving structural blueprints.

Result: ISLock achieves high-quality, structure-consistent edits without additional training, outperforming conventional methods.

Conclusion: ISLock bridges the performance gap between diffusion and AR models, enabling efficient and flexible AR-based image editing.

Abstract: Text-to-image generation has seen groundbreaking advancements with diffusion models, enabling high-fidelity synthesis and precise image editing through cross-attention manipulation. Recently, autoregressive (AR) models have re-emerged as powerful alternatives, leveraging next-token generation to match diffusion models. However, existing editing techniques designed for diffusion models fail to translate directly to AR models due to fundamental differences in structural control. Specifically, AR models suffer from spatial poverty of attention maps and sequential accumulation of structural errors during image editing, which disrupt object layouts and global consistency. In this work, we introduce Implicit Structure Locking (ISLock), the first training-free editing strategy for AR visual models. Rather than relying on explicit attention manipulation or fine-tuning, ISLock preserves structural blueprints by dynamically aligning self-attention patterns with reference images through the Anchor Token Matching (ATM) protocol. By implicitly enforcing structural consistency in latent space, our method ISLock enables structure-aware editing while maintaining generative autonomy. Extensive experiments demonstrate that ISLock achieves high-quality, structure-consistent edits without additional training and is superior or comparable to conventional editing techniques. Our findings pioneer the way for efficient and flexible AR-based image editing, further bridging the performance gap between diffusion and autoregressive generative models. The code will be publicly available at https://github.com/hutaiHang/ATM

[467] TFMPathy: Tabular Foundation Model for Privacy-Aware, Generalisable Empathy Detection from Videos

Md Rakibul Hasan, Md Zakir Hossain, Aneesh Krishna, Shafin Rahman, Tom Gedeon

Main category: cs.CV

TL;DR: The paper proposes TFMPathy, a system using tabular foundation models (TFMs) for empathy detection from video-derived tabular data, improving accuracy and cross-subject generalization.

Details

Motivation: Privacy and ethical concerns limit raw video data sharing, prompting the use of tabular features. The success of foundation models in text inspired their application to tabular data for empathy detection.

Method: TFMPathy employs two TFMs (TabPFN v2 and TabICL) under in-context learning and fine-tuning paradigms, evaluated on a human-robot interaction benchmark.

Result: TFMPathy significantly improves empathy detection accuracy (0.590 → 0.730) and AUC (0.564 → 0.669), with better cross-subject generalization than baselines.

Conclusion: TFMPathy offers a scalable solution for AI systems relying on human-centered video datasets, addressing privacy constraints while enhancing performance.

Abstract: Detecting empathy from video interactions is an emerging area of research, particularly in healthcare and social robotics. However, privacy and ethical concerns often prevent the release of raw video data, with many datasets instead shared as pre-extracted tabular features. Previous work on such datasets has established classical tree-based models as the state of the art. Motivated by recent successes of large-scale foundation models for text, we investigate the potential of tabular foundation models (TFMs) for empathy detection from video-derived tabular data. Our proposed system, TFMPathy, is demonstrated with two recent TFMs (TabPFN v2 and TabICL) under both in-context learning and fine-tuning paradigms. On a public human-robot interaction benchmark, TFMPathy significantly improves empathy detection accuracy reported in the literature. While the established evaluation protocol in the literature does not ensure cross-subject generalisation, our evaluation scheme also captures such generalisation. We show that TFMPathy under a fine-tuning setup has better cross-subject generalisation capacity over baseline methods (accuracy: $0.590 \rightarrow 0.730$; AUC: $0.564 \rightarrow 0.669$). Given the ongoing privacy and ethical constraints around raw video sharing, the proposed TFMPathy system provides a practical and scalable path toward building AI systems dependent on human-centred video datasets. Our code is publicly available at https://github.com/hasan-rakibul/TFMPathy (will be made available upon acceptance of this paper).

[468] Crane: Context-Guided Prompt Learning and Attention Refinement for Zero-Shot Anomaly Detection

Alireza Salehi, Mohammadreza Salehi, Reshad Hosseini, Cees G. M. Snoek, Makoto Yamada, Mohammad Sabokrou

Main category: cs.CV

TL;DR: Crane improves zero-shot anomaly detection by addressing CLIP’s spatial misalignment and limited sensitivity to fine-grained anomalies through correlation-based attention and local-to-global fusion, outperforming state-of-the-art methods by 2-28%.

Details

Motivation: Traditional anomaly detection methods require normal training samples, which is often impractical. CLIP's zero-shot capabilities are promising but limited by coarse-grained alignment and global feature insensitivity.

Method: Crane introduces a correlation-based attention module for spatial alignment and conditions text prompts on image context for local-to-global fusion, leveraging models like DINOv2.

Result: Crane improves zero-shot anomaly detection performance by 2-28% across 14 datasets, maintaining competitive inference speed.

Conclusion: Crane effectively balances learnable and non-learnable adaptations to enhance anomaly detection without overfitting, demonstrating broad applicability.

Abstract: Anomaly Detection involves identifying deviations from normal data distributions and is critical in fields such as medical diagnostics and industrial defect detection. Traditional AD methods typically require the availability of normal training samples; however, this assumption is not always feasible. Recently, the rich pretraining knowledge of CLIP has shown promising zero-shot generalization in detecting anomalies without the need for training samples from target domains. However, CLIP’s coarse-grained image-text alignment limits localization and detection performance for fine-grained anomalies due to: (1) spatial misalignment, and (2) the limited sensitivity of global features to local anomalous patterns. In this paper, we propose Crane which tackles both problems. First, we introduce a correlation-based attention module to retain spatial alignment more accurately. Second, to boost the model’s awareness of fine-grained anomalies, we condition the learnable prompts of the text encoder on image context extracted from the vision encoder and perform a local-to-global representation fusion. Moreover, our method can incorporate vision foundation models such as DINOv2 to further enhance spatial understanding and localization. The key insight of Crane is to balance learnable adaptations for modeling anomalous concepts with non-learnable adaptations that preserve and exploit generalized pretrained knowledge, thereby minimizing in-domain overfitting and maximizing performance on unseen domains. Extensive evaluation across 14 diverse industrial and medical datasets demonstrates that Crane consistently improves the state-of-the-art ZSAD from 2% to 28%, at both image and pixel levels, while remaining competitive in inference speed. The code is available at https://github.com/AlirezaSalehy/Crane.

[469] SVarM: Linear Support Varifold Machines for Classification and Regression on Geometric Data

Emmanuel Hartman, Nicolas Charon

Main category: cs.CV

TL;DR: The paper introduces SVarM, a method for statistical analysis of geometric data using varifold representations and neural networks, achieving robust performance with fewer parameters.

Details

Motivation: Addressing challenges in geometric deep learning due to non-Euclidean shape spaces and the need for invariance to ensure model generalizability.

Method: Proposes SVarM, leveraging varifold representations and duality with test functions, incorporating neural networks for trainable functions.

Result: Demonstrates strong performance and robustness in classification and regression on shape datasets, comparable to state-of-the-art methods with reduced parameters.

Conclusion: SVarM provides an effective framework for geometric data analysis, balancing performance and computational efficiency.

Abstract: Despite progress in the rapidly developing field of geometric deep learning, performing statistical analysis on geometric data–where each observation is a shape such as a curve, graph, or surface–remains challenging due to the non-Euclidean nature of shape spaces, which are defined as equivalence classes under invariance groups. Building machine learning frameworks that incorporate such invariances, notably to shape parametrization, is often crucial to ensure generalizability of the trained models to new observations. This work proposes \textit{SVarM} to exploit varifold representations of shapes as measures and their duality with test functions $h:\mathbb{R}^n \times S^{n-1} \rightarrow \mathbb{R}$. This method provides a general framework akin to linear support vector machines but operating instead over the infinite-dimensional space of varifolds. We develop classification and regression models on shape datasets by introducing a neural network-based representation of the trainable test function $h$. This approach demonstrates strong performance and robustness across various shape graph and surface datasets, achieving results comparable to state-of-the-art methods while significantly reducing the number of trainable parameters.

[470] DamageCAT: A Deep Learning Transformer Framework for Typology-Based Post-Disaster Building Damage Categorization

Yiming Xiao, Ali Mostafavi

Main category: cs.CV

TL;DR: DamageCAT introduces a typology-based categorical classification framework for building damage assessment, outperforming traditional binary or ordinal methods with a hierarchical U-Net-based transformer model.

Details

Motivation: Current automated damage assessment methods lack descriptive detail, limiting post-disaster resource allocation. DamageCAT aims to provide more actionable insights through typology-based classifications.

Method: The framework uses a hierarchical U-Net-based transformer architecture to process pre- and post-disaster satellite image pairs, trained on the BD-TypoSAT dataset with four damage categories.

Result: The model achieves 0.737 IoU and 0.846 F1-score overall, demonstrating transferability across multiple hurricane datasets, though performance varies due to class imbalance.

Conclusion: Typology-based classifications offer more actionable damage assessments than severity-based approaches, improving emergency response and resource allocation.

Abstract: Rapid, accurate, and descriptive building damage assessment is critical for directing post-disaster resources, yet current automated methods typically provide only binary (damaged/undamaged) or ordinal severity scales. This paper introduces DamageCAT, a framework that advances damage assessment through typology-based categorical classifications. We contribute: (1) the BD-TypoSAT dataset containing satellite image triplets from Hurricane Ida with four damage categories - partial roof damage, total roof damage, partial structural collapse, and total structural collapse - and (2) a hierarchical U-Net-based transformer architecture for processing pre- and post-disaster image pairs. Our model achieves 0.737 IoU and 0.846 F1-score overall, with cross-event evaluation demonstrating transferability across Hurricane Harvey, Florence, and Michael data. While performance varies across damage categories due to class imbalance, the framework shows that typology-based classifications can provide more actionable damage assessments than traditional severity-based approaches, enabling targeted emergency response and resource allocation.

[471] Just Say the Word: Annotation-Free Fine-Grained Object Counting

Adriano D’Alessandro, Ali Mahdavi-Amiri, Ghassan Hamarneh

Main category: cs.CV

TL;DR: A method to improve fine-grained object counting by tuning a compact concept embedding using synthetic images and pseudo-labels, avoiding the need for real data or human annotations.

Details

Motivation: Addressing overcounting in class-agnostic counting models for fine-grained categories without costly retraining or new annotations.

Method: Uses a text-to-image diffusion model to generate synthetic images and pseudo-labels, creating a concept embedding to refine raw counts from any frozen counter.

Result: Validated on the Lookalikes benchmark (1,037 images, 27 subcategories), showing significant improvements over baselines.

Conclusion: Proposed paradigm effectively refines counting without real data or annotations, generalizing to novel categories.

Abstract: Fine-grained object counting remains a major challenge for class-agnostic counting models, which overcount visually similar but incorrect instances (e.g., jalape~no vs. poblano). Addressing this by annotating new data and fully retraining the model is time-consuming and does not guarantee generalization to additional novel categories at test time. Instead, we propose an alternative paradigm: Given a category name, tune a compact concept embedding derived from the prompt using synthetic images and pseudo-labels generated by a text-to-image diffusion model. This embedding conditions a specialization module that refines raw overcounts from any frozen counter into accurate, category-specific estimates\textemdash without requiring real images or human annotations. We validate our approach on \textsc{Lookalikes}, a challenging new benchmark containing 1,037 images across 27 fine-grained subcategories, and show substantial improvements over strong baselines. Code and data will be released upon acceptance.

[472] Decoupled Global-Local Alignment for Improving Compositional Understanding

Xiaoxing Hu, Kaicheng Yang, Jun Wang, Haoran Xu, Ziyong Feng, Yupei Wang

Main category: cs.CV

TL;DR: DeGLA framework improves CLIP’s compositional understanding without losing general capabilities, using self-distillation and contrastive losses.

Details

Motivation: CLIP's global contrastive learning limits its ability to understand compositional concepts like relations and attributes.

Method: DeGLA uses self-distillation for global alignment and introduces Image-Grounded Contrast (IGC) and Text-Grounded Contrast (TGC) losses for local alignment, leveraging LLMs for negative caption generation.

Result: DeGLA improves performance by 3.5% on compositional benchmarks and 13.0% on zero-shot classification tasks.

Conclusion: DeGLA effectively balances compositional understanding and general capabilities, outperforming prior methods.

Abstract: Contrastive Language-Image Pre-training (CLIP) has achieved success on multiple downstream tasks by aligning image and text modalities. However, the nature of global contrastive learning limits CLIP’s ability to comprehend compositional concepts, such as relations and attributes. Although recent studies employ global hard negative samples to improve compositional understanding, these methods significantly compromise the model’s inherent general capabilities by forcibly distancing textual negative samples from images in the embedding space. To overcome this limitation, we introduce a Decoupled Global-Local Alignment (DeGLA) framework that improves compositional understanding while substantially mitigating losses in general capabilities. To optimize the retention of the model’s inherent capabilities, we incorporate a self-distillation mechanism within the global alignment process, aligning the learnable image-text encoder with a frozen teacher model derived from an exponential moving average. Under the constraint of self-distillation, it effectively mitigates the catastrophic forgetting of pretrained knowledge during fine-tuning. To improve compositional understanding, we first leverage the in-context learning capability of Large Language Models (LLMs) to construct about 2M high-quality negative captions across five types. Subsequently, we propose the Image-Grounded Contrast (IGC) loss and Text-Grounded Contrast (TGC) loss to enhance vision-language compositionally. Extensive experimental results demonstrate the effectiveness of the DeGLA framework. Compared to previous state-of-the-art methods, DeGLA achieves an average enhancement of 3.5% across the VALSE, SugarCrepe, and ARO benchmarks. Concurrently, it obtains an average performance improvement of 13.0% on zero-shot classification tasks across eleven datasets. Our code will be released at https://github.com/xiaoxing2001/DeGLA

[473] CoherenDream: Boosting Holistic Text Coherence in 3D Generation via Multimodal Large Language Models Feedback

Chenhan Jiang, Yihan Zeng, Dit-Yan Yeung

Main category: cs.CV

TL;DR: The paper introduces Textual Coherent Score Distillation (TCSD) to improve text-to-3D generation by addressing semantic fidelity issues in Score Distillation Sampling (SDS) methods, leveraging multimodal large language models (MLLMs) for alignment feedback.

Details

Motivation: Current SDS-based methods struggle with semantic fidelity for complex prompts involving multiple objects, due to view-independent biases degrading text-3D alignment.

Method: Proposes TCSD, integrating MLLMs for alignment feedback, develops 3DLLaVA-CRITIC for evaluating text-3D correspondence, and introduces LLM-layout initialization for faster convergence.

Result: CoherenDream framework shows consistent improvements on TIFA subset metrics.

Conclusion: The study successfully incorporates MLLMs into SDS optimization, enhancing text-to-3D generation and providing insights for future MLLM adaptations.

Abstract: Score Distillation Sampling (SDS) has achieved remarkable success in text-to-3D content generation. However, SDS-based methods struggle to maintain semantic fidelity for user prompts, particularly when involving multiple objects with intricate interactions. While existing approaches often address 3D consistency through multiview diffusion model fine-tuning on 3D datasets, this strategy inadvertently exacerbates text-3D alignment degradation. The limitation stems from SDS’s inherent accumulation of view-independent biases during optimization, which progressively diverges from the ideal text alignment direction. To alleviate this limitation, we propose a novel SDS objective, dubbed as Textual Coherent Score Distillation (TCSD), which integrates alignment feedback from multimodal large language models (MLLMs). Our TCSD leverages cross-modal understanding capabilities of MLLMs to assess and guide the text-3D correspondence during the optimization. We further develop 3DLLaVA-CRITIC - a fine-tuned MLLM specialized for evaluating multiview text alignment in 3D generations. Additionally, we introduce an LLM-layout initialization that significantly accelerates optimization convergence through semantic-aware spatial configuration. Our framework, CoherenDream, achieves consistent improvement across multiple metrics on TIFA subset.As the first study to incorporate MLLMs into SDS optimization, we also conduct extensive ablation studies to explore optimal MLLM adaptations for 3D generation tasks.

[474] 3D Gaussian Splatting Data Compression with Mixture of Priors

Lei Liu, Zhenghao Chen, Dong Xu

Main category: cs.CV

TL;DR: The paper introduces a Mixture of Priors (MoP) strategy for 3D Gaussian Splatting (3DGS) data compression, addressing limitations in entropy models and quantization for both lossless and lossy scenarios.

Details

Motivation: Current 3DGS compression methods lack robust entropy models and fine-grained quantization, limiting efficiency in storage and transmission.

Method: Proposes MoP, using multiple MLPs for hyperprior feature generation and a gating mechanism. For lossless, MoP improves entropy modeling; for lossy, it guides element-wise quantization via a Coarse-to-Fine Quantization (C2FQ) strategy.

Result: Achieves state-of-the-art performance on benchmarks like Mip-NeRF360, BungeeNeRF, DeepBlending, and Tank&Temples.

Conclusion: The MoP strategy effectively enhances 3DGS compression, outperforming existing methods.

Abstract: 3D Gaussian Splatting (3DGS) data compression is crucial for enabling efficient storage and transmission in 3D scene modeling. However, its development remains limited due to inadequate entropy models and suboptimal quantization strategies for both lossless and lossy compression scenarios, where existing methods have yet to 1) fully leverage hyperprior information to construct robust conditional entropy models, and 2) apply fine-grained, element-wise quantization strategies for improved compression granularity. In this work, we propose a novel Mixture of Priors (MoP) strategy to simultaneously address these two challenges. Specifically, inspired by the Mixture-of-Experts (MoE) paradigm, our MoP approach processes hyperprior information through multiple lightweight MLPs to generate diverse prior features, which are subsequently integrated into the MoP feature via a gating mechanism. To enhance lossless compression, the resulting MoP feature is utilized as a hyperprior to improve conditional entropy modeling. Meanwhile, for lossy compression, we employ the MoP feature as guidance information in an element-wise quantization procedure, leveraging a prior-guided Coarse-to-Fine Quantization (C2FQ) strategy with a predefined quantization step value. Specifically, we expand the quantization step value into a matrix and adaptively refine it from coarse to fine granularity, guided by the MoP feature, thereby obtaining a quantization step matrix that facilitates element-wise quantization. Extensive experiments demonstrate that our proposed 3DGS data compression framework achieves state-of-the-art performance across multiple benchmarks, including Mip-NeRF360, BungeeNeRF, DeepBlending, and Tank&Temples.

[475] QuickSplat: Fast 3D Surface Reconstruction via Learned Gaussian Initialization

Yueh-Cheng Liu, Lukas Höllein, Matthias Nießner, Angela Dai

Main category: cs.CV

TL;DR: QuickSplat accelerates 3D scene reconstruction by using data-driven priors for Gaussian splatting, improving speed and accuracy.

Details

Motivation: Existing volumetric rendering methods are slow and struggle with under-observed or textureless regions.

Method: QuickSplat learns data-driven priors for dense initializations and jointly estimates densification and parameter updates.

Result: Achieves 8x faster runtime and reduces depth errors by up to 48% compared to state-of-the-art methods.

Conclusion: QuickSplat offers a superior, efficient solution for large-scale indoor scene reconstruction.

Abstract: Surface reconstruction is fundamental to computer vision and graphics, enabling applications in 3D modeling, mixed reality, robotics, and more. Existing approaches based on volumetric rendering obtain promising results, but optimize on a per-scene basis, resulting in a slow optimization that can struggle to model under-observed or textureless regions. We introduce QuickSplat, which learns data-driven priors to generate dense initializations for 2D gaussian splatting optimization of large-scale indoor scenes. This provides a strong starting point for the reconstruction, which accelerates the convergence of the optimization and improves the geometry of flat wall structures. We further learn to jointly estimate the densification and update of the scene parameters during each iteration; our proposed densifier network predicts new Gaussians based on the rendering gradients of existing ones, removing the needs of heuristics for densification. Extensive experiments on large-scale indoor scene reconstruction demonstrate the superiority of our data-driven optimization. Concretely, we accelerate runtime by 8x, while decreasing depth errors by up to 48% in comparison to state of the art methods.

[476] From Time-series Generation, Model Selection to Transfer Learning: A Comparative Review of Pixel-wise Approaches for Large-scale Crop Mapping

Judy Long, Tao Liu, Sean Alexander Woznicki, Miljana Marković, Oskar Marko, Molly Sears

Main category: cs.CV

TL;DR: A review of large-scale crop mapping workflows compares supervised and transfer learning methods, identifying optimal preprocessing and models for diverse agricultural sites.

Details

Motivation: To systematically evaluate and identify the best workflows for pixel-wise crop mapping using remote sensing data, addressing challenges like domain shift and sample size variability.

Method: Compared six preprocessing methods and eleven supervised classification models, assessed transfer learning techniques, and evaluated performance across five agricultural sites using Landsat 8 data.

Result: Fine-scale preprocessing with Transformers performed best; RF was efficient for supervised learning. Transfer learning improved adaptability, with UDA for homogeneous classes and fine-tuning for diverse scenarios. Workflow choice depends on labeled sample availability.

Conclusion: Supervised training excels with sufficient samples, while transfer learning is viable below a threshold. Publicly available code promotes reproducibility.

Abstract: Crop mapping involves identifying and classifying crop types using spatial data, primarily derived from remote sensing imagery. This study presents the first comprehensive review of large-scale, pixel-wise crop mapping workflows, encompassing both conventional supervised methods and emerging transfer learning approaches. To identify the optimal time-series generation approaches and supervised crop mapping models, we conducted systematic experiments, comparing six widely adopted satellite image-based preprocessing methods, alongside eleven supervised pixel-wise classification models. Additionally, we assessed the synergistic impact of varied training sample sizes and variable combinations. Moreover, we identified optimal transfer learning techniques for different magnitudes of domain shift. The evaluation of optimal methods was conducted across five diverse agricultural sites. Landsat 8 served as the primary satellite data source. Labels come from CDL trusted pixels and field surveys. Our findings reveal three key insights. First, fine-scale interval preprocessing paired with Transformer models consistently delivered optimal performance for both supervised and transferable workflows. RF offered rapid training and competitive performance in conventional supervised learning and direct transfer to similar domains. Second, transfer learning techniques enhanced workflow adaptability, with UDA being effective for homogeneous crop classes while fine-tuning remains robust across diverse scenarios. Finally, workflow choice depends heavily on the availability of labeled samples. With a sufficient sample size, supervised training typically delivers more accurate and generalizable results. Below a certain threshold, transfer learning that matches the level of domain shift is a viable alternative to achieve crop mapping. All code is publicly available to encourage reproducibility practice.

[477] Transfer Learning from Visual Speech Recognition to Mouthing Recognition in German Sign Language

Dinh Nam Pham, Eleftherios Avramidis

Main category: cs.CV

TL;DR: The paper explores transfer learning from Visual Speech Recognition (VSR) to improve mouthing recognition in German Sign Language, showing multi-task learning enhances accuracy and robustness.

Details

Motivation: Non-manual features like mouthing are linguistically valuable in Sign Language Recognition (SLR), but datasets with mouthing annotations are limited. This work aims to leverage VSR datasets to improve mouthing recognition.

Method: The study uses three VSR datasets (English, German unrelated words, and German target words) to investigate transfer learning. Multi-task learning is applied to classify mouthing instances to spoken words.

Result: Multi-task learning improves mouthing recognition and VSR accuracy, indicating mouthing recognition is a distinct but related task to VSR.

Conclusion: The research suggests knowledge transfer from VSR to SLR is beneficial, especially for datasets with limited mouthing annotations, advancing SLR systems.

Abstract: Sign Language Recognition (SLR) systems primarily focus on manual gestures, but non-manual features such as mouth movements, specifically mouthing, provide valuable linguistic information. This work directly classifies mouthing instances to their corresponding words in the spoken language while exploring the potential of transfer learning from Visual Speech Recognition (VSR) to mouthing recognition in German Sign Language. We leverage three VSR datasets: one in English, one in German with unrelated words and one in German containing the same target words as the mouthing dataset, to investigate the impact of task similarity in this setting. Our results demonstrate that multi-task learning improves both mouthing recognition and VSR accuracy as well as model robustness, suggesting that mouthing recognition should be treated as a distinct but related task to VSR. This research contributes to the field of SLR by proposing knowledge transfer from VSR to SLR datasets with limited mouthing annotations.

[478] Unintended Bias in 2D+ Image Segmentation and Its Effect on Attention Asymmetry

Zsófia Molnár, Gergely Szabó, András Horváth

Main category: cs.CV

TL;DR: The paper examines biases in supervised pretrained models for image segmentation, particularly in biomedical imaging, and proposes methods to mitigate these biases while maintaining model performance.

Details

Motivation: Pretrained models introduce unintended biases in specialized datasets like biomedical imaging, leading to inconsistent feature utilization and reduced reliability.

Method: The study compares pretrained and randomly initialized models, analyzing performance and saliency maps, and proposes methods to neutralize color channel biases.

Result: The proposed methods effectively mitigate biases, improving model explainability without sacrificing the advantages of pretrained models.

Conclusion: The findings offer practical solutions for addressing pretrained weight biases in deep learning tasks, enhancing reliability and performance.

Abstract: Supervised pretrained models have become widely used in deep learning, especially for image segmentation tasks. However, when applied to specialized datasets such as biomedical imaging, pretrained weights often introduce unintended biases. These biases cause models to assign different levels of importance to different slices, leading to inconsistencies in feature utilization, which can be observed as asymmetries in saliency map distributions. This transfer of color distributions from natural images to non-natural datasets can compromise model performance and reduce the reliability of results. In this study, we investigate the effects of these biases and propose strategies to mitigate them. Through a series of experiments, we test both pretrained and randomly initialized models, comparing their performance and saliency map distributions. Our proposed methods, which aim to neutralize the bias introduced by pretrained color channel weights, demonstrate promising results, offering a practical approach to improving model explainability while maintaining the benefits of pretrained models. This publication presents our findings, providing insights into addressing pretrained weight biases across various deep learning tasks.

[479] Toward Patient-specific Partial Point Cloud to Surface Completion for Pre- to Intra-operative Registration in Image-guided Liver Interventions

Nakul Poudel, Zixin Yang, Kelly Merrell, Richard Simon, Cristian A. Linte

Main category: cs.CV

TL;DR: A patient-specific point cloud completion method using VN-OccNet improves image-to-physical registration in surgery by addressing partial visibility of intra-operative data.

Details

Motivation: Intra-operative data lacks sub-surface details, and partial visibility hinders registration. A solution is needed to enhance registration accuracy.

Method: VN-OccNet is used to complete liver surfaces from partial point clouds, trained with patient-specific simulated deformations. The completed surface is integrated into Go-ICP for registration.

Result: The approach improves rigid registration outcomes and shows promise for handling partial intra-operative visibility.

Conclusion: Patient-specific completion with VN-OccNet is effective for robust registration, leveraging its rotation-equivariant and surface generation capabilities.

Abstract: Intra-operative data captured during image-guided surgery lacks sub-surface information, where key regions of interest, such as vessels and tumors, reside. Image-to-physical registration enables the fusion of pre-operative information and intra-operative data, typically represented as a point cloud. However, this registration process struggles due to partial visibility of the intra-operative point cloud. In this research, we propose a patient-specific point cloud completion approach to assist with the registration process. Specifically, we leverage VN-OccNet to generate a complete liver surface from a partial intra-operative point cloud. The network is trained in a patient-specific manner, where simulated deformations from the pre-operative model are used to train the model. First, we conduct an in-depth analysis of VN-OccNet’s rotation-equivariant property and its effectiveness in recovering complete surfaces from partial intra-operative surfaces. Next, we integrate the completed intra-operative surface into the Go-ICP registration algorithm to demonstrate its utility in improving initial rigid registration outcomes. Our results highlight the promise of this patient-specific completion approach in mitigating the challenges posed by partial intra-operative visibility. The rotation equivariant and surface generation capabilities of VN-OccNet hold strong promise for developing robust registration frameworks for variations of the intra-operative point cloud.

Shihao Li, Aihua Zheng, Andong Lu, Jin Tang, Jixin Ma

Main category: cs.CV

TL;DR: The paper proposes NEXT, a multi-modal ReID framework using text-modulated experts for fine-grained and coarse-grained feature capture, improving recognition accuracy.

Details

Motivation: Existing methods struggle with fine-grained recognition in multi-modal ReID due to implicit feature fusion. Leveraging MLLMs for descriptive captions and addressing recognition challenges motivates the work.

Method: NEXT decouples recognition into semantic (TMSE) and structural (CSSE) branches, using text-modulated experts and a soft routing mechanism. MGFA aggregates multi-grained features for final identity representation.

Result: Experiments on four datasets show NEXT outperforms state-of-the-art methods, reducing unknown recognition rates and improving feature quality.

Conclusion: NEXT effectively integrates multi-grained features via text-modulation, enhancing multi-modal ReID performance and addressing real-world challenges.

Abstract: Multi-modal object Re-Identification (ReID) aims to obtain accurate identity features across heterogeneous modalities. However, most existing methods rely on implicit feature fusion modules, making it difficult to model fine-grained recognition patterns under various challenges in real world. Benefiting from the powerful Multi-modal Large Language Models (MLLMs), the object appearances are effectively translated into descriptive captions. In this paper, we propose a reliable caption generation pipeline based on attribute confidence, which significantly reduces the unknown recognition rate of MLLMs and improves the quality of generated text. Additionally, to model diverse identity patterns, we propose a novel ReID framework, named NEXT, the Multi-grained Mixture of Experts via Text-Modulation for Multi-modal Object Re-Identification. Specifically, we decouple the recognition problem into semantic and structural branches to separately capture fine-grained appearance features and coarse-grained structure features. For semantic recognition, we first propose a Text-Modulated Semantic Experts (TMSE), which randomly samples high-quality captions to modulate experts capturing semantic features and mining inter-modality complementary cues. Second, to recognize structure features, we propose a Context-Shared Structure Experts (CSSE), which focuses on the holistic object structure and maintains identity structural consistency via a soft routing mechanism. Finally, we propose a Multi-Grained Features Aggregation (MGFA), which adopts a unified fusion strategy to effectively integrate multi-grained experts into the final identity representations. Extensive experiments on four public datasets demonstrate the effectiveness of our method and show that it significantly outperforms existing state-of-the-art methods.

[481] Toward Errorless Training ImageNet-1k

Bo Deng, Levi Heath

Main category: cs.CV

TL;DR: A feedforward neural network trained on ImageNet 2012 achieves 98.3% accuracy, with a 99.69 Top-1 rate, using a new method. The model has 322M parameters and suggests double-labeling issues prevent 100% accuracy.

Details

Motivation: To improve accuracy in image classification using a new training method on the ImageNet dataset.

Method: Feedforward artificial neural network trained on ImageNet 2012 with a novel approach.

Result: Achieves 98.3% accuracy, 99.69 Top-1 rate, and 285.9 perfectly classified labels per batch.

Conclusion: The model performs well but is limited by dataset double-labeling issues.

Abstract: In this paper, we describe a feedforward artificial neural network trained on the ImageNet 2012 contest dataset [7] with the new method of [5] to an accuracy rate of 98.3% with a 99.69 Top-1 rate, and an average of 285.9 labels that are perfectly classified over the 10 batch partitions of the dataset. The best performing model uses 322,430,160 parameters, with 4 decimal places precision. We conjecture that the reason our model does not achieve a 100% accuracy rate is due to a double-labeling problem, by which there are duplicate images in the dataset with different labels.

[482] EF-VI: Enhancing End-Frame Injection for Video Inbetweening

Liuhan Chen, Xiaodong Cun, Xiaoyu Li, Xianyi He, Shenghai Yuan, Jie Chen, Ying Shan, Li Yuan

Main category: cs.CV

TL;DR: EF-VI is a novel video inbetweening framework for transformer-based I2V-DMs, enhancing end-frame constraints via EF-Net without disrupting input representation.

Details

Motivation: Current methods either weakly constrain the end frame or disrupt input representation, leading to suboptimal performance.

Method: Proposes EF-VI, using EF-Net to encode and inject end-frame features into I2V-DMs.

Result: EF-VI outperforms baselines in experiments.

Conclusion: EF-VI effectively improves end-frame constraints while preserving input representation.

Abstract: Video inbetweening aims to synthesize intermediate video sequences conditioned on the given start and end frames. Current state-of-the-art methods primarily extend large-scale pre-trained Image-to-Video Diffusion Models (I2V-DMs) by incorporating the end-frame condition via direct fine-tuning or temporally bidirectional sampling. However, the former results in a weak end-frame constraint, while the latter inevitably disrupts the input representation of video frames, leading to suboptimal performance. To improve the end-frame constraint while avoiding disruption of the input representation, we propose a novel video inbetweening framework specific to recent and more powerful transformer-based I2V-DMs, termed EF-VI. It efficiently strengthens the end-frame constraint by utilizing an enhanced injection. This is based on our proposed well-designed lightweight module, termed EF-Net, which encodes only the end frame and expands it into temporally adaptive frame-wise features injected into the I2V-DM. Extensive experiments demonstrate the superiority of our EF-VI compared with other baselines.

[483] Rhetorical Text-to-Image Generation via Two-layer Diffusion Policy Optimization

Yuxi Zhang, Yueting Li, Xinyu Du, Sibo Wang

Main category: cs.CV

TL;DR: Rhet2Pix is a framework for generating images from rhetorical language by optimizing multi-step policies and diffusion modules, outperforming SOTA models.

Details

Motivation: Current text-to-image models struggle with rhetorical language, focusing on literal visuals instead of intended semantic meanings.

Method: Rhet2Pix uses a two-layer MDP diffusion module: an outer layer for incremental sub-sentence elaboration and an inner layer for reward optimization during diffusion.

Result: Rhet2Pix outperforms SOTA models like GPT-4o and Grok-3 in qualitative and quantitative evaluations.

Conclusion: Rhet2Pix effectively addresses rhetorical text-to-image generation, with publicly available code and dataset.

Abstract: Generating images from rhetorical languages remains a critical challenge for text-to-image models. Even state-of-the-art (SOTA) multimodal large language models (MLLM) fail to generate images based on the hidden meaning inherent in rhetorical language–despite such content being readily mappable to visual representations by humans. A key limitation is that current models emphasize object-level word embedding alignment, causing metaphorical expressions to steer image generation towards their literal visuals and overlook the intended semantic meaning. To address this, we propose Rhet2Pix, a framework that formulates rhetorical text-to-image generation as a multi-step policy optimization problem, incorporating a two-layer MDP diffusion module. In the outer layer, Rhet2Pix converts the input prompt into incrementally elaborated sub-sentences and executes corresponding image-generation actions, constructing semantically richer visuals. In the inner layer, Rhet2Pix mitigates reward sparsity during image generation by discounting the final reward and optimizing every adjacent action pair along the diffusion denoising trajectory. Extensive experiments demonstrate the effectiveness of Rhet2Pix in rhetorical text-to-image generation. Our model outperforms SOTA MLLMs such as GPT-4o, Grok-3 and leading academic baselines across both qualitative and quantitative evaluations. The code and dataset used in this work are publicly available.

Junyi Guo, Jingxuan Zhang, Fangyu Wu, Huanda Lu, Qiufeng Wang, Wenmian Yang, Eng Gee Lim, Dongming Lu

Main category: cs.CV

TL;DR: The paper introduces FS2RG, a task for generating realistic garment images from flat sketches and text, addressing challenges like fabric detail capture and conflicting inputs. It proposes HiGarment, a framework with multi-modal enhancement and cross-attention mechanisms, and releases a dataset.

Details

Motivation: To bridge the gap in garment production by enabling realistic image generation from flat sketches and text, addressing limitations in fabric detail and input conflicts.

Method: Proposes HiGarment, featuring multi-modal semantic enhancement for fabric representation and a harmonized cross-attention mechanism to balance sketch and text inputs.

Result: HiGarment effectively synthesizes garments, validated by experiments and user studies, and introduces a large open-source dataset.

Conclusion: HiGarment advances garment synthesis by addressing key challenges and providing a practical solution with a dataset for future research.

Abstract: Diffusion-based garment synthesis tasks primarily focus on the design phase in the fashion domain, while the garment production process remains largely underexplored. To bridge this gap, we introduce a new task: Flat Sketch to Realistic Garment Image (FS2RG), which generates realistic garment images by integrating flat sketches and textual guidance. FS2RG presents two key challenges: 1) fabric characteristics are solely guided by textual prompts, providing insufficient visual supervision for diffusion-based models, which limits their ability to capture fine-grained fabric details; 2) flat sketches and textual guidance may provide conflicting information, requiring the model to selectively preserve or modify garment attributes while maintaining structural coherence. To tackle this task, we propose HiGarment, a novel framework that comprises two core components: i) a multi-modal semantic enhancement mechanism that enhances fabric representation across textual and visual modalities, and ii) a harmonized cross-attention mechanism that dynamically balances information from flat sketches and text prompts, allowing controllable synthesis by generating either sketch-aligned (image-biased) or text-guided (text-biased) outputs. Furthermore, we collect Multi-modal Detailed Garment, the largest open-source dataset for garment generation. Experimental results and user studies demonstrate the effectiveness of HiGarment in garment synthesis. The code and dataset are available at https://github.com/Maple498/HiGarment.

[485] Video Signature: In-generation Watermarking for Latent Video Diffusion Models

Yu Huang, Junhao Chen, Shuliang Liu, Hanqian Li, Qi Zheng, Yi R., Fung, Xuming Hu

Main category: cs.CV

TL;DR: VIDSIG is an in-generation watermarking method for video diffusion models, balancing watermark extraction, visual quality, and efficiency.

Details

Motivation: Addressing the trade-off between video quality and watermark extraction in existing post-generation methods.

Method: Partially fine-tunes the latent decoder with Perturbation-Aware Suppression (PAS) and a Temporal Alignment module for consistency.

Result: Achieves best performance in watermark extraction, visual quality, and efficiency, with strong robustness against tampering.

Conclusion: VIDSIG is practical for real-world use, offering implicit and adaptive watermark integration during video generation.

Abstract: The rapid development of Artificial Intelligence Generated Content (AIGC) has led to significant progress in video generation but also raises serious concerns about intellectual property protection and reliable content tracing. Watermarking is a widely adopted solution to this issue, but existing methods for video generation mainly follow a post-generation paradigm, which introduces additional computational overhead and often fails to effectively balance the trade-off between video quality and watermark extraction. To address these issues, we propose Video Signature (VIDSIG), an in-generation watermarking method for latent video diffusion models, which enables implicit and adaptive watermark integration during generation. Specifically, we achieve this by partially fine-tuning the latent decoder, where Perturbation-Aware Suppression (PAS) pre-identifies and freezes perceptually sensitive layers to preserve visual quality. Beyond spatial fidelity, we further enhance temporal consistency by introducing a lightweight Temporal Alignment module that guides the decoder to generate coherent frame sequences during fine-tuning. Experimental results show that VIDSIG achieves the best overall performance in watermark extraction, visual quality, and generation efficiency. It also demonstrates strong robustness against both spatial and temporal tampering, highlighting its practicality in real-world scenarios. Our code is available at \href{https://github.com/hardenyu21/Video-Signature}{here}

[486] Zoom-Refine: Boosting High-Resolution Multimodal Understanding via Localized Zoom and Self-Refinement

Xuan Yu, Dayan Guan, Yanfeng Gu

Main category: cs.CV

TL;DR: Zoom-Refine enhances MLLMs’ ability to interpret high-resolution images by combining localized zoom and self-refinement, improving accuracy without additional training.

Details

Motivation: MLLMs struggle with high-resolution images where fine details are critical for accurate interpretation.

Method: Zoom-Refine uses localized zoom to identify task-relevant regions and self-refinement to integrate fine details, leveraging MLLMs’ existing capabilities.

Result: The method improves performance on high-resolution multimodal benchmarks without extra training.

Conclusion: Zoom-Refine effectively addresses MLLMs’ limitations in high-resolution image understanding through a training-free approach.

Abstract: Multimodal Large Language Models (MLLM) often struggle to interpret high-resolution images accurately, where fine-grained details are crucial for complex visual understanding. We introduce Zoom-Refine, a novel training-free method that enhances MLLM capabilities to address this issue. Zoom-Refine operates through a synergistic process of \textit{Localized Zoom} and \textit{Self-Refinement}. In the \textit{Localized Zoom} step, Zoom-Refine leverages the MLLM to provide a preliminary response to an input query and identifies the most task-relevant image region by predicting its bounding box coordinates. During the \textit{Self-Refinement} step, Zoom-Refine then integrates fine-grained details from the high-resolution crop (identified by \textit{Localized Zoom}) with its initial reasoning to re-evaluate and refine its preliminary response. Our method harnesses the MLLM’s inherent capabilities for spatial localization, contextual reasoning and comparative analysis without requiring additional training or external experts. Comprehensive experiments demonstrate the efficacy of Zoom-Refine on two challenging high-resolution multimodal benchmarks. Code is available at \href{https://github.com/xavier-yu114/Zoom-Refine}{\color{magenta}github.com/xavier-yu114/Zoom-Refine}

[487] Simple Radiology VLLM Test-time Scaling with Thought Graph Traversal

Yue Yao, Zelin Wen, Yan Tong, Xinyu Tian, Xuqing Li, Xiao Ma, Dongliang Xu, Tom Gedeon

Main category: cs.CV

TL;DR: A lightweight Thought Graph Traversal (TGT) framework improves radiology report generation by integrating medical priors and dynamic reasoning depth adjustment, outperforming baselines without model changes.

Details

Motivation: To enhance reasoning performance of vision-language large models (VLLMs) for radiology report generation without additional training.

Method: Introduces TGT framework with structured medical priors and a reasoning budget forcing strategy for dynamic inference depth adjustment.

Result: Outperforms baseline prompting methods, generates more accurate reports, and reveals dataset biases.

Conclusion: TGT is a simple, effective approach for improving VLLM reasoning in radiology, with open-sourced code for reproducibility.

Abstract: Test-time scaling offers a promising way to improve the reasoning performance of vision-language large models (VLLMs) without additional training. In this paper, we explore a simple but effective approach for applying test-time scaling to radiology report generation. Specifically, we introduce a lightweight Thought Graph Traversal (TGT) framework that guides the model to reason through organ-specific findings in a medically coherent order. This framework integrates structured medical priors into the prompt, enabling deeper and more logical analysis with no changes to the underlying model. To further enhance reasoning depth, we apply a reasoning budget forcing strategy that adjusts the model’s inference depth at test time by dynamically extending its generation process. This simple yet powerful combination allows a frozen radiology VLLM to self-correct and generate more accurate, consistent chest X-ray reports. Our method outperforms baseline prompting approaches on standard benchmarks, and also reveals dataset biases through traceable reasoning paths. Code and prompts are open-sourced for reproducibility at https://github.com/glerium/Thought-Graph-Traversal.

[488] 3DGS-IEval-15K: A Large-scale Image Quality Evaluation Database for 3D Gaussian-Splatting

Yuke Xing, Jiarui Wang, Peizhi Niu, Wenjie Huang, Guangtao Zhai, Yiling Xu

Main category: cs.CV

TL;DR: 3DGS-IEval-15K is a large-scale dataset for evaluating the perceptual impact of compressed 3D Gaussian Splatting (3DGS) representations, featuring 15,200 images from 10 scenes and human perception data.

Details

Motivation: Addressing the lack of a comprehensive framework to evaluate the perceptual impact of compression in 3DGS methods.

Method: Creation of a dataset with 15,200 images from 10 scenes using 6 3DGS algorithms at 20 viewpoints, varying compression levels, and collecting human perception data from 60 viewers.

Result: Validated dataset quality through scene diversity and MOS analysis, and benchmarked with 30 IQA metrics.

Conclusion: The dataset supports developing specialized IQA metrics for 3DGS and investigating view-dependent quality patterns.

Abstract: 3D Gaussian Splatting (3DGS) has emerged as a promising approach for novel view synthesis, offering real-time rendering with high visual fidelity. However, its substantial storage requirements present significant challenges for practical applications. While recent state-of-the-art (SOTA) 3DGS methods increasingly incorporate dedicated compression modules, there is a lack of a comprehensive framework to evaluate their perceptual impact. Therefore we present 3DGS-IEval-15K, the first large-scale image quality assessment (IQA) dataset specifically designed for compressed 3DGS representations. Our dataset encompasses 15,200 images rendered from 10 real-world scenes through 6 representative 3DGS algorithms at 20 strategically selected viewpoints, with different compression levels leading to various distortion effects. Through controlled subjective experiments, we collect human perception data from 60 viewers. We validate dataset quality through scene diversity and MOS distribution analysis, and establish a comprehensive benchmark with 30 representative IQA metrics covering diverse types. As the largest-scale 3DGS quality assessment dataset to date, our work provides a foundation for developing 3DGS specialized IQA metrics, and offers essential data for investigating view-dependent quality distribution patterns unique to 3DGS. The database is publicly available at https://github.com/YukeXing/3DGS-IEval-15K.

[489] CDP: Towards Robust Autoregressive Visuomotor Policy Learning via Causal Diffusion

Jiahua Ma, Yiran Qin, Yixiong Li, Xuanqi Liao, Yulan Guo, Ruimao Zhang

Main category: cs.CV

TL;DR: Causal Diffusion Policy (CDP) improves robotic behavior learning by using historical action sequences and a caching mechanism to enhance accuracy and reduce computational costs.

Details

Motivation: Hardware limitations and real-time constraints degrade learning from expert demonstrations, leading to failures in tasks like object localization and grasp planning.

Method: CDP uses a transformer-based diffusion model conditioned on historical action sequences and introduces a caching mechanism to reduce redundant computations.

Result: CDP achieves higher accuracy in 2D and 3D manipulation tasks and maintains precision under degraded input conditions.

Conclusion: CDP is robust and practical for robotic control in imperfect conditions, outperforming existing methods.

Abstract: Diffusion Policy (DP) enables robots to learn complex behaviors by imitating expert demonstrations through action diffusion. However, in practical applications, hardware limitations often degrade data quality, while real-time constraints restrict model inference to instantaneous state and scene observations. These limitations seriously reduce the efficacy of learning from expert demonstrations, resulting in failures in object localization, grasp planning, and long-horizon task execution. To address these challenges, we propose Causal Diffusion Policy (CDP), a novel transformer-based diffusion model that enhances action prediction by conditioning on historical action sequences, thereby enabling more coherent and context-aware visuomotor policy learning. To further mitigate the computational cost associated with autoregressive inference, a caching mechanism is also introduced to store attention key-value pairs from previous timesteps, substantially reducing redundant computations during execution. Extensive experiments in both simulated and real-world environments, spanning diverse 2D and 3D manipulation tasks, demonstrate that CDP uniquely leverages historical action sequences to achieve significantly higher accuracy than existing methods. Moreover, even when faced with degraded input observation quality, CDP maintains remarkable precision by reasoning through temporal continuity, which highlights its practical robustness for robotic control under realistic, imperfect conditions.

[490] Co-VisiON: Co-Visibility ReasONing on Sparse Image Sets of Indoor Scenes

Chao Chen, Nobel Dang, Juexiao Zhang, Wenkai Sun, Pengfei Zheng, Xuhang He, Yimeng Ye, Jiasheng Zhang, Taarun Srinivas, Chen Feng

Main category: cs.CV

TL;DR: The paper introduces Co-VisiON, a benchmark for evaluating co-visibility reasoning in sparse-view scenarios, showing current vision models lag behind human performance. A proposed model, Covis, narrows this gap.

Details

Motivation: To assess if current vision models can replicate human proficiency in recognizing co-visibility in sparse-view scenarios, highlighting the need for models integrating spatial and semantic reasoning.

Method: Developed the Co-VisiON benchmark with 1,000+ sparse-view indoor scenarios and proposed Covis, a multi-view baseline inspired by human visual cognition.

Result: Existing models struggle with co-visibility under sparse conditions. A proprietary vision-language model outperforms vision-only models, but all lag behind humans. Covis performs best among vision models.

Conclusion: The gap between models and humans underscores the need for cognitively inspired architectures. The benchmark and Covis aim to drive advancements in robust, human-like vision models.

Abstract: Humans exhibit a remarkable ability to recognize co-visibility-the 3D regions simultaneously visible in multiple images-even when these images are sparsely distributed across a complex scene. This ability is foundational to 3D vision, robotic perception, and relies not only on low-level feature matching but also on high-level spatial reasoning and cognitive integration. Yet, it remains unclear whether current vision models can replicate this human-level proficiency. In this work, we introduce the Co-VisiON benchmark, designed to evaluate human-inspired co-visibility reasoning across more than 1,000 sparse-view indoor scenarios. Our results show that while co-visibility is often approached as a low-level feature-matching task, it remains challenging for existing vision models under sparse conditions. Notably, a proprietary vision-language model surpasses all vision-only baselines, but all models fall significantly short of human performance. This gap underscores the limitations of current architectures and motivates the need for models that integrate spatial and semantic information in a human-like manner. Inspired by human visual cognition, we propose a novel multi-view baseline, Covis, which achieves top performance among pure vision models and narrows the gap to the proprietary VLM. We hope our benchmark and findings will spur further advancements in developing vision models capable of robust, cognitively inspired reasoning in challenging, sparse environments. Our dataset and source code can be found at https://ai4ce.github.io/CoVISION.

[491] CLGRPO: Reasoning Ability Enhancement for Small VLMs

Fanyi Wang, Binzhi Dong, Haotian Hu, Jinjin Xu, Zhiwang Zhang

Main category: cs.CV

TL;DR: The paper proposes an Incremental Training Strategy to enhance the reasoning ability of Small Vision Language Models (SVLMs) by leveraging self-supervised COT data and multi-stage optimization, achieving performance comparable to larger models.

Details

Motivation: SVLMs (≤2B parameters) are cost-effective but limited in reasoning ability. The paper aims to improve this without increasing model size.

Method: Four-stage Incremental Training Strategy: SFT for domain knowledge, GRPO for COT alignment, GRPO for reasoning enhancement, and CLGRPO to address capacity limits.

Result: Significant improvement in reasoning: +2.77 accuracy and +0.69 recall on EMOSet-118K, matching 8B model performance.

Conclusion: The strategy effectively boosts SVLM reasoning, offering a practical solution for resource-constrained applications.

Abstract: Small Vision Language Models (SVLMs) generally refer to models with parameter sizes less than or equal to 2B. Their low cost and power consumption characteristics confer high commercial value. However, their reasoning abilities are limited by the number of parameters. To address this issue, this paper proposes a post-training optimization paradigm called the Incremental Training Strategy to enhance the reasoning ability of SVLMs. Firstly, we constructed a Self-Supervised Chain-of-Thought (COT) Data Construction System, which leverages multiple LVLMs with 7B parameters or more to transform original data into COT data in a self-supervised manner. Our proposed Incremental Training Strategy consists of four stages. Stage 1 injects domain knowledge by performing Supervised Fine-Tuning (SFT) to the pretrained model on the COT data. Stage 2 aligns the COT data format by conducting a small amount of Group Relative Policy Optimization (GRPO) training constrained only by format rewards on the COT data. Stage 3 enhances reasoning ability by applying GRPO training on the COT data with constraints on both format and accuracy rewards. The resulting model shows significant improvement compared to the baseline. Stage 4 addresses the limited capacity of the SVLMs and the weak ability to capture complex patterns by proposing ClipLow GRPO (CLGRPO) to constrain the capture space of the training process. We conducted extensive comparative and ablation experiments on the abstract semantic recognition dataset EMOSet-118K. Experimental results demonstrate that our method significantly improves the reasoning ability of 1B SVLM. Compared to the baseline model fine-tuned on the original data, accuracy increased by 2.77 and recall by 0.69, achieving performance comparable to that of 8B models.

[492] UltraAD: Fine-Grained Ultrasound Anomaly Classification via Few-Shot CLIP Adaptation

Yue Zhou, Yuan Bi, Wenjuan Tong, Wei Wang, Nassir Navab, Zhongliang Jiang

Main category: cs.CV

TL;DR: UltraAD, a vision-language model, improves anomaly detection in ultrasound images by combining few-shot learning with text embeddings for fine-grained classification and localization.

Details

Motivation: Addressing the lack of fine-grained differentiation in anomaly detection and domain gaps in ultrasound imaging.

Method: Leverages few-shot US examples, fuses image-level tokens with text embeddings, and uses a memory bank for fine-grained classification.

Result: Outperforms state-of-the-art methods in lesion localization and fine-grained classification on breast US datasets.

Conclusion: UltraAD offers a robust solution for precise anomaly detection in medical imaging, with potential for clinical applications.

Abstract: Precise anomaly detection in medical images is critical for clinical decision-making. While recent unsupervised or semi-supervised anomaly detection methods trained on large-scale normal data show promising results, they lack fine-grained differentiation, such as benign vs. malignant tumors. Additionally, ultrasound (US) imaging is highly sensitive to devices and acquisition parameter variations, creating significant domain gaps in the resulting US images. To address these challenges, we propose UltraAD, a vision-language model (VLM)-based approach that leverages few-shot US examples for generalized anomaly localization and fine-grained classification. To enhance localization performance, the image-level token of query visual prototypes is first fused with learnable text embeddings. This image-informed prompt feature is then further integrated with patch-level tokens, refining local representations for improved accuracy. For fine-grained classification, a memory bank is constructed from few-shot image samples and corresponding text descriptions that capture anatomical and abnormality-specific features. During training, the stored text embeddings remain frozen, while image features are adapted to better align with medical data. UltraAD has been extensively evaluated on three breast US datasets, outperforming state-of-the-art methods in both lesion localization and fine-grained medical classification. The code will be released upon acceptance.

[493] Temporal Rate Reduction Clustering for Human Motion Segmentation

Xianghan Meng, Zhengyu Tong, Zhiyuan Huang, Chun-Guang Li

Main category: cs.CV

TL;DR: The paper proposes Temporal Rate Reduction Clustering (TR²C) for Human Motion Segmentation (HMS), improving performance by learning structured representations aligned with a Union-of-Subspaces (UoS) distribution.

Details

Motivation: Existing subspace clustering methods for HMS assume UoS alignment, but complex human motions with cluttered backgrounds may not fit this assumption.

Method: TR²C jointly learns structured representations and affinity to segment video frames, ensuring temporal consistency and UoS alignment.

Result: Experiments on five benchmark datasets show state-of-the-art performance with various feature extractors.

Conclusion: TR²C effectively addresses HMS by aligning learned representations with UoS, outperforming existing methods.

Abstract: Human Motion Segmentation (HMS), which aims to partition videos into non-overlapping human motions, has attracted increasing research attention recently. Existing approaches for HMS are mainly dominated by subspace clustering methods, which are grounded on the assumption that high-dimensional temporal data align with a Union-of-Subspaces (UoS) distribution. However, the frames in video capturing complex human motions with cluttered backgrounds may not align well with the UoS distribution. In this paper, we propose a novel approach for HMS, named Temporal Rate Reduction Clustering ($\text{TR}^2\text{C}$), which jointly learns structured representations and affinity to segment the sequences of frames in video. Specifically, the structured representations learned by $\text{TR}^2\text{C}$ enjoy temporally consistency and are aligned well with a UoS structure, which is favorable for addressing the HMS task. We conduct extensive experiments on five benchmark HMS datasets and achieve state-of-the-art performances with different feature extractors. The code is available at: https://github.com/mengxianghan123/TR2C.

[494] MusiXQA: Advancing Visual Music Understanding in Multimodal Large Language Models

Jian Chen, Wenye Ma, Penghang Liu, Wei Wang, Tengwei Song, Ming Li, Chenguang Wang, Jiayu Qin, Ruiyi Zhang, Changyou Chen

Main category: cs.CV

TL;DR: The paper introduces MusiXQA, a dataset for evaluating MLLMs in music sheet understanding, and proposes Phi-3-MusiX, a fine-tuned model that outperforms GPT-based methods.

Details

Motivation: Current MLLMs lack exploration in music sheet interpretation, despite their success in other visual reasoning tasks.

Method: The authors created MusiXQA, a dataset of synthetic music sheets with structured annotations, and fine-tuned Phi-3-MusiX on it.

Result: Evaluations show current MLLMs struggle with music sheets, but Phi-3-MusiX achieves significant improvements.

Conclusion: MusiXQA and Phi-3-MusiX provide a foundation for advancing MLLMs in music sheet understanding.

Abstract: Multimodal Large Language Models (MLLMs) have achieved remarkable visual reasoning abilities in natural images, text-rich documents, and graphic designs. However, their ability to interpret music sheets remains underexplored. To bridge this gap, we introduce MusiXQA, the first comprehensive dataset for evaluating and advancing MLLMs in music sheet understanding. MusiXQA features high-quality synthetic music sheets generated via MusiXTeX, with structured annotations covering note pitch and duration, chords, clefs, key/time signatures, and text, enabling diverse visual QA tasks. Through extensive evaluations, we reveal significant limitations of current state-of-the-art MLLMs in this domain. Beyond benchmarking, we developed Phi-3-MusiX, an MLLM fine-tuned on our dataset, achieving significant performance gains over GPT-based methods. The proposed dataset and model establish a foundation for future advances in MLLMs for music sheet understanding. Code, data, and model will be released upon acceptance.

[495] PCLVis: Visual Analytics of Process Communication Latency in Large-Scale Simulation

Chongke Bi, Xin Gao, Baofeng Fu, Yuheng Zhao, Siming Chen, Ying Zhao, Lu Yang

Main category: cs.CV

TL;DR: PCLVis is a framework for analyzing process communication latency (PCL) in large-scale simulations using MPI data, improving simulation efficiency.

Details

Motivation: Addressing scalability issues in supercomputing simulations caused by high communication costs, without needing physical link layer data.

Method: Uses MPI process communication data, spatial PCL event locating, correlation clustering, DAG-based path analysis, sliding window abstraction, and CS-Glyphs for visualization.

Result: Demonstrated effectiveness on TH-1A supercomputer, enabling users to optimize simulations.

Conclusion: PCLVis provides a practical solution for general users to analyze and improve simulation efficiency.

Abstract: Large-scale simulations on supercomputers have become important tools for users. However, their scalability remains a problem due to the huge communication cost among parallel processes. Most of the existing communication latency analysis methods rely on the physical link layer information, which is only available to administrators. In this paper, a framework called PCLVis is proposed to help general users analyze process communication latency (PCL) events. Instead of the physical link layer information, the PCLVis uses the MPI process communication data for the analysis. First, a spatial PCL event locating method is developed. All processes with high correlation are classified into a single cluster by constructing a process-correlation tree. Second, the propagation path of PCL events is analyzed by constructing a communication-dependency-based directed acyclic graph (DAG), which can help users interactively explore a PCL event from the temporal evolution of a located PCL event cluster. In this graph, a sliding window algorithm is designed to generate the PCL events abstraction. Meanwhile, a new glyph called the communication state glyph (CS-Glyph) is designed for each process to show its communication states, including its in/out messages and load balance. Each leaf node can be further unfolded to view additional information. Third, a PCL event attribution strategy is formulated to help users optimize their simulations. The effectiveness of the PCLVis framework is demonstrated by analyzing the PCL events of several simulations running on the TH-1A supercomputer. By using the proposed framework, users can greatly improve the efficiency of their simulations.

[496] CPKD: Clinical Prior Knowledge-Constrained Diffusion Models for Surgical Phase Recognition in Endoscopic Submucosal Dissection

Xiangning Zhang, Jinnan Chen, Qingwei Zhang, Yaqi Wang, Chengfeng Zhou, Xiaobo Li, Dahong Qian

Main category: cs.CV

TL;DR: The paper introduces CPKD, a generative framework for surgical phase recognition in endoscopic procedures, outperforming existing methods by integrating clinical knowledge and diffusion principles.

Details

Motivation: Addressing the bottleneck of reliable surgical phase recognition in endoscopic workflows, which limits the clinical adoption of computer-assisted systems.

Method: Proposes CPKD, a generative framework using denoising diffusion principles, conditioned on visual-temporal features and clinical prior knowledge, with a conditional masking strategy.

Result: CPKD achieves superior or comparable performance to state-of-the-art methods on datasets like ESD820 and Cholec80.

Conclusion: The diffusion-based generative paradigm is effective for surgical phase recognition, enhancing precision in endoscopic procedures.

Abstract: Gastrointestinal malignancies constitute a leading cause of cancer-related mortality worldwide, with advanced-stage prognosis remaining particularly dismal. Originating as a groundbreaking technique for early gastric cancer treatment, Endoscopic Submucosal Dissection has evolved into a versatile intervention for diverse gastrointestinal lesions. While computer-assisted systems significantly enhance procedural precision and safety in ESD, their clinical adoption faces a critical bottleneck: reliable surgical phase recognition within complex endoscopic workflows. Current state-of-the-art approaches predominantly rely on multi-stage refinement architectures that iteratively optimize temporal predictions. In this paper, we present Clinical Prior Knowledge-Constrained Diffusion (CPKD), a novel generative framework that reimagines phase recognition through denoising diffusion principles while preserving the core iterative refinement philosophy. This architecture progressively reconstructs phase sequences starting from random noise and conditioned on visual-temporal features. To better capture three domain-specific characteristics, including positional priors, boundary ambiguity, and relation dependency, we design a conditional masking strategy. Furthermore, we incorporate clinical prior knowledge into the model training to improve its ability to correct phase logical errors. Comprehensive evaluations on ESD820, Cholec80, and external multi-center demonstrate that our proposed CPKD achieves superior or comparable performance to state-of-the-art approaches, validating the effectiveness of diffusion-based generative paradigms for surgical phase recognition.

[497] When Trackers Date Fish: A Benchmark and Framework for Underwater Multiple Fish Tracking

Weiran Li, Yeqiang Liu, Qiannan Guo, Yijie Wei, Hwa Liang Leo, Zhenbo Li

Main category: cs.CV

TL;DR: A new dataset (MFT25) and tracking framework (SU-T) for underwater multiple fish tracking, addressing unique challenges like occlusions and erratic motion.

Details

Motivation: Underwater tracking is underexplored despite its importance for marine ecology and aquaculture.

Method: Introduces MFT25 dataset and SU-T framework with UKF for non-linear fish motion and FishIoU for morphological matching.

Result: SU-T achieves 34.1 HOTA and 44.6 IDF1, outperforming existing methods.

Conclusion: Highlights differences between fish and terrestrial tracking, releasing dataset and code for further research.

Abstract: Multiple object tracking (MOT) technology has made significant progress in terrestrial applications, but underwater tracking scenarios remain underexplored despite their importance to marine ecology and aquaculture. In this paper, we present Multiple Fish Tracking Dataset 2025 (MFT25), a comprehensive dataset specifically designed for underwater multiple fish tracking, featuring 15 diverse video sequences with 408,578 meticulously annotated bounding boxes across 48,066 frames. Our dataset captures various underwater environments, fish species, and challenging conditions including occlusions, similar appearances, and erratic motion patterns. Additionally, we introduce Scale-aware and Unscented Tracker (SU-T), a specialized tracking framework featuring an Unscented Kalman Filter (UKF) optimized for non-linear swimming patterns of fish and a novel Fish-Intersection-over-Union (FishIoU) matching that accounts for the unique morphological characteristics of aquatic species. Extensive experiments demonstrate that our SU-T baseline achieves state-of-the-art performance on MFT25, with 34.1 HOTA and 44.6 IDF1, while revealing fundamental differences between fish tracking and terrestrial object tracking scenarios. The dataset and codes are released at https://vranlee.github.io/SU-T/.

[498] Multimodal Visual Transformer for Sim2real Transfer in Visual Reinforcement Learning

Zichun Xu, Yuntao Li, Zhaomin Wang, Lei Zhuang, Guocai Yang, Jingdong Zhao

Main category: cs.CV

TL;DR: A vision transformer-based backbone fuses RGB and depth data for better generalization, using contrastive learning and curriculum learning for sim2real transfer.

Details

Motivation: Depth information is robust to appearance variations and carries 3D spatial details, enhancing generalization in unseen scenarios.

Method: Separate CNN stems process RGB and depth, combined features feed a vision transformer. Contrastive learning and masked tokens improve sample efficiency. Curriculum learning aids sim2real transfer.

Result: The model focuses on task-related regions and generalizes better in unseen scenarios, validated by real-world zero-shot transfer.

Conclusion: The proposed method effectively combines modalities and learning schemes for robust generalization and real-world applicability.

Abstract: Depth information is robust to scene appearance variations and inherently carries 3D spatial details. In this paper, a visual backbone based on the vision transformer is proposed to fuse RGB and depth modalities for enhancing generalization. Different modalities are first processed by separate CNN stems, and the combined convolutional features are delivered to the scalable vision transformer to obtain visual representations. Moreover, a contrastive unsupervised learning scheme is designed with masked and unmasked tokens to accelerate the sample efficiency during the reinforcement learning process. Simulation results demonstrate that our visual backbone can focus more on task-related regions and exhibit better generalization in unseen scenarios. For sim2real transfer, a flexible curriculum learning schedule is developed to deploy domain randomization over training processes. Finally, the feasibility of our model is validated to perform real-world manipulation tasks via zero-shot transfer.

[499] LifelongPR: Lifelong point cloud place recognition based on sample replay and prompt learning

Xianghong Zou, Jianping Li, Zhe Chen, Zhen Cao, Zhen Dong, Qiegen Liu, Bisheng Yang

Main category: cs.CV

TL;DR: LifelongPR is a continual learning framework for point cloud place recognition (PCPR) that addresses catastrophic forgetting and domain shifts, improving performance and scalability.

Details

Motivation: Existing PCPR models suffer from catastrophic forgetting and poor adaptability to new environments, limiting their practicality in real-world applications.

Method: The framework includes a replay sample selection method for knowledge retention and a prompt learning-based CL framework for domain adaptation.

Result: LifelongPR outperforms SOTA methods with 6.50% improvement in mIR@1, 7.96% in mR@1, and an 8.95% reduction in F.

Conclusion: LifelongPR effectively addresses scalability and adaptability issues in PCPR, making it practical for real-world deployments.

Abstract: Point cloud place recognition (PCPR) determines the geo-location within a prebuilt map and plays a crucial role in geoscience and robotics applications such as autonomous driving, intelligent transportation, and augmented reality. In real-world large-scale deployments of a geographic positioning system, PCPR models must continuously acquire, update, and accumulate knowledge to adapt to diverse and dynamic environments, i.e., the ability known as continual learning (CL). However, existing PCPR models often suffer from catastrophic forgetting, leading to significant performance degradation in previously learned scenes when adapting to new environments or sensor types. This results in poor model scalability, increased maintenance costs, and system deployment difficulties, undermining the practicality of PCPR. To address these issues, we propose LifelongPR, a novel continual learning framework for PCPR, which effectively extracts and fuses knowledge from sequential point cloud data. First, to alleviate the knowledge loss, we propose a replay sample selection method that dynamically allocates sample sizes according to each dataset’s information quantity and selects spatially diverse samples for maximal representativeness. Second, to handle domain shifts, we design a prompt learning-based CL framework with a lightweight prompt module and a two-stage training strategy, enabling domain-specific feature adaptation while minimizing forgetting. Comprehensive experiments on large-scale public and self-collected datasets are conducted to validate the effectiveness of the proposed method. Compared with state-of-the-art (SOTA) methods, our method achieves 6.50% improvement in mIR@1, 7.96% improvement in mR@1, and an 8.95% reduction in F. The code and pre-trained models are publicly available at https://github.com/zouxianghong/LifelongPR.

[500] ArtiMuse: Fine-Grained Image Aesthetics Assessment with Joint Scoring and Expert-Level Understanding

Shuo Cao, Nan Ma, Jiayang Li, Xiaohui Li, Lihao Shao, Kaiwen Zhu, Yu Zhou, Yuandong Pu, Jiarui Wu, Jiaquan Wang, Bo Qu, Wenhai Wang, Yu Qiao, Dajuin Yao, Yihao Liu

Main category: cs.CV

TL;DR: ArtiMuse is a new MLLM-based IAA model offering joint scoring and expert-level understanding, addressing modality bias and lack of fine-grained analysis. It introduces ArtiMuse-10K, a 10,000-image expert-curated dataset with detailed annotations.

Details

Motivation: The need for comprehensive IAA methods with quantitative scoring and professional understanding due to advancements in educational, artistic, and AIGC technologies.

Method: Development of ArtiMuse, an MLLM-based IAA model with joint scoring and expert-level understanding, and creation of ArtiMuse-10K, a detailed expert-curated dataset.

Result: ArtiMuse addresses modality bias and lack of fine-grained analysis, supported by the ArtiMuse-10K dataset for robust aesthetic assessment.

Conclusion: The model and dataset will be made public to advance the field of image aesthetics assessment.

Abstract: The rapid advancement of educational applications, artistic creation, and AI-generated content (AIGC) technologies has substantially increased practical requirements for comprehensive Image Aesthetics Assessment (IAA), particularly demanding methods capable of delivering both quantitative scoring and professional understanding. Multimodal Large Language Model (MLLM)-based IAA methods demonstrate stronger perceptual and generalization capabilities compared to traditional approaches, yet they suffer from modality bias (score-only or text-only) and lack fine-grained attribute decomposition, thereby failing to support further aesthetic assessment. In this paper, we present:(1) ArtiMuse, an innovative MLLM-based IAA model with Joint Scoring and Expert-Level Understanding capabilities; (2) ArtiMuse-10K, the first expert-curated image aesthetic dataset comprising 10,000 images spanning 5 main categories and 15 subcategories, each annotated by professional experts with 8-dimensional attributes analysis and a holistic score. Both the model and dataset will be made public to advance the field.

[501] Efficient Face Image Quality Assessment via Self-training and Knowledge Distillation

Wei Sun, Weixia Zhang, Linhan Cao, Jun Jia, Xiangyang Zhu, Dandan Zhu, Xiongkuo Min, Guangtao Zhai

Main category: cs.CV

TL;DR: A two-stage method for efficient face image quality assessment (FIQA) using teacher-student distillation and self-training to achieve high performance with low computational cost.

Details

Motivation: To address the computational complexity of FIQA algorithms for scalable and practical deployment in real-world systems.

Method: 1. Train a teacher model using labeled data. 2. Generate pseudo-labels for unlabeled data. 3. Use pseudo-labels to distill a lightweight student model and enhance the teacher via self-training. 4. Iterate with enhanced teacher for further distillation.

Result: The student model matches teacher performance with minimal computational overhead and won the ICCV 2025 VQualA FIQA Challenge.

Conclusion: The proposed method effectively balances accuracy and efficiency, making FIQA practical for real-world applications.

Abstract: Face image quality assessment (FIQA) is essential for various face-related applications. Although FIQA has been extensively studied and achieved significant progress, the computational complexity of FIQA algorithms remains a key concern for ensuring scalability and practical deployment in real-world systems. In this paper, we aim to develop a computationally efficient FIQA method that can be easily deployed in real-world applications. Specifically, our method consists of two stages: training a powerful teacher model and distilling a lightweight student model from it. To build a strong teacher model, we adopt a self-training strategy to improve its capacity. We first train the teacher model using labeled face images, then use it to generate pseudo-labels for a set of unlabeled images. These pseudo-labeled samples are used in two ways: (1) to distill knowledge into the student model, and (2) to combine with the original labeled images to further enhance the teacher model through self-training. The enhanced teacher model is used to further pseudo-label another set of unlabeled images for distilling the student models. The student model is trained using a combination of labeled images, pseudo-labeled images from the original teacher model, and pseudo-labeled images from the enhanced teacher model. Experimental results demonstrate that our student model achieves comparable performance to the teacher model with an extremely low computational overhead. Moreover, our method achieved first place in the ICCV 2025 VQualA FIQA Challenge. The code is available at https://github.com/sunwei925/Efficient-FIQA.git.

[502] AFRDA: Attentive Feature Refinement for Domain Adaptive Semantic Segmentation

Md. Al-Masrur Khan, Durgakant Pushp, Lantao Liu

Main category: cs.CV

TL;DR: The paper introduces the Adaptive Feature Refinement (AFR) module to improve UDA-SS by balancing local and global features, enhancing segmentation accuracy.

Details

Motivation: Existing UDA-SS methods struggle with balancing fine-grained details and global context, causing segmentation errors.

Method: AFR refines high-resolution features using semantic priors from low-resolution logits, integrates high-frequency components, and uses uncertainty-driven attention.

Result: AFR improves UDA-SS methods by 1.05% mIoU on GTA V→Cityscapes and 1.04% mIoU on Synthia→Cityscapes.

Conclusion: AFR’s lightweight design and effectiveness make it a state-of-the-art solution for UDA-SS.

Abstract: In Unsupervised Domain Adaptive Semantic Segmentation (UDA-SS), a model is trained on labeled source domain data (e.g., synthetic images) and adapted to an unlabeled target domain (e.g., real-world images) without access to target annotations. Existing UDA-SS methods often struggle to balance fine-grained local details with global contextual information, leading to segmentation errors in complex regions. To address this, we introduce the Adaptive Feature Refinement (AFR) module, which enhances segmentation accuracy by refining highresolution features using semantic priors from low-resolution logits. AFR also integrates high-frequency components, which capture fine-grained structures and provide crucial boundary information, improving object delineation. Additionally, AFR adaptively balances local and global information through uncertaintydriven attention, reducing misclassifications. Its lightweight design allows seamless integration into HRDA-based UDA methods, leading to state-of-the-art segmentation performance. Our approach improves existing UDA-SS methods by 1.05% mIoU on GTA V –> Cityscapes and 1.04% mIoU on Synthia–>Cityscapes. The implementation of our framework is available at: https://github.com/Masrur02/AFRDA

[503] A New One-Shot Federated Learning Framework for Medical Imaging Classification with Feature-Guided Rectified Flow and Knowledge Distillation

Yufei Ma, Hanwen Zhang, Qiya Yang, Guibo Luo, Yuesheng Zhu

Main category: cs.CV

TL;DR: A modified OSFL framework with FG-RF and DLKD improves training efficiency and privacy in healthcare, outperforming multi-round FL methods by up to 21.73%.

Details

Motivation: Address low efficiency, privacy risks, and non-IID data challenges in OSFL for healthcare.

Method: Proposes FG-RF for feature-level image synthesis and DLKD for dual-layer knowledge distillation.

Result: Achieves up to 21.73% improvement over multi-round FL and reduces privacy leakage risks.

Conclusion: The framework is effective for non-IID medical imaging, balancing performance and privacy.

Abstract: In multi-center scenarios, One-Shot Federated Learning (OSFL) has attracted increasing attention due to its low communication overhead, requiring only a single round of transmission. However, existing generative model-based OSFL methods suffer from low training efficiency and potential privacy leakage in the healthcare domain. Additionally, achieving convergence within a single round of model aggregation is challenging under non-Independent and Identically Distributed (non-IID) data. To address these challenges, in this paper a modified OSFL framework is proposed, in which a new Feature-Guided Rectified Flow Model (FG-RF) and Dual-Layer Knowledge Distillation (DLKD) aggregation method are developed. FG-RF on the client side accelerates generative modeling in medical imaging scenarios while preserving privacy by synthesizing feature-level images rather than pixel-level images. To handle non-IID distributions, DLKD enables the global student model to simultaneously mimic the output logits and align the intermediate-layer features of client-side teacher models during aggregation. Experimental results on three non-IID medical imaging datasets show that our new framework and method outperform multi-round federated learning approaches, achieving up to 21.73% improvement, and exceeds the baseline FedISCA by an average of 21.75%. Furthermore, our experiments demonstrate that feature-level synthetic images significantly reduce privacy leakage risks compared to pixel-level synthetic images. The code is available at https://github.com/LMIAPC/one-shot-fl-medical.

[504] FROSS: Faster-than-Real-Time Online 3D Semantic Scene Graph Generation from RGB-D Images

Hao-Yu Hou, Chun-Yi Lee, Motoharu Sonogashira, Yasutomo Kawanishi

Main category: cs.CV

TL;DR: FROSS is a faster-than-real-time method for generating 3D semantic scene graphs (SSGs) by lifting 2D scene graphs to 3D and using Gaussian distributions, avoiding intensive point cloud processing.

Details

Motivation: Existing 3D SSG methods are computationally demanding and non-incremental, limiting real-time applications.

Method: FROSS lifts 2D scene graphs to 3D, representing objects as Gaussian distributions, and avoids point cloud processing.

Result: FROSS outperforms prior methods in speed and performance on ReplicaSSG and 3DSSG datasets.

Conclusion: FROSS enables efficient, real-time 3D SSG generation, with publicly available implementation and dataset.

Abstract: The ability to abstract complex 3D environments into simplified and structured representations is crucial across various domains. 3D semantic scene graphs (SSGs) achieve this by representing objects as nodes and their interrelationships as edges, facilitating high-level scene understanding. Existing methods for 3D SSG generation, however, face significant challenges, including high computational demands and non-incremental processing that hinder their suitability for real-time open-world applications. To address this issue, we propose FROSS (Faster-than-Real-Time Online 3D Semantic Scene Graph Generation), an innovative approach for online and faster-than-real-time 3D SSG generation that leverages the direct lifting of 2D scene graphs to 3D space and represents objects as 3D Gaussian distributions. This framework eliminates the dependency on precise and computationally-intensive point cloud processing. Furthermore, we extend the Replica dataset with inter-object relationship annotations, creating the ReplicaSSG dataset for comprehensive evaluation of FROSS. The experimental results from evaluations on ReplicaSSG and 3DSSG datasets show that FROSS can achieve superior performance while operating significantly faster than prior 3D SSG generation methods. Our implementation and dataset are publicly available at https://github.com/Howardkhh/FROSS.

[505] TARS: MinMax Token-Adaptive Preference Strategy for MLLM Hallucination Reduction

Kejia Zhang, Keda Tao, Zhiming Luo, Chang Liu, Jiasheng Tang, Huan Wang

Main category: cs.CV

TL;DR: TARS, a token-adaptive preference strategy, improves multimodal large language models (MLLMs) by reducing hallucinations through dynamic preference optimization, outperforming standard DPO and matching GPT-4o.

Details

Motivation: MLLMs often produce factually incorrect or visually ungrounded outputs, reducing reliability. Existing DPO methods overfit to superficial cues, impairing grounding.

Method: TARS reformulates DPO as a min-max optimization problem, maximizing token-level shifts under semantic constraints to simulate uncertainty and minimize preference loss.

Result: TARS reduces hallucination rates from 26.4% to 13.2% and cognition value from 2.5 to 0.4, outperforming standard DPO and matching GPT-4o.

Conclusion: TARS effectively mitigates hallucinations in MLLMs by preserving causal grounding and avoiding overfitting, demonstrating strong performance with minimal data.

Abstract: Multimodal large language models (MLLMs) enable vision-language reasoning, yet often generate plausible outputs that are factually incorrect or visually ungrounded, thereby compromising their reliability. Direct preference optimization (DPO) is a common strategy for correcting hallucinations by aligning model outputs with human preferences. Existing DPO strategies typically treat hallucination-related preferences as fixed targets, relying on static supervision signals during training. This approach tends to overfit to superficial linguistic cues in preference data, leading to distributional rigidity and spurious correlations that impair grounding in causally relevant visual information. To overcome this limitation, we propose TARS, a token-adaptive preference strategy that reformulates DPO as a min-max optimization problem. TARS maximizes token-level distributional shifts under semantic constraints to simulate alignment uncertainty, and simultaneously minimizes the expected preference loss under these controlled perturbations. This joint objective preserves causal grounding while mitigating overfitting to preference patterns, thereby reducing hallucinations in multimodal reasoning. We evaluate TARS on multiple hallucination benchmarks and find consistently strong performance. Using only 4.8k preference samples and no expert feedback, TARS reduces hallucination rates from 26.4% to 13.2% and decreases cognition value from 2.5 to 0.4. It outperforms standard DPO and matches GPT-4o on several key metrics.

[506] MoGA: 3D Generative Avatar Prior for Monocular Gaussian Avatar Reconstruction

Zijian Dong, Longteng Duan, Jie Song, Michael J. Black, Andreas Geiger

Main category: cs.CV

TL;DR: MoGA reconstructs high-fidelity 3D Gaussian avatars from single-view images by combining a generative avatar model with 2D diffusion models, ensuring 3D consistency and realism.

Details

Motivation: The challenge is to infer unseen appearance and geometric details from a single-view image while maintaining 3D consistency, as previous methods relying on 2D diffusion models produce inconsistent and unrealistic results.

Method: MoGA integrates a generative avatar model as a prior, projects input images into its latent space, and enforces 3D constraints. It formulates avatar creation as model inversion, using synthetic views from 2D diffusion models for fitting.

Result: MoGA outperforms state-of-the-art methods, generalizes well to real-world scenarios, and produces animatable avatars.

Conclusion: The method successfully addresses the limitations of previous approaches by combining generative and diffusion models, achieving high-fidelity 3D avatar reconstruction.

Abstract: We present MoGA, a novel method to reconstruct high-fidelity 3D Gaussian avatars from a single-view image. The main challenge lies in inferring unseen appearance and geometric details while ensuring 3D consistency and realism. Most previous methods rely on 2D diffusion models to synthesize unseen views; however, these generated views are sparse and inconsistent, resulting in unrealistic 3D artifacts and blurred appearance. To address these limitations, we leverage a generative avatar model, that can generate diverse 3D avatars by sampling deformed Gaussians from a learned prior distribution. Due to limited 3D training data, such a 3D model alone cannot capture all image details of unseen identities. Consequently, we integrate it as a prior, ensuring 3D consistency by projecting input images into its latent space and enforcing additional 3D appearance and geometric constraints. Our novel approach formulates Gaussian avatar creation as model inversion by fitting the generative avatar to synthetic views from 2D diffusion models. The generative avatar provides an initialization for model fitting, enforces 3D regularization, and helps in refining pose. Experiments show that our method surpasses state-of-the-art techniques and generalizes well to real-world scenarios. Our Gaussian avatars are also inherently animatable. For code, see https://zj-dong.github.io/MoGA/.

[507] Self-Navigated Residual Mamba for Universal Industrial Anomaly Detection

Hanxi Li, Jingqi Wu, Lin Yuanbo Wu, Mingliang Li, Deyin Liu, Jialie Shen, Chunhua Shen

Main category: cs.CV

TL;DR: SNARM is a novel framework for industrial anomaly detection using self-referential learning and dynamic reference selection to enhance anomaly discrimination.

Details

Motivation: To improve anomaly detection by dynamically refining detection through in-image references, unlike conventional methods relying on pre-trained features.

Method: SNARM computes inter-residuals from test patches and training features, uses small-norm residuals as references for intra-residuals, and employs a Mamba module with dynamic navigation for anomaly focus.

Result: Achieves SOTA performance on MVTec AD, MVTec 3D, and VisA benchmarks, with improvements in Image-AUROC, Pixel-AURC, PRO, and AP.

Conclusion: SNARM’s self-referential learning and dynamic navigation significantly enhance anomaly detection, outperforming existing methods.

Abstract: In this paper, we propose Self-Navigated Residual Mamba (SNARM), a novel framework for universal industrial anomaly detection that leverages self-referential learning'' within test images to enhance anomaly discrimination. Unlike conventional methods that depend solely on pre-trained features from normal training data, SNARM dynamically refines anomaly detection by iteratively comparing test patches against adaptively selected in-image references. Specifically, we first compute the inter-residuals’’ features by contrasting test image patches with the training feature bank. Patches exhibiting small-norm residuals (indicating high normality) are then utilized as self-generated reference patches to compute ``intra-residuals’’, amplifying discriminative signals. These inter- and intra-residual features are concatenated and fed into a novel Mamba module with multiple heads, which are dynamically navigated by residual properties to focus on anomalous regions. Finally, AD results are obtained by aggregating the outputs of a self-navigated Mamba in an ensemble learning paradigm. Extensive experiments on MVTec AD, MVTec 3D, and VisA benchmarks demonstrate that SNARM achieves state-of-the-art (SOTA) performance, with notable improvements in all metrics, including Image-AUROC, Pixel-AURC, PRO, and AP.

[508] StarPose: 3D Human Pose Estimation via Spatial-Temporal Autoregressive Diffusion

Haoxin Yang, Weihong Chen, Xuemiao Xu, Cheng Xu, Peng Xiao, Cuifeng Sun, Shaoyu Huang, Shengfeng He

Main category: cs.CV

TL;DR: StarPose is an autoregressive diffusion framework for monocular 3D human pose estimation, enhancing accuracy and temporal consistency by integrating historical pose predictions and spatial-temporal physical guidance.

Details

Motivation: Addressing the limitations of existing diffusion-based methods, which lack spatial-temporal correlations and temporal consistency in 3D pose predictions.

Method: Proposes StarPose, an autoregressive diffusion framework with a Historical Pose Integration Module (HPIM) and Spatial-Temporal Physical Guidance (STPG) mechanism.

Result: Outperforms state-of-the-art methods in accuracy and temporal consistency on benchmark datasets.

Conclusion: StarPose effectively improves 3D human pose estimation by leveraging historical data and spatial-temporal guidance, offering robust and realistic predictions.

Abstract: Monocular 3D human pose estimation remains a challenging task due to inherent depth ambiguities and occlusions. Compared to traditional methods based on Transformers or Convolutional Neural Networks (CNNs), recent diffusion-based approaches have shown superior performance, leveraging their probabilistic nature and high-fidelity generation capabilities. However, these methods often fail to account for the spatial and temporal correlations across predicted frames, resulting in limited temporal consistency and inferior accuracy in predicted 3D pose sequences. To address these shortcomings, this paper proposes StarPose, an autoregressive diffusion framework that effectively incorporates historical 3D pose predictions and spatial-temporal physical guidance to significantly enhance both the accuracy and temporal coherence of pose predictions. Unlike existing approaches, StarPose models the 2D-to-3D pose mapping as an autoregressive diffusion process. By synergically integrating previously predicted 3D poses with 2D pose inputs via a Historical Pose Integration Module (HPIM), the framework generates rich and informative historical pose embeddings that guide subsequent denoising steps, ensuring temporally consistent predictions. In addition, a fully plug-and-play Spatial-Temporal Physical Guidance (STPG) mechanism is tailored to refine the denoising process in an iterative manner, which further enforces spatial anatomical plausibility and temporal motion dynamics, rendering robust and realistic pose estimates. Extensive experiments on benchmark datasets demonstrate that StarPose outperforms state-of-the-art methods, achieving superior accuracy and temporal consistency in 3D human pose estimation. Code is available at https://github.com/wileychan/StarPose.

[509] SGAD: Semantic and Geometric-aware Descriptor for Local Feature Matching

Xiangzeng Liu, Chi Wang, Guanglu Shi, Xiaodong Zhang, Qiguang Miao, Miao Fan

Main category: cs.CV

TL;DR: SGAD improves area-based matching by generating discriminative descriptors, avoiding complex graph optimization, and using novel supervision and filtering methods, achieving significant accuracy and efficiency gains.

Details

Motivation: Existing A2PM methods rely on inefficient pixel-level comparisons and complex graph matching, limiting scalability and performance.

Method: SGAD introduces a Semantic and Geometric-aware Descriptor Network for direct matching, a novel supervision strategy, and the Hierarchical Containment Redundancy Filter (HCRF) to eliminate overlapping areas.

Result: SGAD reduces runtime by 60x, improves accuracy in outdoor and indoor pose estimation, and sets a new state-of-the-art.

Conclusion: SGAD rethinks area-based matching, offering a scalable, efficient, and accurate solution for local feature matching.

Abstract: Local feature matching remains a fundamental challenge in computer vision. Recent Area to Point Matching (A2PM) methods have improved matching accuracy. However, existing research based on this framework relies on inefficient pixel-level comparisons and complex graph matching that limit scalability. In this work, we introduce the Semantic and Geometric-aware Descriptor Network (SGAD), which fundamentally rethinks area-based matching by generating highly discriminative area descriptors that enable direct matching without complex graph optimization. This approach significantly improves both accuracy and efficiency of area matching. We further improve the performance of area matching through a novel supervision strategy that decomposes the area matching task into classification and ranking subtasks. Finally, we introduce the Hierarchical Containment Redundancy Filter (HCRF) to eliminate overlapping areas by analyzing containment graphs. SGAD demonstrates remarkable performance gains, reducing runtime by 60x (0.82s vs. 60.23s) compared to MESA. Extensive evaluations show consistent improvements across multiple point matchers: SGAD+LoFTR reduces runtime compared to DKM, while achieving higher accuracy (0.82s vs. 1.51s, 65.98 vs. 61.11) in outdoor pose estimation, and SGAD+ROMA delivers +7.39% AUC@5{\deg} in indoor pose estimation, establishing a new state-of-the-art.

[510] Zero-shot Compositional Action Recognition with Neural Logic Constraints

Gefan Ye, Lin Li, Kexin Li, Jun Xiao, Long Chen

Main category: cs.CV

TL;DR: LogicCAR introduces dual symbolic constraints for zero-shot compositional action recognition, addressing spurious correlations and semantic ambiguity by modeling compositional and hierarchical structures.

Details

Motivation: Existing methods lack explicit compositional and hierarchical constraints, leading to spurious correlations and semantic ambiguity in zero-shot compositional action recognition.

Method: LogicCAR integrates Explicit Compositional Logic and Hierarchical Primitive Logic, formalized in first-order logic and embedded into neural networks.

Result: LogicCAR outperforms baselines on the Sth-com dataset, demonstrating the effectiveness of its logic-driven constraints.

Conclusion: Human-like symbolic reasoning, as implemented in LogicCAR, effectively addresses challenges in zero-shot compositional action recognition.

Abstract: Zero-shot compositional action recognition (ZS-CAR) aims to identify unseen verb-object compositions in the videos by exploiting the learned knowledge of verb and object primitives during training. Despite compositional learning’s progress in ZS-CAR, two critical challenges persist: 1) Missing compositional structure constraint, leading to spurious correlations between primitives; 2) Neglecting semantic hierarchy constraint, leading to semantic ambiguity and impairing the training process. In this paper, we argue that human-like symbolic reasoning offers a principled solution to these challenges by explicitly modeling compositional and hierarchical structured abstraction. To this end, we propose a logic-driven ZS-CAR framework LogicCAR that integrates dual symbolic constraints: Explicit Compositional Logic and Hierarchical Primitive Logic. Specifically, the former models the restrictions within the compositions, enhancing the compositional reasoning ability of our model. The latter investigates the semantical dependencies among different primitives, empowering the models with fine-to-coarse reasoning capacity. By formalizing these constraints in first-order logic and embedding them into neural network architectures, LogicCAR systematically bridges the gap between symbolic abstraction and existing models. Extensive experiments on the Sth-com dataset demonstrate that our LogicCAR outperforms existing baseline methods, proving the effectiveness of our logic-driven constraints.

[511] Engagement Prediction of Short Videos with Large Multimodal Models

Wei Sun, Linhan Cao, Yuqin Cao, Weixia Zhang, Wen Wen, Kaiwei Zhang, Zijian Chen, Fangfang Lu, Xiongkuo Min, Guangtao Zhai

Main category: cs.CV

TL;DR: The paper explores using large multimodal models (LMMs) for video engagement prediction, highlighting their effectiveness and the importance of audio features.

Details

Motivation: The rise of user-generated content on short-form video platforms necessitates better engagement prediction for recommendations and content creation, but existing methods struggle with cross-modality interactions.

Method: Two LMMs, VideoLLaMA2 (audio, visual, language) and Qwen2.5-VL (visual, language), are trained on the SnapUGC dataset to predict engagement.

Result: Both models perform competitively, with VideoLLaMA2 outperforming Qwen2.5-VL, emphasizing audio’s role. Ensembling them won the ICCV VQualA 2025 challenge.

Conclusion: LMMs are effective for engagement prediction, with multimodal integration (especially audio) enhancing performance.

Abstract: The rapid proliferation of user-generated content (UGC) on short-form video platforms has made video engagement prediction increasingly important for optimizing recommendation systems and guiding content creation. However, this task remains challenging due to the complex interplay of factors such as semantic content, visual quality, audio characteristics, and user background. Prior studies have leveraged various types of features from different modalities, such as visual quality, semantic content, background sound, etc., but often struggle to effectively model their cross-feature and cross-modality interactions. In this work, we empirically investigate the potential of large multimodal models (LMMs) for video engagement prediction. We adopt two representative LMMs: VideoLLaMA2, which integrates audio, visual, and language modalities, and Qwen2.5-VL, which models only visual and language modalities. Specifically, VideoLLaMA2 jointly processes key video frames, text-based metadata, and background sound, while Qwen2.5-VL utilizes only key video frames and text-based metadata. Trained on the SnapUGC dataset, both models demonstrate competitive performance against state-of-the-art baselines, showcasing the effectiveness of LMMs in engagement prediction. Notably, VideoLLaMA2 consistently outperforms Qwen2.5-VL, highlighting the importance of audio features in engagement prediction. By ensembling two types of models, our method achieves first place in the ICCV VQualA 2025 EVQA-SnapUGC Challenge on short-form video engagement prediction. The code is available at https://github.com/sunwei925/LMM-EVQA.git.

[512] Less is More: Token-Efficient Video-QA via Adaptive Frame-Pruning and Semantic Graph Integration

Shaoguang Wang, Jianxiang He, Yijie Xu, Ziyang Chen, Weiyu Guo, Hui Xiong

Main category: cs.CV

TL;DR: The paper introduces Adaptive Frame-Pruning (AFP) to reduce token costs in Video-QA by pruning redundant frames and using a semantic graph for context, achieving efficiency and accuracy gains.

Details

Motivation: High token costs and performance degradation from excessive frames in Video-QA motivate the need for a method to reduce redundancy without losing critical information.

Method: AFP uses adaptive hierarchical clustering on fused ResNet-50 and CLIP features to prune frames and a lightweight semantic graph for context.

Result: AFP reduces frames by 86.9% and tokens by 83.2%, often improving accuracy over baselines.

Conclusion: AFP efficiently addresses redundancy in Video-QA, enhancing performance while reducing computational costs.

Abstract: The practical application of Multimodal Large Language Models (MLLMs) to Video Question Answering (Video-QA) is severely hindered by the high token cost of processing numerous video frames. While increasing the number of sampled frames is a common strategy, we observe a “less is more” phenomenon where excessive frames can paradoxically degrade performance due to context dilution. Concurrently, state-of-the-art keyframe selection methods, while effective, still yield significant temporal redundancy, which we term ‘visual echoes’. To address these dual challenges, we propose Adaptive Frame-Pruning (AFP), a novel post-processing method that intelligently prunes the selected keyframes. AFP employs an adaptive hierarchical clustering algorithm on a fused ResNet-50 and CLIP feature space to identify and merge these echoes into single representatives. To compensate for information loss, we then introduce a lightweight, text-based semantic graph that provides critical context with minimal token overhead. Conducting extensive experiments on the LongVideoBench and VideoMME benchmarks across multiple leading MLLMs, our full approach demonstrates a drastic reduction in required frames by up to 86.9% and total input tokens by up to 83.2%. Crucially, by providing a concise, high-quality set of frames, our method not only enhances efficiency but often improves accuracy over baselines that use more frames. The code will be released upon publication.

[513] Uni3R: Unified 3D Reconstruction and Semantic Understanding via Generalizable Gaussian Splatting from Unposed Multi-View Images

Xiangyu Sun, Haoyi Jiang, Liu Liu, Seungtae Nam, Gyeongjin Kang, Xinjie Wang, Wei Sui, Zhizhong Su, Wenyu Liu, Xinggang Wang, Eunbyung Park

Main category: cs.CV

TL;DR: Uni3R is a feed-forward framework for joint 3D scene reconstruction and semantic understanding from unposed multi-view images, achieving state-of-the-art results.

Details

Motivation: Addressing the challenge of decoupling semantic understanding from 3D reconstruction and costly per-scene optimization in conventional methods.

Method: Uses a Cross-View Transformer to integrate multi-view inputs and regress 3D Gaussian primitives with semantic feature fields.

Result: Achieves 25.07 PSNR on RE10K and 55.84 mIoU on ScanNet, enabling high-fidelity novel view synthesis and semantic segmentation.

Conclusion: Uni3R introduces a scalable, generalizable paradigm for unified 3D scene reconstruction and understanding.

Abstract: Reconstructing and semantically interpreting 3D scenes from sparse 2D views remains a fundamental challenge in computer vision. Conventional methods often decouple semantic understanding from reconstruction or necessitate costly per-scene optimization, thereby restricting their scalability and generalizability. In this paper, we introduce Uni3R, a novel feed-forward framework that jointly reconstructs a unified 3D scene representation enriched with open-vocabulary semantics, directly from unposed multi-view images. Our approach leverages a Cross-View Transformer to robustly integrate information across arbitrary multi-view inputs, which then regresses a set of 3D Gaussian primitives endowed with semantic feature fields. This unified representation facilitates high-fidelity novel view synthesis, open-vocabulary 3D semantic segmentation, and depth prediction, all within a single, feed-forward pass. Extensive experiments demonstrate that Uni3R establishes a new state-of-the-art across multiple benchmarks, including 25.07 PSNR on RE10K and 55.84 mIoU on ScanNet. Our work signifies a novel paradigm towards generalizable, unified 3D scene reconstruction and understanding. The code is available at https://github.com/HorizonRobotics/Uni3R.

[514] Investigating the Impact of Large-Scale Pre-training on Nutritional Content Estimation from 2D Images

Michele Andrade, Guilherme A. L. Silva, Valéria Santos, Gladston Moreira, Eduardo Luz

Main category: cs.CV

TL;DR: The paper explores the impact of large-scale pre-training datasets on deep learning models for estimating nutritional content from 2D food images, finding proprietary datasets like JFT-300M outperform public ones like ImageNet and COYO.

Details

Motivation: Nutritional estimation from images is vital for health monitoring but challenging due to variability in food presentation and lack of depth information. Reproducibility is limited by reliance on proprietary datasets.

Method: Fine-tuned Vision Transformer (ViT) models pre-trained on ImageNet and COYO, compared against CNN baselines (InceptionV2, ResNet-50) and a state-of-the-art method using JFT-300M. Evaluated on Nutrition5k dataset.

Result: JFT-300M pre-trained models outperformed public dataset models. COYO unexpectedly performed worse than ImageNet, contradicting initial hypotheses.

Conclusion: Pre-training dataset characteristics (scale, domain relevance, curation quality) are crucial for effective transfer learning in 2D nutritional estimation.

Abstract: Estimating the nutritional content of food from images is a critical task with significant implications for health and dietary monitoring. This is challenging, especially when relying solely on 2D images, due to the variability in food presentation, lighting, and the inherent difficulty in inferring volume and mass without depth information. Furthermore, reproducibility in this domain is hampered by the reliance of state-of-the-art methods on proprietary datasets for large-scale pre-training. In this paper, we investigate the impact of large-scale pre-training datasets on the performance of deep learning models for nutritional estimation using only 2D images. We fine-tune and evaluate Vision Transformer (ViT) models pre-trained on two large public datasets, ImageNet and COYO, comparing their performance against baseline CNN models (InceptionV2 and ResNet-50) and a state-of-the-art method pre-trained on the proprietary JFT-300M dataset. We conduct extensive experiments on the Nutrition5k dataset, a large-scale collection of real-world food plates with high-precision nutritional annotations. Our evaluation using Mean Absolute Error (MAE) and Mean Absolute Percentage Error (MAE%) reveals that models pre-trained on JFT-300M significantly outperform those pre-trained on public datasets. Unexpectedly, the model pre-trained on the massive COYO dataset performs worse than the model pre-trained on ImageNet for this specific regression task, refuting our initial hypothesis. Our analysis provides quantitative evidence highlighting the critical role of pre-training dataset characteristics, including scale, domain relevance, and curation quality, for effective transfer learning in 2D nutritional estimation.

[515] TSPO: Temporal Sampling Policy Optimization for Long-form Video Language Understanding

Canhui Tang, Zifan Han, Hongbo Sun, Sanping Zhou, Xuchong Zhang, Xin Wei, Ye Yuan, Huayu Zhang, Jinglin Xu, Hao Sun

Main category: cs.CV

TL;DR: TSPO improves long-form video understanding in MLLMs via reinforcement learning for keyframe selection and language generation.

Details

Motivation: Address challenges in processing long-duration videos due to MLLMs' context limits and training costs.

Method: Proposes TSPO, a reinforcement learning paradigm for event-aware keyframe selection and joint optimization.

Result: Achieves state-of-the-art performance on long video benchmarks and shows transferability.

Conclusion: TSPO effectively enhances Video-MLLMs’ long-form understanding with scalable and transferable solutions.

Abstract: Multimodal Large Language Models (MLLMs) have demonstrated significant progress in vision-language tasks, yet they still face challenges when processing long-duration video inputs. The limitation arises from MLLMs’ context limit and training costs, necessitating sparse frame sampling before feeding videos into MLLMs. However, building a trainable sampling method remains challenging due to the unsupervised and non-differentiable nature of sparse frame sampling in Video-MLLMs. To address these problems, we propose Temporal Sampling Policy Optimization (TSPO), advancing MLLMs’ long-form video-language understanding via reinforcement learning. Specifically, we first propose a trainable event-aware temporal agent, which captures event-query correlation for performing probabilistic keyframe selection. Then, we propose the TSPO reinforcement learning paradigm, which models keyframe selection and language generation as a joint decision-making process, enabling end-to-end group relative optimization for the temporal sampling policy. Furthermore, we propose a dual-style long video training data construction pipeline, balancing comprehensive temporal understanding and key segment localization. Finally, we incorporate rule-based answering accuracy and temporal locating reward mechanisms to optimize the temporal sampling policy. Comprehensive experiments show that our TSPO achieves state-of-the-art performance across multiple long video understanding benchmarks, and shows transferable ability across different cutting-edge Video-MLLMs. Our code is available at https://github.com/Hui-design/TSPO

cs.AI

[516] Solving Pasur Using GPU-Accelerated Counterfactual Regret Minimization

Sina Baghal

Main category: cs.AI

TL;DR: The paper introduces a CUDA-accelerated framework for simulating Pasur, a complex fishing card game, using CFR for near-Nash equilibria. It addresses memory and computational challenges with efficient game tree decomposition and backward training.

Details

Motivation: Pasur's intricate rules and large game tree pose unique computational challenges, motivating the development of an efficient framework for solving such games.

Method: The framework uses PyTorch CUDA tensors for rule complexity and decomposes the game tree into states and inherited scores. It employs backward training and constructs a full game tree with reduced memory overhead.

Result: The approach successfully computes near-Nash equilibria and trains a tree-based model for strategy prediction, validated through large-scale self-play simulations.

Conclusion: The framework is effective for Pasur and can be extended to other multi-round games or sequential decision problems in reinforcement learning.

Abstract: Pasur is a fishing card game played over six rounds and is played similarly to games such as Cassino and Scopa, and Bastra. This paper introduces a CUDA-accelerated computational framework for simulating Pasur, emphasizing efficient memory management. We use our framework to compute near-Nash equilibria via Counterfactual Regret Minimization (CFR), a well-known algorithm for solving large imperfect-information games. Solving Pasur presents unique challenges due to its intricate rules and the large size of its game tree. We handle rule complexity using PyTorch CUDA tensors and to address the memory-intensive nature of the game, we decompose the game tree into two key components: (1) actual game states, and (2) inherited scores from previous rounds. We construct the Full Game Tree by pairing card states with accumulated scores in the Unfolding Process. This design reduces memory overhead by storing only essential strategy values and node connections. To further manage computational complexity, we apply a round-by-round backward training strategy, starting from the final round and recursively propagating average utilities to earlier stages. Our approach constructs the complete game tree, which on average consists of over $10^9$ nodes. We provide detailed implementation snippets. After computing a near-Nash equilibrium strategy, we train a tree-based model to predict these strategies for use during gameplay. We then estimate the fair value of each deck through large-scale self-play between equilibrium strategies by simulating, for instance, 10,000 games per matchup, executed in parallel using GPU acceleration. Similar frameworks can be extended to other reinforcement learning algorithms where the action tree naturally decomposes into multiple rounds such as turn-based strategy games or sequential trading decisions in financial markets.

[517] Operationalizing Serendipity: Multi-Agent AI Workflows for Enhanced Materials Characterization with Theory-in-the-Loop

Lance Yao, Suman Samantray, Ayana Ghosh, Kevin Roccapriore, Libor Kovarik, Sarah Allec, Maxim Ziatdinov

Main category: cs.AI

TL;DR: SciLink is an AI framework designed to foster serendipitous discoveries in materials research by linking experimental data, novelty assessment, and simulations.

Details

Motivation: Modern labs prioritize efficiency over unplanned findings, missing serendipitous discoveries. SciLink aims to bridge this gap.

Method: Uses hybrid AI: ML models for data analysis and LLMs for reasoning, converting raw data into claims scored for novelty.

Result: Demonstrated versatility in diverse scenarios, integrating human guidance and proposing follow-up experiments.

Conclusion: SciLink enhances efficiency while cultivating serendipity, bridging automated experimentation and open-ended research.

Abstract: The history of science is punctuated by serendipitous discoveries, where unexpected observations, rather than targeted hypotheses, opened new fields of inquiry. While modern autonomous laboratories excel at accelerating hypothesis testing, their optimization for efficiency risks overlooking these crucial, unplanned findings. To address this gap, we introduce SciLink, an open-source, multi-agent artificial intelligence framework designed to operationalize serendipity in materials research by creating a direct, automated link between experimental observation, novelty assessment, and theoretical simulations. The framework employs a hybrid AI strategy where specialized machine learning models perform quantitative analysis of experimental data, while large language models handle higher-level reasoning. These agents autonomously convert raw data from materials characterization techniques into falsifiable scientific claims, which are then quantitatively scored for novelty against the published literature. We demonstrate the framework’s versatility across diverse research scenarios, showcasing its application to atomic-resolution and hyperspectral data, its capacity to integrate real-time human expert guidance, and its ability to close the research loop by proposing targeted follow-up experiments. By systematically analyzing all observations and contextualizing them, SciLink provides a practical framework for AI-driven materials research that not only enhances efficiency but also actively cultivates an environment ripe for serendipitous discoveries, thereby bridging the gap between automated experimentation and open-ended scientific exploration.

[518] IRL-VLA: Training an Vision-Language-Action Policy via Reward World Model

Anqing Jiang, Yu Gao, Yiru Wang, Zhigang Sun, Shuo Wang, Yuwen Heng, Hao Sun, Shichen Tang, Lijuan Zhu, Jinhao Chai, Jijun Wang, Zichong Gu, Hao Jiang, Li Sun

Main category: cs.AI

TL;DR: The paper introduces IRL-VLA, a three-stage framework combining imitation learning, inverse reinforcement learning, and PPO to improve Vision-Language-Action models for autonomous driving, achieving top performance in benchmarks.

Details

Motivation: Existing VLA models in autonomous driving face challenges like suboptimal open-loop imitation learning and inefficiencies in close-loop training due to simulation gaps.

Method: A three-stage approach: (1) pretrain VLA policy via imitation learning, (2) build a lightweight reward world model using inverse reinforcement learning, and (3) enhance planning with PPO-based reinforcement learning.

Result: Achieves state-of-the-art performance in NAVSIM v2 and ranks 1st runner-up in CVPR2025 Autonomous Grand Challenge.

Conclusion: The IRL-VLA framework advances close-loop autonomous driving by addressing key challenges and improving performance.

Abstract: Vision-Language-Action (VLA) models have demonstrated potential in autonomous driving. However, two critical challenges hinder their development: (1) Existing VLA architectures are typically based on imitation learning in open-loop setup which tends to capture the recorded behaviors in the dataset, leading to suboptimal and constrained performance, (2) Close-loop training relies heavily on high-fidelity sensor simulation, where domain gaps and computational inefficiencies pose significant barriers. In this paper, we introduce IRL-VLA, a novel close-loop Reinforcement Learning via \textbf{I}nverse \textbf{R}einforcement \textbf{L}earning reward world model with a self-built VLA approach. Our framework proceeds in a three-stage paradigm: In the first stage, we propose a VLA architecture and pretrain the VLA policy via imitation learning. In the second stage, we construct a lightweight reward world model via inverse reinforcement learning to enable efficient close-loop reward computation. To further enhance planning performance, finally, we design specialized reward world model guidence reinforcement learning via PPO(Proximal Policy Optimization) to effectively balance the safety incidents, comfortable driving, and traffic efficiency. Our approach achieves state-of-the-art performance in NAVSIM v2 end-to-end driving benchmark, 1st runner up in CVPR2025 Autonomous Grand Challenge. We hope that our framework will accelerate VLA research in close-loop autonomous driving.

[519] CountQA: How Well Do MLLMs Count in the Wild?

Jayant Sravan Tamarapalli, Rynaa Grover, Nilay Pande, Sahiti Yerramilli

Main category: cs.AI

TL;DR: The paper introduces CountQA, a benchmark to evaluate object counting in Multimodal Large Language Models (MLLMs), revealing their poor performance (42.9% accuracy) in complex scenarios.

Details

Motivation: MLLMs lack fundamental object counting skills, limiting their real-world reliability, which remains unevaluated in complex scenarios.

Method: CountQA, a new benchmark with 1,500 question-answer pairs featuring high object density, clutter, and occlusion, is used to evaluate 15 MLLMs.

Result: Top-performing MLLM achieves only 42.9% accuracy, with performance worsening as object counts increase.

Conclusion: CountQA addresses a critical gap, enabling future MLLMs to improve in numerical and spatial awareness.

Abstract: Multimodal Large Language Models (MLLMs) demonstrate remarkable fluency in understanding visual scenes, yet they exhibit a critical lack in a fundamental cognitive skill: object counting. This blind spot severely limits their reliability in real-world applications. To date, this capability has been largely unevaluated in complex scenarios, as existing benchmarks either feature sparse object densities or are confined to specific visual domains, failing to test models under realistic conditions. Addressing this gap, we introduce CountQA, a challenging new benchmark designed to probe this deficiency. Comprising over 1,500 question-answer pairs, CountQA features real-world images with high object density, clutter, and occlusion. We investigate this weakness by evaluating 15 prominent MLLMs on the CountQA benchmark and reveal that the top-performing model achieves a mere 42.9% accuracy, with performance declining as object counts rise. By providing a dedicated benchmark to diagnose and rectify this core weakness, CountQA paves the way for a new generation of MLLMs that are not only descriptively fluent but also numerically grounded and spatially aware. We will open-source the dataset and code upon paper acceptance to foster further research.

[520] Formal Concept Analysis: a Structural Framework for Variability Extraction and Analysis

Jessie Galasso

Main category: cs.AI

TL;DR: The paper explores how Formal Concept Analysis (FCA) can be used for variability analysis by identifying key properties of FCA and their application in interpreting variability information.

Details

Motivation: FCA is a powerful tool for knowledge representation and clustering, but its potential for variability-related tasks is not fully explored due to its mathematical complexity. This paper aims to clarify how FCA properties can be leveraged for variability analysis.

Method: The authors gather and analyze essential properties of FCA, demonstrating their utility in interpreting variability information within conceptual structures.

Result: The paper identifies key FCA properties that are crucial for variability analysis and provides insights into their application for understanding variability in clustered data.

Conclusion: The study bridges a gap in understanding how FCA can be effectively used for variability analysis, offering a clearer pathway for leveraging its properties in such tasks.

Abstract: Formal Concept Analysis (FCA) is a mathematical framework for knowledge representation and discovery. It performs a hierarchical clustering over a set of objects described by attributes, resulting in conceptual structures in which objects are organized depending on the attributes they share. These conceptual structures naturally highlight commonalities and variabilities among similar objects by categorizing them into groups which are then arranged by similarity, making it particularly appropriate for variability extraction and analysis. Despite the potential of FCA, determining which of its properties can be leveraged for variability-related tasks (and how) is not always straightforward, partly due to the mathematical orientation of its foundational literature. This paper attempts to bridge part of this gap by gathering a selection of properties of the framework which are essential to variability analysis, and how they can be used to interpret diverse variability information within the resulting conceptual structures.

[521] Zero-Shot Cellular Trajectory Map Matching

Weijie Shi, Yue Cui, Hao Chen, Jiaming Li, Mengze Li, Jia Zhu, Jiajie Xu, Xiaofang Zhou

Main category: cs.AI

TL;DR: Proposes a pixel-based trajectory calibration assistant for zero-shot CTMM, improving accuracy by 16.8% without region-specific training.

Details

Motivation: Current CTMM methods rely on region-specific data, limiting adaptability to unexplored areas. Zero-shot CTMM aims to overcome this by leveraging transferable geospatial knowledge.

Method: Uses a pixel-based trajectory calibration assistant, Gaussian mixture model in VAE for scenario-adaptive experts, and a spatial-temporal awareness module for positioning errors. A constrained path-finding algorithm reconstructs road ID sequences.

Result: Outperforms existing methods by 16.8% in zero-shot CTMM.

Conclusion: The proposed method enhances adaptability and accuracy in CTMM without requiring region-specific training, making it suitable for unexplored areas.

Abstract: Cellular Trajectory Map-Matching (CTMM) aims to align cellular location sequences to road networks, which is a necessary preprocessing in location-based services on web platforms like Google Maps, including navigation and route optimization. Current approaches mainly rely on ID-based features and region-specific data to learn correlations between cell towers and roads, limiting their adaptability to unexplored areas. To enable high-accuracy CTMM without additional training in target regions, Zero-shot CTMM requires to extract not only region-adaptive features, but also sequential and location uncertainty to alleviate positioning errors in cellular data. In this paper, we propose a pixel-based trajectory calibration assistant for zero-shot CTMM, which takes advantage of transferable geospatial knowledge to calibrate pixelated trajectory, and then guide the path-finding process at the road network level. To enhance knowledge sharing across similar regions, a Gaussian mixture model is incorporated into VAE, enabling the identification of scenario-adaptive experts through soft clustering. To mitigate high positioning errors, a spatial-temporal awareness module is designed to capture sequential features and location uncertainty, thereby facilitating the inference of approximate user positions. Finally, a constrained path-finding algorithm is employed to reconstruct the road ID sequence, ensuring topological validity within the road network. This process is guided by the calibrated trajectory while optimizing for the shortest feasible path, thus minimizing unnecessary detours. Extensive experiments demonstrate that our model outperforms existing methods in zero-shot CTMM by 16.8%.

[522] Probabilistic Circuits for Knowledge Graph Completion with Reduced Rule Sets

Jaikrishna Manojkumar Patil, Nathaniel Lee, Al Mehdi Saadat Chowdhury, YooJung Choi, Paulo Shakarian

Main category: cs.AI

TL;DR: The paper introduces a method to reduce the number of rules in knowledge graph completion by discovering rule contexts and using probabilistic circuits, achieving competitive performance with fewer rules.

Details

Motivation: Rule-based methods for knowledge graph completion are explainable but require many rules, which can hinder explainability. The goal is to reduce rule count while maintaining performance.

Method: Discover rule contexts from training data and use probabilistic circuits to distribute probabilities over these contexts, enabling faster performance with fewer rules.

Result: Achieves 70-96% reduction in rules, outperforms baselines by up to 31x, and preserves 91% of peak performance. Validated on 8 benchmark datasets.

Conclusion: The framework is effective, grounded in probabilistic logic, and has implications for probabilistic reasoning over rule sets.

Abstract: Rule-based methods for knowledge graph completion provide explainable results but often require a significantly large number of rules to achieve competitive performance. This can hinder explainability due to overwhelmingly large rule sets. We discover rule contexts (meaningful subsets of rules that work together) from training data and use learned probability distribution (i.e. probabilistic circuits) over these rule contexts to more rapidly achieve performance of the full rule set. Our approach achieves a 70-96% reduction in number of rules used while outperforming baseline by up to 31$\times$ when using equivalent minimal number of rules and preserves 91% of peak baseline performance even when comparing our minimal rule sets against baseline’s full rule sets. We show that our framework is grounded in well-known semantics of probabilistic logic, does not require independence assumptions, and that our tractable inference procedure provides both approximate lower bounds and exact probability of a given query. The efficacy of our method is validated by empirical studies on 8 standard benchmark datasets where we show competitive performance by using only a fraction of the rules required by AnyBURL’s standard inference method, the current state-of-the-art for rule-based knowledge graph completion. This work may have further implications for general probabilistic reasoning over learned sets of rules.

[523] K-Dense Analyst: Towards Fully Automated Scientific Analysis

Orion Li, Vinayak Agarwal, Summer Zhou, Ashwin Gopinath, Timothy Kassis

Main category: cs.AI

TL;DR: K-Dense Analyst, a hierarchical multi-agent system, outperforms top LLMs in bioinformatics by 6.3% using a dual-loop architecture for autonomous analysis.

Details

Motivation: Address the gap between data generation and scientific insights in bioinformatics by overcoming LLM limitations in iterative workflows.

Method: Uses a dual-loop architecture with specialized agents to decompose tasks into executable, verifiable steps within secure environments.

Result: Achieves 29.2% accuracy on BixBench, surpassing GPT-5 by 6.3%, and outperforms Gemini 2.5 Pro’s baseline by 27%.

Conclusion: Purpose-built systems like K-Dense Analyst are essential for autonomous scientific reasoning, advancing computational biology.

Abstract: The complexity of modern bioinformatics analysis has created a critical gap between data generation and developing scientific insights. While large language models (LLMs) have shown promise in scientific reasoning, they remain fundamentally limited when dealing with real-world analytical workflows that demand iterative computation, tool integration and rigorous validation. We introduce K-Dense Analyst, a hierarchical multi-agent system that achieves autonomous bioinformatics analysis through a dual-loop architecture. K-Dense Analyst, part of the broader K-Dense platform, couples planning with validated execution using specialized agents to decompose complex objectives into executable, verifiable tasks within secure computational environments. On BixBench, a comprehensive benchmark for open-ended biological analysis, K-Dense Analyst achieves 29.2% accuracy, surpassing the best-performing language model (GPT-5) by 6.3 percentage points, representing nearly 27% improvement over what is widely considered the most powerful LLM available. Remarkably, K-Dense Analyst achieves this performance using Gemini 2.5 Pro, which attains only 18.3% accuracy when used directly, demonstrating that our architectural innovations unlock capabilities far beyond the underlying model’s baseline performance. Our insights demonstrate that autonomous scientific reasoning requires more than enhanced language models, it demands purpose-built systems that can bridge the gap between high-level scientific objectives and low-level computational execution. These results represent a significant advance toward fully autonomous computational biologists capable of accelerating discovery across the life sciences.

[524] GLIDR: Graph-Like Inductive Logic Programming with Differentiable Reasoning

Blair Johnson, Clayton Kerce, Faramarz Fekri

Main category: cs.AI

TL;DR: GLIDR introduces a differentiable rule learning method for knowledge graphs, surpassing chain-like rule limitations with expressive syntax and outperforming existing methods.

Details

Motivation: Existing differentiable ILP techniques are limited by chain-like rule structures, hindering performance and interpretability.

Method: GLIDR uses a differentiable message passing algorithm to learn expressive rules with features like branches and cycles, parameterized by free variables.

Result: GLIDR outperforms existing rule learning methods, competes with embedding methods, and is robust to noise. Extracted rules retain predictive power.

Conclusion: GLIDR advances rule learning with expressive syntax, scalability, and compatibility with neural networks for end-to-end optimization.

Abstract: Differentiable inductive logic programming (ILP) techniques have proven effective at finding approximate rule-based solutions to link prediction and node classification problems on knowledge graphs; however, the common assumption of chain-like rule structure can hamper the performance and interpretability of existing approaches. We introduce GLIDR, a differentiable rule learning method that models the inference of logic rules with more expressive syntax than previous methods. GLIDR uses a differentiable message passing inference algorithm that generalizes previous chain-like rule learning methods to allow rules with features like branches and cycles. GLIDR has a simple and expressive rule search space which is parameterized by a limit on the maximum number of free variables that may be included in a rule. Explicit logic rules can be extracted from the weights of a GLIDR model for use with symbolic solvers. We demonstrate that GLIDR can significantly outperform existing rule learning methods on knowledge graph completion tasks and even compete with embedding methods despite the inherent disadvantage of being a structure-only prediction method. We show that rules extracted from GLIDR retain significant predictive performance, and that GLIDR is highly robust to training data noise. Finally, we demonstrate that GLIDR can be chained with deep neural networks and optimized end-to-end for rule learning on arbitrary data modalities.

[525] Multi-Dimensional Summarization Agents with Context-Aware Reasoning over Enterprise Tables

Amit Dhanda

Main category: cs.AI

TL;DR: A novel multi-agent LLM-based framework improves enterprise data summarization by outperforming traditional methods in faithfulness, relevance, and insight quality.

Details

Motivation: Traditional table-to-text models struggle with hierarchical structures and context-aware deltas, which are crucial for business reporting.

Method: A multi-agent pipeline extracts, analyzes, and summarizes multi-dimensional data using agents for slicing, variance detection, context construction, and LLM-based generation.

Result: The framework achieves 83% faithfulness, superior coverage of changes, and high relevance scores (4.4/5), especially in nuanced scenarios like revenue-price trade-offs.

Conclusion: The proposed framework significantly improves over baselines, demonstrating better performance in enterprise data summarization tasks.

Abstract: We propose a novel framework for summarizing structured enterprise data across multiple dimensions using large language model (LLM)-based agents. Traditional table-to-text models often lack the capacity to reason across hierarchical structures and context-aware deltas, which are essential in business reporting tasks. Our method introduces a multi-agent pipeline that extracts, analyzes, and summarizes multi-dimensional data using agents for slicing, variance detection, context construction, and LLM-based generation. Our results show that the proposed framework outperforms traditional approaches, achieving 83% faithfulness to underlying data, superior coverage of significant changes, and high relevance scores (4.4/5) for decision-critical insights. The improvements are especially pronounced in categories involving subtle trade-offs, such as increased revenue due to price changes amid declining unit volumes, which competing methods either overlook or address with limited specificity. We evaluate the framework on Kaggle datasets and demonstrate significant improvements in faithfulness, relevance, and insight quality over baseline table summarization approaches.

[526] MultiMedEdit: A Scenario-Aware Benchmark for Evaluating Knowledge Editing in Medical VQA

Shengtao Wen, Haodong Chen, Yadong Wang, Zhongying Pan, Xiang Chen, Yu Tian, Bo Qian, Dong Liang, Sheng-Jun Huang

Main category: cs.AI

TL;DR: The paper introduces MultiMedEdit, the first benchmark for evaluating knowledge editing (KE) in multimodal medical scenarios, addressing gaps in current KE methods for clinical tasks.

Details

Motivation: Existing KE methods lack focus on multimodal medical scenarios, which require integrating updated knowledge with visual reasoning for safe clinical decisions.

Method: Proposes MultiMedEdit, a benchmark framework with understanding and reasoning tasks, a three-dimensional metric suite (reliability, generality, locality), and cross-paradigm comparisons.

Result: Current KE methods struggle with generalization and long-tail reasoning in clinical workflows, with efficiency trade-offs in real-world deployment.

Conclusion: MultiMedEdit highlights limitations of current KE approaches and lays groundwork for future clinically robust techniques.

Abstract: Knowledge editing (KE) provides a scalable approach for updating factual knowledge in large language models without full retraining. While previous studies have demonstrated effectiveness in general domains and medical QA tasks, little attention has been paid to KE in multimodal medical scenarios. Unlike text-only settings, medical KE demands integrating updated knowledge with visual reasoning to support safe and interpretable clinical decisions. To address this gap, we propose MultiMedEdit, the first benchmark tailored to evaluating KE in clinical multimodal tasks. Our framework spans both understanding and reasoning task types, defines a three-dimensional metric suite (reliability, generality, and locality), and supports cross-paradigm comparisons across general and domain-specific models. We conduct extensive experiments under single-editing and lifelong-editing settings. Results suggest that current methods struggle with generalization and long-tail reasoning, particularly in complex clinical workflows. We further present an efficiency analysis (e.g., edit latency, memory footprint), revealing practical trade-offs in real-world deployment across KE paradigms. Overall, MultiMedEdit not only reveals the limitations of current approaches but also provides a solid foundation for developing clinically robust knowledge editing techniques in the future.

[527] ParBalans: Parallel Multi-Armed Bandits-based Adaptive Large Neighborhood Search

Alican Yilmaz, Junyang Cai, Serdar Kadioglu, Bistra Dilkina

Main category: cs.AI

TL;DR: ParBalans extends Balans with parallelization to improve MIP solving, showing competitive performance against Gurobi.

Details

Motivation: Parallelization is key for solving large MIP problems efficiently, but Balans's parallel potential is unexplored.

Method: Introduces ParBalans, leveraging solver-level and algorithmic-level parallelism.

Result: ParBalans performs competitively against Gurobi, especially on hard benchmarks.

Conclusion: Parallelization in Balans (via ParBalans) effectively enhances MIP solving performance.

Abstract: Solving Mixed-Integer Programming (MIP) problems often requires substantial computational resources due to their combinatorial nature. Parallelization has emerged as a critical strategy to accelerate solution times and enhance scalability to tackle large, complex instances. This paper investigates the parallelization capabilities of Balans, a recently proposed multi-armed bandits-based adaptive large neighborhood search for MIPs. While Balans’s modular architecture inherently supports parallel exploration of diverse parameter configurations, this potential has not been thoroughly examined. To address this gap, we introduce ParBalans, an extension that leverages both solver-level and algorithmic-level parallelism to improve performance on challenging MIP instances. Our experimental results demonstrate that ParBalans exhibits competitive performance compared to the state-of-the-art commercial solver Gurobi, particularly on hard optimization benchmarks.

[528] Topology Generation of UAV Covert Communication Networks: A Graph Diffusion Approach with Incentive Mechanism

Xin Tang, Qian Chen, Fengshun Li, Youchun Gong, Yinqiu Liu, Wen Tian, Shaowen Qin, Xiaohuan Li

Main category: cs.AI

TL;DR: A self-organizing UAV network framework combining GDPO and SG-based incentives is proposed to ensure reliable connectivity and covert communication in dynamic environments.

Details

Motivation: The increasing use of UAV networks in sensitive applications necessitates reliable and covert communication, challenged by dynamic mobility and exposure risks.

Method: The framework integrates Graph Diffusion-based Policy Optimization (GDPO) for dynamic topology generation and a Stackelberg Game (SG)-based incentive mechanism to guide UAV cooperation.

Result: Experiments confirm the framework’s effectiveness in model convergence, topology quality, and covert communication enhancement.

Conclusion: The proposed framework successfully addresses connectivity and covertness challenges in UAV networks through dynamic adaptation and incentivized cooperation.

Abstract: With the growing demand for Uncrewed Aerial Vehicle (UAV) networks in sensitive applications, such as urban monitoring, emergency response, and secure sensing, ensuring reliable connectivity and covert communication has become increasingly vital. However, dynamic mobility and exposure risks pose significant challenges. To tackle these challenges, this paper proposes a self-organizing UAV network framework combining Graph Diffusion-based Policy Optimization (GDPO) with a Stackelberg Game (SG)-based incentive mechanism. The GDPO method uses generative AI to dynamically generate sparse but well-connected topologies, enabling flexible adaptation to changing node distributions and Ground User (GU) demands. Meanwhile, the Stackelberg Game (SG)-based incentive mechanism guides self-interested UAVs to choose relay behaviors and neighbor links that support cooperation and enhance covert communication. Extensive experiments are conducted to validate the effectiveness of the proposed framework in terms of model convergence, topology generation quality, and enhancement of covert communication performance.

[529] A Comprehensive Survey of Self-Evolving AI Agents: A New Paradigm Bridging Foundation Models and Lifelong Agentic Systems

Jinyuan Fang, Yanwen Peng, Xi Zhang, Yingxu Wang, Xinhao Yi, Guibin Zhang, Yi Xu, Bin Wu, Siwei Liu, Zihao Li, Zhaochun Ren, Nikos Aletras, Xi Wang, Han Zhou, Zaiqiao Meng

Main category: cs.AI

TL;DR: A survey on self-evolving AI agents, reviewing techniques, frameworks, domain-specific strategies, and ethical considerations for adaptive, lifelong agentic systems.

Details

Motivation: Address the limitation of static AI agents by exploring self-evolving techniques to enhance adaptability in dynamic environments.

Method: Introduces a unified conceptual framework and systematically reviews self-evolving techniques, including domain-specific strategies and ethical considerations.

Result: Provides a comprehensive understanding of self-evolving AI agents, enabling more adaptive and autonomous systems.

Conclusion: Lays the foundation for future development of lifelong agentic systems, emphasizing adaptability and ethical reliability.

Abstract: Recent advances in large language models have sparked growing interest in AI agents capable of solving complex, real-world tasks. However, most existing agent systems rely on manually crafted configurations that remain static after deployment, limiting their ability to adapt to dynamic and evolving environments. To this end, recent research has explored agent evolution techniques that aim to automatically enhance agent systems based on interaction data and environmental feedback. This emerging direction lays the foundation for self-evolving AI agents, which bridge the static capabilities of foundation models with the continuous adaptability required by lifelong agentic systems. In this survey, we provide a comprehensive review of existing techniques for self-evolving agentic systems. Specifically, we first introduce a unified conceptual framework that abstracts the feedback loop underlying the design of self-evolving agentic systems. The framework highlights four key components: System Inputs, Agent System, Environment, and Optimisers, serving as a foundation for understanding and comparing different strategies. Based on this framework, we systematically review a wide range of self-evolving techniques that target different components of the agent system. We also investigate domain-specific evolution strategies developed for specialised fields such as biomedicine, programming, and finance, where optimisation objectives are tightly coupled with domain constraints. In addition, we provide a dedicated discussion on the evaluation, safety, and ethical considerations for self-evolving agentic systems, which are critical to ensuring their effectiveness and reliability. This survey aims to provide researchers and practitioners with a systematic understanding of self-evolving AI agents, laying the foundation for the development of more adaptive, autonomous, and lifelong agentic systems.

[530] Pushing the Envelope of LLM Inference on AI-PC

Evangelos Georganas, Dhiraj Kalamkar, Alexander Heinecke

Main category: cs.AI

TL;DR: Ultra-low-bit LLM models (1/1.58/2-bit) match full-precision models in performance. This work optimizes microkernels for CPUs, integrating them into PyTorch-TPP, achieving 2.2x speedup over SOTA runtimes and 7x over 16-bit models.

Details

Motivation: To address the underexplored computational efficiency of SOTA inference runtimes for ultra-low-bit LLM models, enabling cost-effective deployment in resource-constrained environments like edge devices and AI PCs.

Method: A bottom-up approach: designing and implementing optimized 1-bit and 2-bit microkernels for modern CPUs, integrating them into PyTorch-TPP for end-to-end inference.

Result: Achieves up to 2.2x speedup over bitnet.cpp and 7x over 16-bit model inference, demonstrating superior computational efficiency.

Conclusion: The optimized runtime advances LLM inference for AI PCs and edge devices, enabling efficient deployment of ultra-low-bit models.

Abstract: The advent of ultra-low-bit LLM models (1/1.58/2-bit), which match the perplexity and end-task performance of their full-precision counterparts using the same model size, is ushering in a new era of LLM inference for resource-constrained environments such as edge devices and AI PCs. While these quantization advances promise models that are more cost-effective in terms of latency, memory, throughput, and energy consumption, the computational efficiency of state-of-the-art (SOTA) inference runtimes (e.g., bitnet.cpp) used to deploy them remains underexplored. In this work, we take a bottom-up approach: we first design and implement 1-bit and 2-bit microkernels optimized for modern CPUs, achieving peak computational efficiency across a variety of CPU platforms. We integrate these microkernels into a state-of-the-art LLM inference framework, namely PyTorch-TPP, and present end-to-end inference results with 2-bit models that outperform the current SOTA runtime bitnet.cpp by up to 2.2x, and deliver up to 7x speedup compared to the 16-bit model inference. Our optimized runtime advances the state of LLM inference on AI PCs and edge devices, paving the way for efficient deployment of ultra-low-bit LLM models.

[531] A Fuzzy Logic Prompting Framework for Large Language Models in Adaptive and Uncertain Tasks

Vanessa Figueiredo

Main category: cs.AI

TL;DR: A modular prompting framework for safer, adaptive LLM use, grounded in human learning theory, improves scaffolding and adaptivity in tasks like tutoring and gaming.

Details

Motivation: To enable safer and more adaptive use of LLMs in dynamic, user-centered tasks by leveraging human learning theory (ZPD).

Method: Combines natural language boundary prompts with fuzzy scaffolding logic and adaptation rules, avoiding fine-tuning or external orchestration.

Result: Outperforms standard prompting in scaffolding quality, adaptivity, and alignment in simulated tutoring and other domains.

Conclusion: The framework offers a reusable, interpretable method for goal-aligned LLM behavior in uncertain or evolving contexts.

Abstract: We introduce a modular prompting framework that supports safer and more adaptive use of large language models (LLMs) across dynamic, user-centered tasks. Grounded in human learning theory, particularly the Zone of Proximal Development (ZPD), our method combines a natural language boundary prompt with a control schema encoded with fuzzy scaffolding logic and adaptation rules. This architecture enables LLMs to modulate behavior in response to user state without requiring fine-tuning or external orchestration. In a simulated intelligent tutoring setting, the framework improves scaffolding quality, adaptivity, and instructional alignment across multiple models, outperforming standard prompting baselines. Evaluation is conducted using rubric-based LLM graders at scale. While initially developed for education, the framework has shown promise in other interaction-heavy domains, such as procedural content generation for games. Designed for safe deployment, it provides a reusable methodology for structuring interpretable, goal-aligned LLM behavior in uncertain or evolving contexts.

[532] EMPATHIA: Multi-Faceted Human-AI Collaboration for Refugee Integration

Mohamed Rayan Barhdadi, Mehmet Tuncel, Erchin Serpedin, Hasan Kurban

Main category: cs.AI

TL;DR: EMPATHIA is a multi-agent AI framework for refugee integration, addressing cultural, emotional, and ethical dimensions, achieving 87.4% validation convergence.

Details

Motivation: Current AI approaches focus narrowly on employment, neglecting cultural and ethical aspects critical for long-term refugee success.

Method: EMPATHIA uses three modules (SEED, RISE, THRIVE) with specialized agents for transparent, interpretable recommendations, tested on UN Kakuma dataset.

Result: Achieved 87.4% validation convergence and explainable assessments across five host countries.

Conclusion: EMPATHIA balances competing values, supports human-AI collaboration, and offers a generalizable framework for AI-driven allocation tasks.

Abstract: Current AI approaches to refugee integration optimize narrow objectives such as employment and fail to capture the cultural, emotional, and ethical dimensions critical for long-term success. We introduce EMPATHIA (Enriched Multimodal Pathways for Agentic Thinking in Humanitarian Immigrant Assistance), a multi-agent framework addressing the central Creative AI question: how do we preserve human dignity when machines participate in life-altering decisions? Grounded in Kegan’s Constructive Developmental Theory, EMPATHIA decomposes integration into three modules: SEED (Socio-cultural Entry and Embedding Decision) for initial placement, RISE (Rapid Integration and Self-sufficiency Engine) for early independence, and THRIVE (Transcultural Harmony and Resilience through Integrated Values and Engagement) for sustained outcomes. SEED employs a selector-validator architecture with three specialized agents - emotional, cultural, and ethical - that deliberate transparently to produce interpretable recommendations. Experiments on the UN Kakuma dataset (15,026 individuals, 7,960 eligible adults 15+ per ILO/UNHCR standards) and implementation on 6,359 working-age refugees (15+) with 150+ socioeconomic variables achieved 87.4% validation convergence and explainable assessments across five host countries. EMPATHIA’s weighted integration of cultural, emotional, and ethical factors balances competing value systems while supporting practitioner-AI collaboration. By augmenting rather than replacing human expertise, EMPATHIA provides a generalizable framework for AI-driven allocation tasks where multiple values must be reconciled.

Xuan Zhao, Jun Tao

Main category: cs.AI

TL;DR: A framework using natural language interaction and reinforcement learning to automate optimal viewpoint selection for volumetric data exploration.

Details

Motivation: Simplifying volumetric data navigation for users lacking domain expertise or 3D familiarity.

Method: Encodes volumetric blocks, uses CLIP Score for semantic guidance, and employs reinforcement learning for viewpoint selection.

Result: Automated viewpoint selection improves efficiency and interpretability of scientific data.

Conclusion: The framework enhances volumetric data exploration by aligning viewpoints with user intent.

Abstract: Exploring volumetric data is crucial for interpreting scientific datasets. However, selecting optimal viewpoints for effective navigation can be challenging, particularly for users without extensive domain expertise or familiarity with 3D navigation. In this paper, we propose a novel framework that leverages natural language interaction to enhance volumetric data exploration. Our approach encodes volumetric blocks to capture and differentiate underlying structures. It further incorporates a CLIP Score mechanism, which provides semantic information to the blocks to guide navigation. The navigation is empowered by a reinforcement learning framework that leverage these semantic cues to efficiently search for and identify desired viewpoints that align with the user’s intent. The selected viewpoints are evaluated using CLIP Score to ensure that they best reflect the user queries. By automating viewpoint selection, our method improves the efficiency of volumetric data navigation and enhances the interpretability of complex scientific phenomena.

[534] Remote Sensing Image Intelligent Interpretation with the Language-Centered Perspective: Principles, Methods and Challenges

Haifeng Li, Wang Guo, Haiyang Wu, Mengwei Wu, Jipeng Zhang, Qing Zhu, Yu Liu, Xin Huang, Chao Tao

Main category: cs.AI

TL;DR: The paper proposes shifting from vision-centered to language-centered remote sensing interpretation, using LLMs as a cognitive hub inspired by Global Workspace Theory, addressing challenges like multimodal representation and reasoning.

Details

Motivation: Current vision-centered models lack capabilities in multi-modal reasoning and semantic abstraction, prompting the need for a language-centered framework leveraging LLMs.

Method: Proposes a language-centered framework inspired by Global Workspace Theory, treating LLMs as a central hub for integrating perceptual, task, knowledge, and action spaces.

Result: Outlines core challenges (e.g., multimodal representation, knowledge association) and reviews language-centered solutions, proposing future research directions.

Conclusion: Aims to establish a roadmap for cognition-driven intelligent geospatial analysis, providing a foundation for next-gen remote sensing systems.

Abstract: The mainstream paradigm of remote sensing image interpretation has long been dominated by vision-centered models, which rely on visual features for semantic understanding. However, these models face inherent limitations in handling multi-modal reasoning, semantic abstraction, and interactive decision-making. While recent advances have introduced Large Language Models (LLMs) into remote sensing workflows, existing studies primarily focus on downstream applications, lacking a unified theoretical framework that explains the cognitive role of language. This review advocates a paradigm shift from vision-centered to language-centered remote sensing interpretation. Drawing inspiration from the Global Workspace Theory (GWT) of human cognition, We propose a language-centered framework for remote sensing interpretation that treats LLMs as the cognitive central hub integrating perceptual, task, knowledge and action spaces to enable unified understanding, reasoning, and decision-making. We first explore the potential of LLMs as the central cognitive component in remote sensing interpretation, and then summarize core technical challenges, including unified multimodal representation, knowledge association, and reasoning and decision-making. Furthermore, we construct a global workspace-driven interpretation mechanism and review how language-centered solutions address each challenge. Finally, we outline future research directions from four perspectives: adaptive alignment of multimodal data, task understanding under dynamic knowledge constraints, trustworthy reasoning, and autonomous interaction. This work aims to provide a conceptual foundation for the next generation of remote sensing interpretation systems and establish a roadmap toward cognition-driven intelligent geospatial analysis.

[535] D-Judge: How Far Are We? Assessing the Discrepancies Between AI-synthesized and Natural Images through Multimodal Guidance

Renyang Liu, Ziyu Lyu, Wei Zhou, See-Kiong Ng

Main category: cs.AI

TL;DR: The paper introduces a dataset (D-ANI) and benchmark (D-Judge) to quantify discrepancies between AI-generated and natural images, evaluating them across five dimensions to align metrics with human judgment.

Details

Motivation: To systematically investigate and quantify the differences between AI-synthesized and natural images, addressing the challenge of distinguishing them in the AIGC field.

Method: Constructed a large-scale dataset (D-ANI) with natural and AI-generated images, then introduced the D-Judge benchmark for fine-grained evaluation across five dimensions.

Result: Revealed significant discrepancies between AI-generated and natural images across all evaluated dimensions.

Conclusion: Aligning quantitative metrics with human judgment is crucial for comprehensively understanding AI-generated image quality.

Abstract: In the rapidly evolving field of Artificial Intelligence Generated Content (AIGC), a central challenge is distinguishing AI-synthesized images from natural ones. Despite the impressive capabilities of advanced generative models in producing visually compelling images, significant discrepancies remain when compared to natural images. To systematically investigate and quantify these differences, we construct a large-scale multimodal dataset, D-ANI, comprising 5,000 natural images and over 440,000 AIGI samples generated by nine representative models using both unimodal and multimodal prompts, including Text-to-Image (T2I), Image-to-Image (I2I), and Text-and-Image-to-Image (TI2I). We then introduce an AI-Natural Image Discrepancy assessment benchmark (D-Judge) to address the critical question: how far are AI-generated images (AIGIs) from truly realistic images? Our fine-grained evaluation framework assesses the D-ANI dataset across five dimensions: naive visual quality, semantic alignment, aesthetic appeal, downstream task applicability, and coordinated human validation. Extensive experiments reveal substantial discrepancies across these dimensions, highlighting the importance of aligning quantitative metrics with human judgment to achieve a comprehensive understanding of AI-generated image quality. Code: https://github.com/ryliu68/DJudge ; Data: https://huggingface.co/datasets/Renyang/DANI.

[536] Multi-level Advantage Credit Assignment for Cooperative Multi-Agent Reinforcement Learning

Xutong Zhao, Yaqi Xie

Main category: cs.AI

TL;DR: The paper introduces MACA, a method for multi-level credit assignment in cooperative MARL, addressing diverse coordination and reward attribution challenges.

Details

Motivation: The challenge of credit assignment in MARL, especially with diverse agent coordination and overlapping reward subsets, motivates the need for a multi-level approach.

Method: MACA uses multi-level advantage formulation and counterfactual reasoning to assess agent contributions at individual, joint, and correlated action levels, leveraging an attention-based framework.

Result: MACA outperforms existing methods in Starcraft v1&v2 tasks, demonstrating effectiveness in complex credit assignment.

Conclusion: MACA provides a robust solution for multi-level credit assignment in MARL, validated by superior performance in challenging scenarios.

Abstract: Cooperative multi-agent reinforcement learning (MARL) aims to coordinate multiple agents to achieve a common goal. A key challenge in MARL is credit assignment, which involves assessing each agent’s contribution to the shared reward. Given the diversity of tasks, agents may perform different types of coordination, with rewards attributed to diverse and often overlapping agent subsets. In this work, we formalize the credit assignment level as the number of agents cooperating to obtain a reward, and address scenarios with multiple coexisting levels. We introduce a multi-level advantage formulation that performs explicit counterfactual reasoning to infer credits across distinct levels. Our method, Multi-level Advantage Credit Assignment (MACA), captures agent contributions at multiple levels by integrating advantage functions that reason about individual, joint, and correlated actions. Utilizing an attention-based framework, MACA identifies correlated agent relationships and constructs multi-level advantages to guide policy learning. Comprehensive experiments on challenging Starcraft v1&v2 tasks demonstrate MACA’s superior performance, underscoring its efficacy in complex credit assignment scenarios.

[537] FEAT: A Multi-Agent Forensic AI System with Domain-Adapted Large Language Model for Automated Cause-of-Death Analysis

Chen Shen, Wanqing Zhang, Kehan Li, Erwen Huang, Haitao Bi, Aiying Fan, Yiwen Shen, Hongmei Dong, Ji Zhang, Yuming Shao, Zengjia Liu, Xinshe Liu, Tao Li, Chunxia Yan, Shuanliang Fan, Di Wu, Jianhua Ma, Bin Cong, Zhenyuan Wang, Chunfeng Lian

Main category: cs.AI

TL;DR: FEAT is an AI framework for forensic death investigations, outperforming existing AI systems and matching expert accuracy in cause-of-death determination.

Details

Motivation: Address workforce shortages and diagnostic variability in forensic systems, particularly in high-volume settings like China.

Method: Multi-agent AI framework with task decomposition, evidence analysis, iterative refinement, and conclusion synthesis, using tool-augmented reasoning and human feedback.

Result: Outperformed state-of-the-art AI in accuracy, generalized across regions, and achieved high expert concordance in validations.

Conclusion: FEAT offers scalable, expert-level forensic analysis, improving equitable access to reliable medicolegal services.

Abstract: Forensic cause-of-death determination faces systemic challenges, including workforce shortages and diagnostic variability, particularly in high-volume systems like China’s medicolegal infrastructure. We introduce FEAT (ForEnsic AgenT), a multi-agent AI framework that automates and standardizes death investigations through a domain-adapted large language model. FEAT’s application-oriented architecture integrates: (i) a central Planner for task decomposition, (ii) specialized Local Solvers for evidence analysis, (iii) a Memory & Reflection module for iterative refinement, and (iv) a Global Solver for conclusion synthesis. The system employs tool-augmented reasoning, hierarchical retrieval-augmented generation, forensic-tuned LLMs, and human-in-the-loop feedback to ensure legal and medical validity. In evaluations across diverse Chinese case cohorts, FEAT outperformed state-of-the-art AI systems in both long-form autopsy analyses and concise cause-of-death conclusions. It demonstrated robust generalization across six geographic regions and achieved high expert concordance in blinded validations. Senior pathologists validated FEAT’s outputs as comparable to those of human experts, with improved detection of subtle evidentiary nuances. To our knowledge, FEAT is the first LLM-based AI agent system dedicated to forensic medicine, offering scalable, consistent death certification while maintaining expert-level rigor. By integrating AI efficiency with human oversight, this work could advance equitable access to reliable medicolegal services while addressing critical capacity constraints in forensic systems.

[538] MDK12-Bench: A Comprehensive Evaluation of Multimodal Large Language Models on Multidisciplinary Exams

Pengfei Zhou, Xiaopeng Peng, Fanrui Zhang, Zhaopan Xu, Jiaxin Ai, Yansheng Qiu, Chuanhao Li, Zhen Li, Ming Li, Yukang Feng, Jianwen Sun, Haoquan Zhang, Zizhen Li, Xiaofeng Mao, Zekai Li, Wangbo Zhao, Kai Wang, Xiaojun Chang, Wenqi Shao, Yang You, Kaipeng Zhang

Main category: cs.AI

TL;DR: MDK12-Bench is a large-scale multidisciplinary benchmark for evaluating multimodal large language models (MLLMs) using real-world K-12 exams, addressing gaps in current benchmarks.

Details

Motivation: Current benchmarks for MLLMs lack scale, coverage, and structured knowledge, limiting their effectiveness for AGI advancement.

Method: Introduces MDK12-Bench with 141K instances and 6,225 knowledge points, a dynamic evaluation framework, and KP-RAG for knowledge-driven reasoning.

Result: Reveals limitations in current MLLMs across difficulty, temporal shifts, contextual shifts, and knowledge-driven reasoning.

Conclusion: Provides insights for improving MLLM robustness, interpretability, and AI-assisted education.

Abstract: Multimodal large language models (MLLMs), which integrate language and visual cues for problem-solving, are crucial for advancing artificial general intelligence (AGI). However, current benchmarks for measuring the intelligence of MLLMs suffer from limited scale, narrow coverage, and unstructured knowledge, offering only static and undifferentiated evaluations. To bridge this gap, we introduce MDK12-Bench, a large-scale multidisciplinary benchmark built from real-world K-12 exams spanning six disciplines with 141K instances and 6,225 knowledge points organized in a six-layer taxonomy. Covering five question formats with difficulty and year annotations, it enables comprehensive evaluation to capture the extent to which MLLMs perform over four dimensions:

difficulty levels, 2) temporal (cross-year) shifts, 3) contextual shifts, and 4) knowledge-driven reasoning. We propose a novel dynamic evaluation framework that introduces unfamiliar visual, textual, and question form shifts to challenge model generalization while improving benchmark objectivity and longevity by mitigating data contamination. We further evaluate knowledge-point reference-augmented generation (KP-RAG) to examine the role of knowledge in problem-solving. Key findings reveal limitations in current MLLMs in multiple aspects and provide guidance for enhancing model robustness, interpretability, and AI-assisted education.

[539] MeteorPred: A Meteorological Multimodal Large Model and Dataset for Severe Weather Event Prediction

Shuo Tang, Jian Xu, Jiadong Zhang, Yi Chen, Qizhao Jin, Lingdong Shen, Chenglin Liu, Shiming Xiang

Main category: cs.AI

TL;DR: The paper introduces MP-Bench, a large-scale dataset for severe weather prediction, and MMLM, a multimodal model addressing challenges in AI-driven weather forecasting.

Details

Motivation: Current weather forecasting relies on manual interpretation, which is subjective and burdensome. AI-driven systems face challenges like data scarcity and alignment issues.

Method: Developed MP-Bench dataset and MMLM model with adaptive fusion modules for 4D meteorological data.

Result: MMLM performs exceptionally well on MP-Bench, demonstrating effectiveness in severe weather prediction.

Conclusion: The work advances automated AI-driven weather forecasting, with plans to release the dataset and code publicly.

Abstract: Timely and accurate severe weather warnings are critical for disaster mitigation. However, current forecasting systems remain heavily reliant on manual expert interpretation, introducing subjectivity and significant operational burdens. With the rapid development of AI technologies, the end-to-end “AI weather station” is gradually emerging as a new trend in predicting severe weather events. Three core challenges impede the development of end-to-end AI severe weather system: (1) scarcity of severe weather event samples; (2) imperfect alignment between high-dimensional meteorological data and textual warnings; (3) existing multimodal language models are unable to handle high-dimensional meteorological data and struggle to fully capture the complex dependencies across temporal sequences, vertical pressure levels, and spatial dimensions. To address these challenges, we introduce MP-Bench, the first large-scale temporal multimodal dataset for severe weather events prediction, comprising 421,363 pairs of raw multi-year meteorological data and corresponding text caption, covering a wide range of severe weather scenarios across China. On top of this dataset, we develop a meteorology multimodal large model (MMLM) that directly ingests 4D meteorological inputs. In addition, it is designed to accommodate the unique characteristics of 4D meteorological data flow, incorporating three plug-and-play adaptive fusion modules that enable dynamic feature extraction and integration across temporal sequences, vertical pressure layers, and spatial dimensions. Extensive experiments on MP-Bench demonstrate that MMLM performs exceptionally well across multiple tasks, highlighting its effectiveness in severe weather understanding and marking a key step toward realizing automated, AI-driven weather forecasting systems. Our source code and dataset will be made publicly available.

[540] Pushdown Reward Machines for Reinforcement Learning

Giovanni Varricchione, Toryn Q. Klassen, Natasha Alechina, Mehdi Dastani, Brian Logan, Sheila A. McIlraith

Main category: cs.AI

TL;DR: The paper introduces pushdown reward machines (pdRMs), an extension of reward machines, to handle more complex behaviors in reinforcement learning by recognizing deterministic context-free languages.

Details

Motivation: To improve the expressiveness of reward machines in reinforcement learning by extending them to handle temporally extended behaviors representable in deterministic context-free languages.

Method: Proposes pdRMs based on deterministic pushdown automata, introduces two policy variants (full stack and top-k stack access), and provides theoretical and experimental validation.

Result: Theoretical results confirm pdRMs’ expressive power and space complexity, while experiments demonstrate successful training of agents for tasks in deterministic context-free languages.

Conclusion: pdRMs enhance the capabilities of reward machines, enabling more complex behavior recognition and improved sample efficiency in reinforcement learning.

Abstract: Reward machines (RMs) are automata structures that encode (non-Markovian) reward functions for reinforcement learning (RL). RMs can reward any behaviour representable in regular languages and, when paired with RL algorithms that exploit RM structure, have been shown to significantly improve sample efficiency in many domains. In this work, we present pushdown reward machines (pdRMs), an extension of reward machines based on deterministic pushdown automata. pdRMs can recognize and reward temporally extended behaviours representable in deterministic context-free languages, making them more expressive than reward machines. We introduce two variants of pdRM-based policies, one which has access to the entire stack of the pdRM, and one which can only access the top $k$ symbols (for a given constant $k$) of the stack. We propose a procedure to check when the two kinds of policies (for a given environment, pdRM, and constant $k$) achieve the same optimal expected reward. We then provide theoretical results establishing the expressive power of pdRMs, and space complexity results about the proposed learning problems. Finally, we provide experimental results showing how agents can be trained to perform tasks representable in deterministic context-free languages using pdRMs.

[541] Observation Interference in Partially Observable Assistance Games

Scott Emmons, Caspar Oesterheld, Vincent Conitzer, Stuart Russell

Main category: cs.AI

TL;DR: The paper explores AI deception in partially observable assistance games (POAGs), showing that optimal AI assistants may interfere with human observations, resolving contradictions with classic decision-making theorems, and analyzing practical tradeoffs.

Details

Motivation: Addressing AI deception and value alignment in human-AI interactions, particularly when partial observability allows AI to interfere with human observations.

Method: Theoretical proofs and experimental models to analyze when and why AI assistants interfere with observations, including scenarios with optimal and irrational human behavior.

Result: Optimal AI assistants may interfere with observations, contradicting classic decision-making theorems, but this resolves when considering policies. Interference incentives vary with human behavior and communication channels.

Conclusion: AI assistants may interfere with observations in POAGs, highlighting the need for careful design to balance alignment and avoid unintended deception.

Abstract: We study partially observable assistance games (POAGs), a model of the human-AI value alignment problem which allows the human and the AI assistant to have partial observations. Motivated by concerns of AI deception, we study a qualitatively new phenomenon made possible by partial observability: would an AI assistant ever have an incentive to interfere with the human’s observations? First, we prove that sometimes an optimal assistant must take observation-interfering actions, even when the human is playing optimally, and even when there are otherwise-equivalent actions available that do not interfere with observations. Though this result seems to contradict the classic theorem from single-agent decision making that the value of information is nonnegative, we resolve this seeming contradiction by developing a notion of interference defined on entire policies. This can be viewed as an extension of the classic result that the value of information is nonnegative into the cooperative multiagent setting. Second, we prove that if the human is simply making decisions based on their immediate outcomes, the assistant might need to interfere with observations as a way to query the human’s preferences. We show that this incentive for interference goes away if the human is playing optimally, or if we introduce a communication channel for the human to communicate their preferences to the assistant. Third, we show that if the human acts according to the Boltzmann model of irrationality, this can create an incentive for the assistant to interfere with observations. Finally, we use an experimental model to analyze tradeoffs faced by the AI assistant in practice when considering whether or not to take observation-interfering actions.

[542] GDBA Revisited: Unleashing the Power of Guided Local Search for Distributed Constraint Optimization

Yanchen Deng, Xinrun Wang, Bo An

Main category: cs.AI

TL;DR: The paper introduces Distributed Guided Local Search (DGLS), a novel framework for Distributed Constraint Optimization Problems (DCOPs), addressing limitations of GDBA and outperforming state-of-the-art baselines.

Details

Motivation: Local search algorithms for DCOPs often converge to poor local optima, and GDBA's empirical benefits are marginal. The paper aims to improve performance by addressing GDBA's identified weaknesses.

Method: Proposes DGLS with adaptive violation conditions, penalty evaporation, and synchronized penalty updates to address GDBA’s over-aggressive constraints, unbounded penalties, and uncoordinated updates.

Result: DGLS outperforms baselines, achieving competitive performance on general-valued problems and significant improvements (3.77%–66.3%) on structured problems.

Conclusion: DGLS effectively addresses GDBA’s limitations, demonstrating superior performance and bounded penalties, making it a promising approach for DCOPs.

Abstract: Local search is an important class of incomplete algorithms for solving Distributed Constraint Optimization Problems (DCOPs) but it often converges to poor local optima. While GDBA provides a comprehensive rule set to escape premature convergence, its empirical benefits remain marginal on general-valued problems. In this work, we systematically examine GDBA and identify three factors that potentially lead to its inferior performance, i.e., over-aggressive constraint violation conditions, unbounded penalty accumulation, and uncoordinated penalty updates. To address these issues, we propose Distributed Guided Local Search (DGLS), a novel GLS framework for DCOPs that incorporates an adaptive violation condition to selectively penalize constraints with high cost, a penalty evaporation mechanism to control the magnitude of penalization, and a synchronization scheme for coordinated penalty updates. We theoretically show that the penalty values are bounded, and agents play a potential game in our DGLS. Our extensive empirical results on various standard benchmarks demonstrate the great superiority of DGLS over state-of-the-art baselines. Particularly, compared to Damped Max-sum with high damping factors (e.g., 0.7 or 0.9), our DGLS achieves competitive performance on general-valued problems, and outperforms it by significant margins (\textbf{3.77%–66.3%}) on structured problems in terms of anytime results.

[543] A Differentiated Reward Method for Reinforcement Learning based Multi-Vehicle Cooperative Decision-Making Algorithms

Ye Han, Lijun Zhang, Dejian Meng, Zhuang Zhang

Main category: cs.AI

TL;DR: A differentiated reward method for RL in multi-vehicle cooperative driving improves training efficiency and performance.

Details

Motivation: Addressing low sample efficiency in RL for multi-vehicle cooperative driving by incorporating traffic flow characteristics into reward design.

Method: Differentiated reward method based on steady-state transition systems, integrating state transition gradient information.

Result: Accelerates training convergence and outperforms other methods in traffic efficiency, safety, and action rationality.

Conclusion: The method offers scalability and adaptability, advancing multi-agent cooperative decision-making in complex traffic.

Abstract: Reinforcement learning (RL) shows great potential for optimizing multi-vehicle cooperative driving strategies through the state-action-reward feedback loop, but it still faces challenges such as low sample efficiency. This paper proposes a differentiated reward method based on steady-state transition systems, which incorporates state transition gradient information into the reward design by analyzing traffic flow characteristics, aiming to optimize action selection and policy learning in multi-vehicle cooperative decision-making. The performance of the proposed method is validated in RL algorithms such as MAPPO, MADQN, and QMIX under varying autonomous vehicle penetration. The results show that the differentiated reward method significantly accelerates training convergence and outperforms centering reward and others in terms of traffic efficiency, safety, and action rationality. Additionally, the method demonstrates strong scalability and environmental adaptability, providing a novel approach for multi-agent cooperative decision-making in complex traffic scenarios.

[544] Automated Formalization via Conceptual Retrieval-Augmented LLMs

Wangyue Lu, Lun Du, Sirui Li, Ke Weng, Haozhe Sun, Hengyu Liu, Minghe Yu, Tiancheng Zhang, Ge Yu

Main category: cs.AI

TL;DR: CRAMF is a framework for improving automated mathematical formalization by retrieving and grounding formal definitions, addressing challenges like model hallucination and semantic gaps.

Details

Motivation: Manual formalization in ITPs is labor-intensive and requires expertise, while automated methods face issues like model hallucination and semantic gaps.

Method: CRAMF uses retrieval-augmented generation (RAG) to fetch formal definitions from Mathlib4, employs contextual query augmentation, and a dual-channel hybrid retrieval strategy.

Result: CRAMF improves translation accuracy by up to 62.1% (average 29.9%) on benchmarks like miniF2F, ProofNet, and AdvancedMath.

Conclusion: CRAMF effectively enhances autoformalization by integrating concept retrieval, addressing key challenges in the field.

Abstract: Interactive theorem provers (ITPs) require manual formalization, which is labor-intensive and demands expert knowledge. While automated formalization offers a potential solution, it faces two major challenges: model hallucination (e.g., undefined predicates, symbol misuse, and version incompatibility) and the semantic gap caused by ambiguous or missing premises in natural language descriptions. To address these issues, we propose CRAMF, a Concept-driven Retrieval-Augmented Mathematical Formalization framework. CRAMF enhances LLM-based autoformalization by retrieving formal definitions of core mathematical concepts, providing contextual grounding during code generation. However, applying retrieval-augmented generation (RAG) in this setting is non-trivial due to the lack of structured knowledge bases, the polymorphic nature of mathematical concepts, and the high precision required in formal retrieval. We introduce a framework for automatically constructing a concept-definition knowledge base from Mathlib4, the standard mathematical library for the Lean 4 theorem prover, indexing over 26,000 formal definitions and 1,000+ core mathematical concepts. To address conceptual polymorphism, we propose contextual query augmentation with domain- and application-level signals. In addition, we design a dual-channel hybrid retrieval strategy with reranking to ensure accurate and relevant definition retrieval. Experiments on miniF2F, ProofNet, and our newly proposed AdvancedMath benchmark show that CRAMF can be seamlessly integrated into LLM-based autoformalizers, yielding consistent improvements in translation accuracy, achieving up to 62.1% and an average of 29.9% relative improvement.

[545] Intrinsic Explainability of Multimodal Learning for Crop Yield Prediction

Hiba Najjar, Deepak Pathak, Marlon Nuske, Andreas Dengel

Main category: cs.AI

TL;DR: The paper explores Transformer-based models for interpretable multimodal learning in crop yield prediction, comparing attention-based and Shapley-based methods for feature and modality attributions.

Details

Motivation: To address the lack of interpretability in complex multimodal learning architectures, especially in agriculture, by leveraging Transformer models for explainable crop yield prediction.

Method: Uses Transformer-based models with Attention Rollout (AR) and Generic Attention (GA) for feature attributions, and proposes Weighted Modality Activation (WMA) for modality attributions, comparing them to Shapley Value Sampling (SVS).

Result: Transformer models outperform convolutional and recurrent networks, with AR providing more reliable temporal attributions than GA and SVS. Modality attributions vary between methods.

Conclusion: Transformer-based models enhance interpretability and performance in multimodal learning for agriculture, with AR being a robust method for temporal explanations.

Abstract: Multimodal learning enables various machine learning tasks to benefit from diverse data sources, effectively mimicking the interplay of different factors in real-world applications, particularly in agriculture. While the heterogeneous nature of involved data modalities may necessitate the design of complex architectures, the model interpretability is often overlooked. In this study, we leverage the intrinsic explainability of Transformer-based models to explain multimodal learning networks, focusing on the task of crop yield prediction at the subfield level. The large datasets used cover various crops, regions, and years, and include four different input modalities: multispectral satellite and weather time series, terrain elevation maps and soil properties. Based on the self-attention mechanism, we estimate feature attributions using two methods, namely the Attention Rollout (AR) and Generic Attention (GA), and evaluate their performance against Shapley-based model-agnostic estimations, Shapley Value Sampling (SVS). Additionally, we propose the Weighted Modality Activation (WMA) method to assess modality attributions and compare it with SVS attributions. Our findings indicate that Transformer-based models outperform other architectures, specifically convolutional and recurrent networks, achieving R2 scores that are higher by 0.10 and 0.04 at the subfield and field levels, respectively. AR is shown to provide more robust and reliable temporal attributions, as confirmed through qualitative and quantitative evaluation, compared to GA and SVS values. Information about crop phenology stages was leveraged to interpret the explanation results in the light of established agronomic knowledge. Furthermore, modality attributions revealed varying patterns across the two methods compared.[…]

[546] El Agente: An Autonomous Agent for Quantum Chemistry

Yunheng Zou, Austin H. Cheng, Abdulrahman Aldossary, Jiaru Bai, Shi Xuan Leong, Jorge Arturo Campos-Gonzalez-Angulo, Changhyeok Choi, Cher Tian Ser, Gary Tom, Andrew Wang, Zijian Zhang, Ilya Yakavets, Han Hao, Chris Crebolder, Varinia Bernales, Alán Aspuru-Guzik

Main category: cs.AI

TL;DR: El Agente Q is an LLM-based multi-agent system that simplifies quantum chemistry workflows via natural language prompts, achieving high task success and adaptability.

Details

Motivation: The complexity of computational chemistry tools limits accessibility for non-specialists and challenges experts. El Agente Q aims to bridge this gap.

Method: The system uses a hierarchical memory framework for task decomposition, tool selection, and autonomous file handling, tested on course exercises and case studies.

Result: El Agente Q achieved >87% task success, demonstrated adaptive error handling, and supported multi-step workflows with transparency.

Conclusion: El Agente Q advances autonomous and accessible quantum chemistry, laying groundwork for future developments.

Abstract: Computational chemistry tools are widely used to study the behaviour of chemical phenomena. Yet, the complexity of these tools can make them inaccessible to non-specialists and challenging even for experts. In this work, we introduce El Agente Q, an LLM-based multi-agent system that dynamically generates and executes quantum chemistry workflows from natural language user prompts. The system is built on a novel cognitive architecture featuring a hierarchical memory framework that enables flexible task decomposition, adaptive tool selection, post-analysis, and autonomous file handling and submission. El Agente Q is benchmarked on six university-level course exercises and two case studies, demonstrating robust problem-solving performance (averaging >87% task success) and adaptive error handling through in situ debugging. It also supports longer-term, multi-step task execution for more complex workflows, while maintaining transparency through detailed action trace logs. Together, these capabilities lay the foundation for increasingly autonomous and accessible quantum chemistry.

[547] Large Language Models Do Not Simulate Human Psychology

Sarah Schröder, Thekla Morgenroth, Ulrike Kuhl, Valerie Vaquet, Benjamin Paaßen

Main category: cs.AI

TL;DR: The paper argues against using LLMs like ChatGPT to simulate human psychology in research, showing conceptual and empirical flaws in their reliability.

Details

Motivation: To caution against replacing human participants in psychological studies with LLMs, highlighting their limitations in simulating human psychology.

Method: Provides conceptual arguments and empirical evidence, testing LLMs’ responses to wording changes and novel items.

Result: LLMs show notable discrepancies from human responses and lack reliability, even when fine-tuned.

Conclusion: LLMs do not simulate human psychology and should be validated against human responses for each new application.

Abstract: Large Language Models (LLMs),such as ChatGPT, are increasingly used in research, ranging from simple writing assistance to complex data annotation tasks. Recently, some research has suggested that LLMs may even be able to simulate human psychology and can, hence, replace human participants in psychological studies. We caution against this approach. We provide conceptual arguments against the hypothesis that LLMs simulate human psychology. We then present empiric evidence illustrating our arguments by demonstrating that slight changes to wording that correspond to large changes in meaning lead to notable discrepancies between LLMs’ and human responses, even for the recent CENTAUR model that was specifically fine-tuned on psychological responses. Additionally, different LLMs show very different responses to novel items, further illustrating their lack of reliability. We conclude that LLMs do not simulate human psychology and recommend that psychological researchers should treat LLMs as useful but fundamentally unreliable tools that need to be validated against human responses for every new application.

[548] Personalized Constitutionally-Aligned Agentic Superego: Secure AI Behavior Aligned to Diverse Human Values

Nell Watson, Ahmed Amer, Evan Harris, Preeti Ravindra, Shujun Zhang

Main category: cs.AI

TL;DR: The paper introduces a ‘superego’ agent for aligning agentic AI systems with human values, reducing harmful outputs by up to 98.3%.

Details

Motivation: Challenges in aligning autonomous AI with diverse human values, safety, and compliance hinder practical deployment.

Method: A ‘superego’ agent dynamically steers AI planning using user-selected ‘Creed Constitutions’ and a real-time compliance enforcer.

Result: Achieves up to 98.3% harm score reduction and near-perfect refusal rates for leading LLMs.

Conclusion: The approach simplifies AI alignment, improving safety and adaptability to individual and cultural contexts.

Abstract: Agentic AI systems, possessing capabilities for autonomous planning and action, show great potential across diverse domains. However, their practical deployment is hindered by challenges in aligning their behavior with varied human values, complex safety requirements, and specific compliance needs. Existing alignment methodologies often falter when faced with the complex task of providing personalized context without inducing confabulation or operational inefficiencies. This paper introduces a novel solution: a ‘superego’ agent, designed as a personalized oversight mechanism for agentic AI. This system dynamically steers AI planning by referencing user-selected ‘Creed Constitutions’ encapsulating diverse rule sets – with adjustable adherence levels to fit non-negotiable values. A real-time compliance enforcer validates plans against these constitutions and a universal ethical floor before execution. We present a functional system, including a demonstration interface with a prototypical constitution-sharing portal, and successful integration with third-party models via the Model Context Protocol (MCP). Comprehensive benchmark evaluations (HarmBench, AgentHarm) demonstrate that our Superego agent dramatically reduces harmful outputs – achieving up to a 98.3% harm score reduction and near-perfect refusal rates (e.g., 100% with Claude Sonnet 4 on AgentHarm’s harmful set) for leading LLMs like Gemini 2.5 Flash and GPT-4o. This approach substantially simplifies personalized AI alignment, rendering agentic systems more reliably attuned to individual and cultural contexts, while also enabling substantial safety improvements. An overview on this research with examples is available at https://superego.creed.space.

[549] DatasetResearch: Benchmarking Agent Systems for Demand-Driven Dataset Discovery

Keyu Li, Mohan Jiang, Dayuan Fu, Yunze Wu, Xiangkun Hu, Dequan Wang, Pengfei Liu

Main category: cs.AI

TL;DR: The paper introduces DatasetResearch, a benchmark for evaluating AI agents’ ability to discover and synthesize datasets, revealing significant gaps in current capabilities.

Details

Motivation: The bottleneck in AI development has shifted from computational power to data availability, with many valuable datasets hidden across various sources. The paper explores whether AI agents can autonomously discover datasets to meet specific user needs.

Method: The study introduces DatasetResearch, a benchmark with 208 real-world demands, evaluating AI agents’ dataset discovery and synthesis abilities using a tri-dimensional framework.

Result: Advanced systems achieve only 22% on the challenging DatasetResearch-pro subset, highlighting a gap in capabilities. Search agents excel in knowledge tasks, while synthesis agents perform better in reasoning tasks, but both fail on ‘corner cases.’

Conclusion: The findings establish a baseline for dataset discovery agents and guide future AI systems toward autonomous dataset curation. The benchmark and analysis are publicly available for further research.

Abstract: The rapid advancement of large language models has fundamentally shifted the bottleneck in AI development from computational power to data availability-with countless valuable datasets remaining hidden across specialized repositories, research appendices, and domain platforms. As reasoning capabilities and deep research methodologies continue to evolve, a critical question emerges: can AI agents transcend conventional search to systematically discover any dataset that meets specific user requirements, enabling truly autonomous demand-driven data curation? We introduce DatasetResearch, the first comprehensive benchmark evaluating AI agents’ ability to discover and synthesize datasets from 208 real-world demands across knowledge-intensive and reasoning-intensive tasks. Our tri-dimensional evaluation framework reveals a stark reality: even advanced deep research systems achieve only 22% score on our challenging DatasetResearch-pro subset, exposing the vast gap between current capabilities and perfect dataset discovery. Our analysis uncovers a fundamental dichotomy-search agents excel at knowledge tasks through retrieval breadth, while synthesis agents dominate reasoning challenges via structured generation-yet both catastrophically fail on “corner cases” outside existing distributions. These findings establish the first rigorous baseline for dataset discovery agents and illuminate the path toward AI systems capable of finding any dataset in the digital universe. Our benchmark and comprehensive analysis provide the foundation for the next generation of self-improving AI systems and are publicly available at https://github.com/GAIR-NLP/DatasetResearch.

[550] MASteer: Multi-Agent Adaptive Steer Strategy for End-to-End LLM Trustworthiness Repair

Changqing Li, Tianlin Li, Xiaohan Zhang, Aishan Liu, Li Pan

Main category: cs.AI

TL;DR: MASteer is an end-to-end framework for trustworthiness repair in LLMs, using representation engineering to automate and adapt repairs, outperforming existing methods.

Details

Motivation: Existing repair methods for LLMs (SFT, RLHF) are costly and slow, while prompt engineering lacks robustness. Representation engineering offers a lightweight alternative but lacks automation.

Method: MASteer combines AutoTester (multi-agent system for sample generation) and AutoRepairer (adaptive steering strategies) for automated, context-aware repairs.

Result: MASteer improves trustworthiness metrics by 15.36% on LLaMA-3.1-8B-Chat and 4.21% on Qwen-3-8B-Chat, maintaining general model capabilities.

Conclusion: MASteer provides scalable, efficient trustworthiness repair with strong robustness and generalization.

Abstract: Large Language Models (LLMs) face persistent and evolving trustworthiness issues, motivating developers to seek automated and flexible repair methods that enable convenient deployment across diverse scenarios. Existing repair methods like supervised fine-tuning (SFT) and reinforcement learning with human feedback (RLHF) are costly and slow, while prompt engineering lacks robustness and scalability. Representation engineering, which steers model behavior by injecting targeted concept vectors during inference, offers a lightweight, training-free alternative. However, current approaches depend on manually crafted samples and fixed steering strategies, limiting automation and adaptability. To overcome these challenges, we propose MASteer, the first end-to-end framework for trustworthiness repair in LLMs based on representation engineering. MASteer integrates two core components: AutoTester, a multi-agent system that generates diverse, high-quality steer samples tailored to developer needs; and AutoRepairer, which constructs adaptive steering strategies with anchor vectors for automated, context-aware strategy selection during inference. Experiments on standard and customized trustworthiness tasks show MASteer consistently outperforms baselines, improving metrics by 15.36% on LLaMA-3.1-8B-Chat and 4.21% on Qwen-3-8B-Chat, while maintaining general model capabilities. MASteer demonstrates strong robustness, generalization, and practical value for scalable, efficient trustworthiness repair.

[551] DSperse: A Framework for Targeted Verification in Zero-Knowledge Machine Learning

Dan Ivanov, Tristan Freiberg, Haruna Isah

Main category: cs.AI

TL;DR: DSperse is a framework for distributed ML inference with cryptographic verification, focusing on targeted verification of subcomputations for efficiency.

Details

Motivation: To reduce the high cost and rigidity of full-model verification in distributed zero-knowledge ML by enabling selective verification of key subcomputations.

Method: Uses ‘slices’ (verifiable segments) for targeted verification, enforced via audit, replication, or incentives. Evaluated with multiple proving systems.

Result: Empirical data on memory, runtime, and circuit behavior shows efficiency in sliced vs. unsliced configurations.

Conclusion: DSperse offers scalable, flexible verification aligned with model structure, minimizing trust where it matters most.

Abstract: DSperse is a modular framework for distributed machine learning inference with strategic cryptographic verification. Operating within the emerging paradigm of distributed zero-knowledge machine learning, DSperse avoids the high cost and rigidity of full-model circuitization by enabling targeted verification of strategically chosen subcomputations. These verifiable segments, or “slices”, may cover part or all of the inference pipeline, with global consistency enforced through audit, replication, or economic incentives. This architecture supports a pragmatic form of trust minimization, localizing zero-knowledge proofs to the components where they provide the greatest value. We evaluate DSperse using multiple proving systems and report empirical results on memory usage, runtime, and circuit behavior under sliced and unsliced configurations. By allowing proof boundaries to align flexibly with the model’s logical structure, DSperse supports scalable, targeted verification strategies suited to diverse deployment needs.

[552] Simulating Biological Intelligence: Active Inference with Experiment-Informed Generative Model

Aswin Paul, Moein Khajehnejad, Forough Habibollahi, Brett J. Kagan, Adeel Razi

Main category: cs.AI

TL;DR: The paper explores biologically based AI systems for explainable and efficient decision-making, using active inference to model embodied agents in a simulated game-play environment.

Details

Motivation: To develop safe and efficient AI systems by understanding purposeful behavior in autonomous agents, leveraging biologically plausible models for explainability.

Method: Proposes a framework using active inference and experiment-informed generative models to simulate decision-making in embodied agents within a game-play environment.

Result: Demonstrates learning in agents, highlighting the role of memory-based learning and predictive planning in intelligent decision-making.

Conclusion: Contributes to explainable AI with a biologically grounded, scalable approach to understanding agent behavior.

Abstract: With recent and rapid advancements in artificial intelligence (AI), understanding the foundation of purposeful behaviour in autonomous agents is crucial for developing safe and efficient systems. While artificial neural networks have dominated the path to AI, recent studies are exploring the potential of biologically based systems, such as networks of living biological neuronal networks. Along with promises of high power and data efficiency, these systems may also inform more explainable and biologically plausible models. In this work, we propose a framework rooted in active inference, a general theory of behaviour, to model decision-making in embodied agents. Using experiment-informed generative models, we simulate decision-making processes in a simulated game-play environment, mirroring experimental setups that use biological neurons. Our results demonstrate learning in these agents, providing insights into the role of memory-based learning and predictive planning in intelligent decision-making. This work contributes to the growing field of explainable AI by offering a biologically grounded and scalable approach to understanding purposeful behaviour in agents.

[553] Efficient and Reliable Hitting-Set Computations for the Implicit Hitting Set Approach

Hannes Ihalainen, Dieter Vandesande, André Schidler, Jeremias Berg, Bart Bogaerts, Matti Järvisalo

Main category: cs.AI

TL;DR: The paper explores alternative techniques for hitting set optimization in the implicit hitting set (IHS) framework, comparing pseudo-Boolean reasoning and stochastic local search to traditional integer programming. It finds trade-offs between efficiency and reliability, with PB reasoning offering correctness guarantees.

Details

Motivation: To address the limitations of traditional integer programming in IHS, such as numerical instability, and to explore more reliable and certifiable alternatives like pseudo-Boolean reasoning.

Method: The study evaluates alternative HS optimization techniques, including pseudo-Boolean reasoning and stochastic local search, comparing them to integer programming in the context of IHS.

Result: PB reasoning is competitive with exact IP solvers and provides correctness certificates, while commercial IP solvers remain efficient but suffer from numerical instability.

Conclusion: PB reasoning offers a viable, certifiable alternative to IP solvers in IHS, balancing efficiency and reliability, though IP solvers remain the most effective for HS computations.

Abstract: The implicit hitting set (IHS) approach offers a general framework for solving computationally hard combinatorial optimization problems declaratively. IHS iterates between a decision oracle used for extracting sources of inconsistency and an optimizer for computing so-called hitting sets (HSs) over the accumulated sources of inconsistency. While the decision oracle is language-specific, the optimizers is usually instantiated through integer programming. We explore alternative algorithmic techniques for hitting set optimization based on different ways of employing pseudo-Boolean (PB) reasoning as well as stochastic local search. We extensively evaluate the practical feasibility of the alternatives in particular in the context of pseudo-Boolean (0-1 IP) optimization as one of the most recent instantiations of IHS. Highlighting a trade-off between efficiency and reliability, while a commercial IP solver turns out to remain the most effective way to instantiate HS computations, it can cause correctness issues due to numerical instability; in fact, we show that exact HS computations instantiated via PB reasoning can be made competitive with a numerically exact IP solver. Furthermore, the use of PB reasoning as a basis for HS computations allows for obtaining certificates for the correctness of IHS computations, generally applicable to any IHS instantiation in which reasoning in the declarative language at hand can be captured in the PB-based proof format we employ.

[554] Towards Safer AI Moderation: Evaluating LLM Moderators Through a Unified Benchmark Dataset and Advocating a Human-First Approach

Naseem Machlovi, Maryam Saleki, Innocent Ababio, Ruhul Amin

Main category: cs.AI

TL;DR: The paper explores the limitations of Large Language Models (LLMs) in nuanced moderation tasks like detecting implicit hate and biases, proposing a framework and SafePhi, a fine-tuned model, which outperforms benchmarks.

Details

Motivation: The increasing integration of AI in daily life necessitates safer and more reliable moderation, but LLMs struggle with subjective and context-dependent issues like hate speech and biases.

Method: An experimental framework using SOTA models was developed, including a benchmark dataset with 49 categories, and SafePhi, a QLoRA fine-tuned version of Phi-4, was introduced.

Result: SafePhi achieved a Macro F1 score of 0.89, outperforming OpenAI Moderator (0.77) and Llama Guard (0.74). LLMs underperformed in critical domains, highlighting the need for better data and human oversight.

Conclusion: The study underscores the need for more heterogeneous data and human-in-the-loop approaches to improve LLM robustness and explainability in moderation tasks.

Abstract: As AI systems become more integrated into daily life, the need for safer and more reliable moderation has never been greater. Large Language Models (LLMs) have demonstrated remarkable capabilities, surpassing earlier models in complexity and performance. Their evaluation across diverse tasks has consistently showcased their potential, enabling the development of adaptive and personalized agents. However, despite these advancements, LLMs remain prone to errors, particularly in areas requiring nuanced moral reasoning. They struggle with detecting implicit hate, offensive language, and gender biases due to the subjective and context-dependent nature of these issues. Moreover, their reliance on training data can inadvertently reinforce societal biases, leading to inconsistencies and ethical concerns in their outputs. To explore the limitations of LLMs in this role, we developed an experimental framework based on state-of-the-art (SOTA) models to assess human emotions and offensive behaviors. The framework introduces a unified benchmark dataset encompassing 49 distinct categories spanning the wide spectrum of human emotions, offensive and hateful text, and gender and racial biases. Furthermore, we introduced SafePhi, a QLoRA fine-tuned version of Phi-4, adapting diverse ethical contexts and outperforming benchmark moderators by achieving a Macro F1 score of 0.89, where OpenAI Moderator and Llama Guard score 0.77 and 0.74, respectively. This research also highlights the critical domains where LLM moderators consistently underperformed, pressing the need to incorporate more heterogeneous and representative data with human-in-the-loop, for better model robustness and explainability.

[555] Designing a Feedback-Driven Decision Support System for Dynamic Student Intervention

Timothy Oluwapelumi Adeyemi, Nadiah Fahad AlOtaibi

Main category: cs.AI

TL;DR: A Feedback-Driven Decision Support System (DSS) with adaptive learning improves student performance prediction by continuously refining models with new data, reducing RMSE by 10.7%.

Details

Motivation: Static machine learning models in education lack adaptability to new data, such as post-intervention outcomes, limiting timely academic intervention.

Method: Proposes a closed-loop DSS using LightGBM-based regressor with incremental retraining, a Flask web interface, and SHAP for explainability.

Result: Experimental results show a 10.7% RMSE reduction and improved prediction accuracy for intervened students.

Conclusion: The adaptive DSS advances educational analytics by enabling self-improving, human-centered, and responsive AI, suitable for LMS integration.

Abstract: Accurate prediction of student performance is essential for timely academic intervention. However, most machine learning models in education are static and cannot adapt when new data, such as post-intervention outcomes, become available. To address this limitation, we propose a Feedback-Driven Decision Support System (DSS) with a closed-loop architecture that enables continuous model refinement. The system integrates a LightGBM-based regressor with incremental retraining, allowing educators to input updated student results, which automatically trigger model updates. This adaptive mechanism improves prediction accuracy by learning from real-world academic progress. The platform features a Flask-based web interface for real-time interaction and incorporates SHAP for explainability, ensuring transparency. Experimental results show a 10.7% reduction in RMSE after retraining, with consistent upward adjustments in predicted scores for intervened students. By transforming static predictors into self-improving systems, our approach advances educational analytics toward human-centered, data-driven, and responsive AI. The framework is designed for integration into LMS and institutional dashboards.

[556] EndoAgent: A Memory-Guided Reflective Agent for Intelligent Endoscopic Vision-to-Decision Reasoning

Yi Tang, Kaini Wang, Yang Chen, Guangquan Zhou

Main category: cs.AI

TL;DR: EndoAgent is a memory-guided AI agent for endoscopic image diagnosis, integrating iterative reasoning and adaptive tool selection, outperforming existing models.

Details

Motivation: Existing AI methods lack unified coordination for multi-step clinical workflows in endoscopy, and AI agents' potential in this domain is underexplored.

Method: EndoAgent uses a dual-memory design for logical coherence (short-term action tracking) and enhanced reasoning (long-term experiential learning), integrating expert-designed tools.

Result: EndoAgent outperforms general and medical multimodal models, demonstrating strong flexibility and reasoning capabilities.

Conclusion: EndoAgent advances endoscopic AI by combining memory-guided reasoning with adaptive tool integration, validated by extensive benchmarking.

Abstract: Developing general artificial intelligence (AI) systems to support endoscopic image diagnosis is an emerging research priority. Existing methods based on large-scale pretraining often lack unified coordination across tasks and struggle to handle the multi-step processes required in complex clinical workflows. While AI agents have shown promise in flexible instruction parsing and tool integration across domains, their potential in endoscopy remains underexplored. To address this gap, we propose EndoAgent, the first memory-guided agent for vision-to-decision endoscopic analysis that integrates iterative reasoning with adaptive tool selection and collaboration. Built on a dual-memory design, it enables sophisticated decision-making by ensuring logical coherence through short-term action tracking and progressively enhancing reasoning acuity through long-term experiential learning. To support diverse clinical tasks, EndoAgent integrates a suite of expert-designed tools within a unified reasoning loop. We further introduce EndoAgentBench, a benchmark of 5,709 visual question-answer pairs that assess visual understanding and language generation capabilities in realistic scenarios. Extensive experiments show that EndoAgent consistently outperforms both general and medical multimodal models, exhibiting its strong flexibility and reasoning capabilities.

[557] Hallucination as a Computational Boundary: A Hierarchy of Inevitability and the Oracle Escape

Quan Shi, Wang Xi, Zenghui Ding, Jianqing Gao, Xianjun Yang

Main category: cs.AI

TL;DR: The paper formalizes LLMs as probabilistic Turing machines, proves illusions are inevitable, and proposes two escape routes: RAGs as oracle machines and continuous learning via neural game theory.

Details

Motivation: Addressing the core obstacle of illusion in LLMs to ensure reliable deployment.

Method: Formalizing LLMs as probabilistic Turing machines, proving inevitability of illusions, and proposing RAGs and continuous learning as solutions.

Result: Proof of inevitable illusions and effectiveness of RAGs and continuous learning.

Conclusion: Two escape routes (RAGs and continuous learning) provide theoretical foundations for mitigating LLM illusions.

Abstract: The illusion phenomenon of large language models (LLMs) is the core obstacle to their reliable deployment. This article formalizes the large language model as a probabilistic Turing machine by constructing a “computational necessity hierarchy”, and for the first time proves the illusions are inevitable on diagonalization, incomputability, and information theory boundaries supported by the new “learner pump lemma”. However, we propose two “escape routes”: one is to model Retrieval Enhanced Generations (RAGs) as oracle machines, proving their absolute escape through “computational jumps”, providing the first formal theory for the effectiveness of RAGs; The second is to formalize continuous learning as an “internalized oracle” mechanism and implement this path through a novel neural game theory framework.Finally, this article proposes a

[558] Rethinking Domain-Specific LLM Benchmark Construction: A Comprehensiveness-Compactness Approach

Rubing Chen, Jiaxin Wu, Jian Wang, Xulu Zhang, Wenqi Fan, Chenghua Lin, Xiao-Yong Wei, Qing Li

Main category: cs.AI

TL;DR: The paper introduces Comp-Comp, an iterative benchmarking framework for domain-specific LLMs, emphasizing comprehensiveness and compactness over scaling laws, and validates it with XUBench in academia.

Details

Motivation: Existing domain-specific benchmarks rely heavily on scaling laws, neglecting the impact of corpus and QA set design on precision and recall. This gap is addressed by proposing a new framework.

Method: The Comp-Comp framework iteratively balances comprehensiveness (semantic recall) and compactness (precision) for corpus and QA set construction.

Result: The framework was validated with XUBench, a large-scale closed-domain benchmark in academia, demonstrating its effectiveness.

Conclusion: The Comp-Comp framework is extensible beyond academia, offering insights for benchmark construction in various domains.

Abstract: Numerous benchmarks have been built to evaluate the domain-specific abilities of large language models (LLMs), highlighting the need for effective and efficient benchmark construction. Existing domain-specific benchmarks primarily focus on the scaling law, relying on massive corpora for supervised fine-tuning or generating extensive question sets for broad coverage. However, the impact of corpus and question-answer (QA) set design on the precision and recall of domain-specific LLMs remains unexplored. In this paper, we address this gap and demonstrate that the scaling law is not always the optimal principle for benchmark construction in specific domains. Instead, we propose Comp-Comp, an iterative benchmarking framework based on a comprehensiveness-compactness principle. Here, comprehensiveness ensures semantic recall of the domain, while compactness enhances precision, guiding both corpus and QA set construction. To validate our framework, we conducted a case study in a well-renowned university, resulting in the creation of XUBench, a large-scale and comprehensive closed-domain benchmark. Although we use the academic domain as the case in this work, our Comp-Comp framework is designed to be extensible beyond academia, providing valuable insights for benchmark construction across various domains.

[559] Pentest-R1: Towards Autonomous Penetration Testing Reasoning Optimized via Two-Stage Reinforcement Learning

He Kong, Die Hu, Jingguo Ge, Liangxiong Li, Hui Li, Tong Li

Main category: cs.AI

TL;DR: Pentest-R1 is a reinforcement learning framework for LLMs to improve penetration testing, achieving state-of-the-art results on benchmarks.

Details

Motivation: Current LLMs struggle with error handling, reasoning, and autonomous task execution in penetration testing.

Method: Uses a two-stage reinforcement learning pipeline: offline RL on real-world walkthroughs and online RL in a CTF environment.

Result: Achieves 24.2% success on AutoPenBench and 15.0% on Cybench, matching top proprietary models.

Conclusion: The synergy of offline and online RL stages is critical for Pentest-R1’s success.

Abstract: Automating penetration testing is crucial for enhancing cybersecurity, yet current Large Language Models (LLMs) face significant limitations in this domain, including poor error handling, inefficient reasoning, and an inability to perform complex end-to-end tasks autonomously. To address these challenges, we introduce Pentest-R1, a novel framework designed to optimize LLM reasoning capabilities for this task through a two-stage reinforcement learning pipeline. We first construct a dataset of over 500 real-world, multi-step walkthroughs, which Pentest-R1 leverages for offline reinforcement learning (RL) to instill foundational attack logic. Subsequently, the LLM is fine-tuned via online RL in an interactive Capture The Flag (CTF) environment, where it learns directly from environmental feedback to develop robust error self-correction and adaptive strategies. Our extensive experiments on the Cybench and AutoPenBench benchmarks demonstrate the framework’s effectiveness. On AutoPenBench, Pentest-R1 achieves a 24.2% success rate, surpassing most state-of-the-art models and ranking second only to Gemini 2.5 Flash. On Cybench, it attains a 15.0% success rate in unguided tasks, establishing a new state-of-the-art for open-source LLMs and matching the performance of top proprietary models. Ablation studies confirm that the synergy of both training stages is critical to its success.

[560] Invert4TVG: A Temporal Video Grounding Framework with Inversion Tasks for Enhanced Action Understanding

Zhaoyu Chen, Hongnan Lin, Yongwei Nie, Fei Ma, Xuemiao Xu, Fei Yu, Chengjiang Long

Main category: cs.AI

TL;DR: Invert4TVG enhances TVG by integrating inversion tasks (Verb Completion, Action Recognition, Video Description) to improve both localization accuracy and semantic understanding, outperforming state-of-the-art methods.

Details

Motivation: Current TVG methods overfit to temporal IoU, compromising semantic action understanding, which is critical for robust TVG.

Method: Introduces three inversion tasks derived from TVG annotations, integrated via reinforcement learning with reward functions.

Result: Achieves a 7.1% improvement in R1@0.7 on Charades-STA for a 3B model compared to Time-R1.

Conclusion: Invert4TVG strengthens semantic understanding and raises the ceiling of localization accuracy by inverting TVG to derive query-related actions.

Abstract: Temporal Video Grounding (TVG) seeks to localize video segments matching a given textual query. Current methods, while optimizing for high temporal Intersection-over-Union (IoU), often overfit to this metric, compromising semantic action understanding in the video and query, a critical factor for robust TVG. To address this, we introduce Inversion Tasks for TVG (Invert4TVG), a novel framework that enhances both localization accuracy and action understanding without additional data. Our approach leverages three inversion tasks derived from existing TVG annotations: (1) Verb Completion, predicting masked action verbs in queries from video segments; (2) Action Recognition, identifying query-described actions; and (3) Video Description, generating descriptions of video segments that explicitly embed query-relevant actions. These tasks, integrated with TVG via a reinforcement learning framework with well-designed reward functions, ensure balanced optimization of localization and semantics. Experiments show our method outperforms state-of-the-art approaches, achieving a 7.1% improvement in R1@0.7 on Charades-STA for a 3B model compared to Time-R1. By inverting TVG to derive query-related actions from segments, our approach strengthens semantic understanding, significantly raising the ceiling of localization accuracy.

[561] Generative AI for Strategic Plan Development

Jesse Ponnock

Main category: cs.AI

TL;DR: The paper evaluates BERTopic and NMF for topic modeling in strategic plan development for government organizations, finding BERTopic superior.

Details

Motivation: To leverage GAI for automating strategic plan development in large-scale government organizations, overcoming regulatory challenges.

Method: BERTopic and NMF are trained on GAO reports to generate themes for Vision Elements, then scored for similarity against a strategic plan.

Result: Both techniques matched 100% of Vision Elements, with BERTopic achieving better correlation scores.

Conclusion: GAI can effectively aid strategic plan development, with BERTopic outperforming NMF. Future work will operationalize the model and test other modules.

Abstract: Given recent breakthroughs in Generative Artificial Intelligence (GAI) and Large Language Models (LLMs), more and more professional services are being augmented through Artificial Intelligence (AI), which once seemed impossible to automate. This paper presents a modular model for leveraging GAI in developing strategic plans for large scale government organizations and evaluates leading machine learning techniques in their application towards one of the identified modules. Specifically, the performance of BERTopic and Non-negative Matrix Factorization (NMF) are evaluated in their ability to use topic modeling to generate themes representative of Vision Elements within a strategic plan. To accomplish this, BERTopic and NMF models are trained using a large volume of reports from the Government Accountability Office (GAO). The generated topics from each model are then scored for similarity against the Vision Elements of a published strategic plan and the results are compared. Our results show that these techniques are capable of generating themes similar to 100% of the elements being evaluated against. Further, we conclude that BERTopic performs best in this application with more than half of its correlated topics achieving a “medium” or “strong” correlation. A capability of GAI-enabled strategic plan development impacts a multi-billion dollar industry and assists the federal government in overcoming regulatory requirements which are crucial to the public good. Further work will focus on the operationalization of the concept proven in this study as well as viability of the remaining modules in the proposed model for GAI-generated strategic plans.

[562] Grounding Natural Language for Multi-agent Decision-Making with Multi-agentic LLMs

Dom Huh, Prasant Mohapatra

Main category: cs.AI

TL;DR: The paper extends LLMs by integrating them with multi-agent decision-making, proposing a framework for multi-agentic LLMs with advanced techniques like prompt engineering and memory architectures.

Details

Motivation: To enhance collaboration and coordination among agents by leveraging LLMs and multi-agent decision-making.

Method: Proposes a systematic framework integrating LLMs with multi-agent systems, focusing on prompt engineering, memory architectures, multi-modal processing, and fine-tuning.

Result: Evaluated through ablation studies on game settings with social dilemmas, showing effectiveness of the proposed framework.

Conclusion: The integration of LLMs with multi-agent decision-making improves coordination and problem-solving in collaborative settings.

Abstract: Language is a ubiquitous tool that is foundational to reasoning and collaboration, ranging from everyday interactions to sophisticated problem-solving tasks. The establishment of a common language can serve as a powerful asset in ensuring clear communication and understanding amongst agents, facilitating desired coordination and strategies. In this work, we extend the capabilities of large language models (LLMs) by integrating them with advancements in multi-agent decision-making algorithms. We propose a systematic framework for the design of multi-agentic large language models (LLMs), focusing on key integration practices. These include advanced prompt engineering techniques, the development of effective memory architectures, multi-modal information processing, and alignment strategies through fine-tuning algorithms. We evaluate these design choices through extensive ablation studies on classic game settings with significant underlying social dilemmas and game-theoretic considerations.

[563] CP-Agent: Agentic Constraint Programming

Stefan Szeider

Main category: cs.AI

TL;DR: A new agentic approach using a pure ReAct-based Python coding agent successfully translates natural language problem descriptions into formal constraint models, solving all 101 CP-Bench problems without fixed workflows.

Details

Motivation: Automating the translation of natural language problem descriptions into formal constraint models is challenging due to the need for domain expertise and flexible modeling. Previous fixed workflows failed on many benchmarks.

Method: A general-purpose Python coding agent based on ReAct, using a persistent IPython kernel for stateful execution. Domain expertise is injected via a project prompt, and the agent dynamically tests, debugs, and verifies solutions.

Result: The approach solved all 101 problems in the CP-Bench benchmark set, demonstrating success without specialized architectures or predefined workflows.

Conclusion: Constraint modeling benefits from combining general coding tools and prompt-encoded domain expertise, rather than rigid workflows or specialized agent designs.

Abstract: Translating natural language problem descriptions into formal constraint models remains a fundamental challenge in constraint programming, requiring deep expertise in both the problem domain and modeling frameworks. Previous approaches to automating this translation have employed fixed workflows with predetermined modeling steps, failing on a significant number of benchmark problems. We present a new approach using a pure agentic strategy without any fixed pipeline. We developed a general-purpose Python coding agent based on the ReAct (Reason and Act) principle, utilizing a persistent IPython kernel for stateful code execution and iterative development. Rather than embedding constraint programming logic into the agent architecture, domain-specific expertise is injected solely through a carefully crafted project prompt. The agent combines this prompt-encoded knowledge with access to file operations and code execution tools, enabling it to test hypotheses, debug failures, and verify solutions dynamically. Implemented in just a few hundred lines of code, this architecture successfully solves all 101 problems of the CP-Bench constraint programming benchmark set. The results suggest that constraint modeling tasks require the combination of general coding tools and domain expertise encoded in prompts, rather than specialized agent architectures or predefined workflows.

[564] Democratizing Diplomacy: A Harness for Evaluating Any Large Language Model on Full-Press Diplomacy

Alexander Duffy, Samuel J Paech, Ishana Shastri, Elizabeth Karpinski, Baptiste Alloui-Cros, Tyler Marques, Matthew Lyle Olson

Main category: cs.AI

TL;DR: A new evaluation harness enables local LLMs to play Diplomacy without fine-tuning, using optimized game state representation and tooling for analysis. Larger models perform best, but smaller ones are adequate.

Details

Motivation: To democratize the study of strategic reasoning in LLMs by eliminating the need for fine-tuning or specialized training for Diplomacy, a complex game.

Method: Data-driven iteration to optimize textual game state representation, tooling for hypothesis testing, and Critical State Analysis for deep game moment analysis.

Result: Larger LLMs perform best, but smaller models are adequate. The harness enables reliable match completion without fine-tuning.

Conclusion: The harness democratizes LLM evaluation for strategic reasoning and provides insights into naturally emerging capabilities in widely used LLMs.

Abstract: We present the first evaluation harness that enables any out-of-the-box, local, Large Language Models (LLMs) to play full-press Diplomacy without fine-tuning or specialized training. Previous work required frontier LLMs, or fine-tuning, due to the high complexity and information density of Diplomacy’s game state. Combined with the high variance of matches, these factors made Diplomacy prohibitive for study. In this work, we used data-driven iteration to optimize a textual game state representation such that a 24B model can reliably complete matches without any fine tuning. We develop tooling to facilitate hypothesis testing and statistical analysis, and we present case studies on persuasion, aggressive playstyles, and performance across a range of models. We conduct a variety of experiments across many popular LLMs, finding the larger models perform the best, but the smaller models still play adequately. We also introduce Critical State Analysis: an experimental protocol for rapidly iterating and analyzing key moments in a game at depth. Our harness democratizes the evaluation of strategic reasoning in LLMs by eliminating the need for fine-tuning, and it provides insights into how these capabilities emerge naturally from widely used LLMs. Our code is available in the supplement and will be open sourced.

[565] MCPToolBench++: A Large Scale AI Agent Model Context Protocol MCP Tool Use Benchmark

Shiqing Fan, Xichen Ding, Liang Zhang, Linjian Mo

Main category: cs.AI

TL;DR: The paper introduces MCPToolBench++, a benchmark for evaluating LLMs’ ability to use MCP tools, addressing gaps in datasets, diverse response formats, and context window limitations.

Details

Motivation: Existing evaluations of LLMs' MCP tool-use abilities lack comprehensive datasets, face challenges with diverse response formats, and are limited by context window constraints.

Method: Proposes MCPToolBench++, a large-scale, multi-domain benchmark built on 4k+ MCP servers from 40+ categories, evaluating single-step and multi-step tool calls.

Result: Evaluated SOTA LLMs with agentic abilities on MCPToolBench++, reporting performance metrics.

Conclusion: MCPToolBench++ addresses evaluation challenges and provides a standardized benchmark for assessing LLMs’ MCP tool-use capabilities.

Abstract: LLMs’ capabilities are enhanced by using function calls to integrate various data sources or API results into the context window. Typical tools include search, web crawlers, maps, financial data, file systems, and browser usage, etc. Integrating these data sources or functions requires a standardized method. The Model Context Protocol (MCP) provides a standardized way to supply context to LLMs. However, the evaluation of LLMs and AI Agents’ MCP tool use abilities suffer from several issues. First, there’s a lack of comprehensive datasets or benchmarks to evaluate various MCP tools. Second, the diverse formats of response from MCP tool call execution further increase the difficulty of evaluation. Additionally, unlike existing tool-use benchmarks with high success rates in functions like programming and math functions, the success rate of real-world MCP tool is not guaranteed and varies across different MCP servers. Furthermore, the LLMs’ context window also limits the number of available tools that can be called in a single run, because the textual descriptions of tool and the parameters have long token length for an LLM to process all at once. To help address the challenges of evaluating LLMs’ performance on calling MCP tools, we propose MCPToolBench++, a large-scale, multi-domain AI Agent tool use benchmark. As of July 2025, this benchmark is build upon marketplace of over 4k MCP servers from more than 40 categories, collected from the MCP marketplaces and GitHub communities. The datasets consist of both single-step and multi-step tool calls across different categories. We evaluated SOTA LLMs with agentic abilities on this benchmark and reported the results.

[566] Optimization of Private Semantic Communication Performance: An Uncooperative Covert Communication Method

Wenjing Zhang, Ye Hu, Tao Luo, Zhilong Zhang, Mingzhe Chen

Main category: cs.AI

TL;DR: A novel covert semantic communication framework is proposed, using a friendly jammer to protect semantic information from eavesdropping, and a reinforcement learning algorithm to optimize transmission quality and privacy.

Details

Motivation: To prevent attackers from eavesdropping on semantic information (image meaning) transmitted over time slots, while ensuring efficient transmission.

Method: A prioritised sampling assisted twin delayed deep deterministic policy gradient algorithm optimizes semantic information and transmit power without server-jammer communication.

Result: Improves privacy by 77.8% and transmission quality by 14.3% compared to traditional methods.

Conclusion: The proposed framework and algorithm effectively enhance privacy and transmission quality in covert semantic communication.

Abstract: In this paper, a novel covert semantic communication framework is investigated. Within this framework, a server extracts and transmits the semantic information, i.e., the meaning of image data, to a user over several time slots. An attacker seeks to detect and eavesdrop the semantic transmission to acquire details of the original image. To avoid data meaning being eavesdropped by an attacker, a friendly jammer is deployed to transmit jamming signals to interfere the attacker so as to hide the transmitted semantic information. Meanwhile, the server will strategically select time slots for semantic information transmission. Due to limited energy, the jammer will not communicate with the server and hence the server does not know the transmit power of the jammer. Therefore, the server must jointly optimize the semantic information transmitted at each time slot and the corresponding transmit power to maximize the privacy and the semantic information transmission quality of the user. To solve this problem, we propose a prioritised sampling assisted twin delayed deep deterministic policy gradient algorithm to jointly determine the transmitted semantic information and the transmit power per time slot without the communications between the server and the jammer. Compared to standard reinforcement learning methods, the propose method uses an additional Q network to estimate Q values such that the agent can select the action with a lower Q value from the two Q networks thus avoiding local optimal action selection and estimation bias of Q values. Simulation results show that the proposed algorithm can improve the privacy and the semantic information transmission quality by up to 77.8% and 14.3% compared to the traditional reinforcement learning methods.

[567] HGMF: A Hierarchical Gaussian Mixture Framework for Scalable Tool Invocation within the Model Context Protocol

Wenpeng Xing, Zhipeng Chen, Changting Lin, Meng Han

Main category: cs.AI

TL;DR: HGMF improves tool selection accuracy and reduces latency for LLMs by hierarchically pruning irrelevant options using a Gaussian Mixture Model.

Details

Motivation: The challenge of selecting the correct tool from large, hierarchical libraries due to limited LLM context windows and noise from irrelevant options.

Method: HGMF maps queries and tools into a semantic space, clusters and filters tools hierarchically using GMM, and produces a compact candidate set.

Result: HGMF significantly improves tool selection accuracy and reduces inference latency in experiments.

Conclusion: HGMF is scalable and effective for large-scale tool libraries, enhancing LLM performance in real-world tasks.

Abstract: Invoking external tools enables Large Language Models (LLMs) to perform complex, real-world tasks, yet selecting the correct tool from large, hierarchically-structured libraries remains a significant challenge. The limited context windows of LLMs and noise from irrelevant options often lead to low selection accuracy and high computational costs. To address this, we propose the Hierarchical Gaussian Mixture Framework (HGMF), a probabilistic pruning method for scalable tool invocation. HGMF first maps the user query and all tool descriptions into a unified semantic space. The framework then operates in two stages: it clusters servers using a Gaussian Mixture Model (GMM) and filters them based on the query’s likelihood. Subsequently, it applies the same GMM-based clustering and filtering to the tools associated with the selected servers. This hierarchical process produces a compact, high-relevance candidate set, simplifying the final selection task for the LLM. Experiments on a public dataset show that HGMF significantly improves tool selection accuracy while reducing inference latency, confirming the framework’s scalability and effectiveness for large-scale tool libraries.

[568] ThinkTuning: Instilling Cognitive Reflections without Distillation

Aswin RRV, Jacob Dineen, Divij Handa, Md Nayem Uddin, Mihir Parmar, Chitta Baral, Ben Zhou

Main category: cs.AI

TL;DR: ThinkTuning, a GRPO-based interactive training method, improves reasoning in LLMs by using teacher feedback, showing notable performance gains over baselines.

Details

Motivation: To develop reasoning abilities in models lacking self-reflective behavior, as RL alone doesn't instill new reasoning.

Method: ThinkTuning uses teacher-student interaction: the teacher provides feedback during student rollouts to guide reasoning.

Result: Achieves 3.85% average improvement over zero-shot baselines, with specific gains on MATH-500, AIME, and GPQA-Diamond.

Conclusion: Teacher-guided feedback effectively enhances reasoning in student models, outperforming vanilla-GRPO.

Abstract: Recent advances in test-time scaling have led to the emergence of thinking LLMs that exhibit self-reflective behaviors and multi-step reasoning. While RL drives this self-improvement paradigm, a recent study (Gandhi et al., 2025) shows that RL alone does not truly instill these new reasoning abilities - it merely draws out behaviors already present in the base models. This raises a question: How can we train the models that don’t exhibit such thinking behavior to develop it in the first place? To this end, we propose ThinkTuning, a GRPO-based interactive training approach where we augment the rollouts of a student model with the guidance from a teacher model. A simple idea from classroom practice inspires our method: a teacher poses a problem, lets the student try an answer, then gives corrective feedback – enough to point the mind in the right direction and then show the solution. Each piece of feedback reshapes the student’s thoughts, leading them to arrive at the correct solution. Similarly, we find that this type of implicit supervision through feedback from a teacher model of the same size improves the reasoning capabilities of the student model. In particular, on average, our method shows a 3.85% improvement over zero-shot baselines across benchmarks, and on MATH-500, AIME and GPQA-Diamond it shows 2.08%, 2.23% and 3.99% improvements over the vanilla-GRPO baseline. Source code is available at https://github.com/3rdAT/ThinkTuning.

[569] Multimodal AI Systems for Enhanced Laying Hen Welfare Assessment and Productivity Optimization

Daniel Essien, Suresh Neethirajan

Main category: cs.AI

TL;DR: The paper proposes using multimodal AI to improve poultry welfare monitoring, highlighting feature-level fusion as optimal, and introduces tools (DTS, DRI) and a deployment framework for practical adoption.

Details

Motivation: Traditional welfare checks are limited by human observation and single-sensor data, failing to capture the complexity of laying hen welfare in modern farms.

Method: The study uses multimodal AI to integrate visual, acoustic, environmental, and physiological data, focusing on feature-level fusion strategies. It introduces the Domain Transfer Score (DTS) and Data Reliability Index (DRI) for evaluation.

Result: Feature-level fusion balances robustness and performance best, and the proposed tools and framework address adoption barriers like sensor fragility and deployment costs.

Conclusion: The work paves the way for proactive, precision-driven welfare systems, combining productivity with ethical animal care.

Abstract: The future of poultry production depends on a paradigm shift replacing subjective, labor-intensive welfare checks with data-driven, intelligent monitoring ecosystems. Traditional welfare assessments-limited by human observation and single-sensor data-cannot fully capture the complex, multidimensional nature of laying hen welfare in modern farms. Multimodal Artificial Intelligence (AI) offers a breakthrough, integrating visual, acoustic, environmental, and physiological data streams to reveal deeper insights into avian welfare dynamics. This investigation highlights multimodal As transformative potential, showing that intermediate (feature-level) fusion strategies achieve the best balance between robustness and performance under real-world poultry conditions, and offer greater scalability than early or late fusion approaches. Key adoption barriers include sensor fragility in harsh farm environments, high deployment costs, inconsistent behavioral definitions, and limited cross-farm generalizability. To address these, we introduce two novel evaluation tools - the Domain Transfer Score (DTS) to measure model adaptability across diverse farm settings, and the Data Reliability Index (DRI) to assess sensor data quality under operational constraints. We also propose a modular, context-aware deployment framework designed for laying hen environments, enabling scalable and practical integration of multimodal sensing. This work lays the foundation for a transition from reactive, unimodal monitoring to proactive, precision-driven welfare systems that unite productivity with ethical, science based animal care.

Tianyi Ma, Yue Zhang, Zehao Wang, Parisa Kordjamshidi

Main category: cs.AI

TL;DR: SkillNav introduces a modular, skill-based framework for Vision-and-Language Navigation (VLN), improving generalization and performance on benchmarks like R2R and GSA-R2R.

Details

Motivation: Current VLN methods struggle with generalization, especially in unseen scenarios requiring complex reasoning.

Method: SkillNav decomposes navigation into interpretable atomic skills, each managed by specialized agents, and uses a VLM-based router for dynamic skill selection.

Result: Achieves state-of-the-art performance on R2R and strong generalization on GSA-R2R.

Conclusion: SkillNav’s modular, skill-based approach enhances VLN performance and generalization.

Abstract: Vision-and-Language Navigation (VLN) poses significant challenges in enabling agents to interpret natural language instructions and navigate complex 3D environments. While recent progress has been driven by large-scale pre-training and data augmentation, current methods still struggle to generalize to unseen scenarios, particularly when complex spatial and temporal reasoning is required. In this work, we propose SkillNav, a modular framework that introduces structured, skill-based reasoning into Transformer-based VLN agents. Our method decomposes navigation into a set of interpretable atomic skills (e.g., Vertical Movement, Area and Region Identification, Stop and Pause), each handled by a specialized agent. We then introduce a novel zero-shot Vision-Language Model (VLM)-based router, which dynamically selects the most suitable agent at each time step by aligning sub-goals with visual observations and historical actions. SkillNav achieves a new state-of-the-art performance on the R2R benchmark and demonstrates strong generalization to the GSA-R2R benchmark that includes novel instruction styles and unseen environments.

[571] Disentangling Multiplex Spatial-Temporal Transition Graph Representation Learning for Socially Enhanced POI Recommendation

Jie Li, Haoye Dong, Zhengyang Wu, Zetao Zheng, Mingrong Lin

Main category: cs.AI

TL;DR: DiMuST is a POI recommendation model using disentangled representation learning to align spatial-temporal transitions, improving accuracy and interpretability.

Details

Motivation: Existing POI recommendation models misalign spatial-temporal transitions, causing redundancy and reduced interpretability.

Method: DiMuST uses a Disentangled variational multiplex graph Auto-Encoder (DAE) to disentangle shared/private features, fusing shared features via PoE and denoising private ones with contrastive constraints.

Result: DiMuST outperforms existing methods on two datasets across multiple metrics.

Conclusion: DiMuST effectively captures spatial-temporal transitions while preserving their intrinsic correlations, enhancing POI recommendations.

Abstract: Next Point-of-Interest (POI) recommendation is a research hotspot in business intelligence, where users’ spatial-temporal transitions and social relationships play key roles. However, most existing works model spatial and temporal transitions separately, leading to misaligned representations of the same spatial-temporal key nodes. This misalignment introduces redundant information during fusion, increasing model uncertainty and reducing interpretability. To address this issue, we propose DiMuST, a socially enhanced POI recommendation model based on disentangled representation learning over multiplex spatial-temporal transition graphs. The model employs a novel Disentangled variational multiplex graph Auto-Encoder (DAE), which first disentangles shared and private distributions using a multiplex spatial-temporal graph strategy. It then fuses the shared features via a Product of Experts (PoE) mechanism and denoises the private features through contrastive constraints. The model effectively captures the spatial-temporal transition representations of POIs while preserving the intrinsic correlation of their spatial-temporal relationships. Experiments on two challenging datasets demonstrate that our DiMuST significantly outperforms existing methods across multiple metrics.

[572] 1-2-3 Check: Enhancing Contextual Privacy in LLM via Multi-Agent Reasoning

Wenkai Li, Liwen Sun, Zhenxiang Guan, Xuhui Zhou, Maarten Sap

Main category: cs.AI

TL;DR: A multi-agent framework improves privacy in LLMs by decomposing tasks and reducing leakage, outperforming single-agent baselines.

Details

Motivation: Addressing privacy concerns in LLMs when processing mixed private/public information.

Method: Multi-agent framework with specialized subtasks (extraction, classification) and iterative validation.

Result: Reduced private info leakage by 18% (ConfAIde) and 19% (PrivacyLens) with GPT-4o.

Conclusion: Principled multi-agent systems enhance contextual privacy in LLMs.

Abstract: Addressing contextual privacy concerns remains challenging in interactive settings where large language models (LLMs) process information from multiple sources (e.g., summarizing meetings with private and public information). We introduce a multi-agent framework that decomposes privacy reasoning into specialized subtasks (extraction, classification), reducing the information load on any single agent while enabling iterative validation and more reliable adherence to contextual privacy norms. To understand how privacy errors emerge and propagate, we conduct a systematic ablation over information-flow topologies, revealing when and why upstream detection mistakes cascade into downstream leakage. Experiments on the ConfAIde and PrivacyLens benchmark with several open-source and closed-sourced LLMs demonstrate that our best multi-agent configuration substantially reduces private information leakage (\textbf{18%} on ConfAIde and \textbf{19%} on PrivacyLens with GPT-4o) while preserving the fidelity of public content, outperforming single-agent baselines. These results highlight the promise of principled information-flow design in multi-agent systems for contextual privacy with LLMs.

[573] Ethics2vec: aligning automatic agents and human preferences

Gianluca Bontempi

Main category: cs.AI

TL;DR: The paper proposes Ethics2Vec, an extension of the Anything2vec approach, to align AI decision-making with human ethical values by mapping agent strategies to vector representations.

Details

Motivation: The challenge of aligning AI systems with human ethical values, especially when dealing with incommensurable values like life and cost, motivates the need for a common metric space.

Method: The paper introduces Ethics2Vec, mapping agent decision-making strategies to multivariate vectors for comparison with human values, starting with binary decisions and extending to control laws like in self-driving cars.

Result: Ethics2Vec provides a framework to assess and compare AI alignment with human ethical values through vector representations.

Conclusion: The approach offers a scalable method to address AI alignment by leveraging vectorization techniques from hard-to-quantify domains.

Abstract: Though intelligent agents are supposed to improve human experience (or make it more efficient), it is hard from a human perspective to grasp the ethical values which are explicitly or implicitly embedded in an agent behaviour. This is the well-known problem of alignment, which refers to the challenge of designing AI systems that align with human values, goals and preferences. This problem is particularly challenging since most human ethical considerations refer to \emph{incommensurable} (i.e. non-measurable and/or incomparable) values and criteria. Consider, for instance, a medical agent prescribing a treatment to a cancerous patient. How could it take into account (and/or weigh) incommensurable aspects like the value of a human life and the cost of the treatment? Now, the alignment between human and artificial values is possible only if we define a common space where a metric can be defined and used. This paper proposes to extend to ethics the conventional Anything2vec approach, which has been successful in plenty of similar and hard-to-quantify domains (ranging from natural language processing to recommendation systems and graph analysis). This paper proposes a way to map an automatic agent decision-making (or control law) strategy to a multivariate vector representation, which can be used to compare and assess the alignment with human values. The Ethics2Vec method is first introduced in the case of an automatic agent performing binary decision-making. Then, a vectorisation of an automatic control law (like in the case of a self-driving car) is discussed to show how the approach can be extended to automatic control settings.

[574] Symmetry-Aware Transformer Training for Automated Planning

Markus Fritzsche, Elliot Gestrin, Jendrik Seipp

Main category: cs.AI

TL;DR: Transformers struggle with planning tasks due to problem symmetries. A novel contrastive learning method makes them symmetry-aware, improving performance in plan-generation and heuristic-prediction.

Details

Motivation: Transformers lack inductive bias for handling symmetries in planning tasks, limiting their effectiveness in automated planning.

Method: Proposed a contrastive learning objective to make transformers symmetry-aware, combined with architectural improvements.

Result: Improved performance in plan-generation and heuristic-prediction across multiple planning domains, addressing PlanGPT’s limitations.

Conclusion: Symmetry-aware training effectively enhances transformers for automated planning tasks.

Abstract: While transformers excel in many settings, their application in the field of automated planning is limited. Prior work like PlanGPT, a state-of-the-art decoder-only transformer, struggles with extrapolation from easy to hard planning problems. This in turn stems from problem symmetries: planning tasks can be represented with arbitrary variable names that carry no meaning beyond being identifiers. This causes a combinatorial explosion of equivalent representations that pure transformers cannot efficiently learn from. We propose a novel contrastive learning objective to make transformers symmetry-aware and thereby compensate for their lack of inductive bias. Combining this with architectural improvements, we show that transformers can be efficiently trained for either plan-generation or heuristic-prediction. Our results across multiple planning domains demonstrate that our symmetry-aware training effectively and efficiently addresses the limitations of PlanGPT.

[575] Best-Effort Policies for Robust Markov Decision Processes

Alessandro Abate, Thom Badings, Giuseppe De Giacomo, Francesco Fabiano

Main category: cs.AI

TL;DR: The paper introduces ORBE policies for robust MDPs, refining policy selection by maximizing worst-case and non-adversarial expected returns.

Details

Motivation: Addressing the issue of multiple optimal robust policies in RMDPs, which differ under non-adversarial conditions despite equivalent worst-case performance.

Method: Proposes ORBE policies, inspired by game theory, and presents an algorithm to compute them with minimal overhead.

Result: ORBE policies exist, are structurally characterized, and can be computed efficiently. Numerical experiments validate feasibility.

Conclusion: ORBE policies provide a principled tie-breaker for optimal robust policies, enhancing decision-making in RMDPs.

Abstract: We study the common generalization of Markov decision processes (MDPs) with sets of transition probabilities, known as robust MDPs (RMDPs). A standard goal in RMDPs is to compute a policy that maximizes the expected return under an adversarial choice of the transition probabilities. If the uncertainty in the probabilities is independent between the states, known as s-rectangularity, such optimal robust policies can be computed efficiently using robust value iteration. However, there might still be multiple optimal robust policies, which, while equivalent with respect to the worst-case, reflect different expected returns under non-adversarial choices of the transition probabilities. Hence, we propose a refined policy selection criterion for RMDPs, drawing inspiration from the notions of dominance and best-effort in game theory. Instead of seeking a policy that only maximizes the worst-case expected return, we additionally require the policy to achieve a maximal expected return under different (i.e., not fully adversarial) transition probabilities. We call such a policy an optimal robust best-effort (ORBE) policy. We prove that ORBE policies always exist, characterize their structure, and present an algorithm to compute them with a small overhead compared to standard robust value iteration. ORBE policies offer a principled tie-breaker among optimal robust policies. Numerical experiments show the feasibility of our approach.

[576] KIRETT: Knowledge-Graph-Based Smart Treatment Assistant for Intelligent Rescue Operations

Mubaris Nadeem, Johannes Zenkert, Lisa Bender, Christian Weber, Madjid Fathi

Main category: cs.AI

TL;DR: A Knowledge Graph aids first responders by providing AI-driven treatment recommendations in emergency scenarios.

Details

Motivation: The increasing need for rapid, personalized emergency care requires innovative tools to assist first responders in making timely, informed decisions.

Method: The paper introduces a Knowledge Graph as a central knowledge representation, leveraging AI for situation pre-recognition and treatment recommendations.

Result: The system enables intelligent, data-driven treatment suggestions, improving emergency care efficiency.

Conclusion: The Knowledge Graph enhances first responders’ ability to deliver optimized healthcare in time-critical situations.

Abstract: Over the years, the need for rescue operations throughout the world has increased rapidly. Demographic changes and the resulting risk of injury or health disorders form the basis for emergency calls. In such scenarios, first responders are in a rush to reach the patient in need, provide first aid, and save lives. In these situations, they must be able to provide personalized and optimized healthcare in the shortest possible time and estimate the patients condition with the help of freshly recorded vital data in an emergency situation. However, in such a timedependent situation, first responders and medical experts cannot fully grasp their knowledge and need assistance and recommendation for further medical treatments. To achieve this, on the spot calculated, evaluated, and processed knowledge must be made available to improve treatments by first responders. The Knowledge Graph presented in this article as a central knowledge representation provides first responders with an innovative knowledge management that enables intelligent treatment recommendations with an artificial intelligence-based pre-recognition of the situation.

[577] (X)-evolve: Solution space evolution powered by large language models

Yi Zhai, Zhiqiang Wei, Ruohan Li, Keyu Pan, Shuo Liu, Lu Zhang, Jianmin Ji, Wuyang Zhang, Yu Zhang, Yanyong Zhang

Main category: cs.AI

TL;DR: X-evolve is a novel method combining LLMs and EAs to evolve solution spaces, reducing LLM call costs and improving optimization efficiency.

Details

Motivation: Current approaches evolve individual solutions, incurring high LLM call costs, limiting efficiency.

Method: X-evolve evolves solution spaces (sets of solutions) using LLMs to generate tunable programs, with score-based search exploring these spaces.

Result: Demonstrated efficacy in three problems: improved cap set bounds, larger independent sets in graphs, and better bin packing heuristics.

Conclusion: Evolving solution spaces enhances search effectiveness, enabling high-dimensional problem-solving previously deemed prohibitive.

Abstract: While combining large language models (LLMs) with evolutionary algorithms (EAs) shows promise for solving complex optimization problems, current approaches typically evolve individual solutions, often incurring high LLM call costs. We introduce (X)-evolve, a paradigm-shifting method that instead evolves solution spaces (X) (sets of individual solutions) - subsets of the overall search space (S). In (X)-evolve, LLMs generate tunable programs wherein certain code snippets, designated as parameters, define a tunable solution space. A score-based search algorithm then efficiently explores this parametrically defined space, guided by feedback from objective function scores. This strategy enables broader and more efficient exploration, which can potentially accelerate convergence at a much lower search cost, requiring up to two orders of magnitude fewer LLM calls than prior leading methods. We demonstrate (X)-evolve’s efficacy across three distinct hard optimization problems. For the cap set problem, we discover a larger partial admissible set, establishing a new tighter asymptotic lower bound for the cap set constant ((C \ge 2.2203)). In information theory, we uncover a larger independent set for the 15-vertex cycle graph ((\mathcal{C}_{15}^{\boxtimes 5}), size 19,946), thereby raising the known lower bound on its Shannon capacity. Furthermore, for the NP-hard online bin packing problem, we generate heuristics that consistently outperform standard strategies across established benchmarks. By evolving solution spaces, our method considerably improves search effectiveness, making it possible to tackle high-dimensional problems that were previously computationally prohibitive.

[578] Deep Reinforcement Learning with anticipatory reward in LSTM for Collision Avoidance of Mobile Robots

Olivier Poulet, Frédéric Guinand, François Guérin

Main category: cs.AI

TL;DR: A method using LSTM for short-term position prediction and DQN for collision risk anticipation reduces collisions in constrained robot environments.

Details

Motivation: To improve collision avoidance in multi-robot systems without communication or identifiers.

Method: Uses LSTM for position prediction and DQN with dynamically modulated rewards for collision risk anticipation.

Result: Significant reduction in collisions and improved stability, even with low sampling frequency (1 Hz).

Conclusion: The method is effective, computationally inexpensive, and suitable for embedded systems.

Abstract: This article proposes a collision risk anticipation method based on short-term prediction of the agents position. A Long Short-Term Memory (LSTM) model, trained on past trajectories, is used to estimate the next position of each robot. This prediction allows us to define an anticipated collision risk by dynamically modulating the reward of a Deep Q-Learning Network (DQN) agent. The approach is tested in a constrained environment, where two robots move without communication or identifiers. Despite a limited sampling frequency (1 Hz), the results show a significant decrease of the collisions number and a stability improvement. The proposed method, which is computationally inexpensive, appears particularly attractive for implementation on embedded systems.

[579] Interpreting Fedspeak with Confidence: A LLM-Based Uncertainty-Aware Framework Guided by Monetary Policy Transmission Paths

Rui Yao, Qi Chai, Jinhai Yao, Siyuan Li, Junhao Chen, Qi Zhang, Hao Wang

Main category: cs.AI

TL;DR: The paper introduces an LLM-based framework to decode Fedspeak, enhancing policy stance classification with uncertainty awareness and domain-specific reasoning.

Details

Motivation: Fedspeak's nuanced language impacts financial markets, making automated interpretation crucial for forecasting and policy analysis.

Method: Proposes an LLM-based framework with domain-specific reasoning and a dynamic uncertainty decoding module for confidence assessment.

Result: Achieves state-of-the-art performance in policy stance analysis, with perceptual uncertainty correlating with model errors.

Conclusion: The framework effectively deciphers Fedspeak, improving accuracy and reliability for financial and policy applications.

Abstract: “Fedspeak”, the stylized and often nuanced language used by the U.S. Federal Reserve, encodes implicit policy signals and strategic stances. The Federal Open Market Committee strategically employs Fedspeak as a communication tool to shape market expectations and influence both domestic and global economic conditions. As such, automatically parsing and interpreting Fedspeak presents a high-impact challenge, with significant implications for financial forecasting, algorithmic trading, and data-driven policy analysis. In this paper, we propose an LLM-based, uncertainty-aware framework for deciphering Fedspeak and classifying its underlying monetary policy stance. Technically, to enrich the semantic and contextual representation of Fedspeak texts, we incorporate domain-specific reasoning grounded in the monetary policy transmission mechanism. We further introduce a dynamic uncertainty decoding module to assess the confidence of model predictions, thereby enhancing both classification accuracy and model reliability. Experimental results demonstrate that our framework achieves state-of-the-art performance on the policy stance analysis task. Moreover, statistical analysis reveals a significant positive correlation between perceptual uncertainty and model error rates, validating the effectiveness of perceptual uncertainty as a diagnostic signal.

[580] Fitting Description Logic Ontologies to ABox and Query Examples

Maurice Funk, Marvin Grosser, Carsten Lutz

Main category: cs.AI

TL;DR: The paper studies ontology fitting for Boolean queries, analyzing complexity for AQs, CQs, and UCQs in ALC and ALCI.

Details

Motivation: To address the challenge of fitting an ontology to satisfy given positive and negative examples of Boolean queries.

Method: Analyzes fitting problems using description logics ALC and ALCI, with query languages AQs, CQs, and UCQs. Provides characterizations and computational complexity results.

Result: Fitting ontology existence is CONP for AQs and full CQs, and 2EXPTIME-complete for CQs and UCQs, in both ALC and ALCI.

Conclusion: The study provides clear complexity bounds for ontology fitting, aiding practical applications in ontology-mediated querying.

Abstract: We study a fitting problem inspired by ontology-mediated querying: given a collection of positive and negative examples of the form $(\mathcal{A},q)$ with $\mathcal{A}$ an ABox and $q$ a Boolean query, we seek an ontology $\mathcal{O}$ that satisfies $\mathcal{A} \cup \mathcal{O} \vDash q$ for all positive examples and $\mathcal{A} \cup \mathcal{O}\not\vDash q$ for all negative examples. We consider the description logics $\mathcal{ALC}$ and $\mathcal{ALCI}$ as ontology languages and a range of query languages that includes atomic queries (AQs), conjunctive queries (CQs), and unions thereof (UCQs). For all of the resulting fitting problems, we provide effective characterizations and determine the computational complexity of deciding whether a fitting ontology exists. This problem turns out to be ${\small CO}NP$ for AQs and full CQs and $2E{\small XP}T{\small IME}$-complete for CQs and UCQs. These results hold for both $\mathcal{ALC}$ and $\mathcal{ALCI}$.

[581] AdaptFlow: Adaptive Workflow Optimization via Meta-Learning

Runchuan Zhu, Bowen Jiang, Lingrui Mei, Fangkai Yang, Lu Wang, Haoxiang Gao, Fengshuo Bai, Pu Zhao, Qingwei Lin, Saravan Rajmohan, Dongmei Zhang

Main category: cs.AI

TL;DR: AdaptFlow is a meta-learning framework for LLM workflows, enabling rapid adaptation to diverse tasks through language-guided modifications, outperforming existing methods.

Details

Motivation: Existing LLM workflows rely on static templates or manual designs, limiting adaptability and scalability. AdaptFlow addresses this by learning generalizable workflow initializations.

Method: AdaptFlow uses a bi-level optimization scheme: inner loop refines workflows for subtasks using LLM feedback, while the outer loop updates shared initializations for cross-task performance.

Result: AdaptFlow outperforms manual and automatic baselines in QA, code generation, and math reasoning, achieving state-of-the-art results with strong generalization.

Conclusion: AdaptFlow demonstrates effective generalization to unseen tasks through language-guided workflow adaptation, offering a scalable solution for complex LLM workflows.

Abstract: Recent advances in large language models (LLMs) have sparked growing interest in agentic workflows, which are structured sequences of LLM invocations intended to solve complex tasks. However, existing approaches often rely on static templates or manually designed workflows, which limit adaptability to diverse tasks and hinder scalability. We propose AdaptFlow, a natural language-based meta-learning framework inspired by model-agnostic meta-learning (MAML). AdaptFlow learns a generalizable workflow initialization that enables rapid subtask-level adaptation. It employs a bi-level optimization scheme: the inner loop refines the workflow for a specific subtask using LLM-generated feedback, while the outer loop updates the shared initialization to perform well across tasks. This setup allows AdaptFlow to generalize effectively to unseen tasks by adapting the initialized workflow through language-guided modifications. Evaluated across question answering, code generation, and mathematical reasoning benchmarks, AdaptFlow consistently outperforms both manually crafted and automatically searched baselines, achieving state-of-the-art results with strong generalization across tasks and models. The source code and data are available at https://github.com/microsoft/DKI_LLM/tree/AdaptFlow/AdaptFlow.

[582] FNBT: Full Negation Belief Transformation for Open-World Information Fusion Based on Dempster-Shafer Theory of Evidence

Meishen He, Wenjun Ma, Jiao Wang, Huijun Yue, Xiaoma Fan

Main category: cs.AI

TL;DR: Proposes Full Negation Belief Transformation (FNBT) for open-world information fusion under heterogeneous frames, addressing limitations of traditional Dempster-Shafer methods.

Details

Motivation: Existing fusion methods fail for heterogeneous frames due to data silos and varied sources, necessitating a new approach.

Method: FNBT introduces a criterion for open-world tasks, extends frames for heterogeneous elements, and uses full negation to transform mass functions for fusion.

Result: FNBT satisfies theoretical properties (invariance, heritability, conflict elimination) and outperforms in classification tasks, resolving Zadeh’s counterexample.

Conclusion: FNBT effectively handles open-world fusion, proving practical and theoretical superiority over traditional methods.

Abstract: The Dempster-Shafer theory of evidence has been widely applied in the field of information fusion under uncertainty. Most existing research focuses on combining evidence within the same frame of discernment. However, in real-world scenarios, trained algorithms or data often originate from different regions or organizations, where data silos are prevalent. As a result, using different data sources or models to generate basic probability assignments may lead to heterogeneous frames, for which traditional fusion methods often yield unsatisfactory results. To address this challenge, this study proposes an open-world information fusion method, termed Full Negation Belief Transformation (FNBT), based on the Dempster-Shafer theory. More specially, a criterion is introduced to determine whether a given fusion task belongs to the open-world setting. Then, by extending the frames, the method can accommodate elements from heterogeneous frames. Finally, a full negation mechanism is employed to transform the mass functions, so that existing combination rules can be applied to the transformed mass functions for such information fusion. Theoretically, the proposed method satisfies three desirable properties, which are formally proven: mass function invariance, heritability, and essential conflict elimination. Empirically, FNBT demonstrates superior performance in pattern classification tasks on real-world datasets and successfully resolves Zadeh’s counterexample, thereby validating its practical effectiveness.

[583] TeamMedAgents: Enhancing Medical Decision-Making of LLMs Through Structured Teamwork

Pranav Pushkar Mishra, Mohammad Arvan, Mohan Zalake

Main category: cs.AI

TL;DR: TeamMedAgents integrates human teamwork components into multi-agent medical decision-making with LLMs, improving performance across 7 out of 8 medical benchmarks.

Details

Motivation: To bridge human teamwork theories (Salas et al.'s 'Big Five' model) with computational multi-agent systems for better medical decision-making.

Method: Operationalizes six teamwork components (e.g., leadership, trust) into modular mechanisms, evaluates their impact across medical benchmarks, and conducts ablation studies.

Result: Consistent improvements in 7/8 datasets; optimal teamwork configurations vary by task complexity and domain.

Conclusion: TeamMedAgents advances collaborative AI by translating human teamwork theories into agentic systems for critical decision-making.

Abstract: We present TeamMedAgents, a novel multi-agent approach that systematically integrates evidence-based teamwork components from human-human collaboration into medical decision-making with large language models (LLMs). Our approach validates an organizational psychology teamwork model from human collaboration to computational multi-agent medical systems by operationalizing six core teamwork components derived from Salas et al.’s “Big Five” model: team leadership, mutual performance monitoring, team orientation, shared mental models, closed-loop communication, and mutual trust. We implement and evaluate these components as modular, configurable mechanisms within an adaptive collaboration architecture while assessing the effect of the number of agents involved based on the task’s requirements and domain. Systematic evaluation of computational implementations of teamwork behaviors across eight medical benchmarks (MedQA, MedMCQA, MMLU-Pro Medical, PubMedQA, DDXPlus, MedBullets, Path-VQA, and PMC-VQA) demonstrates consistent improvements across 7 out of 8 evaluated datasets. Controlled ablation studies conducted on 50 questions per configuration across 3 independent runs provide mechanistic insights into individual component contributions, revealing optimal teamwork configurations that vary by reasoning task complexity and domain-specific requirements. Our ablation analyses reveal dataset-specific optimal teamwork configurations, indicating that different medical reasoning modalities benefit from distinct collaborative patterns. TeamMedAgents represents an advancement in collaborative AI by providing a systematic translation of established teamwork theories from human collaboration into agentic collaboration, establishing a foundation for evidence-based multi-agent system design in critical decision-making domains.

[584] BlindGuard: Safeguarding LLM-based Multi-Agent Systems under Unknown Attacks

Rui Miao, Yixin Liu, Yili Wang, Xu Shen, Yue Tan, Yiwei Dai, Shirui Pan, Xin Wang

Main category: cs.AI

TL;DR: BlindGuard is an unsupervised defense method for detecting malicious agents in LLM-based multi-agent systems without needing labeled data or prior attack knowledge.

Details

Motivation: Existing supervised defense methods rely heavily on labeled malicious agents, making them impractical for real-world scenarios. BlindGuard aims to provide a practical and generalizable solution.

Method: BlindGuard uses a hierarchical agent encoder to capture interaction patterns and a corruption-guided detector with noise injection and contrastive learning for training on normal behaviors.

Result: BlindGuard effectively detects diverse attack types (prompt injection, memory poisoning, tool attack) across various communication patterns, outperforming supervised baselines.

Conclusion: BlindGuard offers a robust, unsupervised solution for securing multi-agent systems, demonstrating superior generalizability and practicality.

Abstract: The security of LLM-based multi-agent systems (MAS) is critically threatened by propagation vulnerability, where malicious agents can distort collective decision-making through inter-agent message interactions. While existing supervised defense methods demonstrate promising performance, they may be impractical in real-world scenarios due to their heavy reliance on labeled malicious agents to train a supervised malicious detection model. To enable practical and generalizable MAS defenses, in this paper, we propose BlindGuard, an unsupervised defense method that learns without requiring any attack-specific labels or prior knowledge of malicious behaviors. To this end, we establish a hierarchical agent encoder to capture individual, neighborhood, and global interaction patterns of each agent, providing a comprehensive understanding for malicious agent detection. Meanwhile, we design a corruption-guided detector that consists of directional noise injection and contrastive learning, allowing effective detection model training solely on normal agent behaviors. Extensive experiments show that BlindGuard effectively detects diverse attack types (i.e., prompt injection, memory poisoning, and tool attack) across MAS with various communication patterns while maintaining superior generalizability compared to supervised baselines. The code is available at: https://github.com/MR9812/BlindGuard.

[585] From Natural Language to Solver-Ready Power System Optimization: An LLM-Assisted, Validation-in-the-Loop Framework

Yunkai Hu, Tianqiao Zhao, Meng Yue

Main category: cs.AI

TL;DR: A novel LLM-assisted agent converts natural-language power system optimization descriptions into solver-ready formulations, ensuring feasibility and optimality by combining AI with optimization solvers.

Details

Motivation: To address the limitations of LLMs in producing feasible and optimal solutions for power system optimization by integrating them with established optimization frameworks.

Method: The pipeline uses domain-aware prompts and schemas with an LLM, enforces feasibility through validation and iterative repair, and generates solver-ready models.

Result: The agent produces optimal or near-optimal solutions, demonstrated with the unit commitment problem, enhancing reliability through solver coupling.

Conclusion: Combining AI with optimization frameworks bridges high-level problem descriptions and executable models, improving decision-making in energy systems.

Abstract: This paper introduces a novel Large Language Models (LLMs)-assisted agent that automatically converts natural-language descriptions of power system optimization scenarios into compact, solver-ready formulations and generates corresponding solutions. In contrast to approaches that rely solely on LLM to produce solutions directly, the proposed method focuses on discovering a mathematically compatible formulation that can be efficiently solved by off-the-shelf optimization solvers. Directly using LLMs to produce solutions often leads to infeasible or suboptimal results, as these models lack the numerical precision and constraint-handling capabilities of established optimization solvers. The pipeline integrates a domain-aware prompt and schema with an LLM, enforces feasibility through systematic validation and iterative repair, and returns both solver-ready models and user-facing results. Using the unit commitment problem as a representative case study, the agent produces optimal or near-optimal schedules along with the associated objective costs. Results demonstrate that coupling the solver with task-specific validation significantly enhances solution reliability. This work shows that combining AI with established optimization frameworks bridges high-level problem descriptions and executable mathematical models, enabling more efficient decision-making in energy systems

[586] Sortability of Time Series Data

Christopher Lohse, Jonas Wahl

Main category: cs.AI

TL;DR: The paper evaluates causal discovery algorithms for time-dependent processes, showing that dataset characteristics like varsortability and $R^2$-sortability also apply to autocorrelated time series. Empirical evidence is provided using simulated and real-world datasets, revealing surprising findings about causal information in scales.

Details

Motivation: To address the challenge of evaluating causal discovery algorithms for time-dependent processes and explore how dataset characteristics like varsortability and $R^2$-sortability manifest in time series data.

Method: Adapts var- and $R^2$-sortability to time series data and tests them on simulated data (SVAR models, Erdős-Rényi graphs), climate challenge data, river stream datasets, and Causal Chamber data.

Result: Real-world datasets show high varsortability and low $R^2$-sortability, suggesting scales may carry significant causal information.

Conclusion: The findings highlight the impact of dataset characteristics on causal discovery performance and reveal unexpected insights about causal information in real-world data.

Abstract: Evaluating the performance of causal discovery algorithms that aim to find causal relationships between time-dependent processes remains a challenging topic. In this paper, we show that certain characteristics of datasets, such as varsortability (Reisach et al. 2021) and $R^2$-sortability (Reisach et al. 2023), also occur in datasets for autocorrelated stationary time series. We illustrate this empirically using four types of data: simulated data based on SVAR models and Erd\H{o}s-R'enyi graphs, the data used in the 2019 causality-for-climate challenge (Runge et al. 2019), real-world river stream datasets, and real-world data generated by the Causal Chamber of (Gamella et al. 2024). To do this, we adapt var- and $R^2$-sortability to time series data. We also investigate the extent to which the performance of score-based causal discovery methods goes hand in hand with high sortability. Arguably, our most surprising finding is that the investigated real-world datasets exhibit high varsortability and low $R^2$-sortability indicating that scales may carry a significant amount of causal information.

[587] Learning How to Vote with Principles: Axiomatic Insights Into the Collective Decisions of Neural Networks

Levin Hornischer, Zoi Terzopoulou

Main category: cs.AI

TL;DR: Neural networks can be used in voting theory but often fail to align with core axioms. Training with axiom-specific data doesn’t help, but optimizing for axioms can create new, superior voting rules.

Details

Motivation: To explore if neural networks can meet the transparency needs of voting theory while aligning with its axioms.

Method: Proposed axiomatic deep voting, a framework to build and evaluate neural networks for preference aggregation using voting theory axioms.

Result: Neural networks often fail to align with voting axioms, axiom-specific training doesn’t improve alignment, but optimizing for axioms can create new, superior voting rules.

Conclusion: The study bridges AI and voting theory, offering rigorous insights into bias and value-alignment in AI and expanding the exploration of voting rules.

Abstract: Can neural networks be applied in voting theory, while satisfying the need for transparency in collective decisions? We propose axiomatic deep voting: a framework to build and evaluate neural networks that aggregate preferences, using the well-established axiomatic method of voting theory. Our findings are: (1) Neural networks, despite being highly accurate, often fail to align with the core axioms of voting rules, revealing a disconnect between mimicking outcomes and reasoning. (2) Training with axiom-specific data does not enhance alignment with those axioms. (3) By solely optimizing axiom satisfaction, neural networks can synthesize new voting rules that often surpass and substantially differ from existing ones. This offers insights for both fields: For AI, important concepts like bias and value-alignment are studied in a mathematically rigorous way; for voting theory, new areas of the space of voting rules are explored.

[588] Graph-Powered Defense: Controller Area Network Intrusion Detection for Unmanned Aerial Vehicles

Reek Majumder, Gurcan Comert, David Werth, Adrian Gale, Mashrur Chowdhury, M Sabbir Salek

Main category: cs.AI

TL;DR: The paper proposes a graph-based intrusion detection system (IDS) for UAVs using the UAVCAN protocol, achieving high accuracy with inductive models like GATs and GraphSAGE.

Details

Motivation: UAVs are vulnerable to cyberattacks on the CAN bus, necessitating robust security solutions.

Method: Decode CAN messages via UAVCAN, transform them into graphs, and apply graph-based ML models (GCNNs, GATs, GraphSAGE, transformers).

Result: Inductive models outperform transductive ones and baseline LSTM, offering protocol-independent intrusion detection.

Conclusion: Graph-based IDS provides a generic, robust solution for UAV CAN bus security.

Abstract: The network of services, including delivery, farming, and environmental monitoring, has experienced exponential expansion in the past decade with Unmanned Aerial Vehicles (UAVs). Yet, UAVs are not robust enough against cyberattacks, especially on the Controller Area Network (CAN) bus. The CAN bus is a general-purpose vehicle-bus standard to enable microcontrollers and in-vehicle computers to interact, primarily connecting different Electronic Control Units (ECUs). In this study, we focus on solving some of the most critical security weaknesses in UAVs by developing a novel graph-based intrusion detection system (IDS) leveraging the Uncomplicated Application-level Vehicular Communication and Networking (UAVCAN) protocol. First, we decode CAN messages based on UAVCAN protocol specification; second, we present a comprehensive method of transforming tabular UAVCAN messages into graph structures. Lastly, we apply various graph-based machine learning models for detecting cyber-attacks on the CAN bus, including graph convolutional neural networks (GCNNs), graph attention networks (GATs), Graph Sample and Aggregate Networks (GraphSAGE), and graph structure-based transformers. Our findings show that inductive models such as GATs, GraphSAGE, and graph-based transformers can achieve competitive and even better accuracy than transductive models like GCNNs in detecting various types of intrusions, with minimum information on protocol specification, thus providing a generic robust solution for CAN bus security for the UAVs. We also compared our results with baseline single-layer Long Short-Term Memory (LSTM) and found that all our graph-based models perform better without using any decoded features based on the UAVCAN protocol, highlighting higher detection performance with protocol-independent capability.

[589] A Research Agenda for Usability and Generalisation in Reinforcement Learning

Dennis J. N. J. Soemers, Spyridon Samothrakis, Kurt Driessens, Mark H. M. Winands

Main category: cs.AI

TL;DR: Advocates for user-friendly description languages in RL to improve usability and generalization, enabling non-experts to describe problems and algorithms to generalize effectively.

Details

Motivation: Current RL practices require engineering expertise for both algorithm development and deployment, limiting accessibility and generalization.

Method: Proposes a research agenda centered on user-friendly description languages for problem representation.

Result: Aims to enable non-experts to describe problems and improve algorithm generalization.

Conclusion: Calls for standardized, accessible problem descriptions in RL to enhance usability and generalization.

Abstract: It is common practice in reinforcement learning (RL) research to train and deploy agents in bespoke simulators, typically implemented by engineers directly in general-purpose programming languages or hardware acceleration frameworks such as CUDA or JAX. This means that programming and engineering expertise is not only required to develop RL algorithms, but is also required to use already developed algorithms for novel problems. The latter poses a problem in terms of the usability of RL, in particular for private individuals and small organisations without substantial engineering expertise. We also perceive this as a challenge for effective generalisation in RL, in the sense that is no standard, shared formalism in which different problems are represented. As we typically have no consistent representation through which to provide information about any novel problem to an agent, our agents also cannot instantly or rapidly generalise to novel problems. In this position paper, we advocate for a research agenda centred around the use of user-friendly description languages for describing problems, such that (i) users with little to no engineering expertise can formally describe the problems they would like to be tackled by RL algorithms, and (ii) algorithms can leverage problem descriptions to effectively generalise among all problems describable in the language of choice.

[590] Aligning Instruction Tuning with Pre-training

Yiming Liang, Tianyu Zheng, Xinrun Du, Ge Zhang, Jiaheng Liu, Xingwei Qu, Wenqiang Zu, Xingrun Xing, Chujie Zheng, Lei Ma, Guoyin Wang, Zhaoxiang Zhang, Wenhao Huang, Xiang Yue, Jiajun Zhang

Main category: cs.AI

TL;DR: AITP aligns instruction tuning with pre-training by rewriting underrepresented data into instruction-response pairs, improving LLM performance.

Details

Motivation: Current instruction-tuning datasets are narrow and misaligned with pre-training, limiting LLM generalization.

Method: AITP identifies coverage gaps in datasets and rewrites underrepresented pre-training data into instruction-response pairs.

Result: Evaluations show consistent performance improvements on three LLMs across eight benchmarks.

Conclusion: Aligning instruction tuning with pre-training distributions enhances LLM potential.

Abstract: Instruction tuning enhances large language models (LLMs) to follow human instructions across diverse tasks, relying on high-quality datasets to guide behavior. However, these datasets, whether manually curated or synthetically generated, are often narrowly focused and misaligned with the broad distributions captured during pre-training, limiting LLM generalization and effective use of pre-trained knowledge. We propose Aligning Instruction Tuning with Pre-training (AITP), a method that bridges this gap by identifying coverage shortfalls in instruction-tuning datasets and rewriting underrepresented pre-training data into high-quality instruction-response pairs. This approach enriches dataset diversity while preserving task-specific objectives. Evaluations on three fully open LLMs across eight benchmarks demonstrate consistent performance improvements with AITP. Ablations highlight the benefits of adaptive data selection, controlled rewriting, and balanced integration, emphasizing the importance of aligning instruction tuning with pre-training distributions to unlock the full potential of LLMs.

[591] Meta-Reasoner: Dynamic Guidance for Optimized Inference-time Reasoning in Large Language Models

Yuan Sui, Yufei He, Tri Cao, Simeng Han, Yulin Chen, Bryan Hooi

Main category: cs.AI

TL;DR: Meta-Reasoner improves LLM performance by optimizing reasoning strategies during inference, reducing time and errors.

Details

Motivation: Address LLMs' inefficiency in complex tasks due to wasted computation and lack of progress during inference.

Method: Decouples strategy generation from stepwise reasoning using a lightweight progress report and CMABs for dynamic strategy adjustment.

Result: Improves performance by 9-12% and reduces inference time by 28-35% on tasks like math and science.

Conclusion: Meta-Reasoner is versatile and effective for diverse reasoning-intensive tasks.

Abstract: Large Language Models (LLMs) struggle with high computational time and error propagation during inference time, especially for complex tasks like math, puzzles, or coding requiring multi-step thinking. While existing reasoning models with chain-of-thoughts (CoT) can enable LLMs to do step-wise analysis and reflection, they often face the issue of wasting computation on less productive solutions and fail to make progress during inference time. In this paper, we propose Meta-Reasoner, a new framework to enable LLMs ``Think about how to think’’, i.e., optimize the inference compute by adjusting strategies on how to reason during inference time. Inspired by dual-process theory, our method decouples the high-level strategy generation (e.g., backtracking, switching approaches, or restarting) from stepwise CoT generation via a lightweight progress report. The strategy module only consider the summarized version from the previous CoTs to propose new strategies accordingly. We employ the contextual multi-armed bandits (CMABs) for this module to iteratively evaluate the previous reasoning states and dynamically adjust the strategy to avoid reasoning get stuck in less productive paths during inference. Evaluations on math problems (e.g., Game-of-24, TheoremQA) and scientific problems (e.g., SciBench) demonstrate that our method improves performance by 9-12% over previous SOTA methods while reducing inference time by 28-35%. This approach also generalizes to other domains like creative writing, demonstrating its versatility for diverse reasoning-intensive problems using LLMs.

[592] Reviewing Clinical Knowledge in Medical Large Language Models: Training and Beyond

Qiyuan Li, Haijiang Liu, Caicai Guo, Chao Gao, Deyu Chen, Meng Wang, Feng Gao, Frank van Harmelen, Jinguang Gu

Main category: cs.AI

TL;DR: The paper reviews efforts to integrate clinical knowledge into LLMs using training, knowledge graphs, and retrieval-augmented generation, highlighting diverse approaches and applications in academia and industry.

Details

Motivation: To address the need for accurate medical knowledge and traceable decision-making in LLMs used for medical tasks like diagnostics and treatment recommendations.

Method: Review of initiatives embedding clinical knowledge into LLMs via specialized datasets, knowledge graphs, and retrieval-augmented generation, along with evaluation of implementations and applications.

Result: Identifies diverse approaches for integrating clinical knowledge into LLMs, assesses academic-industry disparities, and presents evaluation systems and challenges.

Conclusion: The review emphasizes showcasing diverse approaches rather than completeness, reflecting real-world practices and highlighting future challenges.

Abstract: The large-scale development of large language models (LLMs) in medical contexts, such as diagnostic assistance and treatment recommendations, necessitates that these models possess accurate medical knowledge and deliver traceable decision-making processes. Clinical knowledge, encompassing the insights gained from research on the causes, prognosis, diagnosis, and treatment of diseases, has been extensively examined within real-world medical practices. Recently, there has been a notable increase in research efforts aimed at integrating this type of knowledge into LLMs, encompassing not only traditional text and multimodal data integration but also technologies such as knowledge graphs (KGs) and retrieval-augmented generation (RAG). In this paper, we review the various initiatives to embed clinical knowledge into training-based, KG-supported, and RAG-assisted LLMs. We begin by gathering reliable knowledge sources from the medical domain, including databases and datasets. Next, we evaluate implementations for integrating clinical knowledge through specialized datasets and collaborations with external knowledge sources such as KGs and relevant documentation. Furthermore, we discuss the applications of the developed medical LLMs in the industrial sector to assess the disparity between models developed in academic settings and those in industry. We conclude the survey by presenting evaluation systems applicable to relevant tasks and identifying potential challenges facing this field. In this review, we do not aim for completeness, since any ostensibly complete review would soon be outdated. Our goal is to illustrate diversity by selecting representative and accessible items from current research and industry practices, reflecting real-world situations rather than claiming completeness. Thus, we emphasize showcasing diverse approaches.

[593] Instructor-Worker Large Language Model System for Policy Recommendation: a Case Study on Air Quality Analysis of the January 2025 Los Angeles Wildfires

Kyle Gao, Dening Lu, Liangzhi Li, Nan Chen, Hongjie He, Linlin Xu, Jonathan Li

Main category: cs.AI

TL;DR: The paper presents a multi-agent large language model system to analyze air quality during the 2025 Los Angeles wildfires, demonstrating its utility for data-driven policy recommendations.

Details

Motivation: The study aims to leverage advanced large language models and cloud-mapping integration to address the challenges of analyzing air quality data during large-scale wildfires, building on prior work with the Digital Twin Building.

Method: A multi-agent system (Instructor and Worker agents) is used. The Instructor retrieves data and generates prompts for Workers, who analyze and summarize data, feeding back to the Instructor for final analysis.

Result: The system successfully analyzed air quality data during the wildfires and provided health recommendations, showcasing its potential for policy-making.

Conclusion: The multi-agent LLM framework proves effective for large-scale data analysis and policy recommendations in disaster scenarios like wildfires.

Abstract: The Los Angeles wildfires of January 2025 caused more than 250 billion dollars in damage and lasted for nearly an entire month before containment. Following our previous work, the Digital Twin Building, we modify and leverage the multi-agent large language model framework as well as the cloud-mapping integration to study the air quality during the Los Angeles wildfires. Recent advances in large language models have allowed for out-of-the-box automated large-scale data analysis. We use a multi-agent large language system comprised of an Instructor agent and Worker agents. Upon receiving the users’ instructions, the Instructor agent retrieves the data from the cloud platform and produces instruction prompts to the Worker agents. The Worker agents then analyze the data and provide summaries. The summaries are finally input back into the Instructor agent, which then provides the final data analysis. We test this system’s capability for data-based policy recommendation by assessing our Instructor-Worker LLM system’s health recommendations based on air quality during the Los Angeles wildfires.

[594] A Planning Compilation to Reason about Goal Achievement at Planning Time

Alberto Pozanco, Marianela Morales, Daniel Borrajo, Manuela Veloso

Main category: cs.AI

TL;DR: A method to identify permanent goal-achieving actions during planning by extending tasks with commit actions, without added overhead.

Details

Motivation: Identifying specific actions that permanently achieve goals can benefit planning applications, traditionally done post-search.

Method: Proposes a compilation extending planning tasks with commit actions to enforce goal persistence during planning.

Result: Reformulated tasks show no additional overhead in optimal or suboptimal planning, aiding downstream tasks.

Conclusion: The method effectively identifies permanent goal achievement during planning without performance cost.

Abstract: Identifying the specific actions that achieve goals when solving a planning task might be beneficial for various planning applications. Traditionally, this identification occurs post-search, as some actions may temporarily achieve goals that are later undone and re-achieved by other actions. In this paper, we propose a compilation that extends the original planning task with commit actions that enforce the persistence of specific goals once achieved, allowing planners to identify permanent goal achievement during planning. Experimental results indicate that solving the reformulated tasks does not incur on any additional overhead both when performing optimal and suboptimal planning, while providing useful information for some downstream tasks.

[595] Reasoning Capabilities of Large Language Models on Dynamic Tasks

Annie Wong, Thomas Bäck, Aske Plaat, Niki van Stein, Anna V. Kononova

Main category: cs.AI

TL;DR: Larger language models outperform smaller ones in dynamic tasks, but strategic prompting can reduce the gap. Advanced techniques benefit smaller models more, but introduce instability. True emergent reasoning is lacking, revealing fundamental limitations in planning and spatial coordination.

Details

Motivation: To assess the self-learning capabilities of large language models in dynamic environments and evaluate the effectiveness of different prompting strategies.

Method: Evaluated three prompting strategies (self-reflection, heuristic mutation, planning) on dynamic tasks using open-source models of varying sizes.

Result: Larger models generally perform better, but strategic prompting helps smaller models. Advanced techniques improve smaller models on complex tasks but introduce variability. No evidence of true emergent reasoning; models struggle with planning and spatial coordination.

Conclusion: Large language models have persistent limitations in reasoning and dynamic tasks, suggesting the need for benchmarks beyond static evaluations to better capture reasoning complexity.

Abstract: Large language models excel on static benchmarks, but their ability as self-learning agents in dynamic environments remains unclear. We evaluate three prompting strategies: self-reflection, heuristic mutation, and planning across dynamic tasks with open-source models. We find that larger models generally outperform smaller ones, but that strategic prompting can close this performance gap. Second, an overly long prompt can negatively impact smaller models on basic reactive tasks, while larger models show more robust behaviour. Third, advanced prompting techniques primarily benefit smaller models on complex games, but offer less improvement for already high-performing large language models. Yet, we find that advanced reasoning methods yield highly variable outcomes: while capable of significantly improving performance when reasoning and decision-making align, they also introduce instability and can lead to big performance drops. Compared to human performance, our findings reveal little evidence of true emergent reasoning. Instead, large language model performance exhibits persistent limitations in areas like planning and spatial coordination, suggesting that large language models still suffer fundamental shortcomings that may not be fully overcome through self-reflective prompting alone. Reasoning is a multi-faceted task, and while methods like Chain-of-thought improve multi-step reasoning on math word problems, our findings using dynamic benchmarks highlight important shortcomings in general reasoning capabilities, indicating a need to move beyond static benchmarks to capture the complexity of reasoning.

[596] Identification of Probabilities of Causation: A Complete Characterization

Xin Shu, Shuai Wang, Ang Li

Main category: cs.AI

TL;DR: The paper resolves the long-standing gap in characterizing probabilities of causation for multi-valued treatments and outcomes, providing tight bounds and practical relevance.

Details

Motivation: The unresolved theoretical characterization of probabilities of causation with multi-valued treatments and outcomes limits causality-based decision-making.

Method: Proposes a complete set of representative probabilities of causation and derives tight bounds using formal proofs within Structural Causal Models (SCMs).

Result: The derived bounds and representative quantities are proven sufficient to characterize all possible probabilities of causation.

Conclusion: The work fills a foundational gap, enhancing causality-based decision-making with practical applications demonstrated through examples.

Abstract: Probabilities of causation are fundamental to modern decision-making. Pearl first introduced three binary probabilities of causation, and Tian and Pearl later derived tight bounds for them using Balke’s linear programming. The theoretical characterization of probabilities of causation with multi-valued treatments and outcomes has remained unresolved for decades, limiting the scope of causality-based decision-making. In this paper, we resolve this foundational gap by proposing a complete set of representative probabilities of causation and proving that they are sufficient to characterize all possible probabilities of causation within the framework of Structural Causal Models (SCMs). We then formally derive tight bounds for these representative quantities using formal mathematical proofs. Finally, we demonstrate the practical relevance of our results through illustrative toy examples.

[597] SMART: Self-Generating and Self-Validating Multi-Dimensional Assessment for LLMs’ Mathematical Problem Solving

Yujie Hou, Ting Zhang, Mei Wang, Xuetao Ma, Hua Huang

Main category: cs.AI

TL;DR: SMART is a framework for evaluating LLMs’ problem-solving across four cognitive dimensions, revealing weaknesses and introducing the All-Pass Score.

Details

Motivation: Address concerns about LLMs' reasoning vs. pattern recognition by assessing the entire problem-solving process.

Method: Introduce SMART, a framework decomposing problem-solving into Understanding, Reasoning, Arithmetic, and Reflection & Refinement, evaluated via SMART-Bench.

Result: Applied to 21 LLMs, SMART uncovered discrepancies in abilities across dimensions, highlighting weaknesses.

Conclusion: SMART provides interpretable analysis and motivates the All-Pass Score for better assessment of LLMs’ problem-solving.

Abstract: Large Language Models (LLMs) have achieved remarkable results on a variety of mathematical benchmarks. However, concerns remain as to whether these successes reflect genuine reasoning or superficial pattern recognition. Common evaluation methods, which focus on the either the final answer or the reasoning process, fail to assess the entire problem-solving procedure. To address these limitations, we introduce SMART: a Self-Generating and Self-Validating Multi-Dimensional Assessment Framework, together with its corresponding benchmark, SMART-Bench. SMART decomposes the entire problem solving process into four distinct cognitive dimensions: Understanding, Reasoning, Arithmetic, and Reflection & Refinement. Each dimension is evaluated independently through tailored tasks, enabling interpretable and fine-grained analysis of LLM behavior. We apply SMART to 21 state-of-the-art open- and closed-source LLMs, uncovering significant discrepancies in their abilities across different dimensions. Our findings reveal genuine weaknesses in current LLMs and motivate a new metric, the All-Pass Score, to better capture true problem-solving capabilities. Code and benchmarks will be released upon acceptance.

[598] Reinforcement Learning for Hybrid Charging Stations Planning and Operation Considering Fixed and Mobile Chargers

Yanchen Zhu, Honghui Zou, Chufan Liu, Yuyu Luo, Yuankai Wu, Yuxuan Liang

Main category: cs.AI

TL;DR: The paper proposes a hybrid charging infrastructure combining fixed and mobile chargers, optimized via deep reinforcement learning, to improve coverage and reduce waiting times.

Details

Motivation: Address inefficiencies in fixed-location charging stations and leverage mobile chargers' flexibility to meet fluctuating demand in urban areas.

Method: Formulates the HCSPO problem, uses MPC for demand prediction, and solves it with deep reinforcement learning enhanced by heuristic scheduling.

Result: Achieves up to 244.4% increase in coverage and reduces waiting times by up to 79.8% compared to existing solutions.

Conclusion: Hybrid infrastructure with dynamic optimization significantly enhances charging availability and user convenience.

Abstract: The success of vehicle electrification relies on efficient and adaptable charging infrastructure. Fixed-location charging stations often suffer from underutilization or congestion due to fluctuating demand, while mobile chargers offer flexibility by relocating as needed. This paper studies the optimal planning and operation of hybrid charging infrastructures that combine both fixed and mobile chargers within urban road networks. We formulate the Hybrid Charging Station Planning and Operation (HCSPO) problem, jointly optimizing the placement of fixed stations and the scheduling of mobile chargers. A charging demand prediction model based on Model Predictive Control (MPC) supports dynamic decision-making. To solve the HCSPO problem, we propose a deep reinforcement learning approach enhanced with heuristic scheduling. Experiments on real-world urban scenarios show that our method improves infrastructure availability - achieving up to 244.4% increase in coverage - and reduces user inconvenience with up to 79.8% shorter waiting times, compared to existing solutions.

[599] UI-AGILE: Advancing GUI Agents with Effective Reinforcement Learning and Precise Inference-Time Grounding

Shuquan Lian, Yuhang Wu, Jia Ma, Yifan Ding, Zihan Song, Bingqi Chen, Xiawu Zheng, Hui Li

Main category: cs.AI

TL;DR: UI-AGILE enhances GUI agents with improved training (continuous reward, ‘Simple Thinking’ reward, cropping-based resampling) and inference (decomposed grounding) methods, achieving state-of-the-art performance on benchmarks.

Details

Motivation: Addressing the limitations of existing GUI agent training and inference techniques, such as ineffective rewards and visual noise, to improve grounding accuracy and general agent capabilities.

Method: Proposes training enhancements (continuous reward, ‘Simple Thinking’ reward, cropping-based resampling) and inference improvements (decomposed grounding) for GUI agents.

Result: Achieves state-of-the-art grounding performance on benchmarks (ScreenSpot-Pro and ScreenSpot-v2) with a 23% accuracy improvement over baselines.

Conclusion: UI-AGILE effectively addresses key challenges in GUI agent training and inference, demonstrating superior performance and generalizability.

Abstract: The emergence of Multimodal Large Language Models (MLLMs) has driven significant advances in Graphical User Interface (GUI) agent capabilities. Nevertheless, existing GUI agent training and inference techniques still suffer from a dilemma for reasoning designs, ineffective reward, and visual noise. To address these issues, we introduce UI-AGILE for enhancing GUI agents at both training and inference. For training, we propose a suite of improvements to the Supervised Fine-Tuning (SFT) process: 1) a continuous reward function to incentivize high-precision grounding; 2) a ``Simple Thinking’’ reward to balance planning with speed and grounding accuracy; and 3) a cropping-based resampling strategy to mitigate the sparse reward problem and improve learning on complex tasks. For inference, we present decomposed grounding with selection to dramatically improve grounding accuracy on high-resolution displays by breaking the image into smaller, manageable parts. Experiments show that UI-AGILE achieves the state-of-the-art grounding performance on two benchmarks ScreenSpot-Pro and ScreenSpot-v2 while it also exhibits strong general agent capabilities. For instance, using both our training and inference enhancement methods brings 23% grounding accuracy improvement over the best baseline on ScreenSpot-Pro. We provide the code in https://github.com/KDEGroup/UI-AGILE.

[600] TextQuests: How Good are LLMs at Text-Based Video Games?

Long Phan, Mantas Mazeika, Andy Zou, Dan Hendrycks

Main category: cs.AI

TL;DR: TextQuests is a benchmark for evaluating AI agents’ long-context reasoning in exploratory environments, using interactive fiction games to test self-contained problem-solving without external tools.

Details

Motivation: Existing benchmarks lack the ability to assess AI agents' autonomous, long-horizon reasoning in exploratory settings.

Method: TextQuests uses Infocom interactive fiction games to create a benchmark that requires sustained, self-directed problem-solving over long contexts.

Result: The benchmark evaluates agents’ intrinsic reasoning capabilities in trial-and-error learning and long-horizon tasks.

Conclusion: TextQuests aims to advance AI agents’ robust reasoning in complex, exploratory environments.

Abstract: Evaluating AI agents within complex, interactive environments that mirror real-world challenges is critical for understanding their practical capabilities. While existing agent benchmarks effectively assess skills like tool use or performance on structured tasks, they often do not fully capture an agent’s ability to operate autonomously in exploratory environments that demand sustained, self-directed reasoning over a long and growing context. To spur the development of agents capable of more robust intrinsic reasoning over long horizons, we introduce TextQuests, a benchmark based on the Infocom suite of interactive fiction games. These text-based adventures, which can take human players over 30 hours and require hundreds of precise actions to solve, serve as an effective proxy for evaluating AI agents on focused, stateful tasks. The benchmark is specifically designed to assess an LLM agent’s capacity for self-contained problem-solving by precluding the use of external tools, thereby focusing on intrinsic long-context reasoning capabilities in an exploratory environment characterized by the need for trial-and-error learning and sustained problem-solving within a single interactive session. We release TextQuests at https://textquests.ai.

[601] H2C: Hippocampal Circuit-inspired Continual Learning for Lifelong Trajectory Prediction in Autonomous Driving

Yunlong Lin, Zirui Li, Guodong Du, Xiaocong Zhao, Cheng Gong, Xinwei Wang, Chao Lu, Jianwei Gong

Main category: cs.AI

TL;DR: The paper introduces H2C, a hippocampal circuit-inspired continual learning method for trajectory prediction in autonomous driving, reducing catastrophic forgetting by 22.71%.

Details

Motivation: DL-based trajectory prediction methods suffer from catastrophic forgetting, limiting real-world applicability in dynamic scenarios. Neuroscience-inspired memory replay offers a solution.

Method: H2C uses two strategies for selective sample recall (diversity maximization and equiprobable sampling) and updates via a memory replay loss function.

Result: H2C reduces catastrophic forgetting by 22.71% on average in task-free experiments on the INTERACTION dataset.

Conclusion: H2C effectively retains prior knowledge while learning new data, improving adaptability in varying autonomous driving scenarios.

Abstract: Deep learning (DL) has shown state-of-the-art performance in trajectory prediction, which is critical to safe navigation in autonomous driving (AD). However, most DL-based methods suffer from catastrophic forgetting, where adapting to a new distribution may cause significant performance degradation in previously learned ones. Such inability to retain learned knowledge limits their applicability in the real world, where AD systems need to operate across varying scenarios with dynamic distributions. As revealed by neuroscience, the hippocampal circuit plays a crucial role in memory replay, effectively reconstructing learned knowledge based on limited resources. Inspired by this, we propose a hippocampal circuit-inspired continual learning method (H2C) for trajectory prediction across varying scenarios. H2C retains prior knowledge by selectively recalling a small subset of learned samples. First, two complementary strategies are developed to select the subset to represent learned knowledge. Specifically, one strategy maximizes inter-sample diversity to represent the distinctive knowledge, and the other estimates the overall knowledge by equiprobable sampling. Then, H2C updates via a memory replay loss function calculated by these selected samples to retain knowledge while learning new data. Experiments based on various scenarios from the INTERACTION dataset are designed to evaluate H2C. Experimental results show that H2C reduces catastrophic forgetting of DL baselines by 22.71% on average in a task-free manner, without relying on manually informed distributional shifts. The implementation is available at https://github.com/BIT-Jack/H2C-lifelong.

[602] Game Reasoning Arena: A Framework and Benchmark for Assessing Reasoning Capabilites of Large Language Models via Game Play

Lucia Cipolina-Kun, Marianna Nezhurina, Jenia Jitsev

Main category: cs.AI

TL;DR: The Game Reasoning Arena library evaluates LLMs’ decision-making in strategic games, comparing them to other agents, and supports distributed execution.

Details

Motivation: To empirically assess LLMs' reasoning and game-theoretic behavior through systematic comparisons in diverse game scenarios.

Method: The framework wraps board and matrix games, integrates API and local model access, and uses distributed execution via Ray.

Result: Provides a structured way to compare LLM-based agents with random, heuristic, and reinforcement learning agents.

Conclusion: The library advances the evaluation of LLMs’ strategic decision-making and contributes to understanding their game-theoretic behavior.

Abstract: The Game Reasoning Arena library provides a framework for evaluating the decision making abilities of large language models (LLMs) through strategic board games implemented in Google OpenSpiel library. The framework enables systematic comparisons between LLM based agents and other agents (random, heuristic, reinforcement learning agents, etc.) in various game scenarios by wrapping multiple board and matrix games and supporting different agent types. It integrates API access to models via liteLLM, local model deployment via vLLM, and offers distributed execution through Ray. This paper summarises the library structure, key characteristics, and motivation of the repository, highlighting how it contributes to the empirical evaluation of the reasoning of LLM and game theoretic behaviour.

Rui Lu, Jinhe Bi, Yunpu Ma, Feng Xiao, Yuntao Du, Yijun Tian

Main category: cs.AI

TL;DR: MV-Debate is a multi-view agent debate framework for detecting harmful content in social media, outperforming existing methods by leveraging diverse interpretive perspectives and dynamic reflection.

Details

Motivation: Social media's multimodal nature makes detecting harmful intent (e.g., sarcasm, hate speech) challenging due to cross-modal contradictions and cultural shifts.

Method: MV-Debate uses four debate agents (surface analyst, deep reasoner, modality contrast, social contextualist) for iterative analysis under a reflection-gain criterion.

Result: Experiments show MV-Debate outperforms single-model and multi-agent baselines on three benchmark datasets.

Conclusion: Multi-agent debate frameworks like MV-Debate enhance reliable harmful content detection in safety-critical online environments.

Abstract: Social media has evolved into a complex multimodal environment where text, images, and other signals interact to shape nuanced meanings, often concealing harmful intent. Identifying such intent, whether sarcasm, hate speech, or misinformation, remains challenging due to cross-modal contradictions, rapid cultural shifts, and subtle pragmatic cues. To address these challenges, we propose MV-Debate, a multi-view agent debate framework with dynamic reflection gating for unified multimodal harmful content detection. MV-Debate assembles four complementary debate agents, a surface analyst, a deep reasoner, a modality contrast, and a social contextualist, to analyze content from diverse interpretive perspectives. Through iterative debate and reflection, the agents refine responses under a reflection-gain criterion, ensuring both accuracy and efficiency. Experiments on three benchmark datasets demonstrate that MV-Debate significantly outperforms strong single-model and existing multi-agent debate baselines. This work highlights the promise of multi-agent debate in advancing reliable social intent detection in safety-critical online contexts.

[604] Overconfidence in LLM-as-a-Judge: Diagnosis and Confidence-Driven Solution

Zailong Tian, Zhuoheng Han, Yanzhe Chen, Haozhe Xu, Xi Yang, Richeng Xuan, Houfeng Wang, Lizi Liao

Main category: cs.AI

TL;DR: The paper advocates for confidence-driven, risk-aware LLM-as-a-Judge systems, addressing overconfidence in current models and introducing TH-Score and LLM-as-a-Fuser for improved reliability and accuracy.

Details

Motivation: Existing LLM-as-a-Judge systems focus on accuracy but lack well-calibrated confidence, which is crucial for trustworthy and adaptive evaluation.

Method: The authors introduce TH-Score to measure confidence-accuracy alignment and propose LLM-as-a-Fuser, an ensemble framework for risk-aware evaluation.

Result: Experiments show the approach improves calibration, enabling adaptive evaluation with superior reliability and accuracy over baselines.

Conclusion: The work highlights the importance of confidence calibration in LLM-as-a-Judge systems and demonstrates the effectiveness of the proposed methods.

Abstract: Large Language Models (LLMs) are widely used as automated judges, where practical value depends on both accuracy and trustworthy, risk-aware judgments. Existing approaches predominantly focus on accuracy, overlooking the necessity of well-calibrated confidence, which is vital for adaptive and reliable evaluation pipelines. In this work, we advocate a shift from accuracy-centric evaluation to confidence-driven, risk-aware LLM-as-a-Judge systems, emphasizing the necessity of well-calibrated confidence for trustworthy and adaptive evaluation. We systematically identify the Overconfidence Phenomenon in current LLM-as-a-Judges, where predicted confidence significantly overstates actual correctness, undermining reliability in practical deployment. To quantify this phenomenon, we introduce TH-Score, a novel metric measuring confidence-accuracy alignment. Furthermore, we propose LLM-as-a-Fuser, an ensemble framework that transforms LLMs into reliable, risk-aware evaluators. Extensive experiments demonstrate that our approach substantially improves calibration and enables adaptive, confidence-driven evaluation pipelines, achieving superior reliability and accuracy compared to existing baselines.

[605] Symmetry breaking for inductive logic programming

Andrew Cropper, David M. Cerna, Matti Järvisalo

Main category: cs.AI

TL;DR: A method to break symmetries in hypothesis spaces for inductive logic programming, reducing solving times significantly.

Details

Motivation: Addressing the challenge of searching vast and logically equivalent hypothesis spaces in inductive logic programming.

Method: Introducing a symmetry-breaking technique implemented in answer set programming.

Result: Experiments show solving times reduced from over an hour to just 17 seconds in domains like visual reasoning and game playing.

Conclusion: The proposed method effectively improves efficiency in inductive logic programming by tackling symmetry issues.

Abstract: The goal of inductive logic programming is to search for a hypothesis that generalises training data and background knowledge. The challenge is searching vast hypothesis spaces, which is exacerbated because many logically equivalent hypotheses exist. To address this challenge, we introduce a method to break symmetries in the hypothesis space. We implement our idea in answer set programming. Our experiments on multiple domains, including visual reasoning and game playing, show that our approach can reduce solving times from over an hour to just 17 seconds.

cs.SD

[606] AutoMashup: Automatic Music Mashups Creation

Marine Delabaere, Léa Miqueu, Michael Moreno, Gautier Bigois, Hoang Duong, Ella Fernandez, Flavie Manent, Maria Salgado-Herrera, Bastien Pasdeloup, Nicolas Farrugia, Axel Marmoret

Main category: cs.SD

TL;DR: AutoMashup automates mashup creation using source separation, music analysis, and compatibility estimation, revealing limitations of general-purpose audio models.

Details

Motivation: To automate mashup creation by assessing compatibility between separated stems and evaluating the effectiveness of pretrained audio models.

Method: Uses COCOLA for compatibility assessment and tests CLAP and MERT for zero-shot compatibility estimation.

Result: Mashup compatibility is asymmetric (role-dependent), and current embeddings fail to match COCOLA’s perceptual coherence.

Conclusion: General-purpose audio representations are limited for mashup compatibility estimation.

Abstract: We introduce AutoMashup, a system for automatic mashup creation based on source separation, music analysis, and compatibility estimation. We propose using COCOLA to assess compatibility between separated stems and investigate whether general-purpose pretrained audio models (CLAP and MERT) can support zero-shot estimation of track pair compatibility. Our results show that mashup compatibility is asymmetric – it depends on the role assigned to each track (vocals or accompaniment) – and that current embeddings fail to reproduce the perceptual coherence measured by COCOLA. These findings underline the limitations of general-purpose audio representations for compatibility estimation in mashup creation.

[607] Maestro-EVC: Controllable Emotional Voice Conversion Guided by References and Explicit Prosody

Jinsung Yoon, Wooyeol Jeong, Jio Gim, Young-Joo Suh

Main category: cs.SD

TL;DR: Maestro-EVC is a controllable emotional voice conversion framework that disentangles content, speaker identity, and emotion for high-quality, expressive speech synthesis.

Details

Motivation: Existing EVC methods struggle with disentangling attributes and modeling fine-grained emotional expressions like temporal dynamics.

Method: Maestro-EVC uses separate references for each attribute, introduces temporal emotion representation, and explicit prosody modeling with augmentation.

Result: The framework achieves high-quality, controllable, and emotionally expressive speech synthesis.

Conclusion: Maestro-EVC effectively addresses limitations in existing EVC methods, enabling robust and expressive emotional voice conversion.

Abstract: Emotional voice conversion (EVC) aims to modify the emotional style of speech while preserving its linguistic content. In practical EVC, controllability, the ability to independently control speaker identity and emotional style using distinct references, is crucial. However, existing methods often struggle to fully disentangle these attributes and lack the ability to model fine-grained emotional expressions such as temporal dynamics. We propose Maestro-EVC, a controllable EVC framework that enables independent control of content, speaker identity, and emotion by effectively disentangling each attribute from separate references. We further introduce a temporal emotion representation and an explicit prosody modeling with prosody augmentation to robustly capture and transfer the temporal dynamics of the target emotion, even under prosody-mismatched conditions. Experimental results confirm that Maestro-EVC achieves high-quality, controllable, and emotionally expressive speech synthesis.

[608] Whisfusion: Parallel ASR Decoding via a Diffusion Transformer

Taeyoun Kwon, Junhyuk Ahn, Taegeun Yun, Heeju Jwa, Yoonchae Choi, Siwon Park, Nam-Joon Kim, Jangchan Kim, Hyun Gon Ryu, Hyuk-Jae Lee

Main category: cs.SD

TL;DR: Whisfusion combines a Whisper encoder with a text diffusion decoder to enable parallel ASR decoding, reducing latency for long-form audio while maintaining accuracy.

Details

Motivation: Address the latency bottleneck in AR decoders and context limitations in NAR methods for real-time ASR applications.

Method: Fuses a pre-trained Whisper encoder with a text diffusion decoder, uses a cross-attention adapter for modality bridging, and employs a batch-parallel decoding strategy.

Result: Achieves lower WER (8.3%) than Whisper-tiny (9.7%) and is up to 2.6x faster for long utterances.

Conclusion: Whisfusion provides an efficient solution for long-form ASR, balancing speed and accuracy.

Abstract: Fast Automatic Speech Recognition (ASR) is critical for latency-sensitive applications such as real-time captioning and meeting transcription. However, truly parallel ASR decoding remains challenging due to the sequential nature of autoregressive (AR) decoders and the context limitations of non-autoregressive (NAR) methods. While modern ASR encoders can process up to 30 seconds of audio at once, AR decoders still generate tokens sequentially, creating a latency bottleneck. We propose Whisfusion, the first framework to fuse a pre-trained Whisper encoder with a text diffusion decoder. This NAR architecture resolves the AR latency bottleneck by processing the entire acoustic context in parallel at every decoding step. A lightweight cross-attention adapter trained via parameter-efficient fine-tuning (PEFT) bridges the two modalities. We also introduce a batch-parallel, multi-step decoding strategy that improves accuracy by increasing the number of candidates with minimal impact on speed. Fine-tuned solely on LibriSpeech (960h), Whisfusion achieves a lower WER than Whisper-tiny (8.3% vs. 9.7%), and offers comparable latency on short audio. For longer utterances (>20s), it is up to 2.6x faster than the AR baseline, establishing a new, efficient operating point for long-form ASR. The implementation and training scripts are available at https://github.com/taeyoun811/Whisfusion.

[609] SEF-MK: Speaker-Embedding-Free Voice Anonymization through Multi-k-means Quantization

Beilong Tang, Xiaoxiao Miao, Xin Wang, Ming Li

Main category: cs.SD

TL;DR: SEF-MK anonymizes SSL voice representations using multiple k-means models, improving content preservation but increasing privacy attack risks.

Details

Motivation: To protect speaker privacy while retaining linguistic and emotional content in voice anonymization.

Method: Proposes SEF-MK, a speaker-embedding-free framework using multiple k-means models trained on speaker subsets for anonymization.

Result: SEF-MK better preserves content for users but increases vulnerability to privacy attacks.

Conclusion: The findings help design anonymization systems balancing content preservation and privacy risks.

Abstract: Voice anonymization protects speaker privacy by concealing identity while preserving linguistic and paralinguistic content. Self-supervised learning (SSL) representations encode linguistic features but preserve speaker traits. We propose a novel speaker-embedding-free framework called SEF-MK. Instead of using a single k-means model trained on the entire dataset, SEF-MK anonymizes SSL representations for each utterance by randomly selecting one of multiple k-means models, each trained on a different subset of speakers. We explore this approach from both attacker and user perspectives. Extensive experiments show that, compared to a single k-means model, SEF-MK with multiple k-means models better preserves linguistic and emotional content from the user’s viewpoint. However, from the attacker’s perspective, utilizing multiple k-means models boosts the effectiveness of privacy attacks. These insights can aid users in designing voice anonymization systems to mitigate attacker threats.

[610] Inversion of Arctic dual-channel sound speed profile based on random airgun signal

Jinbao Weng, Yubo Qi, Yanming Yang, Hongtao Wen, Hongtao Zhou, Benqing Chen, Dewei Xu, Ruichao Xue, Caigao Zeng

Main category: cs.SD

TL;DR: A method for inverting dual-channel sound speed profiles in the Arctic using refracted normal modes is proposed, offering fewer parameters, faster speed, and horizontal variation handling.

Details

Motivation: To address the unique dual-channel sound speed profiles in the Arctic and improve inversion efficiency and accuracy.

Method: Proposes a dual-parameter representation for sound speed profiles and a dispersion structure extraction method, combining them for inversion. Also handles horizontal variation.

Result: Verified effective in Arctic experiments, outperforming previous methods with fewer parameters, faster speed, and single-hydrophone usability.

Conclusion: The method is cost-effective, easy to deploy, and computationally efficient, solving key inversion challenges.

Abstract: For the unique dual-channel sound speed profiles of the Canadian Basin and the Chukchi Plateau in the Arctic, based on the propagation characteristics of refracted normal modes under dual-channel sound speed profiles, an inversion method using refracted normal modes for dual-channel sound speed profiles is proposed. This method proposes a dual-parameter representation method for dual-channel sound speed profiles, tailored to the characteristics of dual-channel sound speed profiles. A dispersion structure extraction method is proposed for the dispersion structure characteristics of refracted normal modes under dual-channel sound speed profiles. Combining the parameter representation method of sound speed profiles and the dispersion structure extraction method, an inversion method for dual-channel sound speed profiles is proposed. For the common horizontal variation of sound speed profiles in long-distance acoustic propagation, a method for inverting horizontally varying dual-channel sound speed profiles is proposed. Finally, this article verifies the effectiveness of the dual-channel sound speed profile inversion method using the Arctic low-frequency long-range acoustic propagation experiment. Compared with previous sound speed profile inversion methods, the method proposed in this article has the advantages of fewer inversion parameters and faster inversion speed. It can be implemented using only a single hydrophone passively receiving random air gun signals, and it also solves the inversion problem of horizontal variation of sound speed profiles. It has significant advantages such as low cost, easy deployment, and fast computation speed.

[611] Acoustic source depth estimation method based on a single hydrophone in Arctic underwater

Jinbao Weng, Yubo Qi, Yanming Yang, Hongtao Wen, Hongtao Zhou, Benqing Chen, Dewei Xu, Ruichao Xue, Caigao Zeng

Main category: cs.SD

TL;DR: The paper explores depth estimation methods for surface sound sources using normal modes and ray theory, proposing a method based on modal frequency limits and verifying its effectiveness with experimental data.

Details

Motivation: To address the challenge of accurately estimating the depth of sound sources in surface layers and deep Arctic seas by leveraging normal mode and ray theory.

Method: Uses warping transformation to separate modes, matches amplitude and frequency characteristics for depth estimation, and analyzes ray trajectories for deep-sea applications.

Result: Proposes and validates methods for sound source depth estimation, demonstrating their applicability and limitations through experimental data.

Conclusion: The proposed methods based on normal modes and ray theory are effective for depth estimation, with specific techniques tailored for surface and deep-sea scenarios.

Abstract: Based on the normal mode and ray theory, this article discusses the characteristics of surface sound source and reception at the surface layer, and explores depth estimation methods based on normal modes and rays, and proposes a depth estimation method based on the upper limit of modal frequency. Data verification is conducted to discuss the applicability and limitations of different methods. For the surface refracted normal mode waveguide, modes can be separated through warping transformation. Based on the characteristics of normal mode amplitude variation with frequency and number, the sound source depth can be estimated by matching amplitude information. Based on the spatial variation characteristics of eigenfunctions with frequency, a sound source depth estimation method matching the cutoff frequency of normal modes is proposed. For the deep Arctic sea, the sound ray arrival structure at the receiving end is obtained through the analysis of deep inversion sound ray trajectories, and the sound source depth can be estimated by matching the time difference of ray arrivals. Experimental data is used to verify the sound field patterns and the effectiveness of the sound source depth estimation method.

[612] Noise-Robust Sound Event Detection and Counting via Language-Queried Sound Separation

Yuanjian Chen, Yang Xiao, Han Yin, Yadong Guan, Xubo Liu

Main category: cs.SD

TL;DR: A new co-training framework (EAD + SED) improves sound event detection in noisy environments by leveraging event appearance detection and multi-task learning.

Details

Motivation: Existing SED systems degrade in noisy settings, and current LASS models lack explicit event guidance and require complex training.

Method: Proposes event appearance detection (EAD) for counting events, then co-trains EAD and SED with a task-based constraint for consistency.

Result: Outperforms existing methods on DESED and WildDESED datasets, especially in high-noise scenarios.

Conclusion: The framework enhances SED robustness and timestamp detection, offering a simpler yet effective alternative to multi-stage LASS models.

Abstract: Most sound event detection (SED) systems perform well on clean datasets but degrade significantly in noisy environments. Language-queried audio source separation (LASS) models show promise for robust SED by separating target events; existing methods require elaborate multi-stage training and lack explicit guidance for target events. To address these challenges, we introduce event appearance detection (EAD), a counting-based approach that counts event occurrences at both the clip and frame levels. Based on EAD, we propose a co-training-based multi-task learning framework for EAD and SED to enhance SED’s performance in noisy environments. First, SED struggles to learn the same patterns as EAD. Then, a task-based constraint is designed to improve prediction consistency between SED and EAD. This framework provides more reliable clip-level predictions for LASS models and strengthens timestamp detection capability. Experiments on DESED and WildDESED datasets demonstrate better performance compared to existing methods, with advantages becoming more pronounced at higher noise levels.

[613] Keyword Mamba: Spoken Keyword Spotting with State Space Models

Hanyu Ding, Wenlong Dong, Qirong Mao

Main category: cs.SD

TL;DR: Keyword Mamba, a new KWS architecture using a neural state space model (Mamba), outperforms traditional models like CNNs and Transformers in efficiency and accuracy.

Details

Motivation: Existing deep learning models (CNNs, RNNs, Transformers) struggle with long-term patterns and efficiency in keyword spotting (KWS).

Method: Proposes Keyword Mamba, applying Mamba (a neural SSM) along the time axis and replacing Transformer self-attention.

Result: Tested on Google Speech Commands datasets, Keyword Mamba achieves high accuracy with fewer parameters and lower computational cost.

Conclusion: Mamba shows strong potential for speech-related tasks, marking its first use in KWS.

Abstract: Keyword spotting (KWS) is an essential task in speech processing. It is widely used in voice assistants and smart devices. Deep learning models like CNNs, RNNs, and Transformers have performed well in KWS. However, they often struggle to handle long-term patterns and stay efficient at the same time. In this work, we present Keyword Mamba, a new architecture for KWS. It uses a neural state space model (SSM) called Mamba. We apply Mamba along the time axis and also explore how it can replace the self-attention part in Transformer models. We test our model on the Google Speech Commands datasets. The results show that Keyword Mamba reaches strong accuracy with fewer parameters and lower computational cost. To our knowledge, this is the first time a state space model has been used for KWS. These results suggest that Mamba has strong potential in speech-related tasks.

[614] A Small-footprint Acoustic Echo Cancellation Solution for Mobile Full-Duplex Speech Interactions

Yiheng Jiang, Tian Biao

Main category: cs.SD

TL;DR: A neural network-based Acoustic Echo Cancellation (AEC) method for mobile scenarios, using data augmentation, progressive learning, and post-processing for improved speech quality and downstream tasks like VAD and ASR.

Details

Motivation: Address challenges in mobile AEC, including hardware variability, nonlinear distortions, and latency, to enhance speech quality and downstream applications.

Method: Incorporates diverse data augmentation, progressive learning, and a novel post-processing strategy for downstream tasks. Uses a small-footprint model with streaming inference.

Result: Improved Echo Return Loss Enhancement, Perceptual Evaluation of Speech Quality, and significant gains in VAD and ASR performance.

Conclusion: The proposed AEC solution is effective for mobile deployment, enhancing speech quality and downstream task performance.

Abstract: In full-duplex speech interaction systems, effective Acoustic Echo Cancellation (AEC) is crucial for recovering echo-contaminated speech. This paper presents a neural network-based AEC solution to address challenges in mobile scenarios with varying hardware, nonlinear distortions and long latency. We first incorporate diverse data augmentation strategies to enhance the model’s robustness across various environments. Moreover, progressive learning is employed to incrementally improve AEC effectiveness, resulting in a considerable improvement in speech quality. To further optimize AEC’s downstream applications, we introduce a novel post-processing strategy employing tailored parameters designed specifically for tasks such as Voice Activity Detection (VAD) and Automatic Speech Recognition (ASR), thus enhancing their overall efficacy. Finally, our method employs a small-footprint model with streaming inference, enabling seamless deployment on mobile devices. Empirical results demonstrate effectiveness of the proposed method in Echo Return Loss Enhancement and Perceptual Evaluation of Speech Quality, alongside significant improvements in both VAD and ASR results.

[615] Exploring Efficient Directional and Distance Cues for Regional Speech Separation

Yiheng Jiang, Haoxu Wang, Yafeng Chen, Gang Qiao, Biao Tian

Main category: cs.SD

TL;DR: A neural network-based method for regional speech separation using spatial cues and distance-based features achieves state-of-the-art performance.

Details

Motivation: To improve speech separation by leveraging spatial and distance cues for better source discrimination in real-world scenarios.

Method: Uses an improved delay-and-sum technique for directional cues and incorporates direct-to-reverberant ratio for distance-based separation.

Result: Substantial gains in objective metrics and state-of-the-art performance on the CHiME-8 MMCSG dataset.

Conclusion: The method is effective for practical speech separation, especially in real-world conversational settings.

Abstract: In this paper, we introduce a neural network-based method for regional speech separation using a microphone array. This approach leverages novel spatial cues to extract the sound source not only from specified direction but also within defined distance. Specifically, our method employs an improved delay-and-sum technique to obtain directional cues, substantially enhancing the signal from the target direction. We further enhance separation by incorporating the direct-to-reverberant ratio into the input features, enabling the model to better discriminate sources within and beyond a specified distance. Experimental results demonstrate that our proposed method leads to substantial gains across multiple objective metrics. Furthermore, our method achieves state-of-the-art performance on the CHiME-8 MMCSG dataset, which was recorded in real-world conversational scenarios, underscoring its effectiveness for speech separation in practical applications.

[616] Filling MIDI Velocity using U-Net Image Colorizer

Zhanhong He, David Cooper, Defeng Huang, Roberto Togneri

Main category: cs.SD

TL;DR: The paper proposes using a U-Net architecture, adapted from image colorization, to predict MIDI velocity values, enhancing music expressiveness. It outperforms previous methods on piano datasets.

Details

Motivation: MIDI files often lack expressive human performance characteristics, particularly in velocity (note loudness), which defaults to flat values. Enhancing this can improve music expressiveness.

Method: The U-Net architecture is adapted for MIDI velocity prediction by treating MIDI data as images. Window attention and a custom loss function address sparsity issues. Experiments are limited to piano data.

Result: The method outperforms previous approaches on the MAESTRO v3 and SMD datasets in both quantitative metrics and qualitative listening tests.

Conclusion: The U-Net-based approach effectively predicts MIDI velocity, improving music expressiveness, though current limitations include dataset restrictions to piano data.

Abstract: Modern music producers commonly use MIDI (Musical Instrument Digital Interface) to store their musical compositions. However, MIDI files created with digital software may lack the expressive characteristics of human performances, essentially leaving the velocity parameter - a control for note loudness - undefined, which defaults to a flat value. The task of filling MIDI velocity is termed MIDI velocity prediction, which uses regression models to enhance music expressiveness by adjusting only this parameter. In this paper, we introduce the U-Net, a widely adopted architecture in image colorization, to this task. By conceptualizing MIDI data as images, we adopt window attention and develop a custom loss function to address the sparsity of MIDI-converted images. Current dataset availability restricts our experiments to piano data. Evaluated on the MAESTRO v3 and SMD datasets, our proposed method for filling MIDI velocity outperforms previous approaches in both quantitative metrics and qualitative listening tests.

[617] Audio-Thinker: Guiding Audio Language Model When and How to Think via Reinforcement Learning

Shu Wu, Chenxing Li, Wenfu Wang, Hao Zhang, Hualei Wang, Meng Yu, Dong Yu

Main category: cs.SD

TL;DR: Audio-Thinker, a reinforcement learning framework, enhances reasoning in large audio language models (LALMs) by introducing adaptive rewards and external evaluation, outperforming existing models.

Details

Motivation: Existing LALMs lack significant benefits from explicit reasoning for audio question answering and fall short of human-level auditory-language reasoning.

Method: Proposes Audio-Thinker, using adaptive think accuracy rewards, external reward models, and think-based rewards to improve reasoning.

Result: Audio-Thinker outperforms existing reasoning-oriented LALMs in benchmark tasks, showing superior reasoning and generalization.

Conclusion: The framework effectively enhances LALMs’ reasoning capabilities, addressing adaptability, consistency, and effectiveness.

Abstract: Recent advancements in large language models, multimodal large language models, and large audio language models (LALMs) have significantly improved their reasoning capabilities through reinforcement learning with rule-based rewards. However, the explicit reasoning process has yet to show significant benefits for audio question answering, and effectively leveraging deep reasoning remains an open challenge, with LALMs still falling short of human-level auditory-language reasoning. To address these limitations, we propose Audio-Thinker, a reinforcement learning framework designed to enhance the reasoning capabilities of LALMs, with a focus on improving adaptability, consistency, and effectiveness. Our approach introduces an adaptive think accuracy reward, enabling the model to adjust its reasoning strategies based on task complexity dynamically. Furthermore, we incorporate an external reward model to evaluate the overall consistency and quality of the reasoning process, complemented by think-based rewards that help the model distinguish between valid and flawed reasoning paths during training. Experimental results demonstrate that our Audio-Thinker model outperforms existing reasoning-oriented LALMs across various benchmark tasks, exhibiting superior reasoning and generalization capabilities.

[618] SCDF: A Speaker Characteristics DeepFake Speech Dataset for Bias Analysis

Vojtěch Staněk, Karel Srna, Anton Firc, Kamil Malinka

Main category: cs.SD

TL;DR: The paper introduces the SCDF dataset to evaluate biases in deepfake speech detection, revealing performance disparities across demographics and synthesizer types.

Details

Motivation: To address the underexplored aspects of bias and fairness in deepfake speech detection.

Method: Creation of the SCDF dataset with balanced demographic representation and evaluation of state-of-the-art detectors.

Result: Speaker characteristics significantly influence detection performance, showing disparities across sex, language, age, and synthesizer type.

Conclusion: Highlights the need for bias-aware development in deepfake detection systems to align with ethical standards.

Abstract: Despite growing attention to deepfake speech detection, the aspects of bias and fairness remain underexplored in the speech domain. To address this gap, we introduce the Speaker Characteristics Deepfake (SCDF) dataset: a novel, richly annotated resource enabling systematic evaluation of demographic biases in deepfake speech detection. SCDF contains over 237,000 utterances in a balanced representation of both male and female speakers spanning five languages and a wide age range. We evaluate several state-of-the-art detectors and show that speaker characteristics significantly influence detection performance, revealing disparities across sex, language, age, and synthesizer type. These findings highlight the need for bias-aware development and provide a foundation for building non-discriminatory deepfake detection systems aligned with ethical and regulatory standards.

[619] Joint Transcription of Acoustic Guitar Strumming Directions and Chords

Sebastian Murgul, Johannes Schimper, Michael Heizmann

Main category: cs.SD

TL;DR: A deep learning-based approach improves guitar strumming transcription by combining real-world and synthetic datasets, outperforming baseline methods.

Details

Motivation: The task of transcribing guitar strumming (directions and chords) is underrepresented and challenging due to limited datasets.

Method: A CRNN model is trained on a hybrid dataset (real-world recordings and synthetic audio) to detect strumming events, classify directions, and identify chords.

Result: The hybrid method achieves higher accuracy for strumming detection and chord classification compared to baselines.

Conclusion: Deep learning shows promise for robust guitar strumming transcription, enabling new possibilities for rhythm guitar analysis.

Abstract: Automatic transcription of guitar strumming is an underrepresented and challenging task in Music Information Retrieval (MIR), particularly for extracting both strumming directions and chord progressions from audio signals. While existing methods show promise, their effectiveness is often hindered by limited datasets. In this work, we extend a multimodal approach to guitar strumming transcription by introducing a novel dataset and a deep learning-based transcription model. We collect 90 min of real-world guitar recordings using an ESP32 smartwatch motion sensor and a structured recording protocol, complemented by a synthetic dataset of 4h of labeled strumming audio. A Convolutional Recurrent Neural Network (CRNN) model is trained to detect strumming events, classify their direction, and identify the corresponding chords using only microphone audio. Our evaluation demonstrates significant improvements over baseline onset detection algorithms, with a hybrid method combining synthetic and real-world data achieving the highest accuracy for both strumming action detection and chord classification. These results highlight the potential of deep learning for robust guitar strumming transcription and open new avenues for automatic rhythm guitar analysis.

[620] Exploring Procedural Data Generation for Automatic Acoustic Guitar Fingerpicking Transcription

Sebastian Murgul, Michael Heizmann

Main category: cs.SD

TL;DR: The paper explores using procedurally generated synthetic data to train acoustic guitar transcription models, achieving comparable results to real data and improving accuracy with finetuning.

Details

Motivation: Challenges in transcribing acoustic guitar fingerpicking due to limited labeled data and legal issues with recordings motivate the use of synthetic data.

Method: A four-stage pipeline synthesizes training data: tablature composition, MIDI rendering, physical modeling, and audio augmentation. A CRNN model is trained on both real and synthetic data.

Result: Procedural data achieves reasonable note-tracking results, with finetuning on small real datasets further enhancing accuracy.

Conclusion: Procedurally generated audio is a viable solution for data-scarce music transcription tasks.

Abstract: Automatic transcription of acoustic guitar fingerpicking performances remains a challenging task due to the scarcity of labeled training data and legal constraints connected with musical recordings. This work investigates a procedural data generation pipeline as an alternative to real audio recordings for training transcription models. Our approach synthesizes training data through four stages: knowledge-based fingerpicking tablature composition, MIDI performance rendering, physical modeling using an extended Karplus-Strong algorithm, and audio augmentation including reverb and distortion. We train and evaluate a CRNN-based note-tracking model on both real and synthetic datasets, demonstrating that procedural data can be used to achieve reasonable note-tracking results. Finetuning with a small amount of real data further enhances transcription accuracy, improving over models trained exclusively on real recordings. These results highlight the potential of procedurally generated audio for data-scarce music information retrieval tasks.

[621] Bridging ASR and LLMs for Dysarthric Speech Recognition: Benchmarking Self-Supervised and Generative Approaches

Ahmed Aboeitta, Ahmed Sharshar, Youssef Nafea, Shady Shehata

Main category: cs.SD

TL;DR: Benchmarking self-supervised ASR models for dysarthric speech, introducing LLM-enhanced decoding, and analyzing generalization and errors.

Details

Motivation: To evaluate the effectiveness of self-supervised ASR models (Wav2Vec, HuBERT, Whisper) for dysarthric speech due to phoneme distortions and variability.

Method: Systematically benchmark models with CTC, seq2seq, and LLM-enhanced decoding (BART, GPT-2, Vicuna).

Result: LLM-enhanced decoding improves dysarthric ASR by leveraging linguistic constraints for phoneme restoration and grammatical correction.

Conclusion: LLM-based decoding enhances intelligibility and generalization in dysarthric speech recognition.

Abstract: Speech Recognition (ASR) due to phoneme distortions and high variability. While self-supervised ASR models like Wav2Vec, HuBERT, and Whisper have shown promise, their effectiveness in dysarthric speech remains unclear. This study systematically benchmarks these models with different decoding strategies, including CTC, seq2seq, and LLM-enhanced decoding (BART,GPT-2, Vicuna). Our contributions include (1) benchmarking ASR architectures for dysarthric speech, (2) introducing LLM-based decoding to improve intelligibility, (3) analyzing generalization across datasets, and (4) providing insights into recognition errors across severity levels. Findings highlight that LLM-enhanced decoding improves dysarthric ASR by leveraging linguistic constraints for phoneme restoration and grammatical correction.

[622] Exploring Adapter Design Tradeoffs for Low Resource Music Generation

Atharva Mehta, Shivam Chauhan, Monojit Choudhury

Main category: cs.SD

TL;DR: The paper explores adapter-based PEFT techniques for fine-tuning music generation models (MusicGen and Mustango) on low-resource genres, identifying trade-offs between adapter types and model performance.

Details

Motivation: To address the high computational cost of fine-tuning large music models by identifying optimal adapter configurations for specific genres.

Method: Study various adapter architectures (convolution-based and transformer-based) for MusicGen and Mustango on Hindustani Classical and Turkish Makam music, analyzing performance and resource requirements.

Result: Convolution-based adapters capture local details better, while transformer-based adapters handle long-range dependencies. Mid-sized adapters (40M parameters) balance expressivity and quality. Mustango offers diversity but lacks stability; MusicGen is faster and more efficient.

Conclusion: Adapter design significantly impacts model performance, with trade-offs between detail capture and computational efficiency. Optimal configurations depend on genre-specific needs and resource constraints.

Abstract: Fine-tuning large-scale music generation models, such as MusicGen and Mustango, is a computationally expensive process, often requiring updates to billions of parameters and, therefore, significant hardware resources. Parameter-Efficient Fine-Tuning (PEFT) techniques, particularly adapter-based methods, have emerged as a promising alternative, enabling adaptation with minimal trainable parameters while preserving model performance. However, the design choices for adapters, including their architecture, placement, and size, are numerous, and it is unclear which of these combinations would produce optimal adapters and why, for a given case of low-resource music genre. In this paper, we attempt to answer this question by studying various adapter configurations for two AI music models, MusicGen and Mustango, on two genres: Hindustani Classical and Turkish Makam music. Our findings reveal distinct trade-offs: convolution-based adapters excel in capturing fine-grained local musical details such as ornamentations and short melodic phrases, while transformer-based adapters better preserve long-range dependencies crucial for structured improvisation. Additionally, we analyze computational resource requirements across different adapter scales, demonstrating how mid-sized adapters (40M parameters) achieve an optimal balance between expressivity and quality. Furthermore, we find that Mustango, a diffusion-based model, generates more diverse outputs with better adherence to the description in the input prompt while lacking in providing stability in notes, rhythm alignment, and aesthetics. Also, it is computationally intensive and requires significantly more time to train. In contrast, autoregressive models like MusicGen offer faster training and are more efficient, and can produce better quality output in comparison, but have slightly higher redundancy in their generations.

[623] Zero-Shot Voice Conversion via Content-Aware Timbre Ensemble and Conditional Flow Matching

Yu Pan, Yuguang Yang, Jixun Yao, Lei Ma, Jianjun Zhao

Main category: cs.SD

TL;DR: CTEFM-VC is a zero-shot voice conversion framework that improves speaker similarity and naturalness by integrating content-aware timbre modeling with conditional flow matching.

Details

Motivation: To address the challenge of achieving high speaker similarity and naturalness in zero-shot voice conversion, which current methods struggle with.

Method: CTEFM-VC decouples speech into content and timbre, uses conditional flow matching for Mel-spectrogram reconstruction, and employs context-aware timbre ensemble modeling with a cross-attention module. A structural similarity-based timbre loss is also introduced.

Result: CTEFM-VC outperforms state-of-the-art systems in speaker similarity, naturalness, and intelligibility metrics.

Conclusion: The proposed framework effectively enhances zero-shot voice conversion performance by combining advanced modeling techniques.

Abstract: Despite recent advances in zero-shot voice conversion (VC), achieving speaker similarity and naturalness comparable to ground-truth recordings remains a significant challenge. In this letter, we propose CTEFM-VC, a zero-shot VC framework that integrates content-aware timbre ensemble modeling with conditional flow matching. Specifically, CTEFM-VC decouples utterances into content and timbre representations and leverages a conditional flow matching model to reconstruct the Mel-spectrogram of the source speech. To enhance its timbre modeling capability and naturalness of generated speech, we first introduce a context-aware timbre ensemble modeling approach that adaptively integrates diverse speaker verification embeddings and enables the effective utilization of source content and target timbre elements through a cross-attention module. Furthermore, a structural similarity-based timbre loss is presented to jointly train CTEFM-VC end-to-end. Experiments show that CTEFM-VC consistently achieves the best performance in all metrics assessing speaker similarity, speech naturalness, and intelligibility, significantly outperforming state-of-the-art zero-shot VC systems.

[624] Learning Perceptually Relevant Temporal Envelope Morphing

Satvik Dixit, Sungjoon Park, Chris Donahue, Laurie M. Heller

Main category: cs.SD

TL;DR: The paper introduces a perceptually guided workflow for temporal envelope morphing, combining human listening studies, dataset synthesis, and machine learning to produce natural intermediate audio morphs.

Details

Motivation: Existing audio morphing techniques often fail to produce perceptually natural intermediate temporal envelopes, limiting applications in creative media and psychoacoustics.

Method: The workflow involves deriving perceptual principles from listening studies, synthesizing datasets, and training models (including an autoencoder) to create intermediate morphs.

Result: The approach outperforms existing methods in producing temporally intermediate morphs, validated by benchmarks on synthetic and naturalistic data.

Conclusion: The proposed method provides a perceptually grounded solution for temporal envelope morphing, with potential applications in sound blending and perceptual research.

Abstract: Temporal envelope morphing, the process of interpolating between the amplitude dynamics of two audio signals, is an emerging problem in generative audio systems that lacks sufficient perceptual grounding. Morphing of temporal envelopes in a perceptually intuitive manner should enable new methods for sound blending in creative media and for probing perceptual organization in psychoacoustics. However, existing audio morphing techniques often fail to produce intermediate temporal envelopes when input sounds have distinct temporal structures; many morphers effectively overlay both temporal structures, leading to perceptually unnatural results. In this paper, we introduce a novel workflow for learning envelope morphing with perceptual guidance: we first derive perceptually grounded morphing principles through human listening studies, then synthesize large-scale datasets encoding these principles, and finally train machine learning models to create perceptually intermediate morphs. Specifically, we present: (1) perceptual principles that guide envelope morphing, derived from our listening studies, (2) a supervised framework to learn these principles, (3) an autoencoder that learns to compress temporal envelope structures into latent representations, and (4) benchmarks for evaluating audio envelope morphs, using both synthetic and naturalistic data, and show that our approach outperforms existing methods in producing temporally intermediate morphs. All code, models, and checkpoints are available at https://github.com/TemporalMorphing/EnvelopeMorphing.

[625] Direction Estimation of Sound Sources Using Microphone Arrays and Signal Strength

Mahdi Ali Pour, Zahra Habibzadeh

Main category: cs.SD

TL;DR: A lightweight sound-tracking method using three electret microphones achieves high accuracy and precision with a simple, cost-effective design.

Details

Motivation: Sound-tracking is crucial for applications like security and acoustic monitoring but faces challenges in accuracy and hardware complexity.

Method: Uses three strategically placed microphones to analyze signal power levels and infer sound direction.

Result: Achieves less than 6 degrees localization error and 98% precision with a straightforward hardware setup.

Conclusion: The method offers a robust, affordable solution for sound-tracking, adaptable to diverse applications like smart homes and security systems.

Abstract: Sound-tracking refers to the process of determining the direction from which a sound originates, making it a fundamental component of sound source localization. This capability is essential in a variety of applications, including security systems, acoustic monitoring, and speaker tracking, where accurately identifying the direction of a sound source enables real-time responses, efficient resource allocation, and improved situational awareness. While sound-tracking is closely related to localization, it specifically focuses on identifying the direction of the sound source rather than estimating its exact position in space. Despite its utility, sound-tracking systems face several challenges, such as maintaining directional accuracy and precision, along with the need for sophisticated hardware configurations and complex signal processing algorithms. This paper presents a sound-tracking method using three electret microphones. We estimate the direction of a sound source using a lightweight method that analyzes signals from three strategically placed microphones. By comparing the average power of the received signals, the system infers the most probable direction of the sound. The results indicate that the power level from each microphone effectively determines the sound source direction. Our system employs a straightforward and cost-effective hardware design, ensuring simplicity and affordability in implementation. It achieves a localization error of less than 6 degrees and a precision of 98%. Additionally, its effortless integration with various systems makes it versatile and adaptable. Consequently, this technique presents a robust and reliable solution for sound-tracking and localization, with potential applications spanning diverse domains such as security systems, smart homes, and acoustic monitoring.

[626] DMF2Mel: A Dynamic Multiscale Fusion Network for EEG-Driven Mel Spectrogram Reconstruction

Cunhang Fan, Sheng Zhang, Jingjing Zhang, Enrui Liu, Xinhui Li, Gangming Zhao, Zhao Lv

Main category: cs.SD

TL;DR: The paper proposes DMF2Mel, a Dynamic Multiscale Fusion Network, to improve mel spectrogram reconstruction of imagined speech by addressing temporal dependency and information retention challenges.

Details

Motivation: Existing methods struggle with precise reconstruction of continuous imagined speech due to inefficiencies in temporal dependency modeling and long-sequence decoding.

Method: DMF2Mel includes DC-FAM for feature separation, HAMS-Net for cross-scale fusion, SplineMap attention for global-local modeling, and convMamba for long-range dependencies.

Result: DMF2Mel improves Pearson correlation coefficients by 48% for known subjects and 35% for unknown subjects on the SparrKULee dataset.

Conclusion: DMF2Mel effectively enhances mel spectrogram reconstruction for imagined speech, outperforming baselines.

Abstract: Decoding speech from brain signals is a challenging research problem. Although existing technologies have made progress in reconstructing the mel spectrograms of auditory stimuli at the word or letter level, there remain core challenges in the precise reconstruction of minute-level continuous imagined speech: traditional models struggle to balance the efficiency of temporal dependency modeling and information retention in long-sequence decoding. To address this issue, this paper proposes the Dynamic Multiscale Fusion Network (DMF2Mel), which consists of four core components: the Dynamic Contrastive Feature Aggregation Module (DC-FAM), the Hierarchical Attention-Guided Multi-Scale Network (HAMS-Net), the SplineMap attention mechanism, and the bidirectional state space module (convMamba). Specifically, the DC-FAM separates speech-related “foreground features” from noisy “background features” through local convolution and global attention mechanisms, effectively suppressing interference and enhancing the representation of transient signals. HAMS-Net, based on the U-Net framework,achieves cross-scale fusion of high-level semantics and low-level details. The SplineMap attention mechanism integrates the Adaptive Gated Kolmogorov-Arnold Network (AGKAN) to combine global context modeling with spline-based local fitting. The convMamba captures long-range temporal dependencies with linear complexity and enhances nonlinear dynamic modeling capabilities. Results on the SparrKULee dataset show that DMF2Mel achieves a Pearson correlation coefficient of 0.074 in mel spectrogram reconstruction for known subjects (a 48% improvement over the baseline) and 0.048 for unknown subjects (a 35% improvement over the baseline).Code is available at: https://github.com/fchest/DMF2Mel.

[627] Enhancing Target Speaker Extraction with Explicit Speaker Consistency Modeling

Shu Wu, Anbin Qi, Yanzhang Xie, Xiang Xie

Main category: cs.SD

TL;DR: The paper proposes a speaker consistency-aware method for Target Speaker Extraction (TSE) using a centroid-based loss and conditional loss suppression to improve performance.

Details

Motivation: Speaker embeddings in TSE systems can suffer from identity confusion, so the study focuses on improving speaker consistency rather than just embedding extraction.

Method: A centroid-based speaker consistency loss and conditional loss suppression are integrated into the training process.

Result: Experiments show the proposed methods effectively enhance TSE performance.

Conclusion: The approach advances TSE by ensuring speaker consistency between enrolled and extracted speech, with validated effectiveness.

Abstract: Target Speaker Extraction (TSE) uses a reference cue to extract the target speech from a mixture. In TSE systems relying on audio cues, the speaker embedding from the enrolled speech is crucial to performance. However, these embeddings may suffer from speaker identity confusion. Unlike previous studies that focus on improving speaker embedding extraction, we improve TSE performance from the perspective of speaker consistency. In this paper, we propose a speaker consistency-aware target speaker extraction method that incorporates a centroid-based speaker consistency loss. This approach enhances TSE performance by ensuring speaker consistency between the enrolled and extracted speech. In addition, we integrate conditional loss suppression into the training process. The experimental results validate the effectiveness of our proposed methods in advancing the TSE performance. A speech demo is available online:https://sc-tse.netlify.app/

[628] Live Music Models

Lyria Team, Antoine Caillon, Brian McWilliams, Cassie Tarakajian, Ian Simon, Ilaria Manco, Jesse Engel, Noah Constant, Yunpeng Li, Timo I. Denk, Alberto Lalama, Andrea Agostinelli, Cheng-Zhi Anna Huang, Ethan Manilow, George Brower, Hakan Erdogan, Heidi Lei, Itai Rolnick, Ivan Grishchenko, Manu Orsini, Matej Kastelic, Mauricio Zuluaga, Mauro Verzetti, Michael Dooley, Ondrej Skopek, Rafael Ferrer, Zalán Borsos, Äaron van den Oord, Douglas Eck, Eli Collins, Jason Baldridge, Tom Hume, Chris Donahue, Kehang Han, Adam Roberts

Main category: cs.SD

TL;DR: Magenta RealTime and Lyria RealTime are new generative models for live music, offering real-time, user-controlled music generation with text or audio prompts. They outperform other models in quality and introduce a human-in-the-loop paradigm.

Details

Motivation: To advance AI-assisted music creation by enabling real-time, interactive music generation with user control.

Method: Developed Magenta RealTime (open-weights) and Lyria RealTime (API-based) models, using text/audio prompts for style control.

Result: Magenta RealTime outperforms other models in music quality metrics, despite fewer parameters, and introduces live generation capabilities.

Conclusion: These models pioneer a new paradigm for live music performance with AI, emphasizing human-AI collaboration.

Abstract: We introduce a new class of generative models for music called live music models that produce a continuous stream of music in real-time with synchronized user control. We release Magenta RealTime, an open-weights live music model that can be steered using text or audio prompts to control acoustic style. On automatic metrics of music quality, Magenta RealTime outperforms other open-weights music generation models, despite using fewer parameters and offering first-of-its-kind live generation capabilities. We also release Lyria RealTime, an API-based model with extended controls, offering access to our most powerful model with wide prompt coverage. These models demonstrate a new paradigm for AI-assisted music creation that emphasizes human-in-the-loop interaction for live music performance.

cs.LG

[629] Self-Organizing Survival Manifolds: A Theory for Unsupervised Discovery of Prognostic Structures in Biological Systems

Atahan Karagoz

Main category: cs.LG

TL;DR: The paper proposes a geometric theory of survival, framing it as an emergent property of biological state space curvature and flow, rather than a supervised learning task.

Details

Motivation: To move beyond traditional supervised survival modeling by grounding survival in biophysical and geometric principles.

Method: Develops Self-Organizing Survival Manifolds (SOSM), using geodesic curvature minimization to align prognosis with geometric flow stability.

Result: Theoretical proofs show survival-aligned trajectories emerge under biologically plausible conditions, linking survival to thermodynamics and optimal transport.

Conclusion: Survival is reframed as a geometric property, offering a universal, label-free foundation bridging machine learning, biophysics, and geometry.

Abstract: Survival is traditionally modeled as a supervised learning task, reliant on curated outcome labels and fixed covariates. This work rejects that premise. It proposes that survival is not an externally annotated target but a geometric consequence: an emergent property of the curvature and flow inherent in biological state space. We develop a theory of Self-Organizing Survival Manifolds (SOSM), in which survival-relevant dynamics arise from low-curvature geodesic flows on latent manifolds shaped by internal biological constraints. A survival energy functional based on geodesic curvature minimization is introduced and shown to induce structures where prognosis aligns with geometric flow stability. We derive discrete and continuous formulations of the objective and prove theoretical results demonstrating the emergence and convergence of survival-aligned trajectories under biologically plausible conditions. The framework draws connections to thermodynamic efficiency, entropy flow, Ricci curvature, and optimal transport, grounding survival modeling in physical law. Health, disease, aging, and death are reframed as geometric phase transitions in the manifold’s structure. This theory offers a universal, label-free foundation for modeling survival as a property of form, not annotation-bridging machine learning, biophysics, and the geometry of life itself.

[630] Semi-Supervised Supply Chain Fraud Detection with Unsupervised Pre-Filtering

Fatemeh Moradi, Mehran Tarif, Mohammadhossein Homaei

Main category: cs.LG

TL;DR: A two-phase learning framework combines unsupervised anomaly detection (Isolation Forest) and semi-supervised refinement (self-training SVM) to detect fraud in supply chains, achieving an F1-score of 0.817.

Details

Motivation: Fraud detection in supply chains is challenging due to complexity, class imbalance, and limited labeled data. Traditional methods are ineffective.

Method: Phase 1: Isolation Forest for unsupervised anomaly detection. Phase 2: Self-training SVM for semi-supervised refinement using labeled and pseudo-labeled data.

Result: Achieves an F1-score of 0.817 with a false positive rate below 3.0% on the DataCo Smart Supply Chain Dataset.

Conclusion: The framework is effective for supply chain fraud detection but has limitations like concept drift and lacks comparison with deep learning.

Abstract: Detecting fraud in modern supply chains is a growing challenge, driven by the complexity of global networks and the scarcity of labeled data. Traditional detection methods often struggle with class imbalance and limited supervision, reducing their effectiveness in real-world applications. This paper proposes a novel two-phase learning framework to address these challenges. In the first phase, the Isolation Forest algorithm performs unsupervised anomaly detection to identify potential fraud cases and reduce the volume of data requiring further analysis. In the second phase, a self-training Support Vector Machine (SVM) refines the predictions using both labeled and high-confidence pseudo-labeled samples, enabling robust semi-supervised learning. The proposed method is evaluated on the DataCo Smart Supply Chain Dataset, a comprehensive real-world supply chain dataset with fraud indicators. It achieves an F1-score of 0.817 while maintaining a false positive rate below 3.0%. These results demonstrate the effectiveness and efficiency of combining unsupervised pre-filtering with semi-supervised refinement for supply chain fraud detection under real-world constraints, though we acknowledge limitations regarding concept drift and the need for comparison with deep learning approaches.

[631] GFlowNets for Learning Better Drug-Drug Interaction Representations

Azmine Toushik Wasi

Main category: cs.LG

TL;DR: A framework combining GFlowNet and VGAE addresses severe class imbalance in drug-drug interaction prediction by generating synthetic samples for rare classes, improving model performance.

Details

Motivation: Drug-drug interaction datasets suffer from severe class imbalance, with rare but critical interactions underrepresented, leading to poor model performance on infrequent cases.

Method: The proposed framework integrates Generative Flow Networks (GFlowNet) and Variational Graph Autoencoders (VGAE) to generate synthetic samples for rare interaction classes.

Result: The approach enhances predictive performance across all interaction types, ensuring better clinical reliability.

Conclusion: The framework effectively mitigates class imbalance in DDI prediction, improving model reliability for rare interactions.

Abstract: Drug-drug interactions pose a significant challenge in clinical pharmacology, with severe class imbalance among interaction types limiting the effectiveness of predictive models. Common interactions dominate datasets, while rare but critical interactions remain underrepresented, leading to poor model performance on infrequent cases. Existing methods often treat DDI prediction as a binary problem, ignoring class-specific nuances and exacerbating bias toward frequent interactions. To address this, we propose a framework combining Generative Flow Networks (GFlowNet) with Variational Graph Autoencoders (VGAE) to generate synthetic samples for rare classes, improving model balance and generate effective and novel DDI pairs. Our approach enhances predictive performance across interaction types, ensuring better clinical reliability.

[632] Hypergraph Neural Network with State Space Models for Node Classification

A. Quadir, M. Tanveer

Main category: cs.LG

TL;DR: The paper introduces HGMN, a hypergraph neural network with a state space model, to enhance node classification by integrating role-aware representations and addressing limitations of traditional GNNs.

Details

Motivation: Traditional GNNs focus on adjacency relationships but overlook role-based features, leading to suboptimal performance. Existing role-based methods are unsupervised and ineffective.

Method: HGMN uses hypergraph construction for higher-order relationships, combines role-based and adjacency-based features via a mamba transformer, and employs hypergraph convolution and residual networks to avoid over-smoothing.

Result: HGMN outperforms state-of-the-art GNNs on node classification tasks across multiple datasets, demonstrating improved representational power.

Conclusion: HGMN effectively integrates role-based features with adjacency information, offering a robust solution for graph-based learning tasks.

Abstract: In recent years, graph neural networks (GNNs) have gained significant attention for node classification tasks on graph-structured data. However, traditional GNNs primarily focus on adjacency relationships between nodes, often overlooking the rich role-based characteristics that are crucial for learning more expressive node representations. Existing methods for capturing role-based features are largely unsupervised and fail to achieve optimal performance in downstream tasks. To address these limitations, we propose a novel hypergraph neural network with state space model (HGMN) that effectively integrates role-aware representations into GNNs and the state space model. HGMN utilizes hypergraph construction techniques to model higher-order relationships and combines role-based and adjacency-based representations through a learnable mamba transformer mechanism. By leveraging two distinct hypergraph construction methods-based on node degree and neighborhood levels, it strengthens the connections among nodes with similar roles, enhancing the model’s representational power. Additionally, the inclusion of hypergraph convolution layers enables the model to capture complex dependencies within hypergraph structures. To mitigate the over-smoothing problem inherent in deep GNNs, we incorporate a residual network, ensuring improved stability and better feature propagation across layers. Extensive experiments conducted on one newly introduced dataset and four benchmark datasets demonstrate the superiority of HGMN. The model achieves significant performance improvements on node classification tasks compared to state-of-the-art GNN methods. These results highlight HGMN’s ability to provide enriched node representations by effectively embedding role-based features alongside adjacency information, making it a versatile and powerful tool for a variety of graph-based learning applications.

[633] PANAMA: A Network-Aware MARL Framework for Multi-Agent Path Finding in Digital Twin Ecosystems

Arman Dogru, R. Irem Bor-Yaliniz, Nimal Gamini Senarath

Main category: cs.LG

TL;DR: PANAMA, a novel algorithm using Priority Asymmetry for Network Aware MARL, improves multi-agent pathfinding in Digital Twin ecosystems, outperforming benchmarks in accuracy, speed, and scalability.

Details

Motivation: The need for efficient data-sharing and robust algorithms in Digital Twin ecosystems as robotics and automation scale.

Method: PANAMA employs a CTDE framework and asynchronous actor-learner architectures for MARL-based multi-agent pathfinding.

Result: Superior pathfinding performance in accuracy, speed, and scalability compared to existing benchmarks.

Conclusion: PANAMA bridges network-aware decision-making and multi-agent coordination, advancing DTs, wireless networks, and AI-driven automation.

Abstract: Digital Twins (DTs) are transforming industries through advanced data processing and analysis, positioning the world of DTs, Digital World, as a cornerstone of nextgeneration technologies including embodied AI. As robotics and automated systems scale, efficient data-sharing frameworks and robust algorithms become critical. We explore the pivotal role of data handling in next-gen networks, focusing on dynamics between application and network providers (AP/NP) in DT ecosystems. We introduce PANAMA, a novel algorithm with Priority Asymmetry for Network Aware Multi-agent Reinforcement Learning (MARL) based multi-agent path finding (MAPF). By adopting a Centralized Training with Decentralized Execution (CTDE) framework and asynchronous actor-learner architectures, PANAMA accelerates training while enabling autonomous task execution by embodied AI. Our approach demonstrates superior pathfinding performance in accuracy, speed, and scalability compared to existing benchmarks. Through simulations, we highlight optimized data-sharing strategies for scalable, automated systems, ensuring resilience in complex, real-world environments. PANAMA bridges the gap between network-aware decision-making and robust multi-agent coordination, advancing the synergy between DTs, wireless networks, and AI-driven automation.

[634] Graph is a Natural Regularization: Revisiting Vector Quantization for Graph Representation Learning

Zian Zhai, Fan Li, Xingyu Tan, Xiaoyang Wang, Wenjie Zhang

Main category: cs.LG

TL;DR: The paper addresses codebook collapse in graph VQ, identifies causes, and proposes RGVQ, a framework improving codebook utilization and token diversity.

Details

Motivation: Codebook collapse in graph VQ limits expressiveness and generalization of graph tokens, yet remains underexplored.

Method: Proposes RGVQ, integrating graph topology and feature similarity as regularization, using soft assignments and contrastive regularization.

Result: RGVQ improves codebook utilization and boosts performance of graph VQ backbones in downstream tasks.

Conclusion: RGVQ enables more expressive and transferable graph token representations.

Abstract: Vector Quantization (VQ) has recently emerged as a promising approach for learning discrete representations of graph-structured data. However, a fundamental challenge, i.e., codebook collapse, remains underexplored in the graph domain, significantly limiting the expressiveness and generalization of graph tokens.In this paper, we present the first empirical study showing that codebook collapse consistently occurs when applying VQ to graph data, even with mitigation strategies proposed in vision or language domains. To understand why graph VQ is particularly vulnerable to collapse, we provide a theoretical analysis and identify two key factors: early assignment imbalances caused by redundancy in graph features and structural patterns, and self-reinforcing optimization loops in deterministic VQ. To address these issues, we propose RGVQ, a novel framework that integrates graph topology and feature similarity as explicit regularization signals to enhance codebook utilization and promote token diversity. RGVQ introduces soft assignments via Gumbel-Softmax reparameterization, ensuring that all codewords receive gradient updates. In addition, RGVQ incorporates a structure-aware contrastive regularization to penalize the token co-assignments among similar node pairs. Extensive experiments demonstrate that RGVQ substantially improves codebook utilization and consistently boosts the performance of state-of-the-art graph VQ backbones across multiple downstream tasks, enabling more expressive and transferable graph token representations.

[635] A Federated Learning Framework for Handling Subtype Confounding and Heterogeneity in Large-Scale Neuroimaging Diagnosis

Xinglin Zhao, Yanwen Wang, Xiaobo Liu, Yanrong Hao, Rui Cao, Xin Wen

Main category: cs.LG

TL;DR: A federated learning framework for neuroimaging CAD systems improves diagnostic accuracy by addressing subtype heterogeneity and data variability.

Details

Motivation: Small-sample studies lack reproducibility, while large-scale datasets introduce confounding heterogeneity due to mislabeled disease subtypes.

Method: Proposes a federated learning framework with dynamic navigation for sample routing and meta-integration for unified predictions.

Result: Achieved 74.06% accuracy across multiple sites, outperforming traditional methods.

Conclusion: The framework enhances CAD reliability and reproducibility, benefiting personalized medicine and clinical decision-making.

Abstract: Computer-aided diagnosis (CAD) systems play a crucial role in analyzing neuroimaging data for neurological and psychiatric disorders. However, small-sample studies suffer from low reproducibility, while large-scale datasets introduce confounding heterogeneity due to multiple disease subtypes being labeled under a single category. To address these challenges, we propose a novel federated learning framework tailored for neuroimaging CAD systems. Our approach includes a dynamic navigation module that routes samples to the most suitable local models based on latent subtype representations, and a meta-integration module that combines predictions from heterogeneous local models into a unified diagnostic output. We evaluated our framework using a comprehensive dataset comprising fMRI data from over 1300 MDD patients and 1100 healthy controls across multiple study cohorts. Experimental results demonstrate significant improvements in diagnostic accuracy and robustness compared to traditional methods. Specifically, our framework achieved an average accuracy of 74.06% across all tested sites, showcasing its effectiveness in handling subtype heterogeneity and enhancing model generalizability. Ablation studies further confirmed the importance of both the dynamic navigation and meta-integration modules in improving performance. By addressing data heterogeneity and subtype confounding, our framework advances reliable and reproducible neuroimaging CAD systems, offering significant potential for personalized medicine and clinical decision-making in neurology and psychiatry.

[636] Conformal Set-based Human-AI Complementarity with Multiple Experts

Helbert Paat, Guohao Shen

Main category: cs.LG

TL;DR: The paper explores selecting instance-specific experts from a pool for human-AI collaboration, using conformal prediction sets to improve classification performance. A greedy algorithm for subset selection outperforms naive methods, achieving near-optimal results on real datasets.

Details

Motivation: Existing research focuses on single-expert scenarios, but multiple experts can enhance classification. The study aims to identify when and how multiple experts benefit from conformal sets.

Method: A greedy algorithm is introduced to select subsets of expert predictions using conformal sets, tested on CIFAR-10H and ImageNet-16H datasets.

Result: The greedy algorithm outperforms naive methods, achieving near-optimal subsets and improved classification performance.

Conclusion: Leveraging multiple experts with conformal sets and a greedy selection algorithm enhances classification accuracy effectively.

Abstract: Decision support systems are designed to assist human experts in classification tasks by providing conformal prediction sets derived from a pre-trained model. This human-AI collaboration has demonstrated enhanced classification performance compared to using either the model or the expert independently. In this study, we focus on the selection of instance-specific experts from a pool of multiple human experts, contrasting it with existing research that typically focuses on single-expert scenarios. We characterize the conditions under which multiple experts can benefit from the conformal sets. With the insight that only certain experts may be relevant for each instance, we explore the problem of subset selection and introduce a greedy algorithm that utilizes conformal sets to identify the subset of expert predictions that will be used in classifying an instance. This approach is shown to yield better performance compared to naive methods for human subset selection. Based on real expert predictions from the CIFAR-10H and ImageNet-16H datasets, our simulation study indicates that our proposed greedy algorithm achieves near-optimal subsets, resulting in improved classification performance among multiple experts.

[637] Generative Artificial Intelligence Extracts Structure-Function Relationships from Plants for New Materials

Rachel K. Luu, Jingyu Deng, Mohammed Shahrudin Ibrahim, Nam-Joon Cho, Ming Dao, Subra Suresh, Markus J. Buehler

Main category: cs.LG

TL;DR: A framework integrating generative AI with multi-disciplinary literature to design bioinspired materials, validated by real-world experiments.

Details

Motivation: To bridge the gap in applying LLMs to discipline-specific experimental science, especially in multi-disciplinary fields like materials science.

Method: Uses fine-tuned models (BioinspiredLLM), RAG, agentic systems, and Hierarchical Sampling to extract insights and design experiments.

Result: Developed a novel pollen-based adhesive with tunable morphology and measured shear strength, validating the AI-assisted approach.

Conclusion: AI-assisted ideation can effectively drive real-world materials design and enhance human-AI collaboration.

Abstract: Large language models (LLMs) have reshaped the research landscape by enabling new approaches to knowledge retrieval and creative ideation. Yet their application in discipline-specific experimental science, particularly in highly multi-disciplinary domains like materials science, remains limited. We present a first-of-its-kind framework that integrates generative AI with literature from hitherto-unconnected fields such as plant science, biomimetics, and materials engineering to extract insights and design experiments for materials. We focus on humidity-responsive systems such as pollen-based materials and Rhapis excelsa (broadleaf lady palm) leaves, which exhibit self-actuation and adaptive performance. Using a suite of AI tools, including a fine-tuned model (BioinspiredLLM), Retrieval-Augmented Generation (RAG), agentic systems, and a Hierarchical Sampling strategy, we extract structure-property relationships and translate them into new classes of bioinspired materials. Structured inference protocols generate and evaluate hundreds of hypotheses from a single query, surfacing novel and experimentally tractable ideas. We validate our approach through real-world implementation: LLM-generated procedures, materials designs, and mechanical predictions were tested in the laboratory, culminating in the fabrication of a novel pollen-based adhesive with tunable morphology and measured shear strength, establishing a foundation for future plant-derived adhesive design. This work demonstrates how AI-assisted ideation can drive real-world materials design and enable effective human-AI collaboration.

[638] Deep Ignorance: Filtering Pretraining Data Builds Tamper-Resistant Safeguards into Open-Weight LLMs

Kyle O’Brien, Stephen Casper, Quentin Anthony, Tomek Korbak, Robert Kirk, Xander Davies, Ishan Mishra, Geoffrey Irving, Yarin Gal, Stella Biderman

Main category: cs.LG

TL;DR: The paper explores filtering dual-use topics from training data to enhance tamper-resistance in open-weight AI systems, showing improved resistance to adversarial attacks without degrading unrelated capabilities.

Details

Motivation: Open-weight AI systems are vulnerable to tampering attacks, and existing safety methods are insufficient. The study aims to assess if data filtering can serve as a robust safeguard.

Method: A multi-stage pipeline for scalable data filtering is introduced, and 6.9B-parameter models are pretrained to test resistance to adversarial fine-tuning.

Result: Filtered models resist adversarial attacks (up to 10,000 steps) better than baselines, but can still use dangerous knowledge if provided in context.

Conclusion: Pretraining data curation is a promising defense layer for open-weight AI, though a defense-in-depth approach is needed.

Abstract: Open-weight AI systems offer unique benefits, including enhanced transparency, open research, and decentralized access. However, they are vulnerable to tampering attacks which can efficiently elicit harmful behaviors by modifying weights or activations. Currently, there is not yet a robust science of open-weight model risk management. Existing safety fine-tuning methods and other post-training techniques have struggled to make LLMs resistant to more than a few dozen steps of adversarial fine-tuning. In this paper, we investigate whether filtering text about dual-use topics from training data can prevent unwanted capabilities and serve as a more tamper-resistant safeguard. We introduce a multi-stage pipeline for scalable data filtering and show that it offers a tractable and effective method for minimizing biothreat proxy knowledge in LLMs. We pretrain multiple 6.9B-parameter models from scratch and find that they exhibit substantial resistance to adversarial fine-tuning attacks on up to 10,000 steps and 300M tokens of biothreat-related text – outperforming existing post-training baselines by over an order of magnitude – with no observed degradation to unrelated capabilities. However, while filtered models lack internalized dangerous knowledge, we find that they can still leverage such information when it is provided in context (e.g., via search tool augmentation), demonstrating a need for a defense-in-depth approach. Overall, these findings help to establish pretraining data curation as a promising layer of defense for open-weight AI systems.

[639] Local Diffusion Models and Phases of Data Distributions

Fangjun Hu, Guangkuo Liu, Yifan Zhang, Xun Gao

Main category: cs.LG

TL;DR: The paper introduces a new perspective on data distribution phases in diffusion models, proposing local denoisers for efficiency and identifying critical phase transitions where global models are necessary.

Details

Motivation: Real-life data often has low-dimensional spatial structure, but standard diffusion models ignore this, leading to computationally expensive global score functions. The work aims to leverage local structure for efficiency.

Method: The authors define data distribution phases based on local operations, analyze the denoising process into trivial and data phases with a rapid transition, and derive an information-theoretic bound for local denoiser fidelity. Numerical experiments validate the approach.

Result: The study shows that small local neural networks suffice for most of the denoising process, with global networks needed only near phase transitions, reducing computational costs.

Conclusion: This work simplifies diffusion model architectures and opens new research directions in generative AI and physics-inspired neural network design.

Abstract: As a class of generative artificial intelligence frameworks inspired by statistical physics, diffusion models have shown extraordinary performance in synthesizing complicated data distributions through a denoising process gradually guided by score functions. Real-life data, like images, is often spatially structured in low-dimensional spaces. However, ordinary diffusion models ignore this local structure and learn spatially global score functions, which are often computationally expensive. In this work, we introduce a new perspective on the phases of data distributions, which provides insight into constructing local denoisers with reduced computational costs. We define two distributions as belonging to the same data distribution phase if they can be mutually connected via spatially local operations such as local denoisers. Then, we show that the reverse denoising process consists of an early trivial phase and a late data phase, sandwiching a rapid phase transition where local denoisers must fail. To diagnose such phase transitions, we prove an information-theoretic bound on the fidelity of local denoisers based on conditional mutual information, and conduct numerical experiments in a real-world dataset. This work suggests simpler and more efficient architectures of diffusion models: far from the phase transition point, we can use small local neural networks to compute the score function; global neural networks are only necessary around the narrow time interval of phase transitions. This result also opens up new directions for studying phases of data distributions, the broader science of generative artificial intelligence, and guiding the design of neural networks inspired by physics concepts.

[640] LLM-based Agents for Automated Confounder Discovery and Subgroup Analysis in Causal Inference

Po-Han Lee, Yu-Cheng Lin, Chan-Tung Ku, Chan Hsu, Pei-Cing Huang, Ping-Hsun Wu, Yihuang Kang

Main category: cs.LG

TL;DR: The paper proposes using LLM-based agents to automate confounder discovery and subgroup analysis in causal ML, reducing human dependency and improving robustness in treatment effect estimation.

Details

Motivation: Addressing limitations of causal ML methods in complex real-world environments due to latent confounders, unstructured data, and high annotation costs from expert reliance.

Method: Integrates LLM-based agents into the causal ML pipeline to simulate domain expertise for automated confounder discovery and subgroup analysis.

Result: Experiments on medical datasets show improved robustness in treatment effect estimation, narrower confidence intervals, and discovery of unrecognized biases.

Conclusion: LLM-based agents provide a scalable, trustworthy, and semantically aware solution for causal inference.

Abstract: Estimating individualized treatment effects from observational data presents a persistent challenge due to unmeasured confounding and structural bias. Causal Machine Learning (causal ML) methods, such as causal trees and doubly robust estimators, provide tools for estimating conditional average treatment effects. These methods have limited effectiveness in complex real-world environments due to the presence of latent confounders or those described in unstructured formats. Moreover, reliance on domain experts for confounder identification and rule interpretation introduces high annotation cost and scalability concerns. In this work, we proposed Large Language Model-based agents for automated confounder discovery and subgroup analysis that integrate agents into the causal ML pipeline to simulate domain expertise. Our framework systematically performs subgroup identification and confounding structure discovery by leveraging the reasoning capabilities of LLM-based agents, which reduces human dependency while preserving interpretability. Experiments on real-world medical datasets show that our proposed approach enhances treatment effect estimation robustness by narrowing confidence intervals and uncovering unrecognized confounding biases. Our findings suggest that LLM-based agents offer a promising path toward scalable, trustworthy, and semantically aware causal inference.

[641] Generalizing Scaling Laws for Dense and Sparse Large Language Models

Md Arafat Hossain, Xingfu Wu, Valerie Taylor, Ali Jannesari

Main category: cs.LG

TL;DR: A generalized scaling law is proposed for both dense and sparse large language models, addressing the challenge of resource allocation and model size prediction.

Details

Motivation: The rapid growth of language models and their computational costs has created a need for efficient training techniques, but existing scaling laws are architecture-specific.

Method: The authors revisit existing scaling laws and introduce a unified framework applicable to both dense and sparse models.

Result: The proposed scaling law is evaluated and compared with existing laws, demonstrating its effectiveness.

Conclusion: The generalized scaling law provides a versatile solution for optimizing large language model training.

Abstract: Over the past few years, the size of language models has grown exponentially, as has the computational cost to train these large models. This rapid growth has motivated researchers to develop new techniques aimed at enhancing the efficiency of the training process. Despite these advancements, optimally predicting the model size or allocating optimal resources remains a challenge. Several efforts have addressed the challenge by proposing different scaling laws, but almost all of them are architecture-specific (dense or sparse). In this work we revisit existing scaling laws and propose a generalized scaling law to provide a unified framework that is applicable to both dense and sparse large language models. We evaluate and compare our proposed scaling law with existing scaling laws to demonstrate its effectiveness.

[642] Learning to Forget with Information Divergence Reweighted Objectives for Noisy Labels

Jeremiah Birrell, Reza Ebrahimi

Main category: cs.LG

TL;DR: ANTIDOTE is a novel objective for learning with noisy labels, using adversarial training to reduce the impact of noisy samples, outperforming existing methods with similar computational cost to standard cross-entropy loss.

Details

Motivation: Addressing the challenge of learning from datasets with noisy labels, which is common in real-world scenarios or adversarial settings.

Method: Defines objectives via relaxation over an information-divergence neighborhood, reformulated as adversarial training using convex duality.

Result: Effectively reduces influence of noisy labels, behaving like forgetting them, and outperforms comparable methods in various noise settings.

Conclusion: ANTIDOTE is a robust and efficient solution for learning under noisy labels, with practical advantages in computational cost and performance.

Abstract: We introduce ANTIDOTE, a new class of objectives for learning under noisy labels which are defined in terms of a relaxation over an information-divergence neighborhood. Using convex duality, we provide a reformulation as an adversarial training method that has similar computational cost to training with standard cross-entropy loss. We show that our approach adaptively reduces the influence of the samples with noisy labels during learning, exhibiting a behavior that is analogous to forgetting those samples. ANTIDOTE is effective in practical environments where label noise is inherent in the training data or where an adversary can alter the training labels. Extensive empirical evaluations on different levels of symmetric, asymmetric, human annotation, and real-world label noise show that ANTIDOTE outperforms leading comparable losses in the field and enjoys a time complexity that is very close to that of the standard cross entropy loss.

[643] Early Detection of Pancreatic Cancer Using Multimodal Learning on Electronic Health Record

Mosbah Aouad, Anirudh Choudhary, Awais Farooq, Steven Nevers, Lusine Demirkhanyan, Bhrandon Harris, Suguna Pappu, Christopher Gondi, Ravishankar Iyer

Main category: cs.LG

TL;DR: A multimodal approach using EHR data improves early detection of pancreatic cancer, outperforming existing methods by 6.5-15.5% in AUC.

Details

Motivation: Early detection of pancreatic ductal adenocarcinoma (PDAC) is challenging due to lack of symptoms and biomarkers.

Method: Combines neural controlled differential equations, pretrained language models, recurrent networks, and cross-attention to integrate lab and diagnosis code data.

Result: Achieves 6.5-15.5% higher AUC than state-of-the-art methods and identifies new biomarkers.

Conclusion: The approach enhances early PDAC detection and reveals novel risk-associated factors.

Abstract: Pancreatic ductal adenocarcinoma (PDAC) is one of the deadliest cancers, and early detection remains a major clinical challenge due to the absence of specific symptoms and reliable biomarkers. In this work, we propose a new multimodal approach that integrates longitudinal diagnosis code histories and routinely collected laboratory measurements from electronic health records to detect PDAC up to one year prior to clinical diagnosis. Our method combines neural controlled differential equations to model irregular lab time series, pretrained language models and recurrent networks to learn diagnosis code trajectory representations, and cross-attention mechanisms to capture interactions between the two modalities. We develop and evaluate our approach on a real-world dataset of nearly 4,700 patients and achieve significant improvements in AUC ranging from 6.5% to 15.5% over state-of-the-art methods. Furthermore, our model identifies diagnosis codes and laboratory panels associated with elevated PDAC risk, including both established and new biomarkers. Our code is available at https://github.com/MosbahAouad/EarlyPDAC-MML.

[644] Using Imperfect Synthetic Data in Downstream Inference Tasks

Yewon Byun, Shantanu Gupta, Zachary C. Lipton, Rachel Leah Childers, Bryan Wilder

Main category: cs.LG

TL;DR: The paper introduces a hyperparameter-free estimator using generalized method of moments to combine synthetic data from large language models with real data for statistically valid conclusions in computational social science.

Details

Motivation: To address the challenge of combining synthetic data (generated by large language models) with real data for statistically valid conclusions in limited data regimes.

Method: A new estimator based on generalized method of moments, leveraging interactions between synthetic and real data moment residuals.

Result: The estimator improves target parameter estimates and shows empirical gains in computational social science regression tasks.

Conclusion: The proposed method provides a principled, theoretically grounded solution for integrating synthetic and real data, enhancing research in limited data settings.

Abstract: Predictions and generations from large language models are increasingly being explored as an aid to computational social science and human subject research in limited data regimes. While previous technical work has explored the potential to use model-predicted labels for unlabeled data in a principled manner, there is increasing interest in using large language models to generate entirely new synthetic samples (also termed as synthetic simulations), such as in responses to surveys. However, it is not immediately clear by what means practitioners can combine such data with real data and yet produce statistically valid conclusions upon them. In this work, we introduce a new estimator based on generalized method of moments, providing a hyperparameter-free solution with strong theoretical guarantees to address the challenge at hand. Surprisingly, we find that interactions between the moment residuals of synthetic data and those of real data can improve estimates of the target parameter. We empirically validate the finite-sample performance of our estimator across different regression tasks in computational social science applications, demonstrating large empirical gains.

[645] Segmented Confidence Sequences and Multi-Scale Adaptive Confidence Segments for Anomaly Detection in Nonstationary Time Series

Muyan Anna Li, Aditi Gautam

Main category: cs.LG

TL;DR: The paper introduces two adaptive thresholding frameworks, SCS and MACS, for anomaly detection in nonstationary time series data, outperforming traditional methods in F1-score.

Details

Motivation: Traditional static thresholds fail in nonstationary environments due to regime shifts and concept drift, necessitating adaptive solutions.

Method: Proposes Segmented Confidence Sequences (SCS) and Multi-Scale Adaptive Confidence Segments (MACS), leveraging statistical online learning and segmentation.

Result: Experiments on Wafer Manufacturing datasets show significant F1-score improvement over traditional methods.

Conclusion: Adaptive thresholds like SCS and MACS enable reliable, interpretable, and timely anomaly detection in evolving environments.

Abstract: As time series data become increasingly prevalent in domains such as manufacturing, IT, and infrastructure monitoring, anomaly detection must adapt to nonstationary environments where statistical properties shift over time. Traditional static thresholds are easily rendered obsolete by regime shifts, concept drift, or multi-scale changes. To address these challenges, we introduce and empirically evaluate two novel adaptive thresholding frameworks: Segmented Confidence Sequences (SCS) and Multi-Scale Adaptive Confidence Segments (MACS). Both leverage statistical online learning and segmentation principles for local, contextually sensitive adaptation, maintaining guarantees on false alarm rates even under evolving distributions. Our experiments across Wafer Manufacturing benchmark datasets show significant F1-score improvement compared to traditional percentile and rolling quantile approaches. This work demonstrates that robust, statistically principled adaptive thresholds enable reliable, interpretable, and timely detection of diverse real-world anomalies.

[646] Robust Reinforcement Learning over Wireless Networks with Homomorphic State Representations

Pietro Talli, Federico Mason, Federico Chiariotti, Andrea Zanella

Main category: cs.LG

TL;DR: A novel architecture, HR3L, is proposed to train RL agents over lossy or delayed wireless networks without exchanging gradients, improving efficiency and adaptability.

Details

Motivation: Addressing the challenge of training RL agents with partial/intermittent information due to imperfect wireless communication.

Method: HR3L uses a transmitter-receiver setup to encode/decode environment representations, avoiding gradient exchange for faster training.

Result: HR3L outperforms baselines in sample efficiency and adapts to various communication issues like delays and losses.

Conclusion: HR3L is a robust solution for remote RL training in non-ideal wireless conditions, offering efficiency and adaptability.

Abstract: In this work, we address the problem of training Reinforcement Learning (RL) agents over communication networks. The RL paradigm requires the agent to instantaneously perceive the state evolution to infer the effects of its actions on the environment. This is impossible if the agent receives state updates over lossy or delayed wireless systems and thus operates with partial and intermittent information. In recent years, numerous frameworks have been proposed to manage RL with imperfect feedback; however, they often offer specific solutions with a substantial computational burden. To address these limits, we propose a novel architecture, named Homomorphic Robust Remote Reinforcement Learning (HR3L), that enables the training of remote RL agents exchanging observations across a non-ideal wireless channel. HR3L considers two units: the transmitter, which encodes meaningful representations of the environment, and the receiver, which decodes these messages and performs actions to maximize a reward signal. Importantly, HR3L does not require the exchange of gradient information across the wireless channel, allowing for quicker training and a lower communication overhead than state-of-the-art solutions. Experimental results demonstrate that HR3L significantly outperforms baseline methods in terms of sample efficiency and adapts to different communication scenarios, including packet losses, delayed transmissions, and capacity limitations.

[647] Fractal Language Modelling by Universal Sequence Maps (USM)

Jonas S Almeida, Daniel E Russ, Susana Vinga, Ines Duarte, Lee Mason, Praphulla Bhawsar, Aaron Ge, Arlindo Oliveira, Jeya Balaji Balasubramanian

Main category: cs.LG

TL;DR: The paper introduces an improved Universal Sequence Map (USM) method for bijective fractal encoding of symbolic sequences, resolving seeding biases and revealing USM’s efficient convergence to steady-state embeddings.

Details

Motivation: To explore encoding procedures for symbolic sequences using Language Models and Transformers, addressing the need for mechanisms that retain contextual information for neural network modeling.

Method: USM uses two Chaos Game Representations (CGR) iterated forward and backward, projected into the frequency domain (FCGR), enabling Chebyshev distance and k-mer frequency computation without recomputing coordinates.

Result: Resolved seeding biases in USM, achieving full reconciliation of numeric positioning with sequence identity and uncovering USM’s efficient convergence to steady-state embeddings. Demonstrated with genomic sequences but applicable to any alphabet.

Conclusion: USM provides a robust and scalable method for encoding symbolic sequences, with potential applications beyond genomics due to its adaptability to alphabets of arbitrary cardinality.

Abstract: Motivation: With the advent of Language Models using Transformers, popularized by ChatGPT, there is a renewed interest in exploring encoding procedures that numerically represent symbolic sequences at multiple scales and embedding dimensions. The challenge that encoding addresses is the need for mechanisms that uniquely retain contextual information about the succession of individual symbols, which can then be modeled by nonlinear formulations such as neural networks. Context: Universal Sequence Maps(USM) are iterated functions that bijectively encode symbolic sequences onto embedded numerical spaces. USM is composed of two Chaos Game Representations (CGR), iterated forwardly and backwardly, that can be projected into the frequency domain (FCGR). The corresponding USM coordinates can be used to compute a Chebyshev distance metric as well as k-mer frequencies, without having to recompute the embedded numeric coordinates, and, paradoxically, allowing for non-integers values of k. Results: This report advances the bijective fractal encoding by Universal Sequence Maps (USM) by resolving seeding biases affecting the iterated process. The resolution had two results, the first expected, the second an intriguing outcome: 1) full reconciliation of numeric positioning with sequence identity; and 2) uncovering the nature of USM as an efficient numeric process converging towards a steady state sequence embedding solution. We illustrate these results for genomic sequences because of the convenience of a planar representation defined by an alphabet with only 4 tokens (the 4 nucleotides). Nevertheless, the application to alphabet of arbitrary cardinality was found to be straightforward.

[648] Privacy-Preserving Tabular Synthetic Data Generation Using TabularARGN

Andrey Sidorenko, Paul Tiwald

Main category: cs.LG

TL;DR: TabularARGN is a neural network for generating high-quality synthetic tabular data, balancing privacy and utility effectively.

Details

Motivation: Traditional anonymization techniques often fail to preserve privacy adequately, necessitating better synthetic data generation methods.

Method: TabularARGN uses a discretization-based auto-regressive approach for efficient and high-fidelity synthetic data generation.

Result: TabularARGN outperforms existing methods in statistical similarity, machine learning utility, and detection robustness, with strong privacy protection.

Conclusion: TabularARGN offers a robust and efficient solution for synthetic data generation, achieving a strong privacy-utility balance.

Abstract: Synthetic data generation has become essential for securely sharing and analyzing sensitive data sets. Traditional anonymization techniques, however, often fail to adequately preserve privacy. We introduce the Tabular Auto-Regressive Generative Network (TabularARGN), a neural network architecture specifically designed for generating high-quality synthetic tabular data. Using a discretization-based auto-regressive approach, TabularARGN achieves high data fidelity while remaining computationally efficient. We evaluate TabularARGN against existing synthetic data generation methods, showing competitive results in statistical similarity, machine learning utility, and detection robustness. We further perform an in-depth privacy evaluation using systematic membership-inference attacks, highlighting the robustness and effective privacy-utility balance of our approach.

[649] In-Context Reinforcement Learning via Communicative World Models

Fernando Martinez-Lopez, Tao Li, Yingdong Lu, Juntao Chen

Main category: cs.LG

TL;DR: CORAL framework improves RL generalization by using emergent communication between two agents, enabling zero-shot adaptation in unseen environments.

Details

Motivation: RL agents often overfit to training environments, limiting generalization. CORAL addresses this by decoupling representation learning from control.

Method: CORAL uses an Information Agent (IA) to learn a communicative context via a Causal Influence Loss, and a Control Agent (CA) to interpret this context for task-solving.

Result: Experiments show CORAL improves sample efficiency and enables zero-shot adaptation in unseen sparse-reward environments.

Conclusion: Learning transferable communicative representations enhances RL generalization and adaptability.

Abstract: Reinforcement learning (RL) agents often struggle to generalize to new tasks and contexts without updating their parameters, mainly because their learned representations and policies are overfit to the specifics of their training environments. To boost agents’ in-context RL (ICRL) ability, this work formulates ICRL as a two-agent emergent communication problem and introduces CORAL (Communicative Representation for Adaptive RL), a framework that learns a transferable communicative context by decoupling latent representation learning from control. In CORAL, an Information Agent (IA) is pre-trained as a world model on a diverse distribution of tasks. Its objective is not to maximize task reward, but to build a world model and distill its understanding into concise messages. The emergent communication protocol is shaped by a novel Causal Influence Loss, which measures the effect that the message has on the next action. During deployment, the previously trained IA serves as a fixed contextualizer for a new Control Agent (CA), which learns to solve tasks by interpreting the provided communicative context. Our experiments demonstrate that this approach enables the CA to achieve significant gains in sample efficiency and successfully perform zero-shot adaptation with the help of pre-trained IA in entirely unseen sparse-reward environments, validating the efficacy of learning a transferable communicative representation.

Yuan-Hung Chao, Chia-Hsun Lu, Chih-Ya Shen

Main category: cs.LG

TL;DR: Integration of KANs into GNNs (KGAT, KSGC, KAPPNP) improves node classification accuracy, and knowledge amalgamation enhances student model performance.

Details

Motivation: GNNs' reliance on graph connectivity limits scalability and efficiency, while KANs offer strong nonlinear expressiveness and efficient inference.

Method: KANs are integrated into GNN architectures (GAT, SGC, APPNP), and a multi-teacher knowledge amalgamation framework is used to distill knowledge into a graph-independent KAN student model.

Result: Proposed models improve node classification accuracy, and knowledge amalgamation boosts student model performance.

Conclusion: KANs enhance GNN expressiveness and enable efficient, graph-free inference.

Abstract: Graph Neural Networks (GNNs) have shown strong performance on graph-structured data, but their reliance on graph connectivity often limits scalability and efficiency. Kolmogorov-Arnold Networks (KANs), a recent architecture with learnable univariate functions, offer strong nonlinear expressiveness and efficient inference. In this work, we integrate KANs into three popular GNN architectures-GAT, SGC, and APPNP-resulting in three new models: KGAT, KSGC, and KAPPNP. We further adopt a multi-teacher knowledge amalgamation framework, where knowledge from multiple KAN-based GNNs is distilled into a graph-independent KAN student model. Experiments on benchmark datasets show that the proposed models improve node classification accuracy, and the knowledge amalgamation approach significantly boosts student model performance. Our findings highlight the potential of KANs for enhancing GNN expressiveness and for enabling efficient, graph-free inference.

[651] Watermarking Kolmogorov-Arnold Networks for Emerging Networked Applications via Activation Perturbation

Chia-Hsun Lu, Guan-Jhih Wu, Ya-Chi Ho, Chih-Ya Shen

Main category: cs.LG

TL;DR: A novel watermarking method, DCT-AW, is proposed for Kolmogorov-Arnold Networks (KAN) to protect intellectual property by embedding watermarks via discrete cosine transform, ensuring robustness against attacks.

Details

Motivation: Protecting intellectual property in machine learning, especially for KAN, which lacks effective watermarking methods due to its unique architecture.

Method: Discrete Cosine Transform-based Activation Watermarking (DCT-AW) embeds watermarks by perturbing activation outputs in KAN.

Result: DCT-AW minimally impacts model performance and is robust against attacks like fine-tuning, pruning, and retraining.

Conclusion: DCT-AW is an effective and robust watermarking solution tailored for KAN, addressing its unique challenges.

Abstract: With the increasing importance of protecting intellectual property in machine learning, watermarking techniques have gained significant attention. As advanced models are increasingly deployed in domains such as social network analysis, the need for robust model protection becomes even more critical. While existing watermarking methods have demonstrated effectiveness for conventional deep neural networks, they often fail to adapt to the novel architecture, Kolmogorov-Arnold Networks (KAN), which feature learnable activation functions. KAN holds strong potential for modeling complex relationships in network-structured data. However, their unique design also introduces new challenges for watermarking. Therefore, we propose a novel watermarking method, Discrete Cosine Transform-based Activation Watermarking (DCT-AW), tailored for KAN. Leveraging the learnable activation functions of KAN, our method embeds watermarks by perturbing activation outputs using discrete cosine transform, ensuring compatibility with diverse tasks and achieving task independence. Experimental results demonstrate that DCT-AW has a small impact on model performance and provides superior robustness against various watermark removal attacks, including fine-tuning, pruning, and retraining after pruning.

[652] Stabilizing Federated Learning under Extreme Heterogeneity with HeteRo-Select

Md. Akmol Masud, Md Abrar Jahin, Mahmud Hasan

Main category: cs.LG

TL;DR: HeteRo-Select improves FL training stability and accuracy by smartly selecting clients based on usefulness, fairness, update speed, and data variety, outperforming Oort.

Details

Motivation: FL suffers from instability due to diverse client data. Existing methods like Oort struggle with accuracy drops in later training stages.

Method: Proposes HeteRo-Select, a framework with a scoring system for client selection, ensuring convergence under strong regularization.

Result: HeteRo-Select achieves higher peak (74.75%) and final accuracy (72.76%) with minimal stability drop (1.99%) compared to Oort.

Conclusion: HeteRo-Select is a reliable solution for heterogeneous FL, supported by theory and experiments.

Abstract: Federated Learning (FL) is a machine learning technique that often suffers from training instability due to the diverse nature of client data. Although utility-based client selection methods like Oort are used to converge by prioritizing high-loss clients, they frequently experience significant drops in accuracy during later stages of training. We propose a theoretical HeteRo-Select framework designed to maintain high performance and ensure long-term training stability. We provide a theoretical analysis showing that when client data is very different (high heterogeneity), choosing a smart subset of client participation can reduce communication more effectively compared to full participation. Our HeteRo-Select method uses a clear, step-by-step scoring system that considers client usefulness, fairness, update speed, and data variety. It also shows convergence guarantees under strong regularization. Our experimental results on the CIFAR-10 dataset under significant label skew ($\alpha=0.1$) support the theoretical findings. The HeteRo-Select method performs better than existing approaches in terms of peak accuracy, final accuracy, and training stability. Specifically, HeteRo-Select achieves a peak accuracy of $74.75%$, a final accuracy of $72.76%$, and a minimal stability drop of $1.99%$. In contrast, Oort records a lower peak accuracy of $73.98%$, a final accuracy of $71.25%$, and a larger stability drop of $2.73%$. The theoretical foundations and empirical performance in our study make HeteRo-Select a reliable solution for real-world heterogeneous FL problems.

[653] Schema-Guided Scene-Graph Reasoning based on Multi-Agent Large Language Model System

Yiye Chen, Harpreet Sawhney, Nicholas Gydé, Yanan Jian, Jack Saunders, Patricio Vela, Ben Lundell

Main category: cs.LG

TL;DR: SG^2 is a framework using multi-agent LLMs for scene-graph reasoning, outperforming existing methods in tasks like Q&A and planning.

Details

Motivation: To improve structured reasoning with LLMs by reducing hallucination and irrelevant information in scene graphs.

Method: Uses two modules (Reasoner and Retriever) iteratively, guided by a scene graph schema for efficient reasoning and retrieval.

Result: Outperforms existing LLM-based approaches and single-agent baselines in numerical Q&A and planning tasks.

Conclusion: SG^2 demonstrates effective iterative reasoning and retrieval, enhancing LLM performance in grounded spatial tasks.

Abstract: Scene graphs have emerged as a structured and serializable environment representation for grounded spatial reasoning with Large Language Models (LLMs). In this work, we propose SG^2, an iterative Schema-Guided Scene-Graph reasoning framework based on multi-agent LLMs. The agents are grouped into two modules: a (1) Reasoner module for abstract task planning and graph information queries generation, and a (2) Retriever module for extracting corresponding graph information based on code-writing following the queries. Two modules collaborate iteratively, enabling sequential reasoning and adaptive attention to graph information. The scene graph schema, prompted to both modules, serves to not only streamline both reasoning and retrieval process, but also guide the cooperation between two modules. This eliminates the need to prompt LLMs with full graph data, reducing the chance of hallucination due to irrelevant information. Through experiments in multiple simulation environments, we show that our framework surpasses existing LLM-based approaches and baseline single-agent, tool-based Reason-while-Retrieve strategy in numerical Q&A and planning tasks.

[654] CISO: Species Distribution Modeling Conditioned on Incomplete Species Observations

Hager Radi Abdelwahed, Mélisande Teng, Robin Zbinden, Laura Pollock, Hugo Larochelle, Devis Tuia, David Rolnick

Main category: cs.LG

TL;DR: CISO is a deep learning method for species distribution modeling that incorporates incomplete biotic data alongside environmental variables, improving predictive performance.

Details

Motivation: Current SDMs often overlook biotic interactions due to sparse and inconsistent species co-occurrence data.

Method: CISO uses deep learning to condition predictions on flexible, incomplete species observations and environmental variables.

Result: CISO outperforms alternatives in predicting species distributions and benefits from combining multiple datasets.

Conclusion: CISO is a promising tool for ecological research, integrating incomplete biotic data and identifying cross-taxa interactions.

Abstract: Species distribution models (SDMs) are widely used to predict species’ geographic distributions, serving as critical tools for ecological research and conservation planning. Typically, SDMs relate species occurrences to environmental variables representing abiotic factors, such as temperature, precipitation, and soil properties. However, species distributions are also strongly influenced by biotic interactions with other species, which are often overlooked. While some methods partially address this limitation by incorporating biotic interactions, they often assume symmetrical pairwise relationships between species and require consistent co-occurrence data. In practice, species observations are sparse, and the availability of information about the presence or absence of other species varies significantly across locations. To address these challenges, we propose CISO, a deep learning-based method for species distribution modeling Conditioned on Incomplete Species Observations. CISO enables predictions to be conditioned on a flexible number of species observations alongside environmental variables, accommodating the variability and incompleteness of available biotic data. We demonstrate our approach using three datasets representing different species groups: sPlotOpen for plants, SatBird for birds, and a new dataset, SatButterfly, for butterflies. Our results show that including partial biotic information improves predictive performance on spatially separate test sets. When conditioned on a subset of species within the same dataset, CISO outperforms alternative methods in predicting the distribution of the remaining species. Furthermore, we show that combining observations from multiple datasets can improve performance. CISO is a promising ecological tool, capable of incorporating incomplete biotic information and identifying potential interactions between species from disparate taxa.

[655] Analysis of Schedule-Free Nonconvex Optimization

Connor Brown

Main category: cs.LG

TL;DR: The paper introduces a robust Lyapunov framework to analyze the Schedule-Free (SF) method in nonconvex optimization, providing horizon-agnostic convergence guarantees under minimal assumptions.

Details

Motivation: Classical first-order methods require step-sizes dependent on the total horizon, which is often unknown. The SF method avoids this but lacks nonconvex analysis without strong assumptions.

Method: A Lyapunov framework is developed to simplify SF analysis under L-smoothness and lower-boundedness, enabling horizon-free convergence proofs.

Result: Horizon-agnostic bounds are derived: O(1/log T), O(log T/T), and O(T^{-(1-α)}). PEP experiments suggest tighter rates.

Conclusion: The work extends SF’s guarantees to nonconvex settings and opens avenues for optimal nonconvex rates.

Abstract: First-order methods underpin most large-scale learning algorithms, yet their classical convergence guarantees hinge on carefully scheduled step-sizes that depend on the total horizon $T$, which is rarely known in advance. The Schedule-Free (SF) method promises optimal performance with hyperparameters that are independent of $T$ by interpolating between Polyak–Ruppert averaging and momentum, but nonconvex analysis of SF has been limited or reliant on strong global assumptions. We introduce a robust Lyapunov framework that, under only $L$-smoothness and lower-boundedness, reduces SF analysis to a single-step descent inequality. This yields horizon-agnostic bounds in the nonconvex setting: $O(1/\log T)$ for constant step + PR averaging, $O(\log T/T)$ for a linearly growing step-size, and a continuum of $O(T^{-(1-\alpha)})$ rates for polynomial averaging. We complement these proofs with Performance Estimation Problem (PEP) experiments that numerically validate our rates and suggest that our $O(1/\log T)$ bound on the original nonconvex SF algorithm may tighten to $O(1/T)$. Our work extends SF’s horizon-free guarantees to smooth nonconvex optimization and charts future directions for optimal nonconvex rates.

[656] Fed MobiLLM: Efficient Federated LLM Fine-Tuning over Heterogeneous Mobile Devices via Server Assisted Side-Tuning

Xingke Yang, Liang Li, Sicong Li, Liwei Guan, Hao Wang, Xiaoqi Qi, Jiang Liu, Xin Fu, Miao Pan

Main category: cs.LG

TL;DR: Fed MobiLLM enables efficient federated fine-tuning of large language models on heterogeneous mobile devices by reducing computational overhead and enabling asynchronous updates.

Details

Motivation: To address the challenges of computational and memory burdens in federated LLM fine-tuning on mobile devices with diverse speeds and architectures.

Method: Uses a server-assisted federated side-tuning paradigm where mobile devices perform lightweight forward propagation and upload activations, while the server trains a shared side-network asynchronously. Adaptive layer-wise feature alignment bridges model heterogeneity.

Result: Achieves 95.2% reduction in computation, 93.2% reduction in communication costs, and 5.1x faster convergence compared to existing methods.

Conclusion: Fed MobiLLM is effective for practical LLM adaptation on heterogeneous mobile devices, maintaining performance while significantly reducing overhead.

Abstract: Collaboratively fine-tuning (FT) large language models (LLMs) over heterogeneous mobile devices fosters immense potential applications of personalized intelligence. However, such a vision faces critical system challenges. Conventional federated LLM FT approaches place prohibitive computational and memory burdens on mobile hardware, and their synchronous model aggregation protocols stall for slower devices. In this paper, we propose Fed MobiLLM, a novel design to facilitate efficient federated LLM FT across mobile devices with diverse computing/communication speeds and local model architectures. In particular, Fed MobiLLM implements a pioneering server-assisted federated side-tuning paradigm. Briefly, mobile devices perform lightweight forward propagation computations on local data using their frozen pre-scaled backbone LLMs, and then upload selected intermediate activations. The server trains a shared side-network independently, eliminating client-side backpropagation and enabling asynchronous updates. To bridge model heterogeneity across different devices, we introduce an adaptive layer-wise feature alignment method, which ensures consistent representations for collaboratively tuning a shared side network. Extensive experimental results demonstrate that Fed MobiLLM can maintain robust fine-tuning performance while achieving extremely low on-device memory, with at least 95.2% reduction in computation overhead, 93.2% reduction in communication costs and 5.1x faster convergence compared to existing methods, validating its efficacy for practical LLM adaptation over heterogeneous mobile devices.

[657] Zero-Direction Probing: A Linear-Algebraic Framework for Deep Analysis of Large-Language-Model Drift

Amit Pandey

Main category: cs.LG

TL;DR: Zero-Direction Probing (ZDP) detects model drift via transformer activations’ null directions, with theoretical guarantees and metrics like Spectral Null-Leakage (SNL).

Details

Motivation: To detect model drift without task labels or output evaluations, leveraging null directions of transformer activations.

Method: Proposes ZDP framework, proves theorems (Variance-Leak, Fisher Null-Conservation, Rank-Leak bound), and derives SNL metric with non-asymptotic bounds.

Result: Theoretical guarantees on representational change via monitoring null spaces and Fisher geometry.

Conclusion: ZDP provides a practical, theory-backed method for drift detection with testable guarantees.

Abstract: We present Zero-Direction Probing (ZDP), a theory-only framework for detecting model drift from null directions of transformer activations without task labels or output evaluations. Under assumptions A1–A6, we prove: (i) the Variance–Leak Theorem, (ii) Fisher Null-Conservation, (iii) a Rank–Leak bound for low-rank updates, and (iv) a logarithmic-regret guarantee for online null-space trackers. We derive a Spectral Null-Leakage (SNL) metric with non-asymptotic tail bounds and a concentration inequality, yielding a-priori thresholds for drift under a Gaussian null model. These results show that monitoring right/left null spaces of layer activations and their Fisher geometry provides concrete, testable guarantees on representational change.

[658] PROPS: Progressively Private Self-alignment of Large Language Models

Noel Teku, Fengwei Tian, Payel Bhattacharjee, Souradip Chakraborty, Amrit Singh Bedi, Ravi Tandon

Main category: cs.LG

TL;DR: PROPS is a privacy-preserving alignment framework for LLMs, offering better utility than DP-SGD and RR while maintaining high privacy.

Details

Motivation: Addressing privacy concerns in LLM alignment by focusing on preference-level privacy, as human feedback can reveal personal values and traits.

Method: Introduces PROPS, a multi-stage alignment framework where privately aligned models from earlier stages label data for later stages.

Result: PROPS achieves up to 3x higher win-rates than DP-SGD and 2.5x higher than RR for the same privacy budget.

Conclusion: PROPS effectively balances privacy and utility in LLM alignment, outperforming existing methods.

Abstract: Alignment is a key step in developing Large Language Models (LLMs) using human feedback to ensure adherence to human values and societal norms. Dependence on human feedback raises privacy concerns about how much a labeler’s preferences may reveal about their personal values, beliefs, and personality traits. Existing approaches, such as Differentially Private SGD (DP-SGD), provide rigorous privacy guarantees by privatizing gradients during fine-tuning and alignment but can provide more privacy than necessary as human preferences are tied only to labels of (prompt, response) pairs and can degrade model utility. This work focuses on LLM alignment with preference-level privacy, which preserves the privacy of preference labels provided by humans. We propose PROPS (PROgressively Private Self-alignment), a multi-stage privacy preserving alignment framework where privately aligned models in previous stages can serve as labelers for supplementing training data in the subsequent stages of alignment. We present theoretical guarantees for PROPS as well as comprehensive validation using multiple models (Pythia and GPT) and datasets (AlpacaEval, Anthropic HH-RLHF, truthy-dpo-v0.1) to demonstrate the utility of PROPS over existing methods while still providing high privacy. For the same privacy budget, alignment via PROPS can achieve up to 3x higher win-rates compared to DP-SGD, and 2.5x higher win-rates compared to Randomized Response (RR) based alignment.

[659] Mode-Aware Non-Linear Tucker Autoencoder for Tensor-based Unsupervised Learning

Junjing Zheng, Chengliang Song, Weidong Jiang, Xinyu Zhang

Main category: cs.LG

TL;DR: MA-NTAE is a novel autoencoder for high-dimensional tensor data, addressing limitations of MLP-based AEs and tensor networks by combining non-linear Tucker decomposition with flexible per-mode encoding.

Details

Motivation: High-dimensional tensor data challenges self-supervised learning due to computational and optimization issues in existing methods like MLP-based AEs and tensor networks.

Method: MA-NTAE generalizes Tucker decomposition into a non-linear framework, using a Pick-and-Unfold strategy for per-mode encoding via recursive operations.

Result: MA-NTAE shows linear complexity growth with tensor order and outperforms standard AEs and tensor networks in compression and clustering, especially for higher-order tensors.

Conclusion: MA-NTAE effectively integrates tensor structural priors, offering a scalable and efficient solution for high-dimensional tensor learning.

Abstract: High-dimensional data, particularly in the form of high-order tensors, presents a major challenge in self-supervised learning. While MLP-based autoencoders (AE) are commonly employed, their dependence on flattening operations exacerbates the curse of dimensionality, leading to excessively large model sizes, high computational overhead, and challenging optimization for deep structural feature capture. Although existing tensor networks alleviate computational burdens through tensor decomposition techniques, most exhibit limited capability in learning non-linear relationships. To overcome these limitations, we introduce the Mode-Aware Non-linear Tucker Autoencoder (MA-NTAE). MA-NTAE generalized classical Tucker decomposition to a non-linear framework and employs a Pick-and-Unfold strategy, facilitating flexible per-mode encoding of high-order tensors via recursive unfold-encode-fold operations, effectively integrating tensor structural priors. Notably, MA-NTAE exhibits linear growth in computational complexity with tensor order and proportional growth with mode dimensions. Extensive experiments demonstrate MA-NTAE’s performance advantages over standard AE and current tensor networks in compression and clustering tasks, which become increasingly pronounced for higher-order, higher-dimensional tensors.

[660] Hardness-Aware Dynamic Curriculum Learning for Robust Multimodal Emotion Recognition with Missing Modalities

Rui Liu, Haolin Zuo, Zheng Lian, Hongyu Yuan, Qi Fan

Main category: cs.LG

TL;DR: The paper introduces HARDY-MER, a framework for multimodal emotion recognition that dynamically adjusts training to focus on hard samples by evaluating reconstruction difficulty and cross-modal mutual information.

Details

Motivation: Addressing limitations of conventional missing modality reconstruction methods, which ignore varying reconstruction difficulty across samples, leading to ineffective handling of hard samples.

Method: Proposes a two-stage framework: (1) Multi-view Hardness Evaluation to quantify sample hardness (Direct and Indirect Hardness), and (2) Retrieval-based Dynamic Curriculum Learning to adjust training focus.

Result: Outperforms existing methods in missing-modality scenarios on benchmark datasets.

Conclusion: HARDY-MER effectively enhances model performance on hard samples by dynamically adapting training, offering a robust solution for missing-modality MER.

Abstract: Missing modalities have recently emerged as a critical research direction in multimodal emotion recognition (MER). Conventional approaches typically address this issue through missing modality reconstruction. However, these methods fail to account for variations in reconstruction difficulty across different samples, consequently limiting the model’s ability to handle hard samples effectively. To overcome this limitation, we propose a novel Hardness-Aware Dynamic Curriculum Learning framework, termed HARDY-MER. Our framework operates in two key stages: first, it estimates the hardness level of each sample, and second, it strategically emphasizes hard samples during training to enhance model performance on these challenging instances. Specifically, we first introduce a Multi-view Hardness Evaluation mechanism that quantifies reconstruction difficulty by considering both Direct Hardness (modality reconstruction errors) and Indirect Hardness (cross-modal mutual information). Meanwhile, we introduce a Retrieval-based Dynamic Curriculum Learning strategy that dynamically adjusts the training curriculum by retrieving samples with similar semantic information and balancing the learning focus between easy and hard instances. Extensive experiments on benchmark datasets demonstrate that HARDY-MER consistently outperforms existing methods in missing-modality scenarios. Our code will be made publicly available at https://github.com/HARDY-MER/HARDY-MER.

[661] Offline-to-Online Reinforcement Learning with Classifier-Free Diffusion Generation

Xiao Huang, Xu Liu, Enze Zhang, Tong Yu, Shuai Li

Main category: cs.LG

TL;DR: The paper proposes Classifier-Free Diffusion Generation (CFDG) for offline-to-online RL, improving data augmentation by aligning generated data with online distributions, achieving a 15% performance boost.

Details

Motivation: Existing methods struggle to bridge the gap between offline and online data distributions, limiting performance. CFDG aims to enhance generation quality without extra classifier training.

Method: CFDG uses classifier-free guidance diffusion for high-quality data generation and a reweighting method to align data with online distributions.

Result: CFDG outperforms replaying data or standard diffusion models, achieving a 15% average improvement on D4RL benchmarks like MuJoCo and AntMaze.

Conclusion: CFDG is versatile, integrates with existing RL methods, and significantly enhances offline-to-online RL performance.

Abstract: Offline-to-online Reinforcement Learning (O2O RL) aims to perform online fine-tuning on an offline pre-trained policy to minimize costly online interactions. Existing work used offline datasets to generate data that conform to the online data distribution for data augmentation. However, generated data still exhibits a gap with the online data, limiting overall performance. To address this, we propose a new data augmentation approach, Classifier-Free Diffusion Generation (CFDG). Without introducing additional classifier training overhead, CFDG leverages classifier-free guidance diffusion to significantly enhance the generation quality of offline and online data with different distributions. Additionally, it employs a reweighting method to enable more generated data to align with the online data, enhancing performance while maintaining the agent’s stability. Experimental results show that CFDG outperforms replaying the two data types or using a standard diffusion model to generate new data. Our method is versatile and can be integrated with existing offline-to-online RL algorithms. By implementing CFDG to popular methods IQL, PEX and APL, we achieve a notable 15% average improvement in empirical performance on the D4RL benchmark such as MuJoCo and AntMaze.

[662] Technical Report: Full-Stack Fine-Tuning for the Q Programming Language

Brendan R. Hogan, Will Brown, Adel Boyarsky, Anderson Schneider, Yuriy Nevmyvaka

Main category: cs.LG

TL;DR: The paper presents an open-source approach to adapt LLMs for the Q programming language, a niche tool in finance, achieving superior performance over frontier models like Claude Opus-4 and GPT-4.1.

Details

Motivation: Addressing the challenge of leveraging LLMs for under-represented tasks, particularly niche programming languages like Q, which are less prevalent online.

Method: Introduces a Leetcode-style Q evaluation dataset, benchmarks frontier models, and trains reasoning/non-reasoning models via pretraining, supervised fine-tuning, and reinforcement learning.

Result: The best model achieves 59% pass@1 accuracy on the Q benchmark, outperforming Claude Opus-4 by 29.5% and all models surpassing GPT-4.1.

Conclusion: The methodology is broadly applicable, offering a blueprint for adapting LLMs to niche domains and tasks with subjective evaluation.

Abstract: Even though large language models are becoming increasingly capable, it is still unreasonable to expect them to excel at tasks that are under-represented on the Internet. Leveraging LLMs for specialized applications, particularly in niche programming languages and private domains, remains challenging and largely unsolved. In this work, we address this gap by presenting a comprehensive, open-source approach for adapting LLMs to the Q programming language, a popular tool in quantitative finance that is much less present on the Internet compared to Python, C, Java, and other ``mainstream” languages and is therefore not a strong suit of general-purpose AI models. We introduce a new Leetcode style evaluation dataset for Q, benchmark major frontier models on the dataset, then do pretraining, supervised fine tuning, and reinforcement learning to train a suite of reasoning and non-reasoning models based on the Qwen-2.5 series, spanning five parameter sizes (1.5B, 3B, 7B, 14B, 32B). Our best model achieves a pass@1 accuracy of 59 percent on our Q benchmark, surpassing the best-performing frontier model, Claude Opus-4 by 29.5 percent. Additionally, all models, even our 1.5B model, outperform GPT-4.1 on this task. In addition to releasing models, code, and data, we provide a detailed blueprint for dataset construction, model pretraining, supervised fine-tuning, and reinforcement learning. Our methodology is broadly applicable, and we discuss how these techniques can be extended to other tasks, including those where evaluation may rely on soft or subjective signals.

[663] Who’s the Evil Twin? Differential Auditing for Undesired Behavior

Ishwar Balappanawar, Venkata Hasith Vattikuti, Greta Kintzley, Ronan Azimi-Mancel, Satvik Golechha

Main category: cs.LG

TL;DR: The paper explores detecting hidden behaviors in neural networks through an adversarial game between red and blue teams, achieving high accuracy with adversarial-attack-based methods, while noting challenges in applying similar techniques to LLMs.

Details

Motivation: The challenge of detecting hidden behaviors in neural networks due to minimal prior knowledge and adversarial obfuscation motivates this work.

Method: The red team trains two similar models (one benign, one compromised), while the blue team tries to identify the compromised model using strategies like Gaussian noise analysis, model diffing, integrated gradients, and adversarial attacks.

Result: Adversarial-attack-based methods achieve 100% accuracy with hints, while other techniques vary. LLM auditing requires hints about undesired distributions for effective probing.

Conclusion: The findings contribute to better audit designs, with open-sourced games and models to aid future research.

Abstract: Detecting hidden behaviors in neural networks poses a significant challenge due to minimal prior knowledge and potential adversarial obfuscation. We explore this problem by framing detection as an adversarial game between two teams: the red team trains two similar models, one trained solely on benign data and the other trained on data containing hidden harmful behavior, with the performance of both being nearly indistinguishable on the benign dataset. The blue team, with limited to no information about the harmful behaviour, tries to identify the compromised model. We experiment using CNNs and try various blue team strategies, including Gaussian noise analysis, model diffing, integrated gradients, and adversarial attacks under different levels of hints provided by the red team. Results show high accuracy for adversarial-attack-based methods (100% correct prediction, using hints), which is very promising, whilst the other techniques yield more varied performance. During our LLM-focused rounds, we find that there are not many parallel methods that we could apply from our study with CNNs. Instead, we find that effective LLM auditing methods require some hints about the undesired distribution, which can then used in standard black-box and open-weight methods to probe the models further and reveal their misalignment. We open-source our auditing games (with the model and data) and hope that our findings contribute to designing better audits.

[664] Sparsity-Driven Plasticity in Multi-Task Reinforcement Learning

Aleksandar Todorov, Juan Cardenas-Cartagena, Rafael F. Cunha, Marco Zullich, Matthia Sabatelli

Main category: cs.LG

TL;DR: Sparsification methods like GMP and SET improve plasticity and performance in multi-task reinforcement learning (MTRL) by mitigating plasticity loss.

Details

Motivation: Addressing plasticity loss in MTRL, where adaptability is crucial for handling diverse tasks.

Method: Evaluated GMP and SET across MTRL architectures (shared backbone, Mixture of Experts, etc.) against dense baselines and other methods.

Result: GMP and SET reduced plasticity degradation (e.g., neuron dormancy) and improved performance, often outperforming dense models.

Conclusion: Dynamic sparsification is a robust tool for enhancing plasticity and adaptability in MTRL systems.

Abstract: Plasticity loss, a diminishing capacity to adapt as training progresses, is a critical challenge in deep reinforcement learning. We examine this issue in multi-task reinforcement learning (MTRL), where higher representational flexibility is crucial for managing diverse and potentially conflicting task demands. We systematically explore how sparsification methods, particularly Gradual Magnitude Pruning (GMP) and Sparse Evolutionary Training (SET), enhance plasticity and consequently improve performance in MTRL agents. We evaluate these approaches across distinct MTRL architectures (shared backbone, Mixture of Experts, Mixture of Orthogonal Experts) on standardized MTRL benchmarks, comparing against dense baselines, and a comprehensive range of alternative plasticity-inducing or regularization methods. Our results demonstrate that both GMP and SET effectively mitigate key indicators of plasticity degradation, such as neuron dormancy and representational collapse. These plasticity improvements often correlate with enhanced multi-task performance, with sparse agents frequently outperforming dense counterparts and achieving competitive results against explicit plasticity interventions. Our findings offer insights into the interplay between plasticity, network sparsity, and MTRL designs, highlighting dynamic sparsification as a robust but context-sensitive tool for developing more adaptable MTRL systems.

[665] Conformal Prediction and Trustworthy AI

Anthony Bellotti, Xindi Zhao

Main category: cs.LG

TL;DR: Conformal predictors provide reliable set predictions with guaranteed confidence, aiding trustworthy AI by addressing uncertainty, generalization risk, and bias.

Details

Motivation: To explore conformal prediction's role in trustworthy AI beyond its validity, tackling issues like generalization risk and governance.

Method: Review of conformal prediction’s applications, supported by experiments and examples for calibration and bias mitigation.

Result: Conformal prediction proves effective for well-calibrated uncertainty and bias handling in AI.

Conclusion: Conformal prediction enhances trustworthy AI by addressing uncertainty, generalization, and bias, with practical applications demonstrated.

Abstract: Conformal predictors are machine learning algorithms developed in the 1990’s by Gammerman, Vovk, and their research team, to provide set predictions with guaranteed confidence level. Over recent years, they have grown in popularity and have become a mainstream methodology for uncertainty quantification in the machine learning community. From its beginning, there was an understanding that they enable reliable machine learning with well-calibrated uncertainty quantification. This makes them extremely beneficial for developing trustworthy AI, a topic that has also risen in interest over the past few years, in both the AI community and society more widely. In this article, we review the potential for conformal prediction to contribute to trustworthy AI beyond its marginal validity property, addressing problems such as generalization risk and AI governance. Experiments and examples are also provided to demonstrate its use as a well-calibrated predictor and for bias identification and mitigation.

[666] QuiZSF: An efficient data-model interaction framework for zero-shot time-series forecasting

Shichao Ma, Zhengyang Zhou, Qihe Huang, Binwu Wang, Kuo Yang, Huan Li, Yang Wang

Main category: cs.LG

TL;DR: QuiZSF integrates retrieval-augmented generation (RAG) with time series pre-trained models (TSPMs) to enhance zero-shot forecasting, achieving top performance in efficiency and accuracy.

Details

Motivation: Traditional models struggle with zero-shot time-series forecasting (ZSF) in data-scarce scenarios, and existing TSPMs lack dynamic external knowledge integration.

Method: Proposes QuiZSF, a lightweight framework with ChronoRAG Base for scalable storage, Multi-grained Series Interaction Learner for feature extraction, and Model Cooperation Coherer for aligning retrieved knowledge with TSPMs.

Result: QuiZSF ranks Top1 in 75% (Non-LLM) and 87.5% (LLM) of prediction settings, maintaining high efficiency.

Conclusion: QuiZSF effectively combines RAG and TSPMs, improving ZSF performance and adaptability.

Abstract: Time series forecasting has become increasingly important to empower diverse applications with streaming data. Zero-shot time-series forecasting (ZSF), particularly valuable in data-scarce scenarios, such as domain transfer or forecasting under extreme conditions, is difficult for traditional models to deal with. While time series pre-trained models (TSPMs) have demonstrated strong performance in ZSF, they often lack mechanisms to dynamically incorporate external knowledge. Fortunately, emerging retrieval-augmented generation (RAG) offers a promising path for injecting such knowledge on demand, yet they are rarely integrated with TSPMs. To leverage the strengths of both worlds, we introduce RAG into TSPMs to enhance zero-shot time series forecasting. In this paper, we propose QuiZSF (Quick Zero-Shot Time Series Forecaster), a lightweight and modular framework that couples efficient retrieval with representation learning and model adaptation for ZSF. Specifically, we construct a hierarchical tree-structured ChronoRAG Base (CRB) for scalable time-series storage and domain-aware retrieval, introduce a Multi-grained Series Interaction Learner (MSIL) to extract fine- and coarse-grained relational features, and develop a dual-branch Model Cooperation Coherer (MCC) that aligns retrieved knowledge with two kinds of TSPMs: Non-LLM based and LLM based. Compared with contemporary baselines, QuiZSF, with Non-LLM based and LLM based TSPMs as base model, respectively, ranks Top1 in 75% and 87.5% of prediction settings, while maintaining high efficiency in memory and inference time.

[667] mAIstro: an open-source multi-agentic system for automated end-to-end development of radiomics and deep learning models for medical imaging

Eleftherios Tzanis, Michail E. Klontzas

Main category: cs.LG

TL;DR: mAIstro is an open-source, autonomous multi-agent framework for developing and deploying medical AI models without coding, using LLMs for diverse healthcare tasks.

Details

Motivation: To automate complex healthcare AI workflows by providing a no-code, modular, and extensible framework for end-to-end AI model development and deployment.

Method: mAIstro uses a natural language interface to orchestrate tasks like data analysis, feature extraction, segmentation, and classification, supporting both open- and closed-source LLMs.

Result: The system successfully executed tasks across 16 datasets, producing validated models and interpretable outputs.

Conclusion: mAIstro is the first agentic framework unifying healthcare AI workflows, offering reproducibility and extensibility for clinical and research applications.

Abstract: Agentic systems built on large language models (LLMs) offer promising capabilities for automating complex workflows in healthcare AI. We introduce mAIstro, an open-source, autonomous multi-agentic framework for end-to-end development and deployment of medical AI models. The system orchestrates exploratory data analysis, radiomic feature extraction, image segmentation, classification, and regression through a natural language interface, requiring no coding from the user. Built on a modular architecture, mAIstro supports both open- and closed-source LLMs, and was evaluated using a large and diverse set of prompts across 16 open-source datasets, covering a wide range of imaging modalities, anatomical regions, and data types. The agents successfully executed all tasks, producing interpretable outputs and validated models. This work presents the first agentic framework capable of unifying data analysis, AI model development, and inference across varied healthcare applications, offering a reproducible and extensible foundation for clinical and research AI integration. The code is available at: https://github.com/eltzanis/mAIstro

[668] Class Unbiasing for Generalization in Medical Diagnosis

Lishi Zuo, Man-Wai Mak, Lu Yi, Youzhi Tu

Main category: cs.LG

TL;DR: The paper addresses class-feature bias in medical diagnosis models, proposing a class-unbiased model (Cls-unbias) with a class-wise inequality loss and group distributionally robust optimization to improve generalization.

Details

Motivation: To mitigate class-feature bias and class imbalance in medical diagnosis models, which can lead to biased performance and poor generalization.

Method: Proposes a class-wise inequality loss and a class-weighted training objective (group distributionally robust optimization) to balance contributions from all classes.

Result: Empirical results show the method effectively reduces class-feature bias and class imbalance, improving model generalization.

Conclusion: The proposed Cls-unbias model successfully addresses bias and imbalance, enhancing diagnostic model performance.

Abstract: Medical diagnosis might fail due to bias. In this work, we identified class-feature bias, which refers to models’ potential reliance on features that are strongly correlated with only a subset of classes, leading to biased performance and poor generalization on other classes. We aim to train a class-unbiased model (Cls-unbias) that mitigates both class imbalance and class-feature bias simultaneously. Specifically, we propose a class-wise inequality loss which promotes equal contributions of classification loss from positive-class and negative-class samples. We propose to optimize a class-wise group distributionally robust optimization objective-a class-weighted training objective that upweights underperforming classes-to enhance the effectiveness of the inequality loss under class imbalance. Through synthetic and real-world datasets, we empirically demonstrate that class-feature bias can negatively impact model performance. Our proposed method effectively mitigates both class-feature bias and class imbalance, thereby improving the model’s generalization ability.

[669] AMFT: Aligning LLM Reasoners by Meta-Learning the Optimal Imitation-Exploration Balance

Lixuan He, Jie Feng, Yong Li

Main category: cs.LG

TL;DR: AMFT introduces a single-stage algorithm to balance SFT and RL using implicit rewards, achieving state-of-the-art performance and generalization.

Details

Motivation: Addressing the suboptimal trade-offs and catastrophic forgetting in the traditional two-stage SFT-RL pipeline for LLM fine-tuning.

Method: AMFT uses a meta-gradient adaptive weight controller to dynamically balance SFT and RL rewards, optimizing long-term task performance.

Result: AMFT achieves state-of-the-art results on benchmarks and superior OOD generalization.

Conclusion: AMFT provides a principled, stable, and effective paradigm for LLM alignment, validated by ablation studies.

Abstract: Large Language Models (LLMs) are typically fine-tuned for reasoning tasks through a two-stage pipeline of Supervised Fine-Tuning (SFT) followed by Reinforcement Learning (RL), a process fraught with catastrophic forgetting and suboptimal trade-offs between imitation and exploration. Recent single-stage methods attempt to unify SFT and RL using heuristics, but lack a principled mechanism for dynamically balancing the two paradigms. In this paper, we reframe this challenge through the theoretical lens of \textbf{implicit rewards}, viewing SFT and RL not as distinct methods but as complementary reward signals. We introduce \textbf{Adaptive Meta Fine-Tuning (AMFT)}, a novel single-stage algorithm that learns the optimal balance between SFT’s implicit, path-level reward and RL’s explicit, outcome-based reward. The core of AMFT is a \textbf{meta-gradient adaptive weight controller} that treats the SFT-RL balance as a learnable parameter, dynamically optimizing it to maximize long-term task performance. This forward-looking approach, regularized by policy entropy for stability, autonomously discovers an effective training curriculum. We conduct a comprehensive evaluation on challenging benchmarks spanning mathematical reasoning, abstract visual reasoning (General Points), and vision-language navigation (V-IRL). AMFT consistently establishes a new state-of-the-art and demonstrats superior generalization on out-of-distribution (OOD) tasks. Ablation studies and training dynamic analysis confirm that the meta-learning controller is crucial for AMFT’s stability, sample efficiency, and performance, offering a more principled and effective paradigm for LLM alignment.Our codes are open-sourced via https://github.com/hlxtsyj/AMFT.

[670] Continual Multiple Instance Learning for Hematologic Disease Diagnosis

Zahra Ebrahimi, Raheleh Salehi, Nassir Navab, Carsten Marr, Ario Sadafi

Main category: cs.LG

TL;DR: Proposes a rehearsal-based continual learning method for Multiple Instance Learning (MIL), tailored for single-cell-based disease diagnosis, outperforming state-of-the-art methods.

Details

Motivation: Addresses the challenge of updating machine learning models in dynamic environments like labs, where data streams require continual learning without catastrophic forgetting, especially for MIL in disease diagnosis.

Method: Uses a rehearsal-based approach, selecting instances based on attention scores and distance from bag/class means to preserve data diversity in exemplary sets.

Result: Outperforms existing continual learning methods in a class incremental scenario using real-world leukemia lab data.

Conclusion: Introduces the first continual learning method for MIL, enabling model adaptation to shifting data distributions in medical diagnostics.

Abstract: The dynamic environment of laboratories and clinics, with streams of data arriving on a daily basis, requires regular updates of trained machine learning models for consistent performance. Continual learning is supposed to help train models without catastrophic forgetting. However, state-of-the-art methods are ineffective for multiple instance learning (MIL), which is often used in single-cell-based hematologic disease diagnosis (e.g., leukemia detection). Here, we propose the first continual learning method tailored specifically to MIL. Our method is rehearsal-based over a selection of single instances from various bags. We use a combination of the instance attention score and distance from the bag mean and class mean vectors to carefully select which samples and instances to store in exemplary sets from previous tasks, preserving the diversity of the data. Using the real-world input of one month of data from a leukemia laboratory, we study the effectiveness of our approach in a class incremental scenario, comparing it to well-known continual learning methods. We show that our method considerably outperforms state-of-the-art methods, providing the first continual learning approach for MIL. This enables the adaptation of models to shifting data distributions over time, such as those caused by changes in disease occurrence or underlying genetic alterations.

[671] BoRA: Towards More Expressive Low-Rank Adaptation with Block Diversity

Shiwei Li, Xiandi Luo, Haozhao Wang, Xing Tang, Ziqiang Cui, Dugang Liu, Yuhua Li, Xiuqiang He, Ruixuan Li

Main category: cs.LG

TL;DR: BoRA improves LoRA by partitioning matrices into blocks and introducing diagonal matrices, enhancing rank with minimal extra parameters.

Details

Motivation: To improve the rank of LoRA weights without significantly increasing trainable parameters.

Method: Partitions matrices into blocks, introduces block-wise diagonal matrices, and leverages block multiplication.

Result: BoRA increases rank by a factor of block count with only a small parameter overhead, outperforming LoRA in experiments.

Conclusion: BoRA is a scalable and efficient enhancement to LoRA, validated by extensive testing.

Abstract: Low-rank adaptation (LoRA) is a parameter-efficient fine-tuning (PEFT) method widely used in large language models (LLMs). It approximates the update of a pretrained weight matrix $W\in\mathbb{R}^{m\times n}$ by the product of two low-rank matrices, $BA$, where $A \in\mathbb{R}^{r\times n}$ and $B\in\mathbb{R}^{m\times r} (r\ll\min{m,n})$. Increasing the dimension $r$ can raise the rank of LoRA weights (i.e., $BA$), which typically improves fine-tuning performance but also significantly increases the number of trainable parameters. In this paper, we propose Block Diversified Low-Rank Adaptation (BoRA), which improves the rank of LoRA weights with a small number of additional parameters. Specifically, BoRA treats the product $BA$ as a block matrix multiplication, where $A$ and $B$ are partitioned into $b$ blocks along the columns and rows, respectively (i.e., $A=[A_1,\dots,A_b]$ and $B=[B_1,\dots,B_b]^\top$). Consequently, the product $BA$ becomes the concatenation of the block products $B_iA_j$ for $i,j\in[b]$. To enhance the diversity of different block products, BoRA introduces a unique diagonal matrix $\Sigma_{i,j} \in \mathbb{R}^{r\times r}$ for each block multiplication, resulting in $B_i \Sigma_{i,j} A_j$. By leveraging these block-wise diagonal matrices, BoRA increases the rank of LoRA weights by a factor of $b$ while only requiring $b^2r$ additional parameters. Extensive experiments across multiple datasets and models demonstrate the superiority of BoRA, and ablation studies further validate its scalability.

[672] Can Multitask Learning Enhance Model Explainability?

Hiba Najjar, Bushra Alshbib, Andreas Dengel

Main category: cs.LG

TL;DR: The paper proposes using multitask learning with satellite data modalities as additional targets to improve model interpretability and performance, without requiring extra inputs during deployment.

Details

Motivation: To address the trade-off between model complexity and interpretability in multimodal learning networks for remote sensing.

Method: Leverages modalities as additional targets in multitask learning, using rich satellite data as inputs.

Result: Shows benefits like no need for extra data at deployment, comparable or better performance, and explainable errors via auxiliary tasks.

Conclusion: Demonstrates efficiency on segmentation, classification, and regression tasks, with code provided.

Abstract: Remote sensing provides satellite data in diverse types and formats. The usage of multimodal learning networks exploits this diversity to improve model performance, except that the complexity of such networks comes at the expense of their interpretability. In this study, we explore how modalities can be leveraged through multitask learning to intrinsically explain model behavior. In particular, instead of additional inputs, we use certain modalities as additional targets to be predicted along with the main task. The success of this approach relies on the rich information content of satellite data, which remains as input modalities. We show how this modeling context provides numerous benefits: (1) in case of data scarcity, the additional modalities do not need to be collected for model inference at deployment, (2) the model performance remains comparable to the multimodal baseline performance, and in some cases achieves better scores, (3) prediction errors in the main task can be explained via the model behavior in the auxiliary task(s). We demonstrate the efficiency of our approach on three datasets, including segmentation, classification, and regression tasks. Code available at git.opendfki.de/hiba.najjar/mtl_explainability/.

[673] Structure-Preserving Digital Twins via Conditional Neural Whitney Forms

Brooks Kinch, Benjamin Shaffer, Elizabeth Armstrong, Michael Meehan, John Hewson, Nathaniel Trask

Main category: cs.LG

TL;DR: A framework for real-time digital twins using reduced finite element models with conditional attention mechanisms, ensuring numerical stability and exact conservation laws.

Details

Motivation: To enable real-time, accurate digital twins with sparse data and complex geometries, integrating learned models with conventional finite element methods.

Method: Uses conditional attention mechanisms within finite element exterior calculus (FEEC) to learn reduced bases and nonlinear conservation laws, supporting real-time calibration.

Result: Achieves accurate predictions on complex problems (e.g., turbulence, battery thermal runaway) with sparse data, offering a 3.1x10^8 speedup over LES simulations.

Conclusion: The framework provides a non-invasive, efficient solution for real-time digital twins, validated across diverse benchmarks.

Abstract: We present a framework for constructing real-time digital twins based on structure-preserving reduced finite element models conditioned on a latent variable Z. The approach uses conditional attention mechanisms to learn both a reduced finite element basis and a nonlinear conservation law within the framework of finite element exterior calculus (FEEC). This guarantees numerical well-posedness and exact preservation of conserved quantities, regardless of data sparsity or optimization error. The conditioning mechanism supports real-time calibration to parametric variables, allowing the construction of digital twins which support closed loop inference and calibration to sensor data. The framework interfaces with conventional finite element machinery in a non-invasive manner, allowing treatment of complex geometries and integration of learned models with conventional finite element techniques. Benchmarks include advection diffusion, shock hydrodynamics, electrostatics, and a complex battery thermal runaway problem. The method achieves accurate predictions on complex geometries with sparse data (25 LES simulations), including capturing the transition to turbulence and achieving real-time inference ~0.1s with a speedup of 3.1x10^8 relative to LES. An open-source implementation is available on GitHub.

[674] Discovery Learning accelerates battery design evaluation

Jiawei Zhang, Yifei Zhang, Baozhao Yi, Yao Ren, Qi Jiao, Hanyu Bai, Weiran Jiang, Ziyou Song

Main category: cs.LG

TL;DR: Discovery Learning (DL) integrates active, physics-guided, and zero-shot learning to predict battery lifetimes efficiently, reducing prototyping needs by leveraging historical data.

Details

Motivation: Battery R&D is bottlenecked by high costs and time for prototyping and testing. Existing methods lack efficiency and require labeled data, limiting rapid feedback for design improvements.

Method: DL combines active learning, physics-guided learning, and zero-shot learning in a human-like reasoning loop, using historical battery designs to predict lifetimes without additional labeling.

Result: DL achieved 7.2% test error in predicting cycle life, saving 98% time and 95% energy compared to industrial practices, using public datasets of small cells.

Conclusion: DL accelerates battery development by reducing prototyping needs and leveraging historical insights, advancing data-driven modeling for scientific and engineering innovation.

Abstract: Fast and reliable validation of novel designs in complex physical systems such as batteries is critical to accelerating technological innovation. However, battery research and development remain bottlenecked by the prohibitively high time and energy costs required to evaluate numerous new design candidates, particularly in battery prototyping and life testing. Despite recent progress in data-driven battery lifetime prediction, existing methods require labeled data of target designs to improve accuracy and cannot make reliable predictions until after prototyping, thus falling far short of the efficiency needed to enable rapid feedback for battery design. Here, we introduce Discovery Learning (DL), a scientific machine-learning paradigm that integrates active learning, physics-guided learning, and zero-shot learning into a human-like reasoning loop, drawing inspiration from learning theories in educational psychology. DL can learn from historical battery designs and actively reduce the need for prototyping, thus enabling rapid lifetime evaluation for unobserved material-design combinations without requiring additional data labeling. To test DL, we present 123 industrial-grade large-format lithium-ion pouch cells, spanning eight material-design combinations and diverse cycling protocols. Trained solely on public datasets of small-capacity cylindrical cells, DL achieves 7.2% test error in predicting the average cycle life under unknown device variability. This results in savings of 98% in time and 95% in energy compared to industrial practices. This work highlights the potential of uncovering insights from historical designs to inform and accelerate the development of next-generation battery technologies. DL represents a key advance toward efficient data-driven modeling and helps realize the promise of machine learning for accelerating scientific discovery and engineering innovation.

[675] UniMove: A Unified Model for Multi-city Human Mobility Prediction

Chonghua Han, Yuan Yuan, Yukun Liu, Jingtao Ding, Jie Feng, Yong Li

Main category: cs.LG

TL;DR: UniMove is a unified model for multi-city human mobility prediction, addressing universal spatial representation and heterogeneous mobility patterns, improving accuracy by 10.2%.

Details

Motivation: Human mobility prediction is challenging due to randomness, non-uniform time intervals, and city-specific heterogeneity. Existing models require separate training for each city.

Method: Proposes a dual-tower architecture (location and trajectory towers) and MoE Transformer blocks to handle diverse movement patterns and universal spatial encoding.

Result: UniMove improves mobility prediction accuracy by over 10.2% and enables joint training on multi-city data.

Conclusion: UniMove advances toward a foundational model for human mobility with a unified architecture.

Abstract: Human mobility prediction is vital for urban planning, transportation optimization, and personalized services. However, the inherent randomness, non-uniform time intervals, and complex patterns of human mobility, compounded by the heterogeneity introduced by varying city structures, infrastructure, and population densities, present significant challenges in modeling. Existing solutions often require training separate models for each city due to distinct spatial representations and geographic coverage. In this paper, we propose UniMove, a unified model for multi-city human mobility prediction, addressing two challenges: (1) constructing universal spatial representations for effective token sharing across cities, and (2) modeling heterogeneous mobility patterns from varying city characteristics. We propose a trajectory-location dual-tower architecture, with a location tower for universal spatial encoding and a trajectory tower for sequential mobility modeling. We also design MoE Transformer blocks to adaptively select experts to handle diverse movement patterns. Extensive experiments across multiple datasets from diverse cities demonstrate that UniMove truly embodies the essence of a unified model. By enabling joint training on multi-city data with mutual data enhancement, it significantly improves mobility prediction accuracy by over 10.2%. UniMove represents a key advancement toward realizing a true foundational model with a unified architecture for human mobility. We release the implementation at https://github.com/tsinghua-fib-lab/UniMove/.

[676] A Comparative Study of Feature Selection in Tsetlin Machines

Vojtech Halenka, Ole-Christoffer Granmo, Lei Jiao, Per-Arne Andersen

Main category: cs.LG

TL;DR: The paper evaluates feature selection (FS) techniques for Tsetlin machines (TMs), including classical and novel methods, showing TM-internal scorers perform competitively while leveraging interpretability.

Details

Motivation: Feature selection is essential for model interpretability and accuracy, but TMs lack established tools for feature importance estimation.

Method: Adapts and evaluates FS techniques like filter, embedded methods, SHAP, LIME, and novel TM-specific scorers. Benchmarked on 12 datasets using ROAR and ROAD protocols.

Result: TM-internal scorers perform well, reveal feature interactions, and simpler TM-specific scorers match accuracy at lower computational cost.

Conclusion: Establishes a baseline for FS in TMs and encourages development of specialized interpretability techniques.

Abstract: Feature Selection (FS) is crucial for improving model interpretability, reducing complexity, and sometimes for enhancing accuracy. The recently introduced Tsetlin machine (TM) offers interpretable clause-based learning, but lacks established tools for estimating feature importance. In this paper, we adapt and evaluate a range of FS techniques for TMs, including classical filter and embedded methods as well as post-hoc explanation methods originally developed for neural networks (e.g., SHAP and LIME) and a novel family of embedded scorers derived from TM clause weights and Tsetlin automaton (TA) states. We benchmark all methods across 12 datasets, using evaluation protocols, like Remove and Retrain (ROAR) strategy and Remove and Debias (ROAD), to assess causal impact. Our results show that TM-internal scorers not only perform competitively but also exploit the interpretability of clauses to reveal interacting feature patterns. Simpler TM-specific scorers achieve similar accuracy retention at a fraction of the computational cost. This study establishes the first comprehensive baseline for FS in TM and paves the way for developing specialized TM-specific interpretability techniques.

[677] TLCCSP: A Scalable Framework for Enhancing Time Series Forecasting with Time-Lagged Cross-Correlations

Jianfei Wu, Wenmian Yang, Bingning Liu, Weijia Jia

Main category: cs.LG

TL;DR: The paper introduces TLCCSP, a framework for time series forecasting that leverages time-lagged cross-correlations using SSDTW and a contrastive learning encoder, improving accuracy and computational efficiency.

Details

Motivation: Existing deep learning models often ignore time-lagged cross-correlations, which are vital for capturing complex temporal relationships in forecasting tasks.

Method: The proposed TLCCSP framework uses the SSDTW algorithm to capture lagged correlations and a contrastive learning encoder to approximate SSDTW distances efficiently.

Result: TLCCSP significantly reduces MSE across weather (16.01% with SSDTW, 17.88% with CLE), finance (9.95% with SSDTW, 6.13% with CLE), and real estate (21.29% with SSDTW, 8.62% with CLE) datasets. CLE also cuts SSDTW computational time by ~99%.

Conclusion: TLCCSP enhances forecasting accuracy and scalability by effectively integrating time-lagged cross-correlations and optimizing computational efficiency.

Abstract: Time series forecasting is critical across various domains, such as weather, finance and real estate forecasting, as accurate forecasts support informed decision-making and risk mitigation. While recent deep learning models have improved predictive capabilities, they often overlook time-lagged cross-correlations between related sequences, which are crucial for capturing complex temporal relationships. To address this, we propose the Time-Lagged Cross-Correlations-based Sequence Prediction framework (TLCCSP), which enhances forecasting accuracy by effectively integrating time-lagged cross-correlated sequences. TLCCSP employs the Sequence Shifted Dynamic Time Warping (SSDTW) algorithm to capture lagged correlations and a contrastive learning-based encoder to efficiently approximate SSDTW distances. Experimental results on weather, finance and real estate time series datasets demonstrate the effectiveness of our framework. On the weather dataset, SSDTW reduces mean squared error (MSE) by 16.01% compared with single-sequence methods, while the contrastive learning encoder (CLE) further decreases MSE by 17.88%. On the stock dataset, SSDTW achieves a 9.95% MSE reduction, and CLE reduces it by 6.13%. For the real estate dataset, SSDTW and CLE reduce MSE by 21.29% and 8.62%, respectively. Additionally, the contrastive learning approach decreases SSDTW computational time by approximately 99%, ensuring scalability and real-time applicability across multiple time series forecasting tasks.

[678] From Imitation to Optimization: A Comparative Study of Offline Learning for Autonomous Driving

Antonio Guillen-Perez

Main category: cs.LG

TL;DR: The paper compares Behavioral Cloning (BC) and Offline Reinforcement Learning (CQL) for autonomous driving, showing CQL’s superior robustness and success rates.

Details

Motivation: Address the brittleness of BC policies in autonomous driving by exploring offline RL for robust, long-horizon performance.

Method: Developsophisticated BC baselines, including a Transformer model, then applies CQL with a structured state representation and engineered reward.

Result: CQL achieves 3.2x higher success and 7.4x lower collision rates than BC in large-scale evaluations.

Conclusion: Offline RL (CQL) is essential for robust driving policies from static expert data.

Abstract: Learning robust driving policies from large-scale, real-world datasets is a central challenge in autonomous driving, as online data collection is often unsafe and impractical. While Behavioral Cloning (BC) offers a straightforward approach to imitation learning, policies trained with BC are notoriously brittle and suffer from compounding errors in closed-loop execution. This work presents a comprehensive pipeline and a comparative study to address this limitation. We first develop a series of increasingly sophisticated BC baselines, culminating in a Transformer-based model that operates on a structured, entity-centric state representation. While this model achieves low imitation loss, we show that it still fails in long-horizon simulations. We then demonstrate that by applying a state-of-the-art Offline Reinforcement Learning algorithm, Conservative Q-Learning (CQL), to the same data and architecture, we can learn a significantly more robust policy. Using a carefully engineered reward function, the CQL agent learns a conservative value function that enables it to recover from minor errors and avoid out-of-distribution states. In a large-scale evaluation on 1,000 unseen scenarios from the Waymo Open Motion Dataset, our final CQL agent achieves a 3.2x higher success rate and a 7.4x lower collision rate than the strongest BC baseline, proving that an offline RL approach is critical for learning robust, long-horizon driving policies from static expert data.

[679] A Stage-Aware Mixture of Experts Framework for Neurodegenerative Disease Progression Modelling

Tiantian He, Keyue Jiang, An Zhao, Anna Schroder, Elinor Thompson, Sonja Soskic, Frederik Barkhof, Daniel C. Alexander

Main category: cs.LG

TL;DR: The paper proposes a stage-aware Mixture of Experts (MoE) framework to model neurodegenerative disease progression, addressing data scarcity and complex pathological mechanisms by dynamically integrating time-dependent expert weighting, inhomogeneous graph neural diffusion, and localized neural reaction modules.

Details

Motivation: To overcome challenges in modeling neurodegenerative disease progression due to scarce longitudinal data and complex, stage-varying pathological mechanisms.

Method: A novel stage-aware MoE framework with time-dependent expert weighting, inhomogeneous graph neural diffusion (IGND), and localized neural reaction modules, optimized via iterative dual optimization.

Result: The IGND-MoE model provides stage-specific insights, showing graph-related processes dominate early stages while other mechanisms become influential later.

Conclusion: The framework offers a principled approach to understanding stage-specific pathological contributions in neurodegenerative diseases, aligning with clinical literature.

Abstract: The long-term progression of neurodegenerative diseases is commonly conceptualized as a spatiotemporal diffusion process that consists of a graph diffusion process across the structural brain connectome and a localized reaction process within brain regions. However, modeling this progression remains challenging due to 1) the scarcity of longitudinal data obtained through irregular and infrequent subject visits and 2) the complex interplay of pathological mechanisms across brain regions and disease stages, where traditional models assume fixed mechanisms throughout disease progression. To address these limitations, we propose a novel stage-aware Mixture of Experts (MoE) framework that explicitly models how different contributing mechanisms dominate at different disease stages through time-dependent expert weighting.Data-wise, we utilize an iterative dual optimization method to properly estimate the temporal position of individual observations, constructing a co hort-level progression trajectory from irregular snapshots. Model-wise, we enhance the spatial component with an inhomogeneous graph neural diffusion model (IGND) that allows diffusivity to vary based on node states and time, providing more flexible representations of brain networks. We also introduce a localized neural reaction module to capture complex dynamics beyond standard processes.The resulting IGND-MoE model dynamically integrates these components across temporal states, offering a principled way to understand how stage-specific pathological mechanisms contribute to progression. The stage-wise weights yield novel clinical insights that align with literature, suggesting that graph-related processes are more influential at early stages, while other unknown physical processes become dominant later on.

[680] Differentiable Adaptive Kalman Filtering via Optimal Transport

Yangguang He, Wenhao Li, Minzhe Li, Juan Zhang, Xiangfeng Wang, Bo Jin

Main category: cs.LG

TL;DR: OTAKNet is an online solution for noise-statistics drift in learning-based adaptive Kalman filtering, outperforming offline methods and classical approaches.

Details

Motivation: Addressing degradation of learning-based filtering due to unobserved noise-statistics drift caused by environmental factors.

Method: Uses one-step predictive measurement likelihood and optimal transport for online adaptation without retraining or ground truth labels.

Result: Demonstrates superior performance on synthetic and real-world datasets, especially with limited training data.

Conclusion: OTAKNet effectively mitigates noise-statistics drift, enabling robust online adaptation in dynamic environments.

Abstract: Learning-based filtering has demonstrated strong performance in non-linear dynamical systems, particularly when the statistics of noise are unknown. However, in real-world deployments, environmental factors, such as changing wind conditions or electromagnetic interference, can induce unobserved noise-statistics drift, leading to substantial degradation of learning-based methods. To address this challenge, we propose OTAKNet, the first online solution to noise-statistics drift within learning-based adaptive Kalman filtering. Unlike existing learning-based methods that perform offline fine-tuning using batch pointwise matching over entire trajectories, OTAKNet establishes a connection between the state estimate and the drift via one-step predictive measurement likelihood, and addresses it using optimal transport. This leverages OT’s geometry - aware cost and stable gradients to enable fully online adaptation without ground truth labels or retraining. We compare OTAKNet against classical model-based adaptive Kalman filtering and offline learning-based filtering. The performance is demonstrated on both synthetic and real-world NCLT datasets, particularly under limited training data.

[681] Membership and Memorization in LLM Knowledge Distillation

Ziqi Zhang, Ali Shahin Shamsabadi, Hanxiao Lu, Yifeng Cai, Hamed Haddadi

Main category: cs.LG

TL;DR: The paper examines privacy risks in Knowledge Distillation (KD) for Large Language Models (LLMs), showing that all tested KD techniques transfer privacy risks from teacher to student models, with varying severity.

Details

Motivation: To address the privacy risks inherited by student models in KD, given the widespread use of LLMs trained on private data.

Method: Systematic evaluation of six KD techniques across seven NLP tasks, three teacher model families (GPT-2, LLAMA-2, OPT), and various student sizes.

Result: All KD techniques transfer privacy risks, but severity varies. Key components (objective functions, training data, tasks) impact risks. Memorization and membership risks disagree. Privacy risk varies significantly across model blocks.

Conclusion: KD techniques inherently carry privacy risks, necessitating careful consideration of components and block-level risks to mitigate exposure.

Abstract: Recent advances in Knowledge Distillation (KD) aim to mitigate the high computational demands of Large Language Models (LLMs) by transferring knowledge from a large ‘’teacher’’ to a smaller ‘‘student’’ model. However, students may inherit the teacher’s privacy when the teacher is trained on private data. In this work, we systematically characterize and investigate membership and memorization privacy risks inherent in six LLM KD techniques. Using instruction-tuning settings that span seven NLP tasks, together with three teacher model families (GPT-2, LLAMA-2, and OPT), and various size student models, we demonstrate that all existing LLM KD approaches carry membership and memorization privacy risks from the teacher to its students. However, the extent of privacy risks varies across different KD techniques. We systematically analyse how key LLM KD components (KD objective functions, student training data and NLP tasks) impact such privacy risks. We also demonstrate a significant disagreement between memorization and membership privacy risks of LLM KD techniques. Finally, we characterize per-block privacy risk and demonstrate that the privacy risk varies across different blocks by a large margin.

[682] Surgical Knowledge Rewrite in Compact LLMs: An ‘Unlearn-then-Learn’ Strategy with ($IA^3$) for Localized Factual Modulation and Catastrophic Forgetting Mitigation

Stanley Ngugi

Main category: cs.LG

TL;DR: The paper introduces a ‘unlearn-then-learn’ strategy using PEFT (IA³) for precise knowledge editing in LLMs, addressing resistance to updates and catastrophic forgetting. It achieves high accuracy for new facts while minimizing forgetting.

Details

Motivation: LLMs struggle with dynamic knowledge updates, especially when new information conflicts with embedded facts, leading to resistance and catastrophic forgetting.

Method: A two-stage ‘unlearn-then-learn’ approach using IA³, preceded by circuit localization to target conflicting fact components.

Result: Achieves 98.50% accuracy for new facts, 96.00% forget rate for conflicting facts, and 72.00% F_control accuracy, mitigating catastrophic forgetting.

Conclusion: The strategy advances precise, localized, and safe knowledge management in LLMs, with ‘soft forgetting’ enhancing model safety.

Abstract: Large Language Models (LLMs) struggle with dynamic knowledge updates, especially when new information conflicts with deeply embedded facts. Such conflicting factual edits often lead to two critical issues: resistance to adopting the new fact and severe catastrophic forgetting of unrelated knowledge. This paper introduces and evaluates a novel “unlearn-then-learn” strategy for precise knowledge editing in LLMs, leveraging the parameter-efficient fine-tuning (PEFT) technique, Infused Adapter by Inhibiting and Amplifying Inner Activations ($IA^3$). Crucially, this two-stage approach is powered by an initial circuit localization phase that identifies and targets the specific internal components responsible for encoding the conflicting fact. Through a rigorous experimental methodology on microsoft/Phi-3-mini-4k-instruct, we demonstrate that this mechanistically informed two-stage approach achieves near-perfect accuracy (98.50%) for the new, modulated fact while simultaneously effectively suppressing the original conflicting fact (96.00% forget rate). Critically, our strategy exhibits unprecedented localization (72.00% F_control accuracy), dramatically mitigating catastrophic forgetting observed in direct fine-tuning approaches (which showed as low as ~20% F_control accuracy), a direct benefit of our targeted interpretability-guided intervention. Furthermore, qualitative analysis reveals a nuanced mechanism of “soft forgetting,” where original knowledge is suppressed from default retrieval but remains latent and conditionally accessible, enhancing model safety and control. These findings represent a significant advancement towards precise, localized, and safe knowledge management in compact LLMs.

[683] Improving Real-Time Concept Drift Detection using a Hybrid Transformer-Autoencoder Framework

N Harshit, K Mounvik

Main category: cs.LG

TL;DR: A hybrid framework using Transformers and Autoencoders for early and sensitive online concept drift detection, outperforming baseline methods.

Details

Motivation: Concept drift in machine learning reduces model performance, and existing detection methods are reactive and insensitive to early changes.

Method: Proposes a hybrid framework combining Transformers and Autoencoders, with a Trust Score methodology integrating statistical, reconstruction, prediction uncertainty, and rule violation metrics.

Result: The framework detects drift earlier and more sensitively than baseline methods, validated on a time-sequenced airline passenger dataset with synthetic drift.

Conclusion: A robust framework for reliable real-time concept drift monitoring was developed.

Abstract: In applied machine learning, concept drift, which is either gradual or abrupt changes in data distribution, can significantly reduce model performance. Typical detection methods,such as statistical tests or reconstruction-based models,are generally reactive and not very sensitive to early detection. Our study proposes a hybrid framework consisting of Transformers and Autoencoders to model complex temporal dynamics and provide online drift detection. We create a distinct Trust Score methodology, which includes signals on (1) statistical and reconstruction-based drift metrics, more specifically, PSI, JSD, Transformer-AE error, (2) prediction uncertainty, (3) rules violations, and (4) trend of classifier error aligned with the combined metrics defined by the Trust Score. Using a time sequenced airline passenger data set with synthetic drift, our proposed model allows for a better detection of drift using as a whole and at different detection thresholds for both sensitivity and interpretability compared to baseline methods and provides a strong pipeline for drift detection in real time for applied machine learning. We evaluated performance using a time-sequenced airline passenger dataset having the gradually injected stimulus of drift in expectations,e.g. permuted ticket prices in later batches, broken into 10 time segments [1].In the data, our results support that the Transformation-Autoencoder detected drift earlier and with more sensitivity than the autoencoders commonly used in the literature, and provided improved modeling over more error rates and logical violations. Therefore, a robust framework was developed to reliably monitor concept drift.

[684] Towards High-Order Mean Flow Generative Models: Feasibility, Expressivity, and Provably Efficient Criteria

Yang Cao, Yubin Chen, Zhao Song, Jiahao Zhang

Main category: cs.LG

TL;DR: The paper introduces Second-Order MeanFlow, extending MeanFlow by incorporating average acceleration fields, proving feasibility, expressivity, and efficient implementation criteria.

Details

Motivation: To enhance generative modeling by extending MeanFlow with higher-order dynamics while maintaining practical sampling efficiency.

Method: Theoretical study of Second-Order MeanFlow, proving consistency, analyzing expressivity via circuit complexity, and deriving efficient implementation criteria.

Result: Second-Order MeanFlow supports stable one-step sampling, is implementable in TC0, and allows efficient attention approximations.

Conclusion: The work provides a theoretical foundation for high-order flow matching models with rich dynamics and practical efficiency.

Abstract: Generative modelling has seen significant advances through simulation-free paradigms such as Flow Matching, and in particular, the MeanFlow framework, which replaces instantaneous velocity fields with average velocities to enable efficient single-step sampling. In this work, we introduce a theoretical study on Second-Order MeanFlow, a novel extension that incorporates average acceleration fields into the MeanFlow objective. We first establish the feasibility of our approach by proving that the average acceleration satisfies a generalized consistency condition analogous to first-order MeanFlow, thereby supporting stable, one-step sampling and tractable loss functions. We then characterize its expressivity via circuit complexity analysis, showing that under mild assumptions, the Second-Order MeanFlow sampling process can be implemented by uniform threshold circuits within the $\mathsf{TC}^0$ class. Finally, we derive provably efficient criteria for scalable implementation by leveraging fast approximate attention computations: we prove that attention operations within the Second-Order MeanFlow architecture can be approximated to within $1/\mathrm{poly}(n)$ error in time $n^{2+o(1)}$. Together, these results lay the theoretical foundation for high-order flow matching models that combine rich dynamics with practical sampling efficiency.

[685] BrainATCL: Adaptive Temporal Brain Connectivity Learning for Functional Link Prediction and Age Estimation

Yiran Huang, Amirhossein Nouranizadeh, Christine Ahrends, Mengjia Xu

Main category: cs.LG

TL;DR: BrainATCL is an unsupervised framework for adaptive temporal brain connectivity learning in fMRI data, outperforming conventional GNNs in tasks like functional link prediction and age estimation.

Details

Motivation: To address the limitations of conventional GNNs in capturing long-range temporal dependencies in dynamic fMRI data, which is crucial for understanding brain connectivity dynamics.

Method: Proposes BrainATCL, a nonparametric framework with adaptive lookback windows and a GINE-Mamba2 backbone for spatial-temporal representation learning, incorporating brain structure and function-informed edge attributes.

Result: Demonstrates superior performance in functional link prediction and age estimation, with strong generalization, including cross-session scenarios.

Conclusion: BrainATCL effectively models dynamic functional connectivity, offering insights into transient neural states and potential applications in neuropsychiatric research.

Abstract: Functional Magnetic Resonance Imaging (fMRI) is an imaging technique widely used to study human brain activity. fMRI signals in areas across the brain transiently synchronise and desynchronise their activity in a highly structured manner, even when an individual is at rest. These functional connectivity dynamics may be related to behaviour and neuropsychiatric disease. To model these dynamics, temporal brain connectivity representations are essential, as they reflect evolving interactions between brain regions and provide insight into transient neural states and network reconfigurations. However, conventional graph neural networks (GNNs) often struggle to capture long-range temporal dependencies in dynamic fMRI data. To address this challenge, we propose BrainATCL, an unsupervised, nonparametric framework for adaptive temporal brain connectivity learning, enabling functional link prediction and age estimation. Our method dynamically adjusts the lookback window for each snapshot based on the rate of newly added edges. Graph sequences are subsequently encoded using a GINE-Mamba2 backbone to learn spatial-temporal representations of dynamic functional connectivity in resting-state fMRI data of 1,000 participants from the Human Connectome Project. To further improve spatial modeling, we incorporate brain structure and function-informed edge attributes, i.e., the left/right hemispheric identity and subnetwork membership of brain regions, enabling the model to capture biologically meaningful topological patterns. We evaluate our BrainATCL on two tasks: functional link prediction and age estimation. The experimental results demonstrate superior performance and strong generalization, including in cross-session prediction scenarios.

[686] Approaching Maximal Information Extraction in Low-Signal Regimes via Multiple Instance Learning

Atakan Azakli, Bernd Stelzer

Main category: cs.LG

TL;DR: A new ML methodology improves prediction precision and discriminative power in challenging hypotheses testing, with theoretical and empirical validation using MIL and SMEFT at the LHC.

Details

Motivation: To enhance ML model precision and discriminative power in hypotheses testing, especially where state-of-the-art classifiers struggle.

Method: Proposes Multiple Instance Learning (MIL) for better predictive power, supported by theoretical analysis and scaling behavior. Applied to SMEFT at the LHC.

Result: Demonstrates MIL’s superiority over single-instance methods and potential to extract maximum Fisher Information from datasets.

Conclusion: MIL offers a systematic way to reduce ML prediction errors and improve accuracy in complex scenarios like SMEFT analysis.

Abstract: In this work, we propose a new machine learning (ML) methodology to obtain more precise predictions for some parameters of interest in a given hypotheses testing problem. Our proposed method also allows ML models to have more discriminative power in cases where it is extremely challenging for state-of-the-art classifiers to have any level of accurate predictions. This method can also allow us to systematically decrease the error from ML models in their predictions. In this paper, we provide a mathematical motivation why Multiple Instance Learning (MIL) would have more predictive power over their single-instance counterparts. We support our theoretical claims by analyzing the behavior of the MIL models through their scaling behaviors with respect to the number of instances on which the model makes predictions. As a concrete application, we constrain Wilson coefficients of the Standard Model Effective Field Theory (SMEFT) using kinematic information from subatomic particle collision events at the Large Hadron Collider (LHC). We show that under certain circumstances, it might be possible to extract the theoretical maximum Fisher Information latent in a dataset.

[687] From Nodes to Narratives: Explaining Graph Neural Networks with LLMs and Graph Context

Peyman Baghershahi, Gregoire Fournier, Pranav Nyati, Sourav Medya

Main category: cs.LG

TL;DR: LOGIC is a lightweight framework using LLMs to generate interpretable explanations for GNN predictions, improving fidelity and human-centric metrics.

Details

Motivation: Existing GNN explanation methods struggle with interpretable, fine-grained rationales, especially for text-attributed graphs.

Method: LOGIC projects GNN embeddings into LLM space, using hybrid prompts to generate natural language explanations and subgraphs.

Result: Experiments show LOGIC balances fidelity and sparsity, enhancing insightfulness in human evaluations.

Conclusion: LOGIC advances LLM-based explainability in graph learning by aligning GNN internals with human reasoning.

Abstract: Graph Neural Networks (GNNs) have emerged as powerful tools for learning over structured data, including text-attributed graphs, which are common in domains such as citation networks, social platforms, and knowledge graphs. GNNs are not inherently interpretable and thus, many explanation methods have been proposed. However, existing explanation methods often struggle to generate interpretable, fine-grained rationales, especially when node attributes include rich natural language. In this work, we introduce LOGIC, a lightweight, post-hoc framework that uses large language models (LLMs) to generate faithful and interpretable explanations for GNN predictions. LOGIC projects GNN node embeddings into the LLM embedding space and constructs hybrid prompts that interleave soft prompts with textual inputs from the graph structure. This enables the LLM to reason about GNN internal representations and produce natural language explanations along with concise explanation subgraphs. Our experiments across four real-world TAG datasets demonstrate that LOGIC achieves a favorable trade-off between fidelity and sparsity, while significantly improving human-centric metrics such as insightfulness. LOGIC sets a new direction for LLM-based explainability in graph learning by aligning GNN internals with human reasoning.

[688] Multi-Level Service Performance Forecasting via Spatiotemporal Graph Neural Networks

Zhihao Xue, Yun Zi, Nia Qi, Ming Gong, Yujun Zou

Main category: cs.LG

TL;DR: A spatiotemporal graph neural network predicts performance in distributed backend systems by modeling service states and relationships, outperforming existing methods.

Details

Motivation: To forecast performance fluctuations in complex distributed systems with multi-level service call structures.

Method: Abstracts system states into graphs, integrates runtime features and service relationships, and uses graph convolutional and gated recurrent networks for spatiotemporal modeling.

Result: Outperforms existing methods in MAE, RMSE, and R2 metrics, showing robustness under varying conditions.

Conclusion: The model is effective for backend service performance management, demonstrating practical potential.

Abstract: This paper proposes a spatiotemporal graph neural network-based performance prediction algorithm to address the challenge of forecasting performance fluctuations in distributed backend systems with multi-level service call structures. The method abstracts system states at different time slices into a sequence of graph structures. It integrates the runtime features of service nodes with the invocation relationships among services to construct a unified spatiotemporal modeling framework. The model first applies a graph convolutional network to extract high-order dependency information from the service topology. Then it uses a gated recurrent network to capture the dynamic evolution of performance metrics over time. A time encoding mechanism is also introduced to enhance the model’s ability to represent non-stationary temporal sequences. The architecture is trained in an end-to-end manner, optimizing the multi-layer nested structure to achieve high-precision regression of future service performance metrics. To validate the effectiveness of the proposed method, a large-scale public cluster dataset is used. A series of multi-dimensional experiments are designed, including variations in time windows and concurrent load levels. These experiments comprehensively evaluate the model’s predictive performance and stability. The experimental results show that the proposed model outperforms existing representative methods across key metrics such as MAE, RMSE, and R2. It maintains strong robustness under varying load intensities and structural complexities. These results demonstrate the model’s practical potential for backend service performance management tasks.

[689] Pref-GUIDE: Continual Policy Learning from Real-Time Human Feedback via Preference-Based Learning

Zhengran Ji, Boyuan Chen

Main category: cs.LG

TL;DR: Pref-GUIDE transforms noisy real-time scalar feedback into preference-based data to improve reward model learning in online reinforcement learning, outperforming baselines and even expert-designed rewards.

Details

Motivation: Human feedback is essential for tasks with hard-to-specify objectives, but scalar feedback is noisy and limits reward model accuracy.

Method: Pref-GUIDE converts scalar feedback into preference-based data, using temporal comparisons (Individual) and population consensus (Voting).

Result: Pref-GUIDE outperforms scalar-feedback baselines and matches/exceeds expert-designed rewards in three environments.

Conclusion: Pref-GUIDE provides a scalable, principled method for leveraging human input in online reinforcement learning.

Abstract: Training reinforcement learning agents with human feedback is crucial when task objectives are difficult to specify through dense reward functions. While prior methods rely on offline trajectory comparisons to elicit human preferences, such data is unavailable in online learning scenarios where agents must adapt on the fly. Recent approaches address this by collecting real-time scalar feedback to guide agent behavior and train reward models for continued learning after human feedback becomes unavailable. However, scalar feedback is often noisy and inconsistent, limiting the accuracy and generalization of learned rewards. We propose Pref-GUIDE, a framework that transforms real-time scalar feedback into preference-based data to improve reward model learning for continual policy training. Pref-GUIDE Individual mitigates temporal inconsistency by comparing agent behaviors within short windows and filtering ambiguous feedback. Pref-GUIDE Voting further enhances robustness by aggregating reward models across a population of users to form consensus preferences. Across three challenging environments, Pref-GUIDE significantly outperforms scalar-feedback baselines, with the voting variant exceeding even expert-designed dense rewards. By reframing scalar feedback as structured preferences with population feedback, Pref-GUIDE offers a scalable and principled approach for harnessing human input in online reinforcement learning.

[690] How Effectively Can Large Language Models Connect SNP Variants and ECG Phenotypes for Cardiovascular Risk Prediction?

Niranjana Arun Menon, Iqra Farooq, Yulong Li, Sara Ahmed, Yutong Xie, Muhammad Awais, Imran Razzak

Main category: cs.LG

TL;DR: The paper explores using fine-tuned LLMs to predict CVD risk and SNPs from genomic data, leveraging CoT reasoning for early detection and personalized medicine.

Details

Motivation: CVD prediction is challenging due to multifactorial causes and noisy genomic data. LLMs offer potential to extract meaningful insights.

Method: Fine-tuned LLMs analyze genetic markers and patterns, using CoT reasoning to predict disease labels and clinical deductions.

Result: LLMs show promise in early CVD detection, risk assessment, and advancing personalized cardiac care.

Conclusion: LLMs can enhance CVD prediction and personalized medicine by learning latent biological relationships from genomic data.

Abstract: Cardiovascular disease (CVD) prediction remains a tremendous challenge due to its multifactorial etiology and global burden of morbidity and mortality. Despite the growing availability of genomic and electrophysiological data, extracting biologically meaningful insights from such high-dimensional, noisy, and sparsely annotated datasets remains a non-trivial task. Recently, LLMs has been applied effectively to predict structural variations in biological sequences. In this work, we explore the potential of fine-tuned LLMs to predict cardiac diseases and SNPs potentially leading to CVD risk using genetic markers derived from high-throughput genomic profiling. We investigate the effect of genetic patterns associated with cardiac conditions and evaluate how LLMs can learn latent biological relationships from structured and semi-structured genomic data obtained by mapping genetic aspects that are inherited from the family tree. By framing the problem as a Chain of Thought (CoT) reasoning task, the models are prompted to generate disease labels and articulate informed clinical deductions across diverse patient profiles and phenotypes. The findings highlight the promise of LLMs in contributing to early detection, risk assessment, and ultimately, the advancement of personalized medicine in cardiac care.

[691] A Globally Optimal Analytic Solution for Semi-Nonnegative Matrix Factorization with Nonnegative or Mixed Inputs

Lu Chenggang

Main category: cs.LG

TL;DR: A novel method for globally optimal Semi-NMF is proposed, outperforming existing methods in reconstruction accuracy.

Details

Motivation: Existing semi-NMF algorithms are iterative, non-convex, and prone to local minima, necessitating a globally optimal solution.

Method: The method uses orthogonal decomposition from the scatter matrix to achieve a globally optimal solution under the Frobenius norm.

Result: The solution attains the global minimum of reconstruction error and outperforms NMF/semi-NMF in accuracy, especially in low-rank cases.

Conclusion: The proposed method provides theoretical guarantees and empirical advantages, advancing matrix factorization in optimization and data analysis.

Abstract: Semi-Nonnegative Matrix Factorization (semi-NMF) extends classical Nonnegative Matrix Factorization (NMF) by allowing the basis matrix to contain both positive and negative entries, making it suitable for decomposing data with mixed signs. However, most existing semi-NMF algorithms are iterative, non-convex, and prone to local minima. In this paper, we propose a novel method that yields a globally optimal solution to the semi-NMF problem under the Frobenius norm, through an orthogonal decomposition derived from the scatter matrix of the input data. We rigorously prove that our solution attains the global minimum of the reconstruction error. Furthermore, we demonstrate that when the input matrix is nonnegative, our method often achieves lower reconstruction error than standard NMF algorithms, although unfortunately the basis matrix may not satisfy nonnegativity. In particular, in low-rank cases such as rank 1 or 2, our solution reduces exactly to a nonnegative factorization, recovering the NMF structure. We validate our approach through experiments on both synthetic data and the UCI Wine dataset, showing that our method consistently outperforms existing NMF and semi-NMF methods in terms of reconstruction accuracy. These results confirm that our globally optimal, non-iterative formulation offers both theoretical guarantees and empirical advantages, providing a new perspective on matrix factorization in optimization and data analysis.

[692] A Stable and Principled Loss Function for Direct Language Model Alignment

Yuandong Tan

Main category: cs.LG

TL;DR: The paper critiques Direct Preference Optimization (DPO) for misalignment in its loss function and proposes a novel, theoretically grounded loss function for more stable and effective alignment of LLMs with human preferences.

Details

Motivation: DPO's loss function promotes indefinite maximization of logits differences, causing instability and reward hacking. The paper aims to address this by deriving a loss function from RLHF optimality conditions.

Method: The authors propose a new loss function targeting a finite logits difference, avoiding DPO’s pitfalls. Theoretical analysis and gradient comparisons are provided.

Result: Fine-tuning a Qwen2.5-7B model with the proposed method shows significant win-rate improvements over DPO and competitive performance against larger models like Llama-3.1-8B.

Conclusion: The proposed loss function offers a more stable and effective alternative to DPO for aligning LLMs with human preferences.

Abstract: The alignment of large language models (LLMs) with human preferences is commonly achieved through Reinforcement Learning from Human Feedback (RLHF). Direct Preference Optimization (DPO) simplified this paradigm by establishing a direct mapping between the optimal policy and a reward function, eliminating the need for an explicit reward model. However, we argue that the DPO loss function is theoretically misaligned with its own derivation, as it promotes the indefinite maximization of a logits difference, which can lead to training instability and reward hacking. In this paper, we propose a novel loss function derived directly from the RLHF optimality condition. Our proposed loss targets a specific, finite value for the logits difference, which is dictated by the underlying reward, rather than its maximization. We provide a theoretical analysis, including a gradient-based comparison, to demonstrate that our method avoids the large gradients that plague DPO when the probability of dispreferred responses approaches zero. This inherent stability prevents reward hacking and leads to more effective alignment. We validate our approach by fine-tuning a Qwen2.5-7B model, showing significant win-rate improvements over a standard DPO baseline and achieving competitive performance against larger models like Llama-3.1-8B.

[693] Strategic Incentivization for Locally Differentially Private Federated Learning

Yashwant Krishna Pagoti, Arunesh Sinha, Shamik Sural

Main category: cs.LG

TL;DR: The paper models the privacy-accuracy trade-off in Federated Learning (FL) as a game, introducing a token-based incentivization mechanism to balance noise addition for privacy and model accuracy.

Details

Motivation: To address the degradation in global model accuracy caused by Local Differential Privacy (LDP) noise in FL, while preserving client privacy.

Method: A game-theoretic approach where the server incentivizes clients to add less noise via tokens, and clients balance privacy and accuracy. Token credits depend on gradient perturbation.

Result: Strategic analysis and experiments show the impact of parameters on the privacy-accuracy trade-off.

Conclusion: The token-based mechanism effectively balances privacy and accuracy in FL, with strategic incentives encouraging optimal noise levels.

Abstract: In Federated Learning (FL), multiple clients jointly train a machine learning model by sharing gradient information, instead of raw data, with a server over multiple rounds. To address the possibility of information leakage in spite of sharing only the gradients, Local Differential Privacy (LDP) is often used. In LDP, clients add a selective amount of noise to the gradients before sending the same to the server. Although such noise addition protects the privacy of clients, it leads to a degradation in global model accuracy. In this paper, we model this privacy-accuracy trade-off as a game, where the sever incentivizes the clients to add a lower degree of noise for achieving higher accuracy, while the clients attempt to preserve their privacy at the cost of a potential loss in accuracy. A token based incentivization mechanism is introduced in which the quantum of tokens credited to a client in an FL round is a function of the degree of perturbation of its gradients. The client can later access a newly updated global model only after acquiring enough tokens, which are to be deducted from its balance. We identify the players, their actions and payoff, and perform a strategic analysis of the game. Extensive experiments were carried out to study the impact of different parameters.

[694] SGD Convergence under Stepsize Shrinkage in Low-Precision Training

Vincent-Daniel Yun

Main category: cs.LG

TL;DR: The paper analyzes how low-precision training affects SGD convergence due to gradient shrinkage and quantization noise, showing slower convergence and higher error.

Details

Motivation: To understand the impact of gradient quantization on SGD convergence, as low-precision training is crucial for reducing computational costs in deep learning.

Method: Models gradient quantization as shrinkage and noise in SGD, analyzing convergence under smoothness and bounded-variance assumptions.

Result: Low-precision SGD converges but slower, with an increased error floor due to quantization noise.

Conclusion: Gradient shrinkage and noise from low-precision training degrade SGD performance, though convergence is still guaranteed.

Abstract: Low-precision training has become essential for reducing the computational and memory costs of large-scale deep learning. However, quantization of gradients introduces both magnitude shrinkage and additive noise, which can alter the convergence behavior of stochastic gradient descent (SGD). In this work, we study the convergence of SGD under a gradient shrinkage model, where each stochastic gradient is scaled by a factor $q_k \in (0,1]$ and perturbed by zero-mean quantization noise. We show that this shrinkage is equivalent to replacing the nominal stepsize $\mu_k$ with an effective stepsize $\mu_k q_k$, which slows convergence when $q_{\min} < 1$. Under standard smoothness and bounded-variance assumptions, we prove that low-precision SGD still converges, but at a reduced rate determined by $q_{\min}$, and with an increased asymptotic error floor due to quantization noise. We theoretically analyze how reduced numerical precision slows down training by modeling it as gradient shrinkage in the standard SGD convergence framework.

[695] What One Cannot, Two Can: Two-Layer Transformers Provably Represent Induction Heads on Any-Order Markov Chains

Chanakya Ekbote, Marco Bondaschi, Nived Rajaraman, Jason D. Lee, Michael Gastpar, Ashok Vardhan Makkuva, Paul Pu Liang

Main category: cs.LG

TL;DR: A two-layer transformer with one head per layer can represent any conditional k-gram, resolving the open question about its capability for higher-order Markov processes.

Details

Motivation: To understand the interplay between transformer depth and Markov order for in-context learning (ICL) and determine if a two-layer, single-head transformer can represent any kth-order Markov process.

Method: Theoretical analysis and construction of a two-layer transformer with one head per layer, along with learning dynamics analysis for a simplified first-order Markov chain variant.

Result: The two-layer transformer can represent any conditional k-gram, providing the tightest known characterization of transformer depth and Markov order for ICL.

Conclusion: Shallow transformers can exhibit strong ICL capabilities, deepening understanding of transformer-based ICL for structured sequence modeling.

Abstract: In-context learning (ICL) is a hallmark capability of transformers, through which trained models learn to adapt to new tasks by leveraging information from the input context. Prior work has shown that ICL emerges in transformers due to the presence of special circuits called induction heads. Given the equivalence between induction heads and conditional k-grams, a recent line of work modeling sequential inputs as Markov processes has revealed the fundamental impact of model depth on its ICL capabilities: while a two-layer transformer can efficiently represent a conditional 1-gram model, its single-layer counterpart cannot solve the task unless it is exponentially large. However, for higher order Markov sources, the best known constructions require at least three layers (each with a single attention head) - leaving open the question: can a two-layer single-head transformer represent any kth-order Markov process? In this paper, we precisely address this and theoretically show that a two-layer transformer with one head per layer can indeed represent any conditional k-gram. Thus, our result provides the tightest known characterization of the interplay between transformer depth and Markov order for ICL. Building on this, we further analyze the learning dynamics of our two-layer construction, focusing on a simplified variant for first-order Markov chains, illustrating how effective in-context representations emerge during training. Together, these results deepen our current understanding of transformer-based ICL and illustrate how even shallow architectures can surprisingly exhibit strong ICL capabilities on structured sequence modeling tasks.

[696] Neural Bridge Processes

Jian Xu, Yican Liu, Qibin Zhao, John Paisley, Delu Zeng

Main category: cs.LG

TL;DR: Neural Bridge Processes (NBPs) improve stochastic function modeling by dynamically anchoring inputs to the diffusion trajectory, ensuring endpoint coherence and stronger gradients, outperforming baselines on synthetic and real-world tasks.

Details

Motivation: Traditional models like GPs and NPs have limitations in scalability, flexibility, and capturing multi-modal distributions, while NDPs suffer from weak input coupling and semantic mismatch.

Method: NBPs reformulate the forward kernel to explicitly depend on inputs, enforcing a constrained diffusion path that terminates at the supervised target.

Result: NBPs achieve significant improvements on synthetic data, EEG signal regression, and image regression tasks compared to baselines.

Conclusion: NBPs enhance performance and theoretical consistency for structured prediction tasks using DDPM-style bridge sampling.

Abstract: Learning stochastic functions from partially observed context-target pairs is a fundamental problem in probabilistic modeling. Traditional models like Gaussian Processes (GPs) face scalability issues with large datasets and assume Gaussianity, limiting their applicability. While Neural Processes (NPs) offer more flexibility, they struggle with capturing complex, multi-modal target distributions. Neural Diffusion Processes (NDPs) enhance expressivity through a learned diffusion process but rely solely on conditional signals in the denoising network, resulting in weak input coupling from an unconditional forward process and semantic mismatch at the diffusion endpoint. In this work, we propose Neural Bridge Processes (NBPs), a novel method for modeling stochastic functions where inputs x act as dynamic anchors for the entire diffusion trajectory. By reformulating the forward kernel to explicitly depend on x, NBP enforces a constrained path that strictly terminates at the supervised target. This approach not only provides stronger gradient signals but also guarantees endpoint coherence. We validate NBPs on synthetic data, EEG signal regression and image regression tasks, achieving substantial improvements over baselines. These results underscore the effectiveness of DDPM-style bridge sampling in enhancing both performance and theoretical consistency for structured prediction tasks.

[697] EDGE: A Theoretical Framework for Misconception-Aware Adaptive Learning

Ananda Prakash Verma

Main category: cs.LG

TL;DR: EDGE is an adaptive learning framework with four stages (Evaluate, Diagnose, Generate, Exercise) that integrates psychometrics, cognitive diagnostics, contrastive item generation, and scheduling to address learner misconceptions.

Details

Motivation: To create a unified framework for adaptive learning that addresses misconceptions by combining psychometrics, cognitive diagnostics, and principled scheduling.

Method: EDGE uses IRT/Bayesian models for ability estimation, distractor analysis for misconception diagnosis, contrastive item generation to invalidate shortcuts, and a restless bandit model for scheduling.

Result: The framework introduces EdgeScore, a readiness metric with proven properties, and a near-optimal index policy. It also shows counterfactual items can reduce misconceptions faster than standard methods.

Conclusion: EDGE provides a theoretical and implementable solution for misconception-aware adaptive learning, with empirical validation left for future work.

Abstract: We present EDGE, a general-purpose, misconception-aware adaptive learning framework composed of four stages: Evaluate (ability and state estimation), Diagnose (posterior infer-ence of misconceptions), Generate (counterfactual item synthesis), and Exercise (index-based retrieval scheduling). EDGE unifies psychometrics (IRT/Bayesian state space models), cog-nitive diagnostics (misconception discovery from distractor patterns and response latencies), contrastive item generation (minimal perturbations that invalidate learner shortcuts while pre-serving psychometric validity), and principled scheduling (a restless bandit approximation to spaced retrieval). We formalize a composite readiness metric, EdgeScore, prove its monotonicity and Lipschitz continuity, and derive an index policy that is near-optimal under mild assumptions on forgetting and learning gains. We further establish conditions under which counterfactual items provably reduce the posterior probability of a targeted misconception faster than standard practice. The paper focuses on theory and implementable pseudocode; empirical study is left to future work.

[698] Causal Negative Sampling via Diffusion Model for Out-of-Distribution Recommendation

Chu Zhao, Eneng Yang, Yizhou Dang, Jianzhe Zhao, Guibing Guo, Xingwei Wang

Main category: cs.LG

TL;DR: CNSDiff improves recommendation performance by addressing false hard negatives caused by environmental confounders, using a diffusion-based method and causal regularization.

Details

Motivation: Heuristic negative sampling can introduce false hard negatives due to biases in candidate pools, harming model generalization.

Method: Proposes CNSDiff, a method using conditional diffusion to synthesize negative samples and causal regularization to mitigate confounders.

Result: CNSDiff outperforms baselines by 13.96% on average in OOD scenarios.

Conclusion: CNSDiff effectively reduces bias and enhances robustness in recommendation tasks.

Abstract: Heuristic negative sampling enhances recommendation performance by selecting negative samples of varying hardness levels from predefined candidate pools to guide the model toward learning more accurate decision boundaries. However, our empirical and theoretical analyses reveal that unobserved environmental confounders (e.g., exposure or popularity biases) in candidate pools may cause heuristic sampling methods to introduce false hard negatives (FHNS). These misleading samples can encourage the model to learn spurious correlations induced by such confounders, ultimately compromising its generalization ability under distribution shifts. To address this issue, we propose a novel method named Causal Negative Sampling via Diffusion (CNSDiff). By synthesizing negative samples in the latent space via a conditional diffusion process, CNSDiff avoids the bias introduced by predefined candidate pools and thus reduces the likelihood of generating FHNS. Moreover, it incorporates a causal regularization term to explicitly mitigate the influence of environmental confounders during the negative sampling process, leading to robust negatives that promote out-of-distribution (OOD) generalization. Comprehensive experiments under four representative distribution shift scenarios demonstrate that CNSDiff achieves an average improvement of 13.96% across all evaluation metrics compared to state-of-the-art baselines, verifying its effectiveness and robustness in OOD recommendation tasks.

[699] Policy Newton methods for Distortion Riskmetrics

Soumen Pachal, Mizhaan Prajit Maniyar, Prashanth L. A

Main category: cs.LG

TL;DR: The paper proposes a risk-sensitive reinforcement learning method using distortion risk metrics (DRM) and a cubic-regularized policy Newton algorithm, ensuring convergence to a second-order stationary point with proven sample complexity.

Details

Motivation: To address the gap in risk-sensitive control by optimizing DRM in RL, ensuring robust policies that avoid saddle points.

Method: Derives a policy Hessian theorem for DRM, proposes a Hessian estimator, and introduces a cubic-regularized policy Newton algorithm for on-policy RL.

Result: The algorithm converges to an ε-second-order stationary point with sample complexity O(ε^(-3.5)), validated by experiments.

Conclusion: This is the first work to achieve ε-SOSP convergence for a risk-sensitive objective, advancing beyond prior first-order or risk-neutral results.

Abstract: We consider the problem of risk-sensitive control in a reinforcement learning (RL) framework. In particular, we aim to find a risk-optimal policy by maximizing the distortion riskmetric (DRM) of the discounted reward in a finite horizon Markov decision process (MDP). DRMs are a rich class of risk measures that include several well-known risk measures as special cases. We derive a policy Hessian theorem for the DRM objective using the likelihood ratio method. Using this result, we propose a natural DRM Hessian estimator from sample trajectories of the underlying MDP. Next, we present a cubic-regularized policy Newton algorithm for solving this problem in an on-policy RL setting using estimates of the DRM gradient and Hessian. Our proposed algorithm is shown to converge to an $\epsilon$-second-order stationary point ($\epsilon$-SOSP) of the DRM objective, and this guarantee ensures the escaping of saddle points. The sample complexity of our algorithms to find an $ \epsilon$-SOSP is $\mathcal{O}(\epsilon^{-3.5})$. Our experiments validate the theoretical findings. To the best of our knowledge, our is the first work to present convergence to an $\epsilon$-SOSP of a risk-sensitive objective, while existing works in the literature have either shown convergence to a first-order stationary point of a risk-sensitive objective, or a SOSP of a risk-neutral one.

[700] PySeizure: A single machine learning classifier framework to detect seizures in diverse datasets

Bartlomiej Chybowski, Shima Abdullateef, Hollan Haule, Alfredo Gonzalez-Sulser, Javier Escudero

Main category: cs.LG

TL;DR: An open-source machine-learning framework for robust, generalizable seizure detection across diverse EEG datasets, achieving high performance and cross-dataset transferability.

Details

Motivation: To address the limitations of manual EEG interpretation and dataset-specific machine learning models in seizure detection, enabling broader clinical applicability.

Method: The framework includes automated pre-processing, majority voting for decision-making, and cross-dataset model evaluation.

Result: High within-dataset AUC scores (0.904 for CHB-MIT, 0.864 for TUSZ) and strong cross-dataset performance (0.615-0.762 AUC). Post-processing further improved results.

Conclusion: The framework offers a reproducible, clinically viable solution for seizure detection, complementing expert interpretation and accelerating adoption.

Abstract: Reliable seizure detection is critical for diagnosing and managing epilepsy, yet clinical workflows remain dependent on time-consuming manual EEG interpretation. While machine learning has shown promise, existing approaches often rely on dataset-specific optimisations, limiting their real-world applicability and reproducibility. Here, we introduce an innovative, open-source machine-learning framework that enables robust and generalisable seizure detection across varied clinical datasets. We evaluate our approach on two publicly available EEG datasets that differ in patient populations and electrode configurations. To enhance robustness, the framework incorporates an automated pre-processing pipeline to standardise data and a majority voting mechanism, in which multiple models independently assess each second of EEG before reaching a final decision. We train, tune, and evaluate models within each dataset, assessing their cross-dataset transferability. Our models achieve high within-dataset performance (AUC 0.904+/-0.059 for CHB-MIT and 0.864+/-0.060 for TUSZ) and demonstrate strong generalisation across datasets despite differences in EEG setups and populations (AUC 0.615+/-0.039 for models trained on CHB-MIT and tested on TUSZ and 0.762+/-0.175 in the reverse case) without any post-processing. Furthermore, a mild post-processing improved the within-dataset results to 0.913+/-0.064 and 0.867+/-0.058 and cross-dataset results to 0.619+/-0.036 and 0.768+/-0.172. These results underscore the potential of, and essential considerations for, deploying our framework in diverse clinical settings. By making our methodology fully reproducible, we provide a foundation for advancing clinically viable, dataset-agnostic seizure detection systems. This approach has the potential for widespread adoption, complementing rather than replacing expert interpretation, and accelerating clinical integration.

[701] Revisiting Data Attribution for Influence Functions

Hongbo Zhu, Angelo Cangelosi

Main category: cs.LG

TL;DR: This paper reviews influence functions for data attribution in deep learning, covering theory, algorithmic advances, and applications like mislabel detection, while addressing challenges for real-world use.

Details

Motivation: To understand how individual training examples influence model predictions, aiding interpretability, debugging, and accountability.

Method: Uses influence functions for efficient first-order approximation of data impact without retraining, focusing on inverse-Hessian-vector product estimation.

Result: Demonstrates effectiveness for data attribution and mislabel detection, with insights into scalability and practical challenges.

Conclusion: Influence functions hold promise for large-scale deep learning but require further research to overcome current limitations.

Abstract: The goal of data attribution is to trace the model’s predictions through the learning algorithm and back to its training data. thereby identifying the most influential training samples and understanding how the model’s behavior leads to particular predictions. Understanding how individual training examples influence a model’s predictions is fundamental for machine learning interpretability, data debugging, and model accountability. Influence functions, originating from robust statistics, offer an efficient, first-order approximation to estimate the impact of marginally upweighting or removing a data point on a model’s learned parameters and its subsequent predictions, without the need for expensive retraining. This paper comprehensively reviews the data attribution capability of influence functions in deep learning. We discuss their theoretical foundations, recent algorithmic advances for efficient inverse-Hessian-vector product estimation, and evaluate their effectiveness for data attribution and mislabel detection. Finally, highlighting current challenges and promising directions for unleashing the huge potential of influence functions in large-scale, real-world deep learning scenarios.

[702] When Is Prior Knowledge Helpful? Exploring the Evaluation and Selection of Unsupervised Pretext Tasks from a Neuro-Symbolic Perspective

Lin-Han Jia, Si-Yu Han, Wen-Chao Hu, Jie-Jing Shao, Wen-Da Wei, Zhi Zhou, Lan-Zhe Guo, Yu-Feng Li

Main category: cs.LG

TL;DR: The paper unifies SSL and Nesy learning by extending Nesy theory to unreliable knowledge (assumptions), identifying three key factors for pretext task impact, and proposing a method to predict task effectiveness.

Details

Motivation: To bridge the gap between heuristic-based pretext task selection in SSL and theory-driven Nesy learning by unifying their frameworks.

Method: Extends Nesy theory to unreliable knowledge, identifies three factors (learnability, reliability, completeness), and develops a predictive method for pretext task effectiveness.

Result: High correlation between predicted and actual performance in experiments validates the theory and evaluation method.

Conclusion: The proposed framework and method provide a theory-backed approach for selecting pretext tasks, improving SSL and Nesy learning.

Abstract: Neuro-symbolic (Nesy) learning improves the target task performance of models by enabling them to satisfy knowledge, while semi/self-supervised learning (SSL) improves the target task performance by designing unsupervised pretext tasks for unlabeled data to make models satisfy corresponding assumptions. We extend the Nesy theory based on reliable knowledge to the scenario of unreliable knowledge (i.e., assumptions), thereby unifying the theoretical frameworks of SSL and Nesy. Through rigorous theoretical analysis, we demonstrate that, in theory, the impact of pretext tasks on target performance hinges on three factors: knowledge learnability with respect to the model, knowledge reliability with respect to the data, and knowledge completeness with respect to the target. We further propose schemes to operationalize these theoretical metrics, and thereby develop a method that can predict the effectiveness of pretext tasks in advance. This will change the current status quo in practical applications, where the selections of unsupervised tasks are heuristic-based rather than theory-based, and it is difficult to evaluate the rationality of unsupervised pretext task selection before testing the model on the target task. In experiments, we verify a high correlation between the predicted performance-estimated using minimal data-and the actual performance achieved after large-scale semi-supervised or self-supervised learning, thus confirming the validity of the theory and the effectiveness of the evaluation method.

[703] Efficient Edge LLMs Deployment via HessianAware Quantization and CPU GPU Collaborative

Tuo Zhang, Ning Li, Xin Yuan, Wenchao Xu, Quan Chen, Song Guo, Haijun Zhang

Main category: cs.LG

TL;DR: Proposes an efficient MoE edge deployment scheme using Hessian-Aware Quantization (HAQ) and CPU-GPU collaboration to address quantization accuracy and memory challenges.

Details

Motivation: Efficiently deploying large language models (LLMs) on resource-constrained edge devices is critical, but MoE architectures face quantization accuracy and memory issues.

Method: Uses Hessian-Aware Quantization (HAQ) for 8-bit joint quantization and a CPU-GPU collaborative inference mechanism for expert modules.

Result: Achieves near full-precision accuracy, reduces GPU memory by ~60%, and improves inference latency on models like OPT and Mixtral 8*7B.

Conclusion: The proposed scheme effectively addresses MoE deployment challenges, enhancing efficiency and performance on edge devices.

Abstract: With the breakthrough progress of large language models (LLMs) in natural language processing and multimodal tasks, efficiently deploying them on resource-constrained edge devices has become a critical challenge. The Mixture of Experts (MoE) architecture enhances model capacity through sparse activation, but faces two major difficulties in practical deployment: (1) The presence of numerous outliers in activation distributions leads to severe degradation in quantization accuracy for both activations and weights, significantly impairing inference performance; (2) Under limited memory, efficient offloading and collaborative inference of expert modules struggle to balance latency and throughput. To address these issues, this paper proposes an efficient MoE edge deployment scheme based on Hessian-Aware Quantization (HAQ) and CPU-GPU collaborative inference. First, by introducing smoothed Hessian matrix quantization, we achieve joint 8-bit quantization of activations and weights, which significantly alleviates the accuracy loss caused by outliers while ensuring efficient implementation on mainstream hardware. Second, we design an expert-level collaborative offloading and inference mechanism, which, combined with expert activation path statistics, enables efficient deployment and scheduling of expert modules between CPU and GPU, greatly reducing memory footprint and inference latency. Extensive experiments validate the effectiveness of our method on mainstream large models such as the OPT series and Mixtral 8*7B: on datasets like Wikitext2 and C4, the inference accuracy of the low-bit quantized model approaches that of the full-precision model, while GPU memory usage is reduced by about 60%, and inference latency is significantly improved.

[704] Finite-Time Convergence Analysis of ODE-based Generative Models for Stochastic Interpolants

Yuhao Liu, Rui Hu, Yu Chen, Longbo Huang

Main category: cs.LG

TL;DR: The paper analyzes finite-time convergence guarantees for numerical ODE schemes derived from stochastic interpolants, providing error bounds and optimized schedules for computational efficiency.

Details

Motivation: Stochastic interpolants show promise for generative modeling, but rigorous finite-time convergence guarantees for practical numerical schemes are lacking.

Method: The study examines the finite-time convergence of two numerical integrators (forward Euler and Heun’s methods) for ODEs from stochastic interpolants, deriving error bounds and complexity analyses.

Result: Novel finite-time error bounds in total variation distance are established, and optimized schedules for computational efficiency are provided. Numerical experiments validate the findings.

Conclusion: The work bridges a gap in finite-time convergence analysis for stochastic interpolants, offering practical insights for generative modeling applications.

Abstract: Stochastic interpolants offer a robust framework for continuously transforming samples between arbitrary data distributions, holding significant promise for generative modeling. Despite their potential, rigorous finite-time convergence guarantees for practical numerical schemes remain largely unexplored. In this work, we address the finite-time convergence analysis of numerical implementations for ordinary differential equations (ODEs) derived from stochastic interpolants. Specifically, we establish novel finite-time error bounds in total variation distance for two widely used numerical integrators: the first-order forward Euler method and the second-order Heun’s method. Furthermore, our analysis on the iteration complexity of specific stochastic interpolant constructions provides optimized schedules to enhance computational efficiency. Our theoretical findings are corroborated by numerical experiments, which validate the derived error bounds and complexity analyses.

[705] ProteoKnight: Convolution-based phage virion protein classification and uncertainty analysis

Samiha Afaf Neha, Abir Ahammed Bhuiyan, Md. Ishrak Khan

Main category: cs.LG

TL;DR: ProteoKnight, an image-based encoding method, improves PVP classification accuracy using CNNs and evaluates uncertainty via MCD.

Details

Motivation: Accurate PVP prediction is vital for genomic studies, but existing methods lack spatial efficiency.

Method: Adapts DNA-Walk for proteins, uses CNNs for classification, and MCD for uncertainty analysis.

Result: Achieved 90.8% binary classification accuracy; uncertainty varies by protein class/length.

Conclusion: ProteoKnight outperforms FCGR, offering robust PVP predictions and identifying low-confidence cases.

Abstract: \textbf{Introduction:} Accurate prediction of Phage Virion Proteins (PVP) is essential for genomic studies due to their crucial role as structural elements in bacteriophages. Computational tools, particularly machine learning, have emerged for annotating phage protein sequences from high-throughput sequencing. However, effective annotation requires specialized sequence encodings. Our paper introduces ProteoKnight, a new image-based encoding method that addresses spatial constraints in existing techniques, yielding competitive performance in PVP classification using pre-trained convolutional neural networks. Additionally, our study evaluates prediction uncertainty in binary PVP classification through Monte Carlo Dropout (MCD). \textbf{Methods:} ProteoKnight adapts the classical DNA-Walk algorithm for protein sequences, incorporating pixel colors and adjusting walk distances to capture intricate protein features. Encoded sequences were classified using multiple pre-trained CNNs. Variance and entropy measures assessed prediction uncertainty across proteins of various classes and lengths. \textbf{Results:} Our experiments achieved 90.8% accuracy in binary classification, comparable to state-of-the-art methods. Multi-class classification accuracy remains suboptimal. Our uncertainty analysis unveils variability in prediction confidence influenced by protein class and sequence length. \textbf{Conclusions:} Our study surpasses frequency chaos game representation (FCGR) by introducing novel image encoding that mitigates spatial information loss limitations. Our classification technique yields accurate and robust PVP predictions while identifying low-confidence predictions.

[706] Intrinsic training dynamics of deep neural networks

Sibylle Marcotte, Gabriel Peyré, Rémi Gribonval

Main category: cs.LG

TL;DR: The paper explores gradient-based training in high-dimensional spaces and its potential reduction to lower-dimensional structures, focusing on intrinsic dynamics and relaxed balanced initializations.

Details

Motivation: To understand if gradient flows in high-dimensional parameter spaces can be simplified into lower-dimensional dynamics, aiding the study of implicit bias in deep learning.

Method: Analyzes intrinsic dynamic properties, relates them to conservation laws, and applies the theory to ReLU and linear networks, including relaxed balanced initializations.

Result: Shows that gradient flows can be rewritten as lower-dimensional dynamics for ReLU networks and generalizes balanced initialization results for linear networks.

Conclusion: The study provides insights into dimensionality reduction in gradient flows and identifies conditions under which intrinsic dynamics hold, particularly for relaxed balanced initializations.

Abstract: A fundamental challenge in the theory of deep learning is to understand whether gradient-based training in high-dimensional parameter spaces can be captured by simpler, lower-dimensional structures, leading to so-called implicit bias. As a stepping stone, we study when a gradient flow on a high-dimensional variable $\theta$ implies an intrinsic gradient flow on a lower-dimensional variable $z = \phi(\theta)$, for an architecture-related function $\phi$. We express a so-called intrinsic dynamic property and show how it is related to the study of conservation laws associated with the factorization $\phi$. This leads to a simple criterion based on the inclusion of kernels of linear maps which yields a necessary condition for this property to hold. We then apply our theory to general ReLU networks of arbitrary depth and show that, for any initialization, it is possible to rewrite the flow as an intrinsic dynamic in a lower dimension that depends only on $z$ and the initialization, when $\phi$ is the so-called path-lifting. In the case of linear networks with $\phi$ the product of weight matrices, so-called balanced initializations are also known to enable such a dimensionality reduction; we generalize this result to a broader class of {\em relaxed balanced} initializations, showing that, in certain configurations, these are the \emph{only} initializations that ensure the intrinsic dynamic property. Finally, for the linear neural ODE associated with the limit of infinitely deep linear networks, with relaxed balanced initialization, we explicitly express the corresponding intrinsic dynamics.

[707] Tight Bounds for Schrödinger Potential Estimation in Unpaired Image-to-Image Translation Problems

Nikita Puchkin, Denis Suchkov, Alexey Naumov, Denis Belomestny

Main category: cs.LG

TL;DR: The paper explores generative modeling and unpaired image-to-image translation using Schr"odinger bridges and stochastic optimal control, focusing on scenarios with only i.i.d. samples from initial and target distributions. It uses an Ornstein-Uhlenbeck process and derives generalization bounds for empirical risk minimization.

Details

Motivation: To address generative modeling and image translation with limited data access (i.i.d. samples) and optimize the transformation process using Schr"odinger bridges and stochastic control.

Method: Employ an Ornstein-Uhlenbeck process as a reference, estimate the Schr"odinger potential, and use Kullback-Leibler divergence for risk minimization in Gaussian mixtures.

Result: Tight generalization bounds for empirical risk minimizers, achieving near-optimal convergence rates in favorable cases. Numerical experiments validate the approach.

Conclusion: The proposed method effectively handles generative tasks with limited data, offering theoretical guarantees and practical performance.

Abstract: Modern methods of generative modelling and unpaired image-to-image translation based on Schr"odinger bridges and stochastic optimal control theory aim to transform an initial density to a target one in an optimal way. In the present paper, we assume that we only have access to i.i.d. samples from initial and final distributions. This makes our setup suitable for both generative modelling and unpaired image-to-image translation. Relying on the stochastic optimal control approach, we choose an Ornstein-Uhlenbeck process as the reference one and estimate the corresponding Schr"odinger potential. Introducing a risk function as the Kullback-Leibler divergence between couplings, we derive tight bounds on generalization ability of an empirical risk minimizer in a class of Schr"odinger potentials including Gaussian mixtures. Thanks to the mixing properties of the Ornstein-Uhlenbeck process, we almost achieve fast rates of convergence up to some logarithmic factors in favourable scenarios. We also illustrate performance of the suggested approach with numerical experiments.

[708] Parity Requires Unified Input Dependence and Negative Eigenvalues in SSMs

Behnoush Khavari, Mehran Shakerinava, Jayesh Khullar, Jerry Huang, François Rivest, Siamak Ravanbakhsh, Sarath Chandar

Main category: cs.LG

TL;DR: Combining input-independent and non-negative SSMs in multilayer models fails to solve parity tasks, requiring input-dependent transitions with negative eigenvalues for state-tracking.

Details

Motivation: Address the lack of state-tracking capability in LRNN models like S4D, Mamba, and DeltaNet due to time-invariant or restricted eigenvalue transition matrices.

Method: Investigate combining input-independent and non-negative SSMs in multilayer models with diagonal transition matrices, and test an SSM model combining S4D and Mamba layers.

Result: Multilayer combinations still fail to solve parity tasks, indicating the necessity of input-dependent transitions with negative eigenvalues.

Conclusion: Effective state-tracking in SSMs requires input-dependent transition matrices with negative eigenvalues, as shown by theoretical and experimental results.

Abstract: Recent work has shown that LRNN models such as S4D, Mamba, and DeltaNet lack state-tracking capability due to either time-invariant transition matrices or restricted eigenvalue ranges. To address this, input-dependent transition matrices, particularly those that are complex or non-triangular, have been proposed to enhance SSM performance on such tasks. While existing theorems demonstrate that both input-independent and non-negative SSMs are incapable of solving simple state-tracking tasks, such as parity, regardless of depth, they do not explore whether combining these two types in a multilayer SSM could help. We investigate this question for efficient SSMs with diagonal transition matrices and show that such combinations still fail to solve parity. This implies that a recurrence layer must both be input-dependent and include negative eigenvalues. Our experiments support this conclusion by analyzing an SSM model that combines S4D and Mamba layers.

[709] Efficient Reward Identification In Max Entropy Reinforcement Learning with Sparsity and Rank Priors

Mohamad Louai Shehab, Alperen Tercan, Necmiye Ozay

Main category: cs.LG

TL;DR: The paper addresses recovering time-varying reward functions from optimal policies or demonstrations, using sparsity and linear feature priors for efficient solutions.

Details

Motivation: Reward recovery is ill-posed without assumptions; parsimonious rewards and prior information enable tractable solutions.

Method: Two priors: 1) sparse changes in rewards, solved via sparsification; 2) linear feature combinations, solved via rank minimization.

Result: Polynomial-time algorithm for sparsification and convex relaxations for rank minimization yield accurate, generalizable rewards.

Conclusion: Efficient optimization-based methods successfully recover rewards under sparsity and feature-based assumptions.

Abstract: In this paper, we consider the problem of recovering time-varying reward functions from either optimal policies or demonstrations coming from a max entropy reinforcement learning problem. This problem is highly ill-posed without additional assumptions on the underlying rewards. However, in many applications, the rewards are indeed parsimonious, and some prior information is available. We consider two such priors on the rewards: 1) rewards are mostly constant and they change infrequently, 2) rewards can be represented by a linear combination of a small number of feature functions. We first show that the reward identification problem with the former prior can be recast as a sparsification problem subject to linear constraints. Moreover, we give a polynomial-time algorithm that solves this sparsification problem exactly. Then, we show that identifying rewards representable with the minimum number of features can be recast as a rank minimization problem subject to linear constraints, for which convex relaxations of rank can be invoked. In both cases, these observations lead to efficient optimization-based reward identification algorithms. Several examples are given to demonstrate the accuracy of the recovered rewards as well as their generalizability.

[710] Lightning Prediction under Uncertainty: DeepLight with Hazy Loss

Md Sultanul Arifin, Abu Nowshed Sakib, Yeasir Rayhan, Tanzima Hashem

Main category: cs.LG

TL;DR: DeepLight is a deep learning model for lightning prediction, addressing limitations of existing methods by using multi-source data and a novel loss function, achieving significant performance improvements.

Details

Motivation: Lightning poses severe risks, worsened by climate change. Early prediction can mitigate these risks, but current models fail to capture dynamic spatial context and uncertainty, and rely too much on expensive NWP systems.

Method: DeepLight uses multi-source meteorological data (radar reflectivity, cloud properties, historical lightning) with a dual-encoder architecture and multi-branch convolution. It introduces a Hazy Loss function to handle spatio-temporal uncertainty.

Result: DeepLight improves the Equitable Threat Score (ETS) by 18%-30% over state-of-the-art methods.

Conclusion: DeepLight is a robust solution for lightning prediction, outperforming existing methods by better capturing spatial correlations and uncertainty.

Abstract: Lightning, a common feature of severe meteorological conditions, poses significant risks, from direct human injuries to substantial economic losses. These risks are further exacerbated by climate change. Early and accurate prediction of lightning would enable preventive measures to safeguard people, protect property, and minimize economic losses. In this paper, we present DeepLight, a novel deep learning architecture for predicting lightning occurrences. Existing prediction models face several critical limitations: they often struggle to capture the dynamic spatial context and inherent uncertainty of lightning events, underutilize key observational data, such as radar reflectivity and cloud properties, and rely heavily on Numerical Weather Prediction (NWP) systems, which are both computationally expensive and highly sensitive to parameter settings. To overcome these challenges, DeepLight leverages multi-source meteorological data, including radar reflectivity, cloud properties, and historical lightning occurrences through a dual-encoder architecture. By employing multi-branch convolution techniques, it dynamically captures spatial correlations across varying extents. Furthermore, its novel Hazy Loss function explicitly addresses the spatio-temporal uncertainty of lightning by penalizing deviations based on proximity to true events, enabling the model to better learn patterns amidst randomness. Extensive experiments show that DeepLight improves the Equitable Threat Score (ETS) by 18%-30% over state-of-the-art methods, establishing it as a robust solution for lightning prediction.

[711] Unsupervised operator learning approach for dissipative equations via Onsager principle

Zhipeng Chang, Zhenye Wen, Xiaofei Zhao

Main category: cs.LG

TL;DR: The paper introduces DOOL, an unsupervised operator learning method for dissipative equations, avoiding costly supervised training by leveraging the Onsager variational principle and a spatiotemporal decoupling strategy.

Details

Motivation: To reduce computational costs of supervised operator learning methods by proposing an unsupervised framework for solving dissipative equations.

Method: DOOL minimizes the OVP-defined Rayleighian functional without labeled data, uses a trunk network for spatial coordinates, and integrates external time stepping for temporal extrapolation.

Result: Numerical experiments confirm DOOL’s effectiveness, outperforming supervised methods like DeepONet and MIONet.

Conclusion: DOOL is a promising unsupervised approach for dissipative equations, with potential extensions to more complex models.

Abstract: Existing operator learning methods rely on supervised training with high-fidelity simulation data, introducing significant computational cost. In this work, we propose the deep Onsager operator learning (DOOL) method, a novel unsupervised framework for solving dissipative equations. Rooted in the Onsager variational principle (OVP), DOOL trains a deep operator network by directly minimizing the OVP-defined Rayleighian functional, requiring no labeled data, and then proceeds in time explicitly through conservation/change laws for the solution. Another key innovation here lies in the spatiotemporal decoupling strategy: the operator’s trunk network processes spatial coordinates exclusively, thereby enhancing training efficiency, while integrated external time stepping enables temporal extrapolation. Numerical experiments on typical dissipative equations validate the effectiveness of the DOOL method, and systematic comparisons with supervised DeepONet and MIONet demonstrate its enhanced performance. Extensions are made to cover the second-order wave models with dissipation that do not directly follow OVP.

[712] Stackelberg Coupling of Online Representation Learning and Reinforcement Learning

Fernando Martinez, Tao Li, Yingdong Lu, Juntao Chen

Main category: cs.LG

TL;DR: SCORER improves RL performance by structuring perception-control interaction as a Stackelberg game, avoiding complex auxiliary objectives.

Details

Motivation: Addresses the challenge of learning effective features from sparse rewards in RL without decoupling or naive end-to-end learning.

Method: Introduces SCORER, modeling perception-control interaction as a Stackelberg game with a two-timescale algorithm.

Result: Improves sample efficiency and final performance in benchmark tasks.

Conclusion: Principled design of perception-control dynamics enhances performance without added complexity.

Abstract: Integrated, end-to-end learning of representations and policies remains a cornerstone of deep reinforcement learning (RL). However, to address the challenge of learning effective features from a sparse reward signal, recent trends have shifted towards adding complex auxiliary objectives or fully decoupling the two processes, often at the cost of increased design complexity. This work proposes an alternative to both decoupling and naive end-to-end learning, arguing that performance can be significantly improved by structuring the interaction between distinct perception and control networks with a principled, game-theoretic dynamic. We formalize this dynamic by introducing the Stackelberg Coupled Representation and Reinforcement Learning (SCORER) framework, which models the interaction between perception and control as a Stackelberg game. The perception network (leader) strategically learns features to benefit the control network (follower), whose own objective is to minimize its Bellman error. We approximate the game’s equilibrium with a practical two-timescale algorithm. Applied to standard DQN variants on benchmark tasks, SCORER improves sample efficiency and final performance. Our results show that performance gains can be achieved through principled algorithmic design of the perception-control dynamic, without requiring complex auxiliary objectives or architectures.

[713] Towards Unveiling Predictive Uncertainty Vulnerabilities in the Context of the Right to Be Forgotten

Wei Qian, Chenxu Zhao, Yangyi Li, Wenqian Ye, Mengdi Huai

Main category: cs.LG

TL;DR: The paper introduces a new class of malicious unlearning attacks targeting predictive uncertainties in deep learning models, demonstrating their effectiveness and the inadequacy of existing defenses.

Details

Motivation: To explore vulnerabilities in predictive uncertainties due to malicious unlearning attacks, a gap left unaddressed by current research.

Method: Proposes novel optimization frameworks for malicious unlearning attacks and conducts extensive experiments, including black-box scenarios.

Result: The attacks are more effective in manipulating predictive uncertainties than traditional label misclassification attacks, and existing defenses fail against them.

Conclusion: The study highlights the need for new defenses against malicious unlearning attacks targeting predictive uncertainties.

Abstract: Currently, various uncertainty quantification methods have been proposed to provide certainty and probability estimates for deep learning models’ label predictions. Meanwhile, with the growing demand for the right to be forgotten, machine unlearning has been extensively studied as a means to remove the impact of requested sensitive data from a pre-trained model without retraining the model from scratch. However, the vulnerabilities of such generated predictive uncertainties with regard to dedicated malicious unlearning attacks remain unexplored. To bridge this gap, for the first time, we propose a new class of malicious unlearning attacks against predictive uncertainties, where the adversary aims to cause the desired manipulations of specific predictive uncertainty results. We also design novel optimization frameworks for our attacks and conduct extensive experiments, including black-box scenarios. Notably, our extensive experiments show that our attacks are more effective in manipulating predictive uncertainties than traditional attacks that focus on label misclassifications, and existing defenses against conventional attacks are ineffective against our attacks.

[714] MOTGNN: Interpretable Graph Neural Networks for Multi-Omics Disease Classification

Tiantian Yang, Zhiqian Chen

Main category: cs.LG

TL;DR: MOTGNN is a novel framework for binary disease classification using multi-omics data, outperforming baselines by 5-10% in accuracy and offering interpretability.

Details

Motivation: High dimensionality and complex interactions in multi-omics data challenge predictive modeling, necessitating an interpretable and accurate solution.

Method: MOTGNN uses XGBoost for supervised graph construction, modality-specific GNNs for representation learning, and a deep feedforward network for cross-omics integration.

Result: MOTGNN achieves superior accuracy (5-10% improvement) and robustness to class imbalance, with computational efficiency and interpretability.

Conclusion: MOTGNN enhances predictive accuracy and interpretability in multi-omics disease modeling, showcasing its potential for biomedical applications.

Abstract: Integrating multi-omics data, such as DNA methylation, mRNA expression, and microRNA (miRNA) expression, offers a comprehensive view of the biological mechanisms underlying disease. However, the high dimensionality and complex interactions among omics layers present major challenges for predictive modeling. We propose Multi-Omics integration with Tree-generated Graph Neural Network (MOTGNN), a novel and interpretable framework for binary disease classification. MOTGNN employs eXtreme Gradient Boosting (XGBoost) to perform omics-specific supervised graph construction, followed by modality-specific Graph Neural Networks (GNNs) for hierarchical representation learning, and a deep feedforward network for cross-omics integration. On three real-world disease datasets, MOTGNN outperforms state-of-the-art baselines by 5-10% in accuracy, ROC-AUC, and F1-score, and remains robust to severe class imbalance (e.g., 87.2% vs. 33.4% F1 on imbalanced data). The model maintains computational efficiency through sparse graphs (2.1-2.8 edges per node) and provides built-in interpretability, revealing both top-ranked biomarkers and the relative contributions of each omics modality. These results highlight MOTGNN’s potential to improve both predictive accuracy and interpretability in multi-omics disease modeling.

[715] Online Convex Optimization with Heavy Tails: Old Algorithms, New Regrets, and Applications

Zijian Liu

Main category: cs.LG

TL;DR: The paper analyzes Online Convex Optimization (OCO) under heavy-tailed gradient noise, showing classical algorithms like Online Gradient Descent achieve optimal regret without modifications like gradient clipping.

Details

Motivation: Limited results exist for OCO with heavy-tailed gradients (finite p-th moment, p ∈ (1,2]). This work fills the gap by examining classical algorithms in this setting.

Method: The study revisits old OCO algorithms (e.g., Online Gradient Descent) under heavy-tailed noise, analyzing their performance without algorithmic changes.

Result: Optimal regret bounds are proven for these methods, even without knowing p. The results also apply to nonsmooth nonconvex optimization under heavy-tailed noise.

Conclusion: OCO with heavy tails can be solved effectively using classical methods without extra operations, with broader applications in optimization.

Abstract: In Online Convex Optimization (OCO), when the stochastic gradient has a finite variance, many algorithms provably work and guarantee a sublinear regret. However, limited results are known if the gradient estimate has a heavy tail, i.e., the stochastic gradient only admits a finite $\mathsf{p}$-th central moment for some $\mathsf{p}\in\left(1,2\right]$. Motivated by it, this work examines different old algorithms for OCO (e.g., Online Gradient Descent) in the more challenging heavy-tailed setting. Under the standard bounded domain assumption, we establish new regrets for these classical methods without any algorithmic modification. Remarkably, these regret bounds are fully optimal in all parameters (can be achieved even without knowing $\mathsf{p}$), suggesting that OCO with heavy tails can be solved effectively without any extra operation (e.g., gradient clipping). Our new results have several applications. A particularly interesting one is the first provable convergence result for nonsmooth nonconvex optimization under heavy-tailed noise without gradient clipping. Furthermore, we explore broader settings (e.g., smooth OCO) and extend our ideas to optimistic algorithms to handle different cases simultaneously.

[716] N-BEATS-MOE: N-BEATS with a Mixture-of-Experts Layer for Heterogeneous Time Series Forecasting

Ricardo Matos, Luis Roque, Vitor Cerqueira

Main category: cs.LG

TL;DR: N-BEATS-MOE extends N-BEATS with a Mixture-of-Experts layer, improving adaptability and interpretability for time series forecasting, especially on heterogeneous datasets.

Details

Motivation: To enhance N-BEATS by incorporating a dynamic weighting strategy for better adaptation to diverse time series characteristics and improved interpretability.

Method: Extends N-BEATS with a Mixture-of-Experts (MoE) layer and a gating network for dynamic block weighting.

Result: Consistent improvements on 12 benchmark datasets, particularly for heterogeneous time series.

Conclusion: N-BEATS-MOE outperforms existing methods, offering better adaptability and interpretability in time series forecasting.

Abstract: Deep learning approaches are increasingly relevant for time series forecasting tasks. Methods such as N-BEATS, which is built on stacks of multilayer perceptrons (MLPs) blocks, have achieved state-of-the-art results on benchmark datasets and competitions. N-BEATS is also more interpretable relative to other deep learning approaches, as it decomposes forecasts into different time series components, such as trend and seasonality. In this work, we present N-BEATS-MOE, an extension of N-BEATS based on a Mixture-of-Experts (MoE) layer. N-BEATS-MOE employs a dynamic block weighting strategy based on a gating network which allows the model to better adapt to the characteristics of each time series. We also hypothesize that the gating mechanism provides additional interpretability by identifying which expert is most relevant for each series. We evaluate our method across 12 benchmark datasets against several approaches, achieving consistent improvements on several datasets, especially those composed of heterogeneous time series.

[717] Enhancing Privacy in Decentralized Min-Max Optimization: A Differentially Private Approach

Yueyang Quan, Chang Wang, Shengjie Zhai, Minghong Fang, Zhuqing Liu

Main category: cs.LG

TL;DR: The paper introduces DPMixSGD, a privacy-preserving algorithm for decentralized min-max optimization, addressing privacy risks while maintaining convergence performance.

Details

Motivation: Privacy concerns arise in decentralized min-max optimization due to potential exposure of sensitive data during model updates. Differential privacy (DP) is used but can hinder convergence.

Method: The proposed DPMixSGD algorithm builds on the STORM-based method, adding noise to local gradients while ensuring minimal impact on convergence. Theoretical bounds for privacy guarantees are provided.

Result: Theoretical analysis shows noise does not significantly affect convergence. Experiments validate the algorithm’s effectiveness across tasks and models.

Conclusion: DPMixSGD successfully balances privacy and performance in decentralized min-max optimization, offering a robust solution for non-convex scenarios.

Abstract: Decentralized min-max optimization allows multi-agent systems to collaboratively solve global min-max optimization problems by facilitating the exchange of model updates among neighboring agents, eliminating the need for a central server. However, sharing model updates in such systems carry a risk of exposing sensitive data to inference attacks, raising significant privacy concerns. To mitigate these privacy risks, differential privacy (DP) has become a widely adopted technique for safeguarding individual data. Despite its advantages, implementing DP in decentralized min-max optimization poses challenges, as the added noise can hinder convergence, particularly in non-convex scenarios with complex agent interactions in min-max optimization problems. In this work, we propose an algorithm called DPMixSGD (Differential Private Minmax Hybrid Stochastic Gradient Descent), a novel privacy-preserving algorithm specifically designed for non-convex decentralized min-max optimization. Our method builds on the state-of-the-art STORM-based algorithm, one of the fastest decentralized min-max solutions. We rigorously prove that the noise added to local gradients does not significantly compromise convergence performance, and we provide theoretical bounds to ensure privacy guarantees. To validate our theoretical findings, we conduct extensive experiments across various tasks and models, demonstrating the effectiveness of our approach.

[718] FairDRL-ST: Disentangled Representation Learning for Fair Spatio-Temporal Mobility Prediction

Sichen Zhao, Wei Shao, Jeffrey Chan, Ziqi Xu, Flora Salim

Main category: cs.LG

TL;DR: FairDRL-ST is a novel framework using disentangled representation learning to address fairness in spatio-temporal predictions, specifically for mobility demand forecasting, achieving fairness without significant performance loss.

Details

Motivation: Biased predictions in spatio-temporal applications can disproportionately disadvantage certain groups, reinforcing inequalities and undermining ethical AI deployment in public services.

Method: Leverages adversarial and disentangled representation learning to separate sensitive attributes, achieving fairness unsupervisedly.

Result: Effectively closes fairness gaps while maintaining competitive predictive performance on real-world urban mobility datasets.

Conclusion: FairDRL-ST provides a viable solution for fairness in spatio-temporal predictions, balancing ethical concerns with performance.

Abstract: As deep spatio-temporal neural networks are increasingly utilised in urban computing contexts, the deployment of such methods can have a direct impact on users of critical urban infrastructure, such as public transport, emergency services, and traffic management systems. While many spatio-temporal methods focus on improving accuracy, fairness has recently gained attention due to growing evidence that biased predictions in spatio-temporal applications can disproportionately disadvantage certain demographic or geographic groups, thereby reinforcing existing socioeconomic inequalities and undermining the ethical deployment of AI in public services. In this paper, we propose a novel framework, FairDRL-ST, based on disentangled representation learning, to address fairness concerns in spatio-temporal prediction, with a particular focus on mobility demand forecasting. By leveraging adversarial learning and disentangled representation learning, our framework learns to separate attributes that contain sensitive information. Unlike existing methods that enforce fairness through supervised learning, which may lead to overcompensation and degraded performance, our framework achieves fairness in an unsupervised manner with minimal performance loss. We apply our framework to real-world urban mobility datasets and demonstrate its ability to close fairness gaps while delivering competitive predictive performance compared to state-of-the-art fairness-aware methods.

[719] Physics-Informed Multimodal Bearing Fault Classification under Variable Operating Conditions using Transfer Learning

Tasfiq E. Alam, Md Manjurul Ahsan, Shivakumar Raman

Main category: cs.LG

TL;DR: A physics-informed multimodal CNN with late fusion integrates vibration and motor current signals, outperforming baselines in bearing fault classification under variable conditions. Transfer learning strategies further enhance generalization, with LAS performing best.

Details

Motivation: To improve the reliability of rotating machinery by addressing domain shifts and performance degradation in bearing fault classification under variable operating conditions.

Method: A physics-informed multimodal CNN with late fusion, incorporating vibration and motor current signals, a physics-based feature extraction branch, and a novel physics-informed loss function. Three transfer learning strategies (TSFT, LAS, HFR) are evaluated.

Result: The proposed model achieves higher accuracy, reduced false classifications, and improved robustness. LAS yields the best generalization, with up to 98% accuracy on cross-dataset validation.

Conclusion: Integrating domain knowledge with data-driven learning enhances robustness, interpretability, and generalizability for real-world industrial fault diagnosis.

Abstract: Accurate and interpretable bearing fault classification is critical for ensuring the reliability of rotating machinery, particularly under variable operating conditions where domain shifts can significantly degrade model performance. This study proposes a physics-informed multimodal convolutional neural network (CNN) with a late fusion architecture, integrating vibration and motor current signals alongside a dedicated physics-based feature extraction branch. The model incorporates a novel physics-informed loss function that penalizes physically implausible predictions based on characteristic bearing fault frequencies - Ball Pass Frequency Outer (BPFO) and Ball Pass Frequency Inner (BPFI) - derived from bearing geometry and shaft speed. Comprehensive experiments on the Paderborn University dataset demonstrate that the proposed physics-informed approach consistently outperforms a non-physics-informed baseline, achieving higher accuracy, reduced false classifications, and improved robustness across multiple data splits. To address performance degradation under unseen operating conditions, three transfer learning (TL) strategies - Target-Specific Fine-Tuning (TSFT), Layer-Wise Adaptation Strategy (LAS), and Hybrid Feature Reuse (HFR) - are evaluated. Results show that LAS yields the best generalization, with additional performance gains when combined with physics-informed modeling. Validation on the KAIST bearing dataset confirms the framework’s cross-dataset applicability, achieving up to 98 percent accuracy. Statistical hypothesis testing further verifies significant improvements (p < 0.01) in classification performance. The proposed framework demonstrates the potential of integrating domain knowledge with data-driven learning to achieve robust, interpretable, and generalizable fault diagnosis for real-world industrial applications.

[720] Multimodal Remote Inference

Keyuan Zhang, Yin Sun, Bo Ji

Main category: cs.LG

TL;DR: A study on optimizing remote inference accuracy by scheduling two-modality feature updates to minimize ML model error, using an index-based threshold policy proven optimal for non-monotonic, non-additive AoI functions and heterogeneous transmission times.

Details

Motivation: Fresh sensor features are critical for accurate real-time inference, but limited network resources make timely delivery of all modalities infeasible.

Method: Developed an index-based threshold policy where the scheduler switches modalities when the current modality’s index exceeds a shared threshold, efficiently computable for general AoI functions.

Result: The policy reduces inference error by up to 55% compared to round-robin and random policies, demonstrating significant improvement in accuracy.

Conclusion: The study provides a framework for optimizing remote inference by focusing on task-oriented AoI functions, enhancing accuracy under resource constraints.

Abstract: We consider a remote inference system with multiple modalities, where a multimodal machine learning (ML) model performs real-time inference using features collected from remote sensors. As sensor observations may change dynamically over time, fresh features are critical for inference tasks. However, timely delivering features from all modalities is often infeasible due to limited network resources. To this end, we study a two-modality scheduling problem to minimize the ML model’s inference error, which is expressed as a penalty function of AoI for both modalities. We develop an index-based threshold policy and prove its optimality. Specifically, the scheduler switches modalities when the current modality’s index function exceeds a threshold. We show that the two modalities share the same threshold, and both the index functions and the threshold can be computed efficiently. The optimality of our policy holds for (i) general AoI functions that are \emph{non-monotonic} and \emph{non-additive} and (ii) \emph{heterogeneous} transmission times. Numerical results show that our policy reduces inference error by up to 55% compared to round-robin and uniform random policies, which are oblivious to the AoI-based inference error function. Our results shed light on how to improve remote inference accuracy by optimizing task-oriented AoI functions.

[721] Uncertainty-Driven Reliability: Selective Prediction and Trustworthy Deployment in Modern Machine Learning

Stephan Rabanser

Main category: cs.LG

TL;DR: This paper explores uncertainty estimation in ML for safer, more reliable selective prediction, introducing lightweight, privacy-compatible methods and analyzing error sources and adversarial risks.

Details

Motivation: To enhance the safety and trustworthiness of ML systems in high-stakes domains by improving uncertainty estimation for selective prediction.

Method: Proposes a post-hoc abstention method using model training trajectory signals, studies privacy-uncertainty trade-offs, decomposes selective classification errors, and designs defenses against adversarial manipulation.

Result: Achieves state-of-the-art selective prediction performance, robust under differential privacy, identifies error sources, and provides defenses against adversarial attacks.

Conclusion: Advances reliable ML by improving uncertainty estimation, enabling models to abstain when uncertain, and addressing privacy and adversarial challenges.

Abstract: Machine learning (ML) systems are increasingly deployed in high-stakes domains where reliability is paramount. This thesis investigates how uncertainty estimation can enhance the safety and trustworthiness of ML, focusing on selective prediction – where models abstain when confidence is low. We first show that a model’s training trajectory contains rich uncertainty signals that can be exploited without altering its architecture or loss. By ensembling predictions from intermediate checkpoints, we propose a lightweight, post-hoc abstention method that works across tasks, avoids the cost of deep ensembles, and achieves state-of-the-art selective prediction performance. Crucially, this approach is fully compatible with differential privacy (DP), allowing us to study how privacy noise affects uncertainty quality. We find that while many methods degrade under DP, our trajectory-based approach remains robust, and we introduce a framework for isolating the privacy-uncertainty trade-off. Next, we then develop a finite-sample decomposition of the selective classification gap – the deviation from the oracle accuracy-coverage curve – identifying five interpretable error sources and clarifying which interventions can close the gap. This explains why calibration alone cannot fix ranking errors, motivating methods that improve uncertainty ordering. Finally, we show that uncertainty signals can be adversarially manipulated to hide errors or deny service while maintaining high accuracy, and we design defenses combining calibration audits with verifiable inference. Together, these contributions advance reliable ML by improving, evaluating, and safeguarding uncertainty estimation, enabling models that not only make accurate predictions – but also know when to say “I do not know”.

[722] Towards Theoretical Understanding of Transformer Test-Time Computing: Investigation on In-Context Linear Regression

Xingwu Chen, Miao Lu, Beining Wu, Difan Zou

Main category: cs.LG

TL;DR: The paper explores how increased test-time computation, like generating more intermediate thoughts or sampling answers, improves language model performance. It bridges practical inference and theoretical analysis by studying randomness and sampling in in-context linear regression.

Details

Motivation: To bridge the gap between practical language model inference and theoretical transformer analysis by incorporating randomness and sampling.

Method: The study uses in-context linear regression with continuous/binary coefficients, simulating language model decoding via noise injection and binary coefficient sampling.

Result: The theoretical framework and empirical results provide insights into inference behaviors in real-world language models.

Conclusion: The framework demonstrates potential for deeper understanding of language model inference, supported by empirical and theoretical analysis.

Abstract: Using more test-time computation during language model inference, such as generating more intermediate thoughts or sampling multiple candidate answers, has proven effective in significantly improving model performance. This paper takes an initial step toward bridging the gap between practical language model inference and theoretical transformer analysis by incorporating randomness and sampling. We focus on in-context linear regression with continuous/binary coefficients, where our framework simulates language model decoding through noise injection and binary coefficient sampling. Through this framework, we provide detailed analyses of widely adopted inference techniques. Supported by empirical results, our theoretical framework and analysis demonstrate the potential for offering new insights into understanding inference behaviors in real-world language models.

[723] When and how can inexact generative models still sample from the data manifold?

Nisha Chandramoorthy, Adriaan de Clercq

Main category: cs.LG

TL;DR: The paper investigates why some generative models remain robust in sample generation despite learning errors, focusing on the alignment of perturbations with the data manifold.

Details

Motivation: To understand the phenomenon where generated samples stay close to the data distribution even with errors in the score function or drift vector field.

Method: Uses dynamical systems and perturbation analysis to study the probability flow and Lyapunov vectors’ alignment with the data manifold.

Result: Infinitesimal errors affect the predicted density only on the data manifold, and alignment of Lyapunov vectors with tangent spaces ensures robustness.

Conclusion: The alignment condition is efficient to compute and provides theoretical guarantees for robustness in various generative models.

Abstract: A curious phenomenon observed in some dynamical generative models is the following: despite learning errors in the score function or the drift vector field, the generated samples appear to shift \emph{along} the support of the data distribution but not \emph{away} from it. In this work, we investigate this phenomenon of \emph{robustness of the support} by taking a dynamical systems approach on the generating stochastic/deterministic process. Our perturbation analysis of the probability flow reveals that infinitesimal learning errors cause the predicted density to be different from the target density only on the data manifold for a wide class of generative models. Further, what is the dynamical mechanism that leads to the robustness of the support? We show that the alignment of the top Lyapunov vectors (most sensitive infinitesimal perturbation directions) with the tangent spaces along the boundary of the data manifold leads to robustness and prove a sufficient condition on the dynamics of the generating process to achieve this alignment. Moreover, the alignment condition is efficient to compute and, in practice, for robust generative models, automatically leads to accurate estimates of the tangent bundle of the data manifold. Using a finite-time linear perturbation analysis on samples paths as well as probability flows, our work complements and extends existing works on obtaining theoretical guarantees for generative models from a stochastic analysis, statistical learning and uncertainty quantification points of view. Our results apply across different dynamical generative models, such as conditional flow-matching and score-based generative models, and for different target distributions that may or may not satisfy the manifold hypothesis.

[724] Klear-Reasoner: Advancing Reasoning Capability via Gradient-Preserving Clipping Policy Optimization

Zhenpeng Su, Leiyu Pan, Xue Bai, Dening Liu, Guanting Dong, Jiaming Huang, Wenping Hu, Guorui Zhou

Main category: cs.LG

TL;DR: Klear-Reasoner is a high-performance reasoning model with detailed post-training workflow insights, emphasizing quality over quantity in data and introducing GPPO for better RL performance.

Details

Motivation: Addressing the lack of reproducibility in high-performance inference models due to incomplete training details and improving reasoning capabilities.

Method: Uses long Chain-of-Thought supervised fine-tuning (long CoT SFT) and reinforcement learning (RL) with Gradient-Preserving clipping Policy Optimization (GPPO).

Result: Achieves 90.5% on AIME 2024, 83.2% on AIME 2025, 66.0% on LiveCodeBench V5, and 58.1% on LiveCodeBench V6.

Conclusion: High-quality data and GPPO enhance reasoning performance, making Klear-Reasoner a robust model for complex tasks.

Abstract: We present Klear-Reasoner, a model with long reasoning capabilities that demonstrates careful deliberation during problem solving, achieving outstanding performance across multiple benchmarks. Although there are already many excellent works related to inference models in the current community, there are still many problems with reproducing high-performance inference models due to incomplete disclosure of training details. This report provides an in-depth analysis of the reasoning model, covering the entire post-training workflow from data preparation and long Chain-of-Thought supervised fine-tuning (long CoT SFT) to reinforcement learning (RL), along with detailed ablation studies for each experimental component. For SFT data, our experiments show that a small number of high-quality data sources are more effective than a large number of diverse data sources, and that difficult samples can achieve better results without accuracy filtering. In addition, we investigate two key issues with current clipping mechanisms in RL: Clipping suppresses critical exploration signals and ignores suboptimal trajectories. To address these challenges, we propose Gradient-Preserving clipping Policy Optimization (GPPO) that gently backpropagates gradients from clipped tokens. GPPO not only enhances the model’s exploration capacity but also improves its efficiency in learning from negative samples. Klear-Reasoner exhibits exceptional reasoning abilities in mathematics and programming, scoring 90.5% on AIME 2024, 83.2% on AIME 2025, 66.0% on LiveCodeBench V5 and 58.1% on LiveCodeBench V6.

[725] Efficient Approximate Posterior Sampling with Annealed Langevin Monte Carlo

Advait Parulekar, Litu Rout, Karthikeyan Shanmugam, Sanjay Shakkottai

Main category: cs.LG

TL;DR: The paper addresses the challenge of posterior sampling in score-based generative models, proposing a method to sample from a distribution close to the posterior under minimal assumptions.

Details

Motivation: Despite the intractability of exact posterior sampling under computational hardness assumptions, empirical success in tasks like image super-resolution motivates exploring approximate solutions.

Method: The paper frames the problem as a ’tilting’ problem, biasing a distribution towards a measurement, and shows tractable sampling under minimal assumptions.

Result: The method produces samples close to the posterior of a noised prior in KL divergence and the true posterior in Fisher divergence.

Conclusion: This work provides the first formal results for approximate posterior sampling in polynomial time, balancing consistency with measurements and prior.

Abstract: We study the problem of posterior sampling in the context of score based generative models. We have a trained score network for a prior $p(x)$, a measurement model $p(y|x)$, and are tasked with sampling from the posterior $p(x|y)$. Prior work has shown this to be intractable in KL (in the worst case) under well-accepted computational hardness assumptions. Despite this, popular algorithms for tasks such as image super-resolution, stylization, and reconstruction enjoy empirical success. Rather than establishing distributional assumptions or restricted settings under which exact posterior sampling is tractable, we view this as a more general “tilting” problem of biasing a distribution towards a measurement. Under minimal assumptions, we show that one can tractably sample from a distribution that is simultaneously close to the posterior of a noised prior in KL divergence and the true posterior in Fisher divergence. Intuitively, this combination ensures that the resulting sample is consistent with both the measurement and the prior. To the best of our knowledge these are the first formal results for (approximate) posterior sampling in polynomial time.

[726] Attribution Explanations for Deep Neural Networks: A Theoretical Perspective

Huiqi Deng, Hongbin Pei, Quanshi Zhang, Mengnan Du

Main category: cs.LG

TL;DR: The paper discusses challenges in evaluating the faithfulness of attribution methods for explaining DNNs and highlights three core issues: unstructured heterogeneity, lack of theoretical foundations, and empirical evaluation difficulties. It reviews recent theoretical advances addressing these challenges and suggests future directions.

Details

Motivation: The motivation is to address the unresolved issue of whether attribution methods faithfully reflect input contributions to DNN decisions, which affects their reliability and utility.

Method: The paper reviews and summarizes recent theoretical advances in three key directions: theoretical unification, theoretical rationale, and theoretical evaluation of attribution methods.

Result: The review provides insights into unifying, clarifying, and rigorously evaluating attribution methods, aiding theoretical understanding and practical method selection.

Conclusion: The paper concludes by identifying open problems and future directions for improving the faithfulness and reliability of attribution explanations in DNNs.

Abstract: Attribution explanation is a typical approach for explaining deep neural networks (DNNs), inferring an importance or contribution score for each input variable to the final output. In recent years, numerous attribution methods have been developed to explain DNNs. However, a persistent concern remains unresolved, i.e., whether and which attribution methods faithfully reflect the actual contribution of input variables to the decision-making process. The faithfulness issue undermines the reliability and practical utility of attribution explanations. We argue that these concerns stem from three core challenges. First, difficulties arise in comparing attribution methods due to their unstructured heterogeneity, differences in heuristics, formulations, and implementations that lack a unified organization. Second, most methods lack solid theoretical underpinnings, with their rationales remaining absent, ambiguous, or unverified. Third, empirically evaluating faithfulness is challenging without ground truth. Recent theoretical advances provide a promising way to tackle these challenges, attracting increasing attention. We summarize these developments, with emphasis on three key directions: (i) Theoretical unification, which uncovers commonalities and differences among methods, enabling systematic comparisons; (ii) Theoretical rationale, clarifying the foundations of existing methods; (iii) Theoretical evaluation, rigorously proving whether methods satisfy faithfulness principles. Beyond a comprehensive review, we provide insights into how these studies help deepen theoretical understanding, inform method selection, and inspire new attribution methods. We conclude with a discussion of promising open problems for further work.

[727] Extracting Complex Topology from Multivariate Functional Approximation: Contours, Jacobi Sets, and Ridge-Valley Graphs

Guanqun Ma, David Lenz, Hanqi Guo, Tom Peterka, Bei Wang

Main category: cs.LG

TL;DR: The paper introduces a framework for extracting complex topological features directly from continuous implicit models like MFA, bypassing discrete representations.

Details

Motivation: To enable direct topological feature extraction from continuous implicit models, enhancing data analysis and visualization.

Method: Proposes a framework to extract contours, Jacobi sets, and ridge-valley graphs from MFA models without discretization.

Result: The framework successfully extracts topological features from MFA and is generalizable to other continuous implicit models.

Conclusion: This work lays the foundation for topological data analysis and visualization on continuous implicit models.

Abstract: Implicit continuous models, such as functional models and implicit neural networks, are an increasingly popular method for replacing discrete data representations with continuous, high-order, and differentiable surrogates. These models offer new perspectives on the storage, transfer, and analysis of scientific data. In this paper, we introduce the first framework to directly extract complex topological features – contours, Jacobi sets, and ridge-valley graphs – from a type of continuous implicit model known as multivariate functional approximation (MFA). MFA replaces discrete data with continuous piecewise smooth functions. Given an MFA model as the input, our approach enables direct extraction of complex topological features from the model, without reverting to a discrete representation of the model. Our work is easily generalizable to any continuous implicit model that supports the queries of function values and high-order derivatives. Our work establishes the building blocks for performing topological data analysis and visualization on implicit continuous models.

[728] Beyond Single: A Data Selection Principle for LLM Alignment via Fine-Grained Preference Signals

Jia Zhang, Yao Liu, Chen-Xi Zhang, Yi Liu, Yi-Xuan Jin, Lan-Zhe Guo, Yu-Feng Li

Main category: cs.LG

TL;DR: The paper introduces Direct Multi-Preference Optimization (DMPO) to address noise and conflicts in fine-grained preference data, proposing a data selection method based on Preference Divergence (PD) for improved LLM alignment.

Details

Motivation: Existing methods like DPO struggle with noise and conflicts in fine-grained preference datasets, necessitating a more robust approach for aligning LLMs with diverse human values.

Method: The authors derive the DMPO objective, identify a PD term to quantify preference conflicts, and use it to select high-consensus data for efficient DPO training. Practical methods for PD estimation and bias mitigation are introduced.

Result: Evaluation on the UltraFeedback dataset shows a 10%+ improvement over standard holistic preference methods, enhancing training efficiency and avoiding the need for holistic annotations.

Conclusion: The proposed PD selection method effectively leverages fine-grained preference signals, enabling robust LLM alignment without intractable holistic annotations.

Abstract: Aligning Large Language Models (LLMs) with diverse human values requires moving beyond a single holistic “better-than” preference criterion. While collecting fine-grained, aspect-specific preference data is more reliable and scalable, existing methods like Direct Preference Optimization (DPO) struggle with the severe noise and conflicts inherent in such aggregated datasets. In this paper, we tackle this challenge from a data-centric perspective. We first derive the Direct Multi-Preference Optimization (DMPO) objective, and uncover a key Preference Divergence (PD) term that quantifies inter-aspect preference conflicts. Instead of using this term for direct optimization, we leverage it to formulate a novel, theoretically-grounded data selection principle. Our principle advocates for selecting a subset of high-consensus data-identified by the most negative PD values-for efficient DPO training. We prove the optimality of this strategy by analyzing the loss bounds of the DMPO objective in the selection problem. To operationalize our approach, we introduce practical methods of PD term estimation and length bias mitigation, thereby proposing our PD selection method. Evaluation on the UltraFeedback dataset with three varying conflict levels shows that our simple yet effective strategy achieves over 10% relative improvement against both the standard holistic preference and a stronger oracle using aggregated preference signals, all while boosting training efficiency and obviating the need for intractable holistic preference annotating, unlocking the potential of robust LLM alignment via fine-grained preference signals.

[729] Multi-Turn Jailbreaks Are Simpler Than They Seem

Xiaoxue Yang, Jaeha Lee, Anna-Katharina Dick, Jasper Timm, Fei Xie, Diogo Cruz

Main category: cs.LG

TL;DR: Multi-turn jailbreak attacks on LLMs are as effective as repeated single-turn attacks, challenging assumptions about their sophistication. Success rates correlate among similar models, and higher reasoning effort increases attack success.

Details

Motivation: To analyze the effectiveness of multi-turn jailbreak attacks on state-of-the-art LLMs and challenge the perceived complexity of such attacks.

Method: Empirical analysis using the StrongREJECT benchmark across models like GPT-4, Claude, and Gemini variants.

Result: Multi-turn attacks are no more sophisticated than repeated single-turn attacks; success correlates among similar models; higher reasoning effort increases attack success.

Conclusion: The findings highlight vulnerabilities in AI safety evaluation and the need for improved jailbreak-resistant system designs.

Abstract: While defenses against single-turn jailbreak attacks on Large Language Models (LLMs) have improved significantly, multi-turn jailbreaks remain a persistent vulnerability, often achieving success rates exceeding 70% against models optimized for single-turn protection. This work presents an empirical analysis of automated multi-turn jailbreak attacks across state-of-the-art models including GPT-4, Claude, and Gemini variants, using the StrongREJECT benchmark. Our findings challenge the perceived sophistication of multi-turn attacks: when accounting for the attacker’s ability to learn from how models refuse harmful requests, multi-turn jailbreaking approaches are approximately equivalent to simply resampling single-turn attacks multiple times. Moreover, attack success is correlated among similar models, making it easier to jailbreak newly released ones. Additionally, for reasoning models, we find surprisingly that higher reasoning effort often leads to higher attack success rates. Our results have important implications for AI safety evaluation and the design of jailbreak-resistant systems. We release the source code at https://github.com/diogo-cruz/multi_turn_simpler

[730] Discovering Spatial Correlations between Earth Observations in Global Atmospheric State Estimation by using Adaptive Graph Structure Learning

Hyeon-Ju Jeon, Jeon-Ho Kang, In-Hyuk Kwon, O-Joun Lee

Main category: cs.LG

TL;DR: The study improves atmospheric state forecasting by using spatiotemporal graph neural networks (STGNNs) with adaptive edge sampling to handle dynamic spatial correlations between Earth observations and NWP grid points.

Details

Motivation: Conventional NWP systems lack dynamic handling of spatial correlations between observations and atmospheric states, limiting forecasting accuracy.

Method: Employed STGNNs with structure learning, regulated edge sampling by adaptive node degrees and spatial distances to avoid information loss and over-smoothing.

Result: Outperformed existing STGNN models in high-variability regions using real-world East Asia data.

Conclusion: The proposed method enhances forecasting accuracy by dynamically managing spatial correlations, addressing limitations of traditional NWP and STGNN approaches.

Abstract: This study aims to discover spatial correlations between Earth observations and atmospheric states to improve the forecasting accuracy of global atmospheric state estimation, which are usually conducted using conventional numerical weather prediction (NWP) systems and is the beginning of weather forecasting. NWP systems predict future atmospheric states at fixed locations, which are called NWP grid points, by analyzing previous atmospheric states and newly acquired Earth observations without fixed locations. Thus, surrounding meteorological context and the changing locations of the observations make spatial correlations between atmospheric states and observations over time. To handle complicated spatial correlations, which change dynamically, we employ spatiotemporal graph neural networks (STGNNs) with structure learning. However, structure learning has an inherent limitation that this can cause structural information loss and over-smoothing problem by generating excessive edges. To solve this problem, we regulate edge sampling by adaptively determining node degrees and considering the spatial distances between NWP grid points and observations. We validated the effectiveness of the proposed method by using real-world atmospheric state and observation data from East Asia. Even in areas with high atmospheric variability, the proposed method outperformed existing STGNN models with and without structure learning.

[731] GLiClass: Generalist Lightweight Model for Sequence Classification Tasks

Ihor Stepanov, Mykhailo Shtopko, Dmytro Vodianytskyi, Oleksandr Lukashov, Alexander Yavorskyi, Mykyta Yaroshenko

Main category: cs.LG

TL;DR: GLiClass is a novel method for sequence classification, combining efficiency and accuracy for zero-shot and few-shot learning, while addressing limitations of generative LLMs, cross-encoders, and embedding-based approaches.

Details

Motivation: Modern AI systems require efficient and accurate classification methods that adapt dynamically to changing needs, but existing approaches (generative LLMs, cross-encoders, embedding-based) have significant limitations in consistency, efficiency, or flexibility.

Method: GLiClass adapts the GLiNER architecture for sequence classification and uses proximal policy optimization (PPO) for multi-label classification in data-sparse conditions or human feedback.

Result: The method achieves strong accuracy and efficiency comparable to embedding-based approaches while maintaining flexibility for zero-shot and few-shot learning.

Conclusion: GLiClass offers a promising solution for dynamic classification tasks, balancing efficiency, accuracy, and adaptability.

Abstract: Classification is one of the most widespread tasks in AI applications, serving often as the first step in filtering, sorting, and categorizing data. Since modern AI systems must handle large volumes of input data and early pipeline stages can propagate errors downstream, achieving high efficiency and accuracy is critical. Moreover, classification requirements can change dynamically based on user needs, necessitating models with strong zero-shot capabilities. While generative LLMs have become mainstream for zero-shot classification due to their versatility, they suffer from inconsistent instruction following and computational inefficiency. Cross-encoders, commonly used as rerankers in RAG pipelines, face a different bottleneck: they must process text-label pairs sequentially, significantly reducing efficiency with large label sets. Embedding-based approaches offer good efficiency but struggle with complex scenarios involving logical and semantic constraints. We propose GLiClass, a novel method that adapts the GLiNER architecture for sequence classification tasks. Our approach achieves strong accuracy and efficiency comparable to embedding-based methods, while maintaining the flexibility needed for zero-shot and few-shot learning scenarios. Additionally, we adapted proximal policy optimization (PPO) for multi-label text classification, enabling training classifiers in data-sparse conditions or from human feedback.

[732] AIS-LLM: A Unified Framework for Maritime Trajectory Prediction, Anomaly Detection, and Collision Risk Assessment with Explainable Forecasting

Hyobin Park, Jinwook Jung, Minseok Seo, Hyunsoo Choi, Deukjae Cho, Sekil Park, Dong-Geol Choi

Main category: cs.LG

TL;DR: AIS-LLM integrates AIS data with a large language model to perform trajectory prediction, anomaly detection, and collision risk assessment simultaneously, outperforming existing methods.

Details

Motivation: Existing approaches address maritime tasks individually, lacking holistic analysis. AIS-LLM aims to unify these tasks for better maritime traffic management.

Method: AIS-LLM combines a Time-Series Encoder, LLM-based Prompt Encoder, Cross-Modality Alignment Module, and Multi-Task Decoder to process AIS data and textual prompts.

Result: AIS-LLM outperforms existing methods in trajectory prediction, anomaly detection, and collision risk assessment.

Conclusion: AIS-LLM enables intelligent, efficient maritime traffic management by integrating task outputs for situation summaries.

Abstract: With the increase in maritime traffic and the mandatory implementation of the Automatic Identification System (AIS), the importance and diversity of maritime traffic analysis tasks based on AIS data, such as vessel trajectory prediction, anomaly detection, and collision risk assessment, is rapidly growing. However, existing approaches tend to address these tasks individually, making it difficult to holistically consider complex maritime situations. To address this limitation, we propose a novel framework, AIS-LLM, which integrates time-series AIS data with a large language model (LLM). AIS-LLM consists of a Time-Series Encoder for processing AIS sequences, an LLM-based Prompt Encoder, a Cross-Modality Alignment Module for semantic alignment between time-series data and textual prompts, and an LLM-based Multi-Task Decoder. This architecture enables the simultaneous execution of three key tasks: trajectory prediction, anomaly detection, and risk assessment of vessel collisions within a single end-to-end system. Experimental results demonstrate that AIS-LLM outperforms existing methods across individual tasks, validating its effectiveness. Furthermore, by integratively analyzing task outputs to generate situation summaries and briefings, AIS-LLM presents the potential for more intelligent and efficient maritime traffic management.

[733] Semantic Caching for Low-Cost LLM Serving: From Offline Learning to Online Adaptation

Xutong Liu, Baran Atalar, Xiangxiang Dai, Jinhang Zuo, Siwei Wang, John C. S. Lui, Wei Chen, Carlee Joe-Wong

Main category: cs.LG

TL;DR: A learning-based framework for semantic cache eviction in LLMs addresses scalability and sustainability challenges by optimizing for unknown query and cost distributions.

Details

Motivation: High inference costs of LLMs and inefficiencies of traditional caching methods necessitate a principled approach to semantic caching.

Method: Develops offline optimization and online learning algorithms for semantic cache eviction under uncertain query and cost distributions.

Result: Proposed algorithms show matching or superior performance compared to baselines on synthetic datasets.

Conclusion: The framework provides a theoretical foundation and practical solution for efficient semantic caching in LLMs.

Abstract: Large Language Models (LLMs) are revolutionizing how users interact with information systems, yet their high inference cost poses serious scalability and sustainability challenges. Caching inference responses, allowing them to be retrieved without another forward pass through the LLM, has emerged as one possible solution. Traditional exact-match caching, however, overlooks the semantic similarity between queries, leading to unnecessary recomputation. Semantic caching addresses this by retrieving responses based on semantic similarity, but introduces a fundamentally different cache eviction problem: one must account for mismatch costs between incoming queries and cached responses. Moreover, key system parameters, such as query arrival probabilities and serving costs, are often unknown and must be learned over time. Existing semantic caching methods are largely ad-hoc, lacking theoretical foundations and unable to adapt to real-world uncertainty. In this paper, we present a principled, learning-based framework for semantic cache eviction under unknown query and cost distributions. We formulate both offline optimization and online learning variants of the problem, and develop provably efficient algorithms with state-of-the-art guarantees. We also evaluate our framework on a synthetic dataset, showing that our proposed algorithms perform matching or superior performance compared with baselines.

[734] Learning to Align, Aligning to Learn: A Unified Approach for Self-Optimized Alignment

Haowen Wang, Yun Yue, Zhiling Ye, Shuowen Zhang, Lei Fan, Jiaxin Liang, Jiadi Jiang, Cheng Wei, Jingyuan Deng, Xudong Han, Ji Li, Chunxiao Guo, Peng Wei, Jian Wang, Jinjie Gu

Main category: cs.LG

TL;DR: GRAO (Group Relative Alignment Optimization) is a unified framework combining SFT and RL strengths, offering improved alignment for language models with innovations like multi-sample generation and group-based loss.

Details

Motivation: Addressing the limitations of SFT (constrained by offline policy) and RL (low sample efficiency), GRAO aims to synergize their strengths for better alignment.

Method: GRAO introduces multi-sample generation, Group Direct Alignment Loss, and reference-aware parameter updates to unify SFT and RL.

Result: GRAO achieves significant improvements (57.70%, 17.65%, 7.95%, 5.18%) over SFT, DPO, PPO, and GRPO baselines in alignment tasks.

Conclusion: GRAO provides a theoretically grounded and empirically validated framework for efficient language model alignment.

Abstract: Alignment methodologies have emerged as a critical pathway for enhancing language model alignment capabilities. While SFT (supervised fine-tuning) accelerates convergence through direct token-level loss intervention, its efficacy is constrained by offline policy trajectory. In contrast, RL(reinforcement learning) facilitates exploratory policy optimization, but suffers from low sample efficiency and stringent dependency on high-quality base models. To address these dual challenges, we propose GRAO (Group Relative Alignment Optimization), a unified framework that synergizes the respective strengths of SFT and RL through three key innovations: 1) A multi-sample generation strategy enabling comparative quality assessment via reward feedback; 2) A novel Group Direct Alignment Loss formulation leveraging intra-group relative advantage weighting; 3) Reference-aware parameter updates guided by pairwise preference dynamics. Our theoretical analysis establishes GRAO’s convergence guarantees and sample efficiency advantages over conventional approaches. Comprehensive evaluations across complex human alignment tasks demonstrate GRAO’s superior performance, achieving 57.70%,17.65% 7.95% and 5.18% relative improvements over SFT, DPO, PPO and GRPO baselines respectively. This work provides both a theoretically grounded alignment framework and empirical evidence for efficient capability evolution in language models.

Chenchen Lin, Xuehe Wang

Main category: cs.LG

TL;DR: A socially-aware privacy-preserving federated learning mechanism is proposed to address indirect privacy leakage in decentralized networks, using a Stackelberg game and mean-field estimator to optimize incentives and privacy budgets.

Details

Motivation: Privacy loss in federated learning is influenced by multi-hop social connections, requiring a mechanism to quantify and mitigate indirect leakage while maintaining collaboration.

Method: The approach uses a two-stage Stackelberg game for server-client interaction, a mean-field estimator for privacy risk approximation, and derives Nash Equilibrium and convergence proofs.

Result: The mechanism improves client utilities, reduces server costs, and maintains model performance, outperforming baseline methods.

Conclusion: The proposed method effectively balances privacy and collaboration in federated learning, achieving near-optimal social welfare.

Abstract: Federated learning (FL) enables collaborative model training across decentralized clients without sharing local data, thereby enhancing privacy and facilitating collaboration among clients connected via social networks. However, these social connections introduce privacy externalities: a client’s privacy loss depends not only on its privacy protection strategy but also on the privacy decisions of others, propagated through the network via multi-hop interactions. In this work, we propose a socially-aware privacy-preserving FL mechanism that systematically quantifies indirect privacy leakage through a multi-hop propagation model. We formulate the server-client interaction as a two-stage Stackelberg game, where the server, as the leader, optimizes incentive policies, and clients, as followers, strategically select their privacy budgets, which determine their privacy-preserving levels by controlling the magnitude of added noise. To mitigate information asymmetry in networked privacy estimation, we introduce a mean-field estimator to approximate the average external privacy risk. We theoretically prove the existence and convergence of the fixed point of the mean-field estimator and derive closed-form expressions for the Stackelberg Nash Equilibrium. Despite being designed from a client-centric incentive perspective, our mechanism achieves approximately-optimal social welfare, as revealed by Price of Anarchy (PoA) analysis. Experiments on diverse datasets demonstrate that our approach significantly improves client utilities and reduces server costs while maintaining model performance, outperforming both Social-Agnostic (SA) baselines and methods that account for social externalities.

[736] Pareto Multi-Objective Alignment for Language Models

Qiang He, Setareh Maghsudi

Main category: cs.LG

TL;DR: PAMA is a scalable, efficient algorithm for multi-objective alignment in LLMs, addressing the limitations of single-reward RLHF by optimizing multiple conflicting objectives with theoretical guarantees.

Details

Motivation: Current alignment methods like RLHF optimize LLMs for a single reward, leading to rigid behavior that doesn't capture diverse human preferences, hindering adaptability in real-world applications.

Method: Proposes Pareto Multi-Objective Alignment (PAMA), a convex optimization-based algorithm that transforms multi-objective RLHF into a scalable problem with a closed-form solution, reducing complexity from O(n^2*d) to O(n).

Result: PAMA demonstrates robust performance in experiments across models (125M to 7B parameters), converging to Pareto stationary points efficiently (milliseconds).

Conclusion: PAMA provides a practical, theoretically grounded solution for multi-objective alignment in LLMs, enabling adaptable AI deployments aligned with diverse human values.

Abstract: Large language models (LLMs) are increasingly deployed in real-world applications that require careful balancing of multiple, often conflicting, objectives, such as informativeness versus conciseness, or helpfulness versus creativity. However, current alignment methods, primarily based on RLHF, optimize LLMs toward a single reward function, resulting in rigid behavior that fails to capture the complexity and diversity of human preferences. This limitation hinders the adaptability of LLMs to practical scenarios, making multi-objective alignment (MOA) a critical yet underexplored area. To bridge this gap, we propose Pareto Multi-Objective Alignment (PAMA), a principled and computationally efficient algorithm designed explicitly for MOA in LLMs. In contrast to computationally prohibitive multi-objective optimization (MOO) methods, PAMA transforms multi-objective RLHF into a convex optimization with a closed-form solution, significantly enhancing scalability. Traditional MOO approaches suffer from prohibitive O(n^2*d) complexity, where d represents the number of model parameters, typically in the billions for LLMs, rendering direct optimization infeasible. PAMA reduces this complexity to O(n) where n is the number of objectives, enabling optimization to be completed within milliseconds. We provide theoretical guarantees that PAMA converges to a Pareto stationary point, where no objective can be improved without degrading at least one other. Extensive experiments across language models ranging from 125M to 7B parameters demonstrate PAMA’s robust and effective MOA capabilities, aligning with its theoretical advantages. PAMA provides a highly efficient solution to the MOA problem that was previously considered intractable, offering a practical and theoretically grounded approach to aligning LLMs with diverse human values, paving the way for versatile and adaptable real-world AI deployments.

[737] MORE-CLEAR: Multimodal Offline Reinforcement learning for Clinical notes Leveraged Enhanced State Representation

Yooseok Lim, ByoungJun Jeon, Seong-A Park, Jisoo Lee, Sae Won Choi, Chang Wook Jeong, Ho-Geol Ryu, Hongyeol Lee, Hyun-Lim Yang

Main category: cs.LG

TL;DR: MORE-CLEAR, a multimodal offline RL framework, uses LLMs to extract semantic info from clinical notes, improving sepsis management by enhancing patient state representation and outperforming single-modal RL methods.

Details

Motivation: Early sepsis detection and management are critical, but existing RL approaches lack comprehensive patient understanding due to reliance on structured data alone.

Method: MORE-CLEAR integrates LLMs for semantic extraction from clinical notes, uses gated fusion and cross-modal attention for dynamic multimodal data integration, and validates on MIMIC-III, MIMIC-IV, and a private dataset.

Result: MORE-CLEAR significantly improves survival rates and policy performance over single-modal RL methods.

Conclusion: This is the first use of LLMs in multimodal offline RL for medical state representation, potentially enhancing sepsis treatment by providing a more comprehensive patient understanding.

Abstract: Sepsis, a life-threatening inflammatory response to infection, causes organ dysfunction, making early detection and optimal management critical. Previous reinforcement learning (RL) approaches to sepsis management rely primarily on structured data, such as lab results or vital signs, and on a dearth of a comprehensive understanding of the patient’s condition. In this work, we propose a Multimodal Offline REinforcement learning for Clinical notes Leveraged Enhanced stAte Representation (MORE-CLEAR) framework for sepsis control in intensive care units. MORE-CLEAR employs pre-trained large-scale language models (LLMs) to facilitate the extraction of rich semantic representations from clinical notes, preserving clinical context and improving patient state representation. Gated fusion and cross-modal attention allow dynamic weight adjustment in the context of time and the effective integration of multimodal data. Extensive cross-validation using two public (MIMIC-III and MIMIC-IV) and one private dataset demonstrates that MORE-CLEAR significantly improves estimated survival rate and policy performance compared to single-modal RL approaches. To our knowledge, this is the first to leverage LLM capabilities within a multimodal offline RL for better state representation in medical applications. This approach can potentially expedite the treatment and management of sepsis by enabling reinforcement learning models to propose enhanced actions based on a more comprehensive understanding of patient conditions.

[738] Semantic-Enhanced Time-Series Forecasting via Large Language Models

Hao Liu, Chun Yang, Zhang xiaoxing, Xiaobin Zhu

Main category: cs.LG

TL;DR: The paper introduces SE-LLM, a semantic-enhanced LLM for time series forecasting, addressing modality gaps and improving interpretability and performance.

Details

Motivation: Existing LLMs for time series forecasting focus on token-level alignment, missing intrinsic modality gaps and semantic representation.

Method: Proposes SE-LLM to embed time series periodicity and anomalies into semantic space, and a plugin module for long/short-term dependency modeling.

Result: SE-LLM outperforms SOTA methods, reducing computational consumption while enhancing performance.

Conclusion: SE-LLM effectively bridges modality gaps and improves LLM adaptability for time series analysis.

Abstract: Time series forecasting plays a significant role in finance, energy, meteorology, and IoT applications. Recent studies have leveraged the generalization capabilities of large language models (LLMs) to adapt to time series forecasting, achieving promising performance. However, existing studies focus on token-level modal alignment, instead of bridging the intrinsic modality gap between linguistic knowledge structures and time series data patterns, greatly limiting the semantic representation. To address this issue, we propose a novel Semantic-Enhanced LLM (SE-LLM) that explores the inherent periodicity and anomalous characteristics of time series to embed into the semantic space to enhance the token embedding. This process enhances the interpretability of tokens for LLMs, thereby activating the potential of LLMs for temporal sequence analysis. Moreover, existing Transformer-based LLMs excel at capturing long-range dependencies but are weak at modeling short-term anomalies in time-series data. Hence, we propose a plugin module embedded within self-attention that models long-term and short-term dependencies to effectively adapt LLMs to time-series analysis. Our approach freezes the LLM and reduces the sequence dimensionality of tokens, greatly reducing computational consumption. Experiments demonstrate the superiority performance of our SE-LLM against the state-of-the-art (SOTA) methods.

[739] From Source to Target: Leveraging Transfer Learning for Predictive Process Monitoring in Organizations

Sven Weinzierl, Sandra Zilker, Annina Liessmann, Martin Käppel, Weixin Wang, Martin Matzner

Main category: cs.LG

TL;DR: The paper introduces a transfer learning-based predictive process monitoring (PPM) technique to enable organizations with limited event data to implement PPM effectively.

Details

Motivation: Existing PPM techniques require ample event data or resources, which may not be available to all organizations, limiting their ability to use PPM for proactive decision-making.

Method: The proposed technique uses transfer learning to apply knowledge from one business process to another, even across organizations. It is tested in two real-life IT service management use cases.

Result: Experiments show that knowledge transfer between similar processes, intra- or inter-organizationally, enables effective PPM in target contexts.

Conclusion: The technique allows organizations to leverage transfer learning for PPM, overcoming data scarcity by sharing pre-trained models within or across organizational boundaries.

Abstract: Event logs reflect the behavior of business processes that are mapped in organizational information systems. Predictive process monitoring (PPM) transforms these data into value by creating process-related predictions that provide the insights required for proactive interventions at process runtime. Existing PPM techniques require sufficient amounts of event data or other relevant resources that might not be readily available, preventing some organizations from utilizing PPM. The transfer learning-based PPM technique presented in this paper allows organizations without suitable event data or other relevant resources to implement PPM for effective decision support. The technique is instantiated in two real-life use cases, based on which numerical experiments are performed using event logs for IT service management processes in an intra- and inter-organizational setting. The results of the experiments suggest that knowledge of one business process can be transferred to a similar business process in the same or a different organization to enable effective PPM in the target context. With the proposed technique, organizations can benefit from transfer learning in an intra- and inter-organizational setting, where resources like pre-trained models are transferred within and across organizational boundaries.

[740] Energy Consumption in Parallel Neural Network Training

Philipp Huber, David Li, Juan Pedro Gutiérrez Hermosillo Muriedas, Deifilia Kieckhefen, Markus Götz, Achim Streit, Charlotte Debus

Main category: cs.LG

TL;DR: The paper examines how parallelization in neural network training affects energy consumption, showing linear scaling with GPU hours but varying factors based on model and hardware.

Details

Motivation: Address the overlooked impact of parallelization on energy consumption in neural network training.

Method: Conducted scaling experiments with ResNet50 and FourCastNet, varying GPU count, global/local batch sizes, and measuring performance, time, and energy.

Result: Energy consumption scales linearly with GPU hours, but scaling factors vary by model, hardware, and training specifics.

Conclusion: Findings highlight the complex relationship between scaling and energy use, guiding sustainable AI research.

Abstract: The increasing demand for computational resources of training neural networks leads to a concerning growth in energy consumption. While parallelization has enabled upscaling model and dataset sizes and accelerated training, its impact on energy consumption is often overlooked. To close this research gap, we conducted scaling experiments for data-parallel training of two models, ResNet50 and FourCastNet, and evaluated the impact of parallelization parameters, i.e., GPU count, global batch size, and local batch size, on predictive performance, training time, and energy consumption. We show that energy consumption scales approximately linearly with the consumed resources, i.e., GPU hours; however, the respective scaling factor differs substantially between distinct model trainings and hardware, and is systematically influenced by the number of samples and gradient updates per GPU hour. Our results shed light on the complex interplay of scaling up neural network training and can inform future developments towards more sustainable AI research.

[741] Training-Free ANN-to-SNN Conversion for High-Performance Spiking Transformer

Jingya Wang, Xin Deng, Wenjie Wei, Dehao Zhang, Shuai Wang, Qian Sun, Jieyuan Zhang, Hanwen Liu, Ning Xie, Malu Zhang

Main category: cs.LG

TL;DR: A training-free ANN-to-SNN conversion framework for Transformers using Multi-basis Exponential Decay (MBE) neurons achieves near-lossless accuracy with low latency.

Details

Motivation: Existing ANN-to-SNN conversion methods struggle with nonlinear operations in Transformers and require fine-tuning, limiting efficiency.

Method: Proposes MBE neurons with exponential decay and multi-basis encoding to approximate nonlinear operations without modifying pre-trained ANN weights.

Result: Achieves near-lossless conversion accuracy across tasks (CV, NLU, NLG) and architectures (ViT, RoBERTa, GPT-2) with low latency.

Conclusion: The framework enables efficient, scalable deployment of Spiking Transformers in real-world applications.

Abstract: Leveraging the event-driven paradigm, Spiking Neural Networks (SNNs) offer a promising approach for constructing energy-efficient Transformer architectures. Compared to directly trained Spiking Transformers, ANN-to-SNN conversion methods bypass the high training costs. However, existing methods still suffer from notable limitations, failing to effectively handle nonlinear operations in Transformer architectures and requiring additional fine-tuning processes for pre-trained ANNs. To address these issues, we propose a high-performance and training-free ANN-to-SNN conversion framework tailored for Transformer architectures. Specifically, we introduce a Multi-basis Exponential Decay (MBE) neuron, which employs an exponential decay strategy and multi-basis encoding method to efficiently approximate various nonlinear operations. It removes the requirement for weight modifications in pre-trained ANNs. Extensive experiments across diverse tasks (CV, NLU, NLG) and mainstream Transformer architectures (ViT, RoBERTa, GPT-2) demonstrate that our method achieves near-lossless conversion accuracy with significantly lower latency. This provides a promising pathway for the efficient and scalable deployment of Spiking Transformers in real-world applications.

[742] Detecting Mislabeled and Corrupted Data via Pointwise Mutual Information

Jinghan Yang, Jiayu Weng

Main category: cs.LG

TL;DR: A mutual information-based framework for filtering noisy or mislabeled data in deep learning, improving model accuracy by up to 15% under label corruption.

Details

Motivation: Deep neural networks can memorize corrupted labels, and real-world datasets often suffer from both label and input noise, necessitating robust data selection methods.

Method: Proposes a mutual information-based framework to quantify statistical dependencies between inputs and labels, identifying noisy samples by their low contribution to mutual information.

Result: Empirical validation on MNIST shows the method effectively filters low-quality samples, improving classification accuracy by up to 15% under label corruption.

Conclusion: The framework is robust to benign input modifications, preserving valid data while filtering truly corrupted samples, enhancing model performance.

Abstract: Deep neural networks can memorize corrupted labels, making data quality critical for model performance, yet real-world datasets are frequently compromised by both label noise and input noise. This paper proposes a mutual information-based framework for data selection under hybrid noise scenarios that quantifies statistical dependencies between inputs and labels. We compute each sample’s pointwise contribution to the overall mutual information and find that lower contributions indicate noisy or mislabeled instances. Empirical validation on MNIST with different synthetic noise settings demonstrates that the method effectively filters low-quality samples. Under label corruption, training on high-MI samples improves classification accuracy by up to 15% compared to random sampling. Furthermore, the method exhibits robustness to benign input modifications, preserving semantically valid data while filtering truly corrupted samples.

[743] Part I: Tricks or Traps? A Deep Dive into RL for LLM Reasoning

Zihe Liu, Jiashun Liu, Yancheng He, Weixun Wang, Jiaheng Liu, Ling Pan, Xinyu Hu, Shaopan Xiong, Ju Huang, Jian Hu, Shengyi Huang, Siran Yang, Jiamang Wang, Wenbo Su, Bo Zheng

Main category: cs.LG

TL;DR: The paper reviews RL techniques for LLM reasoning, identifies challenges like lack of standardization, and proposes a minimalist combination of two techniques for improved performance.

Details

Motivation: Address the absence of standardized guidelines and fragmented understanding of RL techniques in LLM reasoning, along with inconsistent experimental results.

Method: Systematic review and rigorous reproduction of RL techniques within a unified framework, analyzing mechanisms, scenarios, and principles through fine-grained experiments.

Result: A minimalist combination of two techniques outperforms existing strategies like GRPO and DAPO, unlocking critic-free policy learning with vanilla PPO loss.

Conclusion: Provides clear guidelines for selecting RL techniques in LLM reasoning and demonstrates the effectiveness of a simple, high-performing combination.

Abstract: Reinforcement learning for LLM reasoning has rapidly emerged as a prominent research area, marked by a significant surge in related studies on both algorithmic innovations and practical applications. Despite this progress, several critical challenges remain, including the absence of standardized guidelines for employing RL techniques and a fragmented understanding of their underlying mechanisms. Additionally, inconsistent experimental settings, variations in training data, and differences in model initialization have led to conflicting conclusions, obscuring the key characteristics of these techniques and creating confusion among practitioners when selecting appropriate techniques. This paper systematically reviews widely adopted RL techniques through rigorous reproductions and isolated evaluations within a unified open-source framework. We analyze the internal mechanisms, applicable scenarios, and core principles of each technique through fine-grained experiments, including datasets of varying difficulty, model sizes, and architectures. Based on these insights, we present clear guidelines for selecting RL techniques tailored to specific setups, and provide a reliable roadmap for practitioners navigating the RL for the LLM domain. Finally, we reveal that a minimalist combination of two techniques can unlock the learning capability of critic-free policies using vanilla PPO loss. The results demonstrate that our simple combination consistently improves performance, surpassing strategies like GRPO and DAPO.

[744] Separation and Collaboration: Two-Level Routing Grouped Mixture-of-Experts for Multi-Domain Continual Learning

Jialu Zhou, Dianxi Shi, Shaowu Yang, Xinyu Wei, Mingyue Yang, Leqian Li, Mengzhu Wang, Chunping Qiu

Main category: cs.LG

TL;DR: TRGE method addresses catastrophic and forward forgetting in MDCL by dynamic expert expansion, intra-group routing, and inter-group collaboration, leveraging MLLMs for task identification and CLIP fusion.

Details

Motivation: To tackle catastrophic and forward forgetting in Multi-Domain Continual Learning (MDCL) with shifting class sets and distributions.

Method: Proposes TRGE: dynamic expert expansion, intra-group routing, inter-group routing policy, MLLMs for task identification, and CLIP fusion.

Result: Outperforms advanced methods with fewer parameters in various settings.

Conclusion: TRGE effectively mitigates forgetting and enhances inter-task collaboration in MDCL.

Abstract: Multi-Domain Continual Learning (MDCL) acquires knowledge from sequential tasks with shifting class sets and distribution. Despite the Parameter-Efficient Fine-Tuning (PEFT) methods can adapt for this dual heterogeneity, they still suffer from catastrophic forgetting and forward forgetting. To address these challenges, we propose a Two-Level Routing Grouped Mixture-of-Experts (TRGE) method. Firstly, TRGE dynamically expands the pre-trained CLIP model, assigning specific expert group for each task to mitigate catastrophic forgetting. With the number of experts continually grows in this process, TRGE maintains the static experts count within the group and introduces the intra-group router to alleviate routing overfitting caused by the increasing routing complexity. Meanwhile, we design an inter-group routing policy based on task identifiers and task prototype distance, which dynamically selects relevant expert groups and combines their outputs to enhance inter-task collaboration. Secondly, to get the correct task identifiers, we leverage Multimodal Large Language Models (MLLMs) which own powerful multimodal comprehension capabilities to generate semantic task descriptions and recognize the correct task identifier. Finally, to mitigate forward forgetting, we dynamically fuse outputs for unseen samples from the frozen CLIP model and TRGE adapter based on training progress, leveraging both pre-trained and learned knowledge. Through extensive experiments across various settings, our method outperforms other advanced methods with fewer trainable parameters.

[745] A Tutorial: An Intuitive Explanation of Offline Reinforcement Learning Theory

Fengdi Che

Main category: cs.LG

TL;DR: The paper surveys theoretical insights in offline RL, linking them to practical algorithm design, highlighting challenges like data coverage and function representation.

Details

Motivation: To bridge the gap between theoretical advances in offline RL and practical algorithm design, exploring key intuitions and their implications.

Method: Analyzes conditions for proofs (function representation, data coverage), examines counterexamples, and discusses sufficient conditions for offline RL.

Result: Identifies inherent challenges and limitations of offline RL, emphasizing the need for novel solutions when theoretical conditions aren’t met.

Conclusion: Theoretical insights guide practical algorithm design but reveal limitations, urging further innovation in offline RL.

Abstract: Offline reinforcement learning (RL) aims to optimize the return given a fixed dataset of agent trajectories without additional interactions with the environment. While algorithm development has progressed rapidly, significant theoretical advances have also been made in understanding the fundamental challenges of offline RL. However, bridging these theoretical insights with practical algorithm design remains an ongoing challenge. In this survey, we explore key intuitions derived from theoretical work and their implications for offline RL algorithms. We begin by listing the conditions needed for the proofs, including function representation and data coverage assumptions. Function representation conditions tell us what to expect for generalization, and data coverage assumptions describe the quality requirement of the data. We then examine counterexamples, where offline RL is not solvable without an impractically large amount of data. These cases highlight what cannot be achieved for all algorithms and the inherent hardness of offline RL. Building on techniques to mitigate these challenges, we discuss the conditions that are sufficient for offline RL. These conditions are not merely assumptions for theoretical proofs, but they also reveal the limitations of these algorithms and remind us to search for novel solutions when the conditions cannot be satisfied.

[746] Sparse Probabilistic Graph Circuits

Martin Rektoris, Milan Papež, Václav Šmídl, Tomáš Pevný

Main category: cs.LG

TL;DR: The paper introduces Sparse Probabilistic Graph Circuits (SPGCs) to address scalability issues in tractable generative models for graphs, reducing complexity from O(n²) to O(n + m).

Details

Motivation: Deep generative models (DGMs) for graphs are intractable due to non-linearities, while existing Probabilistic Graph Circuits (PGCs) are tractable but inefficient for sparse graphs.

Method: The authors propose Sparse PGCs, which operate directly on sparse graph representations, reducing complexity to O(n + m).

Result: SPGCs retain exact inference capabilities, improve memory efficiency and inference speed, and match intractable DGMs in performance for de novo drug design.

Conclusion: SPGCs offer a scalable and efficient solution for tractable probabilistic inference in sparse graphs, maintaining performance comparable to intractable models.

Abstract: Deep generative models (DGMs) for graphs achieve impressively high expressive power thanks to very efficient and scalable neural networks. However, these networks contain non-linearities that prevent analytical computation of many standard probabilistic inference queries, i.e., these DGMs are considered \emph{intractable}. While recently proposed Probabilistic Graph Circuits (PGCs) address this issue by enabling \emph{tractable} probabilistic inference, they operate on dense graph representations with $\mathcal{O}(n^2)$ complexity for graphs with $n$ nodes and \emph{$m$ edges}. To address this scalability issue, we introduce Sparse PGCs, a new class of tractable generative models that operate directly on sparse graph representation, reducing the complexity to $\mathcal{O}(n + m)$, which is particularly beneficial for $m \ll n^2$. In the context of de novo drug design, we empirically demonstrate that SPGCs retain exact inference capabilities, improve memory efficiency and inference speed, and match the performance of intractable DGMs in key metrics.

[747] Topological Feature Compression for Molecular Graph Neural Networks

Rahul Khorana

Main category: cs.LG

TL;DR: A novel Graph Neural Network (GNN) architecture combines compressed higher-order topological signals with standard molecular features for improved accuracy, interpretability, and efficiency in molecular representation learning.

Details

Motivation: Extracting general chemical insight while balancing predictive accuracy, interpretability, and computational efficiency remains a challenge in molecular representation learning.

Method: Introduces a GNN architecture that integrates compressed higher-order topological signals with standard molecular features to capture global geometric information.

Result: Achieves superior performance in accuracy and robustness across various benchmarks, including small-molecule and complex material datasets.

Conclusion: The proposed GNN architecture effectively balances accuracy, interpretability, and efficiency, with open-sourced code for broader use.

Abstract: Recent advances in molecular representation learning have produced highly effective encodings of molecules for numerous cheminformatics and bioinformatics tasks. However, extracting general chemical insight while balancing predictive accuracy, interpretability, and computational efficiency remains a major challenge. In this work, we introduce a novel Graph Neural Network (GNN) architecture that combines compressed higher-order topological signals with standard molecular features. Our approach captures global geometric information while preserving computational tractability and human-interpretable structure. We evaluate our model across a range of benchmarks, from small-molecule datasets to complex material datasets, and demonstrate superior performance using a parameter-efficient architecture. We achieve the best performing results in both accuracy and robustness across almost all benchmarks. We open source all code \footnote{All code and results can be found on Github https://github.com/rahulkhorana/TFC-PACT-Net}.

[748] EvoCoT: Overcoming the Exploration Bottleneck in Reinforcement Learning

Huanyu Liu, Jia Li, Chang Yu, Taozhi Chen, Yihong Dong, Lecheng Wang, Hu XiaoLong, Ge Li

Main category: cs.LG

TL;DR: EvoCoT is a self-evolving curriculum learning framework for improving LLM reasoning via two-stage CoT optimization, addressing sparse rewards in RLVR by controlled exploration.

Details

Motivation: Overcoming sparse rewards and exploration bottlenecks in RLVR for LLMs, without relying on stronger models or filtering hard problems.

Method: EvoCoT uses self-generated and verified CoT trajectories, gradually shortening them to expand exploration space.

Result: Enables LLMs to solve previously unsolved problems, improves reasoning without external CoT supervision, and works with various RL methods.

Conclusion: EvoCoT is scalable, effective, and compatible, advancing LLM reasoning under sparse rewards.

Abstract: Reinforcement learning with verifiable reward (RLVR) has become a promising paradigm for post-training large language models (LLMs) to improve their reasoning capability. However, when the rollout accuracy is low on hard problems, the reward becomes sparse, limiting learning efficiency and causing exploration bottlenecks. Existing approaches either rely on stronger LLMs for distillation or filter out difficult problems, which limits scalability or restricts reasoning improvement through exploration. We propose EvoCoT, a self-evolving curriculum learning framework based on two-stage chain-of-thought (CoT) reasoning optimization. EvoCoT constrains the exploration space by self-generating and verifying CoT trajectories, then gradually shortens them to expand the space in a controlled way. This enables LLMs to stably learn from initially unsolved hard problems under sparse rewards. We apply EvoCoT to multiple LLM families, including Qwen, DeepSeek, and Llama. Experiments show that EvoCoT enables LLMs to solve previously unsolved problems, improves reasoning capability without external CoT supervision, and is compatible with various RL fine-tuning methods. We release the source code to support future research.

[749] Learning Satellite Attitude Dynamics with Physics-Informed Normalising Flow

Carlo Cena, Mauro Martini, Marcello Chiaberge

Main category: cs.LG

TL;DR: The paper explores using Physics-Informed Neural Networks (PINNs) for spacecraft attitude control, showing improved performance over purely data-driven methods when integrated with MPC.

Details

Motivation: Traditional MPC relies on accurate physics models, which can be incomplete or costly. Machine learning offers flexibility but struggles with generalization and stability.

Method: Used Real NVP neural networks with self-attention, trained on Basilisk simulator data, comparing purely data-driven and physics-informed approaches.

Result: Physics-informed models reduced mean relative error by 27.08% and improved control accuracy (up to 42.86%) and robustness in MPC.

Conclusion: Incorporating physics into neural networks enhances spacecraft attitude control, outperforming purely data-driven methods in MPC frameworks.

Abstract: Attitude control is a fundamental aspect of spacecraft operations. Model Predictive Control (MPC) has emerged as a powerful strategy for these tasks, relying on accurate models of the system dynamics to optimize control actions over a prediction horizon. In scenarios where physics models are incomplete, difficult to derive, or computationally expensive, machine learning offers a flexible alternative by learning the system behavior directly from data. However, purely data-driven models often struggle with generalization and stability, especially when applied to inputs outside their training domain. To address these limitations, we investigate the benefits of incorporating Physics-Informed Neural Networks (PINNs) into the learning of spacecraft attitude dynamics, comparing their performance with that of purely data-driven approaches. Using a Real-valued Non-Volume Preserving (Real NVP) neural network architecture with a self-attention mechanism, we trained several models on simulated data generated with the Basilisk simulator. Two training strategies were considered: a purely data-driven baseline and a physics-informed variant to improve robustness and stability. Our results demonstrate that the inclusion of physics-based information significantly enhances the performance in terms of the mean relative error of the best architectures found by 27.08%. These advantages are particularly evident when the learned models are integrated into an MPC framework, where PINN-based models consistently outperform their purely data-driven counterparts in terms of control accuracy and robustness, yielding improvements of up to 42.86% in performance stability error and increased robustness-to-noise.

[750] Not Yet AlphaFold for the Mind: Evaluating Centaur as a Synthetic Participant

Sabrina Namazova, Alessandra Brondetta, Younes Strittmatter, Matthew Nassar, Sebastian Musslick

Main category: cs.LG

TL;DR: The paper discusses the potential of participant simulators in behavioral sciences, focusing on Centaur, an LLM fine-tuned for human-like behavior. While Centaur shows predictive accuracy, its generative behavior diverges from human data, falling short of being a reliable simulator or cognitive model.

Details

Motivation: To evaluate Centaur as a participant simulator in behavioral sciences, inspired by the success of simulators like AlphaFold in natural sciences.

Method: Review and assess Centaur’s performance against core criteria for a participant simulator, focusing on predictive accuracy and generative behavior.

Result: Centaur demonstrates strong predictive accuracy but fails in generative behavior, deviating systematically from human data.

Conclusion: Centaur is a step forward but does not yet meet the standards for a reliable participant simulator or accurate cognitive model.

Abstract: Simulators have revolutionized scientific practice across the natural sciences. By generating data that reliably approximate real-world phenomena, they enable scientists to accelerate hypothesis testing and optimize experimental designs. This is perhaps best illustrated by AlphaFold, a Nobel-prize winning simulator in chemistry that predicts protein structures from amino acid sequences, enabling rapid prototyping of molecular interactions, drug targets, and protein functions. In the behavioral sciences, a reliable participant simulator - a system capable of producing human-like behavior across cognitive tasks - would represent a similarly transformative advance. Recently, Binz et al. introduced Centaur, a large language model (LLM) fine-tuned on human data from 160 experiments, proposing its use not only as a model of cognition but also as a participant simulator for “in silico prototyping of experimental studies”, e.g., to advance automated cognitive science. Here, we review the core criteria for a participant simulator and assess how well Centaur meets them. Although Centaur demonstrates strong predictive accuracy, its generative behavior - a critical criterion for a participant simulator - systematically diverges from human data. This suggests that, while Centaur is a significant step toward predicting human behavior, it does not yet meet the standards of a reliable participant simulator or an accurate model of cognition.

[751] Score Augmentation for Diffusion Models

Liang Hou, Yuan Gao, Boyuan Jiang, Xin Tao, Qi Yan, Renjie Liao, Pengfei Wan, Di Zhang, Kun Gai

Main category: cs.LG

TL;DR: The paper introduces ScoreAug, a data augmentation framework for diffusion models to address overfitting in data-limited regimes by transforming noisy data and enforcing equivariant learning.

Details

Motivation: To mitigate overfitting in diffusion models, especially when training data is limited, by leveraging the denoising mechanism inherent to diffusion models.

Method: Proposes ScoreAug, which applies transformations to noisy data and requires the denoiser to predict the augmentation of the original target, creating an equivariant learning objective.

Result: ScoreAug significantly improves performance on benchmarks like CIFAR-10, FFHQ, AFHQv2, and ImageNet, mitigating overfitting and ensuring stable convergence.

Conclusion: ScoreAug effectively addresses overfitting in diffusion models, outperforms traditional augmentation, and can be combined with standard techniques for further gains.

Abstract: Diffusion models have achieved remarkable success in generative modeling. However, this study confirms the existence of overfitting in diffusion model training, particularly in data-limited regimes. To address this challenge, we propose Score Augmentation (ScoreAug), a novel data augmentation framework specifically designed for diffusion models. Unlike conventional augmentation approaches that operate on clean data, ScoreAug applies transformations to noisy data, aligning with the inherent denoising mechanism of diffusion. Crucially, ScoreAug further requires the denoiser to predict the augmentation of the original target. This design establishes an equivariant learning objective, enabling the denoiser to learn scores across varied denoising spaces, thereby realizing what we term score augmentation. We also theoretically analyze the relationship between scores in different spaces under general transformations. In experiments, we extensively validate ScoreAug on multiple benchmarks including CIFAR-10, FFHQ, AFHQv2, and ImageNet, with results demonstrating significant performance improvements over baselines. Notably, ScoreAug effectively mitigates overfitting across diverse scenarios, such as varying data scales and model capacities, while exhibiting stable convergence properties. Another advantage of ScoreAug over standard data augmentation lies in its ability to circumvent data leakage issues under certain conditions. Furthermore, we show that ScoreAug can be synergistically combined with traditional data augmentation techniques to achieve additional performance gains.

[752] Adaptive Fine-Tuning via Pattern Specialization for Deep Time Series Forecasting

Amal Saadallah, Abdulaziz Al-Ademi

Main category: cs.LG

TL;DR: A novel framework improves DNN-based time series forecasting by adapting and selecting specialized models for evolving patterns, validated on diverse architectures.

Details

Motivation: Addressing the challenge of non-stationary environments in time series forecasting where patterns change over time.

Method: Trains a base DNN offline, segments validation data to cluster dominant patterns, fine-tunes specialized models per cluster, and uses similarity measures for model selection at inference. Includes concept drift detection.

Result: Demonstrates significant performance gains on traditional and advanced DNN architectures in the GluonTS library.

Conclusion: The framework is generalizable and effective for enhancing DNN performance in non-stationary time series forecasting.

Abstract: Time series forecasting poses significant challenges in non-stationary environments where underlying patterns evolve over time. In this work, we propose a novel framework that enhances deep neural network (DNN) performance by leveraging specialized model adaptation and selection. Initially, a base DNN is trained offline on historical time series data. A reserved validation subset is then segmented to extract and cluster the most dominant patterns within the series, thereby identifying distinct regimes. For each identified cluster, the base DNN is fine-tuned to produce a specialized version that captures unique pattern characteristics. At inference, the most recent input is matched against the cluster centroids, and the corresponding fine-tuned version is deployed based on the closest similarity measure. Additionally, our approach integrates a concept drift detection mechanism to identify and adapt to emerging patterns caused by non-stationary behavior. The proposed framework is generalizable across various DNN architectures and has demonstrated significant performance gains on both traditional DNNs and recent advanced architectures implemented in the GluonTS library.

[753] Shapley-Inspired Feature Weighting in $k$-means with No Additional Hyperparameters

Richard J. Fawley, Renato Cordeiro de Amorim

Main category: cs.LG

TL;DR: SHARK is a feature-weighted clustering algorithm using Shapley values for relevance, outperforming existing methods without extra tuning.

Details

Motivation: Traditional clustering assumes equal feature importance, which fails in high-dimensional or noisy data. Feature weighting methods often require additional tuning.

Method: SHARK uses Shapley values to quantify feature relevance, iteratively re-weighting features by their Shapley contribution, reducing computation time.

Result: SHARK matches or outperforms existing methods, showing robustness and accuracy, especially in noisy scenarios.

Conclusion: SHARK provides a parameter-free, efficient, and accurate solution for feature-weighted clustering.

Abstract: Clustering algorithms often assume all features contribute equally to the data structure, an assumption that usually fails in high-dimensional or noisy settings. Feature weighting methods can address this, but most require additional parameter tuning. We propose SHARK (Shapley Reweighted $k$-means), a feature-weighted clustering algorithm motivated by the use of Shapley values from cooperative game theory to quantify feature relevance, which requires no additional parameters beyond those in $k$-means. We prove that the $k$-means objective can be decomposed into a sum of per-feature Shapley values, providing an axiomatic foundation for unsupervised feature relevance and reducing Shapley computation from exponential to polynomial time. SHARK iteratively re-weights features by the inverse of their Shapley contribution, emphasising informative dimensions and down-weighting irrelevant ones. Experiments on synthetic and real-world data sets show that SHARK consistently matches or outperforms existing methods, achieving superior robustness and accuracy, particularly in scenarios where noise may be present. Software: https://github.com/rickfawley/shark.

[754] WeChat-YATT: A Simple, Scalable and Balanced RLHF Trainer

Junyu Wu, Weiming Chang, Xiaotao Liu, Guanyou He, Tingfeng Xian, Haoqiang Hong, Boqi Chen, Haotao Tian, Tao Yang, Yunsheng Shi, Feng Lin, Ting Yao

Main category: cs.LG

TL;DR: WeChat-YATT is a scalable RLHF training framework addressing controller scalability and workflow inefficiencies, improving throughput and GPU utilization.

Details

Motivation: Challenges in scaling RLHF for complex workflows and dynamic workloads, including controller limitations and inefficiencies in resource allocation.

Method: Introduces WeChat-YATT with a parallel controller model and dynamic placement schema for efficient resource use.

Result: Achieves higher throughput and better GPU utilization compared to existing frameworks, successfully deployed in WeChat.

Conclusion: WeChat-YATT effectively addresses scalability and efficiency in RLHF training, proving robust in real-world applications.

Abstract: Reinforcement Learning from Human Feedback (RLHF) has emerged as a prominent paradigm for training large language models and multimodal systems. Despite notable advances enabled by existing RLHF training frameworks, significant challenges remain in scaling to complex multimodal workflows and adapting to dynamic workloads. In particular, current systems often encounter limitations related to controller scalability when managing large models, as well as inefficiencies in orchestrating intricate RLHF pipelines, especially in scenarios that require dynamic sampling and resource allocation. In this paper, we introduce WeChat-YATT (Yet Another Transformer Trainer in WeChat), a simple, scalable, and balanced RLHF training framework specifically designed to address these challenges. WeChat-YATT features a parallel controller programming model that enables flexible and efficient orchestration of complex RLHF workflows, effectively mitigating the bottlenecks associated with centralized controller architectures and facilitating scalability in large-scale data scenarios. In addition, we propose a dynamic placement schema that adaptively partitions computational resources and schedules workloads, thereby significantly reducing hardware idle time and improving GPU utilization under variable training conditions. We evaluate WeChat-YATT across a range of experimental scenarios, demonstrating that it achieves substantial improvements in throughput compared to state-of-the-art RLHF training frameworks. Furthermore, WeChat-YATT has been successfully deployed to train models supporting WeChat product features for a large-scale user base, underscoring its effectiveness and robustness in real-world applications.

[755] A Physics-informed Deep Operator for Real-Time Freeway Traffic State Estimation

Hongxin Yu, Yibing Wang, Fengyue Jin, Meng Zhang, Anni Chen

Main category: cs.LG

TL;DR: The paper introduces a physics-informed deep operator network (PI-DeepONet) for real-time freeway traffic state estimation, outperforming baseline methods with high precision.

Details

Motivation: To improve traffic state estimation (TSE) accuracy by combining model-driven and data-driven approaches, leveraging the strengths of both.

Method: Extended PI-DeepONet architecture with 2-D data input, nonlinear expansion, attention mechanism, MIMO, and adaptive parameter identification.

Result: Outperformed baseline methods in estimating flow and mean speed on NGSIM and a Chinese expressway.

Conclusion: The proposed PI-DeepONet-based TSE method is effective and superior to existing approaches.

Abstract: Traffic state estimation (TSE) falls methodologically into three categories: model-driven, data-driven, and model-data dual-driven. Model-driven TSE relies on macroscopic traffic flow models originated from hydrodynamics. Data-driven TSE leverages historical sensing data and employs statistical models or machine learning methods to infer traffic state. Model-data dual-driven traffic state estimation attempts to harness the strengths of both aspects to achieve more accurate TSE. From the perspective of mathematical operator theory, TSE can be viewed as a type of operator that maps available measurements of inerested traffic state into unmeasured traffic state variables in real time. For the first time this paper proposes to study real-time freeway TSE in the idea of physics-informed deep operator network (PI-DeepONet), which is an operator-oriented architecture embedding traffic flow models based on deep neural networks. The paper has developed an extended architecture from the original PI-DeepONet. The extended architecture is featured with: (1) the acceptance of 2-D data input so as to support CNN-based computations; (2) the introduction of a nonlinear expansion layer, an attention mechanism, and a MIMO mechanism; (3) dedicated neural network design for adaptive identification of traffic flow model parameters. A traffic state estimator built on the basis of this extended PI-DeepONet architecture was evaluated with respect to a short freeway stretch of NGSIM and a large-scale urban expressway in China, along with other four baseline TSE methods. The evaluation results demonstrated that this novel TSE method outperformed the baseline methods with high-precision estimation results of flow and mean speed.

[756] Learning to Select MCP Algorithms: From Traditional ML to Dual-Channel GAT-MLP

Xiang Li, Shanshan Wang, Chenglong Xiao

Main category: cs.LG

TL;DR: A learning-based framework combining traditional ML and graph neural networks is proposed for selecting algorithms for the Maximum Clique Problem (MCP). RF performs best among conventional classifiers, while the dual-channel GAT-MLP model excels overall.

Details

Motivation: No single algorithm consistently performs best for MCP, and there's a lack of research on algorithm selection for MCP.

Method: Constructed a labeled dataset using four exact MCP algorithms and extracted graph features. Evaluated conventional classifiers (SVM, RF, DT, KNN) and developed the GAT-MLP dual-channel model.

Result: RF was the best conventional classifier, and GAT-MLP outperformed all models, showing the promise of graph neural networks.

Conclusion: Dual-channel architectures and graph neural networks are effective for combinatorial algorithm selection, with GAT-MLP demonstrating strong performance.

Abstract: Extensive experiments and prior studies show that no single maximum clique algorithm consistently performs best across all instances, highlighting the importance of selecting suitable algorithms based on instance features. Through an extensive analysis of relevant studies, it is found that there is a lack of research work concerning algorithm selection oriented toward the Maximum Clique Problem (MCP). In this work, we propose a learning-based framework that integrates both traditional machine learning and graph neural networks to address this gap. We construct a labeled dataset by running four exact MCP algorithms on a diverse collection of graph instances, accompanied by structural and global statistical features extracted from each graph. We first evaluate four conventional classifiers: Support Vector Machine (SVM), Random Forest (RF), Decision Tree (DT), and K-Nearest Neighbors (KNN), across multiple dataset variants. Experimental results show that RF consistently shows strong performance across metrics and dataset variants, making it a reliable baseline. In addition, feature importance analysis indicates that connectivity and topological structure are strong predictors of algorithm performance. Building on these findings, we develop a dual-channel model named GAT-MLP, which combines a Graph Attention Network (GAT) for local structural encoding with a Multilayer Perceptron (MLP) for global feature modeling. The GAT-MLP model shows strong and consistent performance across all metrics. Our results highlight the effectiveness of dual-channel architectures and the promise of graph neural networks in combinatorial algorithm selection.

[757] Communication-Efficient Zero-Order and First-Order Federated Learning Methods over Wireless Networks

Mohamad Assaad, Zeinab Nehme, Merouane Debbah

Main category: cs.LG

TL;DR: The paper proposes two communication-efficient Federated Learning methods to reduce overhead by using scalar values and simultaneous transmissions, leveraging channel information without extra resources.

Details

Motivation: FL faces high communication overhead due to data exchange in wireless systems with limited capacity.

Method: Two methods: zero-order optimization with two-point gradient estimator and first-order gradient computation, both leveraging channel info and supporting asynchronous devices.

Result: Analytical framework shows convergence guarantees and performance bounds for the methods.

Conclusion: The proposed methods effectively reduce FL communication overhead while maintaining performance.

Abstract: Federated Learning (FL) is an emerging learning framework that enables edge devices to collaboratively train ML models without sharing their local data. FL faces, however, a significant challenge due to the high amount of information that must be exchanged between the devices and the aggregator in the training phase, which can exceed the limited capacity of wireless systems. In this paper, two communication-efficient FL methods are considered where communication overhead is reduced by communicating scalar values instead of long vectors and by allowing high number of users to send information simultaneously. The first approach employs a zero-order optimization technique with two-point gradient estimator, while the second involves a first-order gradient computation strategy. The novelty lies in leveraging channel information in the learning algorithms, eliminating hence the need for additional resources to acquire channel state information (CSI) and to remove its impact, as well as in considering asynchronous devices. We provide a rigorous analytical framework for the two methods, deriving convergence guarantees and establishing appropriate performance bounds.

[758] Deep Learning-Based Analysis of Power Consumption in Gasoline, Electric, and Hybrid Vehicles

Roksana Yahyaabadi, Ghazal Farhani, Taufiq Rahman, Soodeh Nikan, Abdullah Jirjees, Fadi Araji

Main category: cs.LG

TL;DR: A scalable data-driven method for power consumption prediction in ICE, EV, and HEV platforms achieves high accuracy, with cumulative errors under 4.1%.

Details

Motivation: Traditional methods for power consumption prediction are impractical for large-scale deployment, necessitating a more scalable and accurate approach.

Method: Uses powertrain dynamic feature sets with traditional machine learning and deep neural networks (Transformer and LSTM) for estimation.

Result: ICE models achieved high accuracy (cumulative errors under 3%), while Transformer and LSTM models performed best for EVs (4.1%) and HEVs (2.1%).

Conclusion: The method is effective across vehicle types, but robust models are needed for advanced powertrains due to dataset variability.

Abstract: Accurate power consumption prediction is crucial for improving efficiency and reducing environmental impact, yet traditional methods relying on specialized instruments or rigid physical models are impractical for large-scale, real-world deployment. This study introduces a scalable data-driven method using powertrain dynamic feature sets and both traditional machine learning and deep neural networks to estimate instantaneous and cumulative power consumption in internal combustion engine (ICE), electric vehicle (EV), and hybrid electric vehicle (HEV) platforms. ICE models achieved high instantaneous accuracy with mean absolute error and root mean squared error on the order of $10^{-3}$, and cumulative errors under 3%. Transformer and long short-term memory models performed best for EVs and HEVs, with cumulative errors below 4.1% and 2.1%, respectively. Results confirm the approach’s effectiveness across vehicles and models. Uncertainty analysis revealed greater variability in EV and HEV datasets than ICE, due to complex power management, emphasizing the need for robust models for advanced powertrains.

[759] BadPromptFL: A Novel Backdoor Threat to Prompt-based Federated Learning in Multimodal Models

Maozhen Zhang, Mengnan Zhao, Bo Wang

Main category: cs.LG

TL;DR: BadPromptFL is a backdoor attack targeting prompt-based federated learning in multimodal models, achieving high success rates without altering model parameters.

Details

Motivation: The security risks of prompt-based aggregation in federated multimodal learning are unexplored, leaving vulnerabilities unaddressed.

Method: Compromised clients optimize local backdoor triggers and prompt embeddings, injecting poisoned prompts into global aggregation.

Result: BadPromptFL achieves >90% attack success rates with minimal visibility and limited client participation.

Conclusion: The attack raises concerns about the robustness of prompt-based federated learning in real-world deployments.

Abstract: Prompt-based tuning has emerged as a lightweight alternative to full fine-tuning in large vision-language models, enabling efficient adaptation via learned contextual prompts. This paradigm has recently been extended to federated learning settings (e.g., PromptFL), where clients collaboratively train prompts under data privacy constraints. However, the security implications of prompt-based aggregation in federated multimodal learning remain largely unexplored, leaving a critical attack surface unaddressed. In this paper, we introduce \textbf{BadPromptFL}, the first backdoor attack targeting prompt-based federated learning in multimodal contrastive models. In BadPromptFL, compromised clients jointly optimize local backdoor triggers and prompt embeddings, injecting poisoned prompts into the global aggregation process. These prompts are then propagated to benign clients, enabling universal backdoor activation at inference without modifying model parameters. Leveraging the contextual learning behavior of CLIP-style architectures, BadPromptFL achieves high attack success rates (e.g., (>90%)) with minimal visibility and limited client participation. Extensive experiments across multiple datasets and aggregation protocols validate the effectiveness, stealth, and generalizability of our attack, raising critical concerns about the robustness of prompt-based federated learning in real-world deployments.

[760] On Understanding of the Dynamics of Model Capacity in Continual Learning

Supriyo Chakraborty, Krishnan Raghavan

Main category: cs.LG

TL;DR: The paper introduces CLEMC to address the stability-plasticity dilemma in continual learning, showing that effective capacity is non-stationary and diminishes with differing task distributions.

Details

Motivation: To understand and quantify the dynamic behavior of the stability-plasticity balance in neural networks during continual learning.

Method: Developed a difference equation to model the interplay between NN, task data, and optimization, and conducted experiments across various architectures.

Result: Effective capacity is non-stationary; NN performance declines when task distributions differ from prior ones, regardless of architecture or optimization.

Conclusion: CLEMC provides a framework to analyze continual learning dynamics, highlighting inherent limitations in NN adaptability to new tasks.

Abstract: The stability-plasticity dilemma, closely related to a neural network’s (NN) capacity-its ability to represent tasks-is a fundamental challenge in continual learning (CL). Within this context, we introduce CL’s effective model capacity (CLEMC) that characterizes the dynamic behavior of the stability-plasticity balance point. We develop a difference equation to model the evolution of the interplay between the NN, task data, and optimization procedure. We then leverage CLEMC to demonstrate that the effective capacity-and, by extension, the stability-plasticity balance point is inherently non-stationary. We show that regardless of the NN architecture or optimization method, a NN’s ability to represent new tasks diminishes when incoming task distributions differ from previous ones. We conduct extensive experiments to support our theoretical findings, spanning a range of architectures-from small feedforward network and convolutional networks to medium-sized graph neural networks and transformer-based large language models with millions of parameters.

[761] C-MAG: Cascade Multimodal Attributed Graphs for Supply Chain Link Prediction

Yunqing Li, Zixiang Tang, Jiaying Zhuang, Zhenyu Yang, Farhad Ameri, Jianbang Zhang

Main category: cs.LG

TL;DR: PMGraph is a benchmark for supply-chain graphs, and C-MAG is a two-stage multimodal architecture improving link prediction between manufacturers and products.

Details

Motivation: Traditional methods fail to capture the complexity of manufacturer profiles and supply-chain data, necessitating a more robust solution.

Method: C-MAG aligns and aggregates textual and visual attributes into group embeddings, then propagates them through a hetero-graph using multiscale message passing.

Result: The approach enhances link prediction accuracy and handles noisy real-world data effectively.

Conclusion: C-MAG provides a scalable and practical solution for multimodal supply-chain analysis.

Abstract: Connecting an ever-expanding catalogue of products with suitable manufacturers and suppliers is critical for resilient, efficient global supply chains, yet traditional methods struggle to capture complex capabilities, certifications, geographic constraints, and rich multimodal data of real-world manufacturer profiles. To address these gaps, we introduce PMGraph, a public benchmark of bipartite and heterogeneous multimodal supply-chain graphs linking 8,888 manufacturers, over 70k products, more than 110k manufacturer-product edges, and over 29k product images. Building on this benchmark, we propose the Cascade Multimodal Attributed Graph C-MAG, a two-stage architecture that first aligns and aggregates textual and visual attributes into intermediate group embeddings, then propagates them through a manufacturer-product hetero-graph via multiscale message passing to enhance link prediction accuracy. C-MAG also provides practical guidelines for modality-aware fusion, preserving predictive performance in noisy, real-world settings.

[762] ELF: Efficient Logic Synthesis by Pruning Redundancy in Refactoring

Dimitris Tsaras, Xing Li, Lei Chen, Zhiyao Xie, Mingxuan Yuan

Main category: cs.LG

TL;DR: A classifier-based approach prunes unsuccessful cuts in logic optimization, achieving a 3.9x speedup over the state-of-the-art ABC implementation.

Details

Motivation: High computational demands and inefficiency (98% failure rate) of conventional logic optimization operators like refactor motivate the need for a more efficient method.

Method: Leverages a classifier to preemptively prune unsuccessful cuts, eliminating unnecessary resynthesis operations.

Result: Experiments show a 3.9x average speedup in logic optimization compared to the ABC implementation.

Conclusion: The classifier-based approach significantly improves efficiency in logic optimization by reducing redundant computations.

Abstract: In electronic design automation, logic optimization operators play a crucial role in minimizing the gate count of logic circuits. However, their computation demands are high. Operators such as refactor conventionally form iterative cuts for each node, striving for a more compact representation - a task which often fails 98% on average. Prior research has sought to mitigate computational cost through parallelization. In contrast, our approach leverages a classifier to prune unsuccessful cuts preemptively, thus eliminating unnecessary resynthesis operations. Experiments on the refactor operator using the EPFL benchmark suite and 10 large industrial designs demonstrate that this technique can speedup logic optimization by 3.9x on average compared with the state-of-the-art ABC implementation.

[763] Symbolic Quantile Regression for the Interpretable Prediction of Conditional Quantiles

Cas Oude Hoekstra, Floris den Hengst

Main category: cs.LG

TL;DR: The paper introduces Symbolic Quantile Regression (SQR), extending Symbolic Regression (SR) to predict conditional quantiles, enhancing interpretability in high-stakes applications.

Details

Motivation: Current SR methods focus on average outcomes, lacking understanding for other distribution points like medians or extremes, which are crucial for safety-critical domains.

Method: The study proposes SQR to predict conditional quantiles using SR, evaluated against transparent and black-box models.

Result: SQR outperforms transparent models and matches black-box performance while maintaining interpretability. A case study on airline fuel usage demonstrates its utility.

Conclusion: SQR effectively predicts conditional quantiles and provides insights into feature influences across quantiles, making it valuable for high-stakes applications.

Abstract: Symbolic Regression (SR) is a well-established framework for generating interpretable or white-box predictive models. Although SR has been successfully applied to create interpretable estimates of the average of the outcome, it is currently not well understood how it can be used to estimate the relationship between variables at other points in the distribution of the target variable. Such estimates of e.g. the median or an extreme value provide a fuller picture of how predictive variables affect the outcome and are necessary in high-stakes, safety-critical application domains. This study introduces Symbolic Quantile Regression (SQR), an approach to predict conditional quantiles with SR. In an extensive evaluation, we find that SQR outperforms transparent models and performs comparably to a strong black-box baseline without compromising transparency. We also show how SQR can be used to explain differences in the target distribution by comparing models that predict extreme and central outcomes in an airline fuel usage case study. We conclude that SQR is suitable for predicting conditional quantiles and understanding interesting feature influences at varying quantiles.

[764] Fast and Generalizable parameter-embedded Neural Operators for Lithium-Ion Battery Simulation

Amir Ali Panahi, Daniel Luder, Billy Wu, Gregory Offer, Dirk Uwe Sauer, Weihan Li

Main category: cs.LG

TL;DR: The paper benchmarks three operator-learning surrogates for lithium-ion battery modeling, proposing a new PE-FNO that balances speed and accuracy, outperforming traditional methods.

Details

Motivation: To achieve high-fidelity, real-time digital twins of lithium-ion batteries, addressing the need for speed and accuracy in dynamic load scenarios.

Method: Benchmarked DeepONets, FNOs, and proposed PE-FNO on simulated battery trajectories under various current loads and SOC ranges.

Result: PE-FNO achieved sub-millisecond speed, maintained low errors, and enabled generalization to varying parameters, outperforming other models.

Conclusion: PE-FNO offers a practical solution for high-speed, high-fidelity battery digital twins, suitable for real-time management and large-scale inference.

Abstract: Reliable digital twins of lithium-ion batteries must achieve high physical fidelity with sub-millisecond speed. In this work, we benchmark three operator-learning surrogates for the Single Particle Model (SPM): Deep Operator Networks (DeepONets), Fourier Neural Operators (FNOs) and a newly proposed parameter-embedded Fourier Neural Operator (PE-FNO), which conditions each spectral layer on particle radius and solid-phase diffusivity. Models are trained on simulated trajectories spanning four current families (constant, triangular, pulse-train, and Gaussian-random-field) and a full range of State-of-Charge (SOC) (0 % to 100 %). DeepONet accurately replicates constant-current behaviour but struggles with more dynamic loads. The basic FNO maintains mesh invariance and keeps concentration errors below 1 %, with voltage mean-absolute errors under 1.7 mV across all load types. Introducing parameter embedding marginally increases error, but enables generalisation to varying radii and diffusivities. PE-FNO executes approximately 200 times faster than a 16-thread SPM solver. Consequently, PE-FNO’s capabilities in inverse tasks are explored in a parameter estimation task with Bayesian optimisation, recovering anode and cathode diffusivities with 1.14 % and 8.4 % mean absolute percentage error, respectively, and 0.5918 percentage points higher error in comparison with classical methods. These results pave the way for neural operators to meet the accuracy, speed and parametric flexibility demands of real-time battery management, design-of-experiments and large-scale inference. PE-FNO outperforms conventional neural surrogates, offering a practical path towards high-speed and high-fidelity electrochemical digital twins.

[765] Grid2Guide: A* Enabled Small Language Model for Indoor Navigation

Md. Wasiul Haque, Sagar Dasgupta, Mizanur Rahman

Main category: cs.LG

TL;DR: Grid2Guide combines A* search and a Small Language Model (SLM) for human-readable indoor navigation instructions without external signals.

Details

Motivation: Addressing the challenge of reliable indoor navigation in environments lacking external positioning signals or infrastructure.

Method: Uses A* search on a binary occupancy matrix to compute optimal paths, then transforms steps into natural language via an SLM.

Result: Effective in producing accurate, timely navigation guidance across various indoor scenarios.

Conclusion: Validated as a lightweight, infrastructure-free solution for real-time indoor navigation.

Abstract: Reliable indoor navigation remains a significant challenge in complex environments, particularly where external positioning signals and dedicated infrastructures are unavailable. This research presents Grid2Guide, a hybrid navigation framework that combines the A* search algorithm with a Small Language Model (SLM) to generate clear, human-readable route instructions. The framework first conducts a binary occupancy matrix from a given indoor map. Using this matrix, the A* algorithm computes the optimal path between origin and destination, producing concise textual navigation steps. These steps are then transformed into natural language instructions by the SLM, enhancing interpretability for end users. Experimental evaluations across various indoor scenarios demonstrate the method’s effectiveness in producing accurate and timely navigation guidance. The results validate the proposed approach as a lightweight, infrastructure-free solution for real-time indoor navigation support.

Keyan Rahimi, Md. Wasiul Haque, Sagar Dasgupta, Mizanur Rahman

Main category: cs.LG

TL;DR: The paper presents a vision-based indoor navigation system using ResNet-50 for localization and an LLM for navigation, achieving high accuracy and demonstrating scalability.

Details

Motivation: Indoor navigation is challenging due to unreliable GPS and complex environments, necessitating infrastructure-free solutions.

Method: Combines ResNet-50 for vision-based localization with an LLM for navigation using floor plans.

Result: 96% localization accuracy; 75% navigation instruction accuracy with limitations in reasoning and speed.

Conclusion: The approach shows promise for scalable, infrastructure-free indoor navigation in resource-constrained settings.

Abstract: Indoor navigation remains a complex challenge due to the absence of reliable GPS signals and the architectural intricacies of large enclosed environments. This study presents an indoor localization and navigation approach that integrates vision-based localization with large language model (LLM)-based navigation. The localization system utilizes a ResNet-50 convolutional neural network fine-tuned through a two-stage process to identify the user’s position using smartphone camera input. To complement localization, the navigation module employs an LLM, guided by a carefully crafted system prompt, to interpret preprocessed floor plan images and generate step-by-step directions. Experimental evaluation was conducted in a realistic office corridor with repetitive features and limited visibility to test localization robustness. The model achieved high confidence and an accuracy of 96% across all tested waypoints, even under constrained viewing conditions and short-duration queries. Navigation tests using ChatGPT on real building floor maps yielded an average instruction accuracy of 75%, with observed limitations in zero-shot reasoning and inference time. This research demonstrates the potential for scalable, infrastructure-free indoor navigation using off-the-shelf cameras and publicly available floor plans, particularly in resource-constrained settings like hospitals, airports, and educational institutions.

[767] MemoryKT: An Integrative Memory-and-Forgetting Method for Knowledge Tracing

Mingrong Lin, Ke Deng, Zhengyang Wu, Zetao Zheng, Jie Li

Main category: cs.LG

TL;DR: The paper introduces memoryKT, a knowledge tracing model using a temporal variational autoencoder to simulate memory dynamics, outperforming existing methods.

Details

Motivation: Existing knowledge tracing models overlook personalized forgetting patterns and other memory processes, limiting performance and interpretability.

Method: Proposes memoryKT, a three-stage model: (i) learning knowledge memory features, (ii) reconstructing exercise feedback, and (iii) embedding a personalized forgetting module.

Result: Outperforms state-of-the-art baselines on four public datasets.

Conclusion: memoryKT enhances perception of individual differences by modeling the full encoding-storage-retrieval cycle.

Abstract: Knowledge Tracing (KT) is committed to capturing students’ knowledge mastery from their historical interactions. Simulating students’ memory states is a promising approach to enhance both the performance and interpretability of knowledge tracing models. Memory consists of three fundamental processes: encoding, storage, and retrieval. Although forgetting primarily manifests during the storage stage, most existing studies rely on a single, undifferentiated forgetting mechanism, overlooking other memory processes as well as personalized forgetting patterns. To address this, this paper proposes memoryKT, a knowledge tracing model based on a novel temporal variational autoencoder. The model simulates memory dynamics through a three-stage process: (i) Learning the distribution of students’ knowledge memory features, (ii) Reconstructing their exercise feedback, while (iii) Embedding a personalized forgetting module within the temporal workflow to dynamically modulate memory storage strength. This jointly models the complete encoding-storage-retrieval cycle, significantly enhancing the model’s perception capability for individual differences. Extensive experiments on four public datasets demonstrate that our proposed approach significantly outperforms state-of-the-art baselines.

[768] NeuroDx-LM: A Clinical Large-Scale Model for EEG-based Neurological Disorder Detection

Guanghao Jin, Yuan Liang, Yihan Ma, Jingpei Wu, Guoyang Liu

Main category: cs.LG

TL;DR: NeuroDx-LM is a large-scale EEG model for neurological disorder detection, addressing challenges like limited labeled data and suboptimal performance with novel embedding and training strategies.

Details

Motivation: To overcome the limitations of EEG-based large-scale models in clinical settings, such as scarce labeled data and performance issues.

Method: Introduces Selective Temporal-Frequency Embedding and Progressive Feature-Aware Training for adaptive EEG pattern capture and refined feature extraction.

Result: Achieves state-of-the-art performance on CHB-MIT and Schizophrenia datasets for seizure and schizophrenia detection.

Conclusion: Demonstrates EEG-based large-scale models’ potential for clinical applications, with code publicly available.

Abstract: Large-scale models pre-trained on Electroencephalography (EEG) have shown promise in clinical applications such as neurological disorder detection. However, the practical deployment of EEG-based large-scale models faces critical challenges such as limited labeled EEG data and suboptimal performance in clinical scenarios. To address these issues, we propose NeuroDx-LM, a novel large-scale model specifically designed for detecting EEG-based neurological disorders. Our key contributions include (i) a Selective Temporal-Frequency Embedding mechanism that adaptively captures complex temporal and spectral patterns in EEG signals; and (ii) a Progressive Feature-Aware Training strategy that refines feature representation in a two-stage process. In the first stage, our model learns the fundamental discriminative features of EEG activities; in the second stage, the model further extracts more specialized fine-grained features for accurate diagnostic performance. We evaluated NeuroDx-LM on the CHB-MIT and Schizophrenia datasets, achieving state-of-the-art performance in EEG-based seizure and schizophrenia detection, respectively. These results demonstrate the great potential of EEG-based large-scale models to advance clinical applicability. Our code is available at https://github.com/LetItBe12345/NeuroDx-LM.

[769] OFAL: An Oracle-Free Active Learning Framework

Hadi Khorsand, Vahid Pourahmadi

Main category: cs.LG

TL;DR: OFAL introduces an oracle-free active learning method using neural network uncertainty to generate informative samples, improving model accuracy without relying on an oracle.

Details

Motivation: Labeling data with an oracle is costly and complex, especially with large unlabeled datasets. OFAL aims to bypass this by leveraging model uncertainty.

Method: OFAL quantifies uncertainty using Monte Carlo Dropouts and generates uncertain samples via a variational autoencoder, starting from confident samples. It integrates with other active learning methods.

Result: The method enhances model accuracy by generating informative samples without oracle dependency.

Conclusion: OFAL provides a viable oracle-free alternative for active learning, improving efficiency and reducing labeling costs.

Abstract: In the active learning paradigm, using an oracle to label data has always been a complex and expensive task, and with the emersion of large unlabeled data pools, it would be highly beneficial If we could achieve better results without relying on an oracle. This research introduces OFAL, an oracle-free active learning scheme that utilizes neural network uncertainty. OFAL uses the model’s own uncertainty to transform highly confident unlabeled samples into informative uncertain samples. First, we start with separating and quantifying different parts of uncertainty and introduce Monte Carlo Dropouts as an approximation of the Bayesian Neural Network model. Secondly, by adding a variational autoencoder, we go on to generate new uncertain samples by stepping toward the uncertain part of latent space starting from a confidence seed sample. By generating these new informative samples, we can perform active learning and enhance the model’s accuracy. Lastly, we try to compare and integrate our method with other widely used active learning sampling methods.

[770] MuaLLM: A Multimodal Large Language Model Agent for Circuit Design Assistance with Hybrid Contextual Retrieval-Augmented Generation

Pravallika Abbineni, Saoud Aldowaish, Colin Liechty, Soroosh Noorzad, Ali Ghazizadeh, Morteza Fayazi

Main category: cs.LG

TL;DR: MuaLLM is an open-source multimodal LLM agent for circuit design, combining RAG and adaptive vector databases for efficient, scalable assistance.

Details

Motivation: Addressing challenges in circuit design literature review due to rapid research influx, inconsistent data, and complex optimization.

Method: Uses a hybrid RAG framework with adaptive vector databases, ReAct workflow for reasoning, and multimodal data processing.

Result: Achieves 90.1% recall on RAG-250 and 86.8% accuracy on Reas-100, with 10x lower cost and 1.6x speed.

Conclusion: MuaLLM offers scalable, efficient circuit design assistance, overcoming limitations of conventional LLMs.

Abstract: Conducting a comprehensive literature review is crucial for advancing circuit design methodologies. However, the rapid influx of state-of-the-art research, inconsistent data representation, and the complexity of optimizing circuit design objectives make this task significantly challenging. In this paper, we propose MuaLLM, an open-source multimodal Large Language Model (LLM) agent for circuit design assistance that integrates a hybrid Retrieval-Augmented Generation (RAG) framework with an adaptive vector database of circuit design research papers. Unlike conventional LLMs, the MuaLLM agent employs a Reason + Act (ReAct) workflow for iterative reasoning, goal-setting, and multi-step information retrieval. It functions as a question-answering design assistant, capable of interpreting complex queries and providing reasoned responses grounded in circuit literature. Its multimodal capabilities enable processing of both textual and visual data, facilitating more efficient and comprehensive analysis. The system dynamically adapts using intelligent search tools, automated document retrieval from the internet, and real-time database updates. Unlike conventional approaches constrained by model context limits, MuaLLM decouples retrieval from inference, enabling scalable reasoning over arbitrarily large corpora. At the maximum context length supported by standard LLMs, MuaLLM remains up to 10x less costly and 1.6x faster while maintaining the same accuracy. This allows rapid, no-human-in-the-loop database generation, overcoming the bottleneck of simulation-based dataset creation for circuits. To evaluate MuaLLM, we introduce two custom datasets: RAG-250, targeting retrieval and citation performance, and Reasoning-100 (Reas-100), focused on multistep reasoning in circuit design. MuaLLM achieves 90.1% recall on RAG-250, and 86.8% accuracy on Reas-100.

[771] FairFLRep: Fairness aware fault localization and repair of Deep Neural Networks

Moses Openja, Paolo Arcaini, Foutse Khomh, Fuyuki Ishikawa

Main category: cs.LG

TL;DR: FairFLRep is an automated technique to identify and fix bias-inducing neurons in DNNs, improving fairness without sacrificing accuracy.

Details

Motivation: DNNs often reflect biases from training data, leading to unfair decisions. Addressing this bias efficiently is challenging.

Method: FairFLRep adjusts neuron weights linked to sensitive attributes (e.g., race, gender) by analyzing input-output relationships.

Result: FairFLRep outperforms existing methods in fairness improvement and efficiency, validated on multiple datasets.

Conclusion: FairFLRep effectively enhances fairness in DNNs while maintaining accuracy, proving its superiority over baseline approaches.

Abstract: Deep neural networks (DNNs) are being utilized in various aspects of our daily lives, including high-stakes decision-making applications that impact individuals. However, these systems reflect and amplify bias from the data used during training and testing, potentially resulting in biased behavior and inaccurate decisions. For instance, having different misclassification rates between white and black sub-populations. However, effectively and efficiently identifying and correcting biased behavior in DNNs is a challenge. This paper introduces FairFLRep, an automated fairness-aware fault localization and repair technique that identifies and corrects potentially bias-inducing neurons in DNN classifiers. FairFLRep focuses on adjusting neuron weights associated with sensitive attributes, such as race or gender, that contribute to unfair decisions. By analyzing the input-output relationships within the network, FairFLRep corrects neurons responsible for disparities in predictive quality parity. We evaluate FairFLRep on four image classification datasets using two DNN classifiers, and four tabular datasets with a DNN model. The results show that FairFLRep consistently outperforms existing methods in improving fairness while preserving accuracy. An ablation study confirms the importance of considering fairness during both fault localization and repair stages. Our findings also show that FairFLRep is more efficient than the baseline approaches in repairing the network.

[772] Federated Learning for Epileptic Seizure Prediction Across Heterogeneous EEG Datasets

Cem Ata Baykara, Saurav Raj Pandey, Ali Burak Ünal, Harlin Lee, Mete Akgün

Main category: cs.LG

TL;DR: The paper explores Federated Learning (FL) for epileptic seizure prediction using EEG data from diverse datasets, addressing privacy and data heterogeneity. It proposes Random Subset Aggregation to improve fairness and performance, outperforming standard methods.

Details

Motivation: To develop accurate and generalizable seizure prediction models across multiple clinical sites while respecting patient privacy and handling data heterogeneity.

Method: Uses FL with privacy-preserving global normalization and proposes Random Subset Aggregation, where clients train on fixed-size random subsets per round for balanced contributions.

Result: Random Subset Aggregation improves performance on under-represented clients (e.g., 81.7% accuracy on Helsinki) and achieves superior macro-average (77.1%) and pooled (80.0%) accuracy.

Conclusion: Balanced FL approaches like Random Subset Aggregation are effective for building robust and fair seizure prediction systems in heterogeneous multi-hospital settings.

Abstract: Developing accurate and generalizable epileptic seizure prediction models from electroencephalography (EEG) data across multiple clinical sites is hindered by patient privacy regulations and significant data heterogeneity (non-IID characteristics). Federated Learning (FL) offers a privacy-preserving framework for collaborative training, but standard aggregation methods like Federated Averaging (FedAvg) can be biased by dominant datasets in heterogeneous settings. This paper investigates FL for seizure prediction using a single EEG channel across four diverse public datasets (Siena, CHB-MIT, Helsinki, NCH), representing distinct patient populations (adult, pediatric, neonate) and recording conditions. We implement privacy-preserving global normalization and propose a Random Subset Aggregation strategy, where each client trains on a fixed-size random subset of its data per round, ensuring equal contribution during aggregation. Our results show that locally trained models fail to generalize across sites, and standard weighted FedAvg yields highly skewed performance (e.g., 89.0% accuracy on CHB-MIT but only 50.8% on Helsinki and 50.6% on NCH). In contrast, Random Subset Aggregation significantly improves performance on under-represented clients (accuracy increases to 81.7% on Helsinki and 68.7% on NCH) and achieves a superior macro-average accuracy of 77.1% and pooled accuracy of 80.0% across all sites, demonstrating a more robust and fair global model. This work highlights the potential of balanced FL approaches for building effective and generalizable seizure prediction systems in realistic, heterogeneous multi-hospital environments while respecting data privacy.

[773] Neural Logic Networks for Interpretable Classification

Vincent Perreault, Katsumi Inoue, Richard Labib, Alain Hertz

Main category: cs.LG

TL;DR: The paper introduces Neural Logic Networks with NOT operations and biases for interpretable logical learning, improving Boolean network discovery and interpretability in classification tasks.

Details

Motivation: To address the lack of interpretability in traditional neural networks by developing a logical and probabilistic model that can be inspected and verified.

Method: Generalizes Neural Logic Networks with NOT operations and biases, introduces a factorized IF-THEN rule structure, and proposes a modified learning algorithm.

Result: Achieves state-of-the-art performance in Boolean network discovery and learns interpretable rules, demonstrated in medical classification tasks.

Conclusion: The method enhances interpretability and performance in logical learning, proving valuable in fields like medicine where transparency is crucial.

Abstract: Traditional neural networks have an impressive classification performance, but what they learn cannot be inspected, verified or extracted. Neural Logic Networks on the other hand have an interpretable structure that enables them to learn a logical mechanism relating the inputs and outputs with AND and OR operations. We generalize these networks with NOT operations and biases that take into account unobserved data and develop a rigorous logical and probabilistic modeling in terms of concept combinations to motivate their use. We also propose a novel factorized IF-THEN rule structure for the model as well as a modified learning algorithm. Our method improves the state-of-the-art in Boolean networks discovery and is able to learn relevant, interpretable rules in tabular classification, notably on an example from the medical field where interpretability has tangible value.

[774] Cross-Subject and Cross-Montage EEG Transfer Learning via Individual Tangent Space Alignment and Spatial-Riemannian Feature Fusion

Nicole Lai-Tan, Xiao Gu, Marios G. Philiastides, Fani Deligianni

Main category: cs.LG

TL;DR: The paper proposes Individual Tangent Space Alignment (ITSA) to improve the generalizability of Brain-Computer Interfaces (BCIs) for personalized music-based motor rehabilitation, addressing EEG variability and movement artifacts.

Details

Motivation: To overcome inter-subject variability in EEG signals and movement-induced artifacts that hinder BCI generalizability in motor rehabilitation.

Method: Introduces ITSA, a pre-alignment strategy with subject-specific recentering, distribution matching, and supervised rotational alignment. Combines Regularised Common Spatial Patterns (RCSP) with Riemannian geometry in parallel and sequential configurations.

Result: ITSA significantly improves performance across subjects and conditions, with parallel fusion outperforming sequential fusion. Robust performance is maintained under varying data conditions and electrode setups.

Conclusion: ITSA enhances cross-subject BCI generalizability for motor rehabilitation, with parallel fusion being the most effective. The code will be publicly available.

Abstract: Personalised music-based interventions offer a powerful means of supporting motor rehabilitation by dynamically tailoring auditory stimuli to provide external timekeeping cues, modulate affective states, and stabilise gait patterns. Generalisable Brain-Computer Interfaces (BCIs) thus hold promise for adapting these interventions across individuals. However, inter-subject variability in EEG signals, further compounded by movement-induced artefacts and motor planning differences, hinders the generalisability of BCIs and results in lengthy calibration processes. We propose Individual Tangent Space Alignment (ITSA), a novel pre-alignment strategy incorporating subject-specific recentering, distribution matching, and supervised rotational alignment to enhance cross-subject generalisation. Our hybrid architecture fuses Regularised Common Spatial Patterns (RCSP) with Riemannian geometry in parallel and sequential configurations, improving class separability while maintaining the geometric structure of covariance matrices for robust statistical computation. Using leave-one-subject-out cross-validation, `ITSA’ demonstrates significant performance improvements across subjects and conditions. The parallel fusion approach shows the greatest enhancement over its sequential counterpart, with robust performance maintained across varying data conditions and electrode configurations. The code will be made publicly available at the time of publication.

[775] Multi-head Transformers Provably Learn Symbolic Multi-step Reasoning via Gradient Descent

Tong Yang, Yu Huang, Yingbin Liang, Yuejie Chi

Main category: cs.LG

TL;DR: The paper explores how transformers learn symbolic multi-step reasoning tasks, focusing on path-finding in trees, and provides theoretical guarantees for their performance.

Details

Motivation: To understand the mechanisms behind transformers' ability to perform multi-step reasoning tasks, particularly in chain-of-thought processes, and to bridge the gap in theoretical understanding.

Method: Analyzes backward and forward reasoning tasks in path-finding using one-layer transformers, grounded in gradient descent dynamics.

Result: Trained one-layer transformers can provably solve both tasks with generalization to unseen trees, demonstrating autonomous specialization and coordination of attention heads.

Conclusion: Shallow multi-head transformers can effectively solve complex reasoning tasks when structured with intermediate steps, offering insights into the emergence of reasoning abilities.

Abstract: Transformers have demonstrated remarkable capabilities in multi-step reasoning tasks. However, understandings of the underlying mechanisms by which they acquire these abilities through training remain limited, particularly from a theoretical standpoint. This work investigates how transformers learn to solve symbolic multi-step reasoning problems through chain-of-thought processes, focusing on path-finding in trees. We analyze two intertwined tasks: a backward reasoning task, where the model outputs a path from a goal node to the root, and a more complex forward reasoning task, where the model implements two-stage reasoning by first identifying the goal-to-root path and then reversing it to produce the root-to-goal path. Our theoretical analysis, grounded in the dynamics of gradient descent, shows that trained one-layer transformers can provably solve both tasks with generalization guarantees to unseen trees. In particular, our multi-phase training dynamics for forward reasoning elucidate how different attention heads learn to specialize and coordinate autonomously to solve the two subtasks in a single autoregressive path. These results provide a mechanistic explanation of how trained transformers can implement sequential algorithmic procedures. Moreover, they offer insights into the emergence of reasoning abilities, suggesting that when tasks are structured to take intermediate chain-of-thought steps, even shallow multi-head transformers can effectively solve problems that would otherwise require deeper architectures.

[776] ComPEFT: Compression for Communicating Parameter Efficient Updates via Sparsification and Quantization

Prateek Yadav, Leshem Choshen, Colin Raffel, Mohit Bansal

Main category: cs.LG

TL;DR: ComPEFT compresses PEFT-based expert models using sparsification and ternary quantization, achieving high compression ratios (8x-50x) without retraining, while preserving or enhancing performance.

Details

Motivation: Addressing the inefficiency of retrieving or serving large expert models over high-latency networks or limited GPU resources.

Method: Uses sparsification and ternary quantization to compress PEFT residuals (task vectors) without additional retraining.

Result: Achieves 8x-50x compression ratios, improves performance (e.g., 4.16% better than QLoRA on MMLU), and maintains few-shot generalization.

Conclusion: ComPEFT efficiently compresses expert models, enhances performance, and supports scalable deployment, making it practical for real-world applications.

Abstract: Parameter-efficient fine-tuning (PEFT) techniques make it possible to efficiently adapt a language model to create “expert” models that specialize to new tasks or domains. Recent techniques in model merging and compositional generalization leverage these expert models by dynamically composing modules to improve zero/few-shot generalization. Despite the efficiency of PEFT methods, the size of expert models can make it onerous to retrieve expert models per query over high-latency networks like the Internet or serve multiple experts on a single GPU. To address these issues, we present ComPEFT, a novel method for compressing fine-tuning residuals (task vectors) of PEFT based models. ComPEFT employs sparsification and ternary quantization to reduce the size of the PEFT module without performing any additional retraining while preserving or enhancing model performance. In extensive evaluation across T5, T0, and LLaMA-based models with 200M - 65B parameters, ComPEFT achieves compression ratios of 8x - 50x. In particular, we show that ComPEFT improves with scale - stronger models exhibit higher compressibility and better performance. For example, we show that ComPEFT applied to LLaMA outperforms QLoRA by 4.16% on MMLU with a storage size reduction of up to 26x. In addition, we show that the compressed experts produced by ComPEFT maintain few-shot compositional generalization capabilities, facilitate efficient communication and computation, and exhibit enhanced performance when merged. Lastly, we provide an analysis of different method components, compare it with other PEFT methods, and test ComPEFT’s efficacy for compressing the residual of full-finetuning. Our code is available at https://github.com/prateeky2806/compeft.

[777] A Survey on Model MoErging: Recycling and Routing Among Specialized Experts for Collaborative Learning

Prateek Yadav, Colin Raffel, Mohammed Muqeeth, Lucas Caccia, Haokun Liu, Tianlong Chen, Mohit Bansal, Leshem Choshen, Alessandro Sordoni

Main category: cs.LG

TL;DR: A survey on MoErging methods, providing a taxonomy, comparing techniques, and discussing applications and tools.

Details

Motivation: The rapid development of MoErging methods has made comparisons difficult due to varied experimental setups and lack of unified evaluation.

Method: The paper presents a comprehensive survey, including a taxonomy for categorizing MoErging methods, comparing them, and discussing applications and tools.

Result: A unified overview of MoErging methods is provided, clarifying design choices and suitable applications for each method.

Conclusion: The survey establishes a foundation for future research in MoErging, addressing gaps in comparison and evaluation.

Abstract: The availability of performant pre-trained models has led to a proliferation of fine-tuned expert models that are specialized to a particular domain or task. Model MoErging methods aim to recycle expert models to create an aggregate system with improved performance or generalization. A key component of MoErging methods is the creation of a router that decides which expert model(s) to use for a particular input or application. The promise, effectiveness, and large design space of MoErging has spurred the development of many new methods over the past few years. This rapid pace of development has made it challenging to compare different MoErging methods, which are rarely compared to one another and are often validated in different experimental setups. To remedy such gaps, we present a comprehensive survey of MoErging methods that includes a novel taxonomy for cataloging key design choices and clarifying suitable applications for each method. Apart from surveying MoErging research, we inventory software tools and applications that make use of MoErging. We additionally discuss related fields of study such as model merging, multitask learning, and mixture-of-experts models. Taken as a whole, our survey provides a unified overview of existing MoErging methods and creates a solid foundation for future work in this burgeoning field.

[778] On the Duality between Gradient Transformations and Adapters

Lucas Torroba-Hennigen, Hunter Lang, Han Guo, Yoon Kim

Main category: cs.LG

TL;DR: The paper explores memory-efficient neural network optimization using linear gradient transformations, linking gradient transformations to adapter-based reparameterizations and unifying existing methods.

Details

Motivation: To reduce memory usage in training neural networks, especially language models, by transforming gradients into a lower-dimensional space.

Method: Uses linear gradient transformations to map gradients to a lower-dimensional space, updates parameters there, and maps back. Shows equivalence to adapter-based reparameterization.

Result: Demonstrates duality between gradient transformations and adapter-based methods, unifying approaches like GaLore and LoRA.

Conclusion: This duality offers a framework for improving training efficiency and memory use, suggesting new techniques.

Abstract: We study memory-efficient optimization of neural networks (in particular language models) with linear gradient transformations, where the gradients are linearly mapped to a lower dimensional space than the full parameter space, thus saving memory required for gradient accumulation and optimizer state persistence. The model parameters are updated by first performing an optimization step in the lower dimensional space and then going back into the original parameter space via the linear map’s transpose. We show that optimizing the model in this transformed space is equivalent to reparameterizing the original model through a linear adapter that additively modifies the model parameters, and then only optimizing the adapter’s parameters. When the transformation is Kronecker-factored, this establishes an equivalence between GaLore and one-sided LoRA. We show that this duality between gradient transformations and adapter-based reparameterizations unifies existing approaches to memory-efficient training and suggests new techniques for improving training efficiency and memory use.

[779] Learning to Reason without External Rewards

Xuandong Zhao, Zhewei Kang, Aosong Feng, Sergey Levine, Dawn Song

Main category: cs.LG

TL;DR: Intuitor, an RLIF method, uses self-certainty as a reward signal for unsupervised learning, matching GRPO’s performance and improving generalization without external rewards.

Details

Motivation: To overcome the limitations of costly, domain-specific supervision in RLVR by leveraging intrinsic signals for learning.

Method: Intuitor replaces external rewards in GRPO with self-certainty scores, enabling fully unsupervised learning.

Result: Intuitor matches GRPO’s performance on math tasks and generalizes better to out-of-domain tasks like code generation.

Conclusion: Intrinsic signals like self-certainty can effectively drive learning, offering a scalable alternative to RLVR.

Abstract: Training large language models (LLMs) for complex reasoning via Reinforcement Learning with Verifiable Rewards (RLVR) is effective but limited by reliance on costly, domain-specific supervision. We explore Reinforcement Learning from Internal Feedback (RLIF), a framework that enables LLMs to learn from intrinsic signals without external rewards or labeled data. We propose Intuitor, an RLIF method that uses a model’s own confidence, termed self-certainty, as its sole reward signal. Intuitor replaces external rewards in Group Relative Policy Optimization (GRPO) with self-certainty scores, enabling fully unsupervised learning. Experiments demonstrate that Intuitor matches GRPO’s performance on mathematical benchmarks while achieving superior generalization to out-of-domain tasks like code generation, without requiring gold solutions or test cases. Our findings show that intrinsic model signals can drive effective learning across domains, offering a scalable alternative to RLVR for autonomous AI systems where verifiable rewards are unavailable. Code is available at https://github.com/sunblaze-ucb/Intuitor

[780] Embracing Imperfection: Simulating Students with Diverse Cognitive Levels Using LLM-based Agents

Tao Wu, Jingyuan Chen, Wang Lin, Mengze Li, Yumeng Zhu, Ang Li, Kun Kuang, Fei Wu

Main category: cs.LG

TL;DR: A training-free framework using cognitive prototypes and beam search improves student simulation accuracy by 100%, addressing LLMs’ limitations in mimicking diverse learning patterns.

Details

Motivation: Current LLMs, trained as 'helpful assistants,' fail to simulate diverse student cognitive abilities due to overly advanced responses, leading to unrealistic simulations.

Method: Constructs cognitive prototypes from knowledge graphs, maps them to tasks, predicts performance, and refines solutions using beam search to replicate realistic mistakes.

Result: Achieves 100% improvement in simulation accuracy on the Student_100 dataset with 100 students and 5,000 learning records.

Conclusion: The proposed framework effectively simulates diverse student learning patterns, outperforming baseline models.

Abstract: Large language models (LLMs) are revolutionizing education, with LLM-based agents playing a key role in simulating student behavior. A major challenge in student simulation is modeling the diverse learning patterns of students at various cognitive levels. However, current LLMs, typically trained as ``helpful assistants’’, target at generating perfect responses. As a result, they struggle to simulate students with diverse cognitive abilities, as they often produce overly advanced answers, missing the natural imperfections that characterize student learning and resulting in unrealistic simulations. To address this issue, we propose a training-free framework for student simulation. We begin by constructing a cognitive prototype for each student using a knowledge graph, which captures their understanding of concepts from past learning records. This prototype is then mapped to new tasks to predict student performance. Next, we simulate student solutions based on these predictions and iteratively refine them using a beam search method to better replicate realistic mistakes. To validate our approach, we construct the \texttt{Student_100} dataset, consisting of $100$ students working on Python programming and $5,000$ learning records. Experimental results show that our method consistently outperforms baseline models, achieving $100%$ improvement in simulation accuracy.

[781] Efficient Contextual Preferential Bayesian Optimization with Historical Examples

Farha A. Khan, Tanmay Chakraborty, Jörg P. Dietrich, Christian Wirth

Main category: cs.LG

TL;DR: Proposes an offline, interpretable utility learning method to reduce expert involvement in multi-objective optimization by leveraging expert knowledge, historical examples, and coarse utility space info.

Details

Motivation: Real-world problems involve implicit preferences hard to formalize, requiring costly expert input. The goal is to reduce this dependency.

Method: Uses expert knowledge, historical examples, and coarse utility space info to learn utility offline. Models uncertainty via Bayesian posterior and propagates it in optimization.

Result: Outperforms standard Gaussian processes and BOPE across four domains, even with biased samples and limited expert input.

Conclusion: The method effectively reduces expert involvement while maintaining strong performance in real-world scenarios.

Abstract: State-of-the-art multi-objective optimization often assumes a known utility function, learns it interactively, or computes the full Pareto front-each requiring costly expert input.~Real-world problems, however, involve implicit preferences that are hard to formalize. To reduce expert involvement, we propose an offline, interpretable utility learning method that uses expert knowledge, historical examples, and coarse information about the utility space to reduce sample requirements. We model uncertainty via a full Bayesian posterior and propagate it throughout the optimization process. Our method outperforms standard Gaussian processes and BOPE across four domains, showing strong performance even with biased samples, as encountered in the real-world, and limited expert input.

[782] Active Policy Improvement from Multiple Black-box Oracles

Xuefeng Liu, Takuma Yoneda, Chaoqi Wang, Matthew R. Walter, Yuxin Chen

Main category: cs.LG

TL;DR: MAPS and MAPS-SE are policy improvement algorithms for imitation learning from multiple suboptimal experts, offering sample efficiency and accelerated policy optimization.

Details

Motivation: RL requires extensive exploration; imitation learning with multiple suboptimal experts addresses this but poses challenges in selecting which expert to imitate.

Method: MAPS actively selects oracles to imitate and improves value function estimates; MAPS-SE adds active state exploration.

Result: Theoretical and empirical results show MAPS and MAPS-SE outperform state-of-the-art algorithms in sample efficiency and policy optimization.

Conclusion: MAPS-SE accelerates policy optimization via state-wise imitation learning from multiple oracles, validated across control tasks.

Abstract: Reinforcement learning (RL) has made significant strides in various complex domains. However, identifying an effective policy via RL often necessitates extensive exploration. Imitation learning aims to mitigate this issue by using expert demonstrations to guide exploration. In real-world scenarios, one often has access to multiple suboptimal black-box experts, rather than a single optimal oracle. These experts do not universally outperform each other across all states, presenting a challenge in actively deciding which oracle to use and in which state. We introduce MAPS and MAPS-SE, a class of policy improvement algorithms that perform imitation learning from multiple suboptimal oracles. In particular, MAPS actively selects which of the oracles to imitate and improve their value function estimates, and MAPS-SE additionally leverages an active state exploration criterion to determine which states one should explore. We provide a comprehensive theoretical analysis and demonstrate that MAPS and MAPS-SE enjoy sample efficiency advantage over the state-of-the-art policy improvement algorithms. Empirical results show that MAPS-SE significantly accelerates policy optimization via state-wise imitation learning from multiple oracles across a broad spectrum of control tasks in the DeepMind Control Suite. Our code is publicly available at: https://github.com/ripl/maps.

[783] Probabilistic Optimality for Inference-time Scaling

Youkang Wang, Jian Wang, Rubing Chen, Xiao-Yong Wei

Main category: cs.LG

TL;DR: The paper introduces OptScale, a probabilistic framework for efficient inference-time scaling of LLMs, reducing sampling overhead while maintaining performance.

Details

Motivation: Existing heuristic approaches for parallel sampling in LLMs lack a principled foundation, prompting the need for a formalized method.

Method: The authors propose a probabilistic framework assuming i.i.d. parallel samples, derive a theoretical lower bound for sample efficiency, and develop OptScale, an algorithm that dynamically determines optimal sample counts.

Result: OptScale significantly reduces sampling overhead while matching or outperforming state-of-the-art reasoning performance on benchmarks like MATH-500, GSM8K, AIME, and AMC.

Conclusion: The work provides a theoretical and practical solution for efficient inference-time scaling in LLMs, addressing a critical deployment gap.

Abstract: Inference-time scaling has emerged as a powerful technique for enhancing the reasoning performance of Large Language Models (LLMs). However, existing approaches often rely on heuristic strategies for parallel sampling, lacking a principled foundation. To address this gap, we propose a probabilistic framework that formalizes the optimality of inference-time scaling under the assumption that parallel samples are independently and identically distributed (i.i.d.), and where the Best-of-N selection strategy follows a probability distribution that can be estimated. Within this framework, we derive a theoretical lower bound on the required number of samples to achieve a target performance level, providing the first principled guidance for compute-efficient scaling. Leveraging this insight, we develop OptScale, a practical algorithm that dynamically determines the optimal number of sampled responses. OptScale employs a language model-based predictor to estimate probabilistic prior parameters, enabling the decision of the minimal number of samples needed that satisfy predefined performance thresholds and confidence levels. Extensive experiments on mathematical reasoning benchmarks (including MATH-500, GSM8K, AIME, and AMC) demonstrate that OptScale significantly reduces sampling overhead while remaining better or on par with state-of-the-art reasoning performance. Our work offers both a theoretical foundation and a practical solution for principled inference-time scaling, addressing a critical gap in the efficient deployment of LLMs for complex reasoning. The source code is publicly available at https://github.com/Albertwyk/OptScale.

[784] Blending Imitation and Reinforcement Learning for Robust Policy Improvement

Xuefeng Liu, Takuma Yoneda, Rick L. Stevens, Matthew R. Walter, Yuxin Chen

Main category: cs.LG

TL;DR: RPI combines imitation learning (IL) and reinforcement learning (RL) to improve sample efficiency, transitioning from IL to RL as learning progresses. It outperforms existing methods.

Details

Motivation: Sample complexity in RL limits its application, and IL depends on oracle quality. RPI aims to leverage both while mitigating their weaknesses.

Method: RPI interleaves IL and RL, using Robust Active Policy Selection (RAPS) and Robust Policy Gradient (RPG) to decide between imitation or RL based on performance.

Result: RPI outperforms state-of-the-art methods in empirical evaluations and theoretical analysis.

Conclusion: RPI effectively combines IL and RL, improving sample efficiency and performance, validated by benchmarks.

Abstract: While reinforcement learning (RL) has shown promising performance, its sample complexity continues to be a substantial hurdle, restricting its broader application across a variety of domains. Imitation learning (IL) utilizes oracles to improve sample efficiency, yet it is often constrained by the quality of the oracles deployed. which actively interleaves between IL and RL based on an online estimate of their performance. RPI draws on the strengths of IL, using oracle queries to facilitate exploration, an aspect that is notably challenging in sparse-reward RL, particularly during the early stages of learning. As learning unfolds, RPI gradually transitions to RL, effectively treating the learned policy as an improved oracle. This algorithm is capable of learning from and improving upon a diverse set of black-box oracles. Integral to RPI are Robust Active Policy Selection (RAPS) and Robust Policy Gradient (RPG), both of which reason over whether to perform state-wise imitation from the oracles or learn from its own value function when the learner’s performance surpasses that of the oracles in a specific state. Empirical evaluations and theoretical analysis validate that RPI excels in comparison to existing state-of-the-art methodologies, demonstrating superior performance across various benchmark domains.

[785] Sparse Variational Student-t Processes

Jian Xu, Delu Zeng

Main category: cs.LG

TL;DR: The paper introduces a sparse representation framework for Student-t Processes to handle heavy-tailed data and outliers, reducing computational complexity using Bayesian methods and variational inference.

Details

Motivation: Address the lack of sparse representation for Student-t Processes, making them more practical for real-world datasets with outliers.

Method: Leverage conditional distribution of Student-t Processes with sparse inducing points, using Bayesian methods and variational inference for optimization. Two approaches: Monte Carlo sampling and Jensen’s inequality for KL regularization.

Result: Proposed methods outperform baselines in computational efficiency and accuracy, showing robustness to outliers on synthetic and real-world datasets.

Conclusion: The sparse Student-t Process framework is a viable alternative to Gaussian Processes for outlier-prone or heavy-tailed data, with practical recommendations provided.

Abstract: The theory of Bayesian learning incorporates the use of Student-t Processes to model heavy-tailed distributions and datasets with outliers. However, despite Student-t Processes having a similar computational complexity as Gaussian Processes, there has been limited emphasis on the sparse representation of this model. This is mainly due to the increased difficulty in modeling and computation compared to previous sparse Gaussian Processes. Our motivation is to address the need for a sparse representation framework that reduces computational complexity, allowing Student-t Processes to be more flexible for real-world datasets. To achieve this, we leverage the conditional distribution of Student-t Processes to introduce sparse inducing points. Bayesian methods and variational inference are then utilized to derive a well-defined lower bound, facilitating more efficient optimization of our model through stochastic gradient descent. We propose two methods for computing the variational lower bound, one utilizing Monte Carlo sampling and the other employing Jensen’s inequality to compute the KL regularization term in the loss function. We propose adopting these approaches as viable alternatives to Gaussian processes when the data might contain outliers or exhibit heavy-tailed behavior, and we provide specific recommendations for their applicability. We evaluate the two proposed approaches on various synthetic and real-world datasets from UCI and Kaggle, demonstrating their effectiveness compared to baseline methods in terms of computational complexity and accuracy, as well as their robustness to outliers.

[786] Fed-TGAN: Federated Learning Framework for Synthesizing Tabular Data

Zilong Zhao, Robert Birke, Aditya Kunar, Lydia Y. Chen

Main category: cs.LG

TL;DR: Fed-TGAN is the first federated learning framework for tabular GANs, addressing privacy and data skew challenges with novel encoding and weighting strategies.

Details

Motivation: GANs for tabular data in federated learning (FL) are unexplored, and existing methods risk privacy due to required prior knowledge on data distribution.

Method: Fed-TGAN introduces privacy-preserving multi-source feature encoding and table similarity aware weighting for model aggregation.

Result: Fed-TGAN accelerates training by 200%, stabilizes loss, and improves data similarity compared to alternatives.

Conclusion: Fed-TGAN effectively trains tabular GANs in FL, balancing privacy and performance.

Abstract: Generative Adversarial Networks (GANs) are typically trained to synthesize data, from images and more recently tabular data, under the assumption of directly accessible training data. Recently, federated learning (FL) is an emerging paradigm that features decentralized learning on client’s local data with a privacy-preserving capability. And, while learning GANs to synthesize images on FL systems has just been demonstrated, it is unknown if GANs for tabular data can be learned from decentralized data sources. Moreover, it remains unclear which distributed architecture suits them best. Different from image GANs, state-of-the-art tabular GANs require prior knowledge on the data distribution of each (discrete and continuous) column to agree on a common encoding – risking privacy guarantees. In this paper, we propose Fed-TGAN, the first Federated learning framework for Tabular GANs. To effectively learn a complex tabular GAN on non-identical participants, Fed-TGAN designs two novel features: (i) a privacy-preserving multi-source feature encoding for model initialization; and (ii) table similarity aware weighting strategies to aggregate local models for countering data skew. We extensively evaluate the proposed Fed-TGAN against variants of decentralized learning architectures on four widely used datasets. Results show that Fed-TGAN accelerates training time per epoch up to 200% compared to the alternative architectures, for both IID and Non-IID data. Overall, Fed-TGAN not only stabilizes the training loss, but also achieves better similarity between generated and original data. Our code is released at https://github.com/zhao-zilong/Fed-TGAN.

[787] SOInter: A Novel Deep Energy Based Interpretation Method for Explaining Structured Output Models

S. Fatemeh Seyyedsalehi, Mahdieh Soleymani, Hamid R. Rabiee

Main category: cs.LG

TL;DR: A novel technique to explain structured output models by focusing on target outputs and leveraging correlations between variables, using an energy-based training process for the interpreter function.

Details

Motivation: To improve the interpretability of structured output models by accounting for the complex relationships and correlations between output variables.

Method: Proposes an energy-based training process for an interpreter function that considers structural information in the model.

Result: Demonstrated effectiveness on simulated and real datasets.

Conclusion: The method enhances explanation performance by incorporating structural correlations in the model.

Abstract: We propose a novel interpretation technique to explain the behavior of structured output models, which learn mappings between an input vector to a set of output variables simultaneously. Because of the complex relationship between the computational path of output variables in structured models, a feature can affect the value of output through other ones. We focus on one of the outputs as the target and try to find the most important features utilized by the structured model to decide on the target in each locality of the input space. In this paper, we assume an arbitrary structured output model is available as a black box and argue how considering the correlations between output variables can improve the explanation performance. The goal is to train a function as an interpreter for the target output variable over the input space. We introduce an energy-based training process for the interpreter function, which effectively considers the structural information incorporated into the model to be explained. The effectiveness of the proposed method is confirmed using a variety of simulated and real data sets.

[788] On the Sample Efficiency of Abstractions and Potential-Based Reward Shaping in Reinforcement Learning

Giuseppe Canonaco, Leo Ardon, Alberto Pozanco, Daniel Borrajo

Main category: cs.LG

TL;DR: The paper explores Potential-Based Reward Shaping (PBRS) in Reinforcement Learning (RL), focusing on selecting the potential function and addressing finite horizon bias. It shows performance advantages using the optimal value function and evaluates PBRS in four environments.

Details

Motivation: To address sample inefficiency in RL and the challenge of selecting the right potential function for PBRS, while also mitigating bias from finite horizons.

Method: Theoretical analysis of potential function selection as the optimal value function, investigation of finite horizon bias, and empirical evaluation using abstractions to approximate the optimal value function in four environments.

Result: Achieved comparable performance to CNN-based solutions with a simple fully-connected network in tested environments.

Conclusion: PBRS with the optimal value function as the potential function improves performance and sample efficiency, even with simpler architectures.

Abstract: The use of Potential-Based Reward Shaping (PBRS) has shown great promise in the ongoing research effort to tackle sample inefficiency in Reinforcement Learning (RL). However, choosing the right potential function remains an open challenge. Additionally, RL techniques are usually constrained to use a finite horizon for computational limitations, which introduces a bias when using PBRS. In this paper, we first build some theoretically-grounded intuition on why selecting the potential function as the optimal value function of the task at hand produces performance advantages. We then analyse the bias induced by finite horizons in the context of PBRS producing novel insights. Finally, leveraging abstractions as a way to approximate the optimal value function of the given task, we assess the sample efficiency and performance impact of PBRS on four environments including a goal-oriented navigation task and three Arcade Learning Environments (ALE) games. Remarkably, experimental results show that we can reach the same level of performance as CNN-based solutions with a simple fully-connected network.

[789] AdaBoost is not an Optimal Weak to Strong Learner

Mikael Møller Høgsgaard, Kasper Green Larsen, Martin Ritzert

Main category: cs.LG

TL;DR: AdaBoost and its variants are sub-optimal in sample complexity by at least a logarithmic factor compared to the optimal weak-to-strong learner.

Details

Motivation: To determine if AdaBoost, a classic boosting algorithm, optimally uses training samples for achieving high accuracy.

Method: Analyzing the sample complexity of AdaBoost and its variations compared to the provably optimal weak-to-strong learner.

Result: AdaBoost and its variants are sub-optimal by at least one logarithmic factor in the desired accuracy.

Conclusion: AdaBoost does not make optimal use of training samples, highlighting a gap between classic and optimal boosting methods.

Abstract: AdaBoost is a classic boosting algorithm for combining multiple inaccurate classifiers produced by a weak learner, to produce a strong learner with arbitrarily high accuracy when given enough training data. Determining the optimal number of samples necessary to obtain a given accuracy of the strong learner, is a basic learning theoretic question. Larsen and Ritzert (NeurIPS'22) recently presented the first provably optimal weak-to-strong learner. However, their algorithm is somewhat complicated and it remains an intriguing question whether the prototypical boosting algorithm AdaBoost also makes optimal use of training samples. In this work, we answer this question in the negative. Concretely, we show that the sample complexity of AdaBoost, and other classic variations thereof, are sub-optimal by at least one logarithmic factor in the desired accuracy of the strong learner.

[790] Runtime Monitoring and Enforcement of Conditional Fairness in Generative AIs

Chih-Hong Cheng, Changshun Wu, Xingyu Zhao, Saddek Bensalem, Harald Ruess

Main category: cs.LG

TL;DR: The paper addresses fairness concerns in generative AI (GenAI) by introducing conditional fairness tailored to context, defining two fairness levels, and proposing a worst-case bounding approach with prompt injection for enforcement.

Details

Motivation: To tackle fairness issues in GenAI, which differ from standard AI due to its broad functionality, requiring context-specific fairness measures.

Method: Defines two fairness levels, bounds worst-case unfairness, and uses combinatorial testing and prompt injection in an agent-based framework.

Result: Develops a method to enforce conditional fairness with minimal intervention, validated on state-of-the-art GenAI systems.

Conclusion: The approach effectively addresses fairness in GenAI by focusing on worst-case scenarios and context-specific enforcement.

Abstract: The deployment of generative AI (GenAI) models raises significant fairness concerns, addressed in this paper through novel characterization and enforcement techniques specific to GenAI. Unlike standard AI performing specific tasks, GenAI’s broad functionality requires ``conditional fairness’’ tailored to the context being generated, such as demographic fairness in generating images of poor people versus successful business leaders. We define two fairness levels: the first evaluates fairness in generated outputs, independent of prompts and models; the second assesses inherent fairness with neutral prompts. Given the complexity of GenAI and challenges in fairness specifications, we focus on bounding the worst case, considering a GenAI system unfair if the distance between appearances of a specific group exceeds preset thresholds. We also explore combinatorial testing for assessing relative completeness in intersectional fairness. By bounding the worst case, we develop a prompt injection scheme within an agent-based framework to enforce conditional fairness with minimal intervention, validated on state-of-the-art GenAI systems.

[791] Optimal Multi-Distribution Learning

Zihan Zhang, Wenhao Zhan, Yuxin Chen, Simon S. Du, Jason D. Lee

Main category: cs.LG

TL;DR: The paper introduces a novel algorithm for multi-distribution learning (MDL) that matches the best-known lower bounds on sample complexity, addressing gaps in current bounds and resolving open problems from COLT 2023.

Details

Motivation: The need for robust, fair, and collaborative learning across diverse data distributions drives the study of MDL, with adaptive sampling being key to data efficiency.

Method: The authors propose an oracle-efficient algorithm for VC dimension hypothesis classes, achieving optimal sample complexity, and extend it to Rademacher classes.

Result: The algorithm achieves ε-optimal randomized hypotheses with sample complexity (d+k)/ε², matching lower bounds, and demonstrates the necessity of randomization.

Conclusion: The work resolves open problems in MDL, providing efficient algorithms and theoretical insights into the role of randomization.

Abstract: Multi-distribution learning (MDL), which seeks to learn a shared model that minimizes the worst-case risk across $k$ distinct data distributions, has emerged as a unified framework in response to the evolving demand for robustness, fairness, multi-group collaboration, etc. Achieving data-efficient MDL necessitates adaptive sampling, also called on-demand sampling, throughout the learning process. However, there exist substantial gaps between the state-of-the-art upper and lower bounds on the optimal sample complexity. Focusing on a hypothesis class of Vapnik-Chervonenkis (VC) dimension d, we propose a novel algorithm that yields an varepsilon-optimal randomized hypothesis with a sample complexity on the order of (d+k)/varepsilon^2 (modulo some logarithmic factor), matching the best-known lower bound. Our algorithmic ideas and theory are further extended to accommodate Rademacher classes. The proposed algorithms are oracle-efficient, which access the hypothesis class solely through an empirical risk minimization oracle. Additionally, we establish the necessity of randomization, revealing a large sample size barrier when only deterministic hypotheses are permitted. These findings resolve three open problems presented in COLT 2023 (i.e., citet[Problems 1, 3 and 4]{awasthi2023sample}).

[792] From Spikes to Heavy Tails: Unveiling the Spectral Evolution of Neural Networks

Vignesh Kothapalli, Tianyu Pang, Shenyang Deng, Zongmin Liu, Yaoqing Yang

Main category: cs.LG

TL;DR: The paper explores the emergence of heavy-tailed spectral density in neural network weights, linking it to generalization, and analyzes its occurrence in a noise-free setting with large learning rates.

Details

Motivation: To theoretically explain the heavy-tailed spectral density phenomenon in neural networks and understand its connection to generalization.

Method: A theory-informed setup for crafting heavy tails in the spectral density of two-layer neural networks, analyzing the phenomenon without gradient noise and incorporating large learning rates.

Result: Learning rates influence the Bulk+Spike and heavy-tailed shape of spectral densities early in training, aiding generalization in two-layer networks.

Conclusion: The study provides insights into the behavior of large-scale neural networks by simplifying the analysis of heavy-tailed spectral density emergence.

Abstract: Training strategies for modern deep neural networks (NNs) tend to induce a heavy-tailed (HT) empirical spectral density (ESD) in the layer weights. While previous efforts have shown that the HT phenomenon correlates with good generalization in large NNs, a theoretical explanation of its occurrence is still lacking. Especially, understanding the conditions which lead to this phenomenon can shed light on the interplay between generalization and weight spectra. Our work aims to bridge this gap by presenting a simple, rich setting to model the emergence of HT ESD. In particular, we present a theory-informed setup for ‘crafting’ heavy tails in the ESD of two-layer NNs and present a systematic analysis of the HT ESD emergence without any gradient noise. This is the first work to analyze a noise-free setting, and we also incorporate optimizer (GD/Adam) dependent (large) learning rates into the HT ESD analysis. Our results highlight the role of learning rates on the Bulk+Spike and HT shape of the ESDs in the early phase of training, which can facilitate generalization in the two-layer NN. These observations shed light on the behavior of large-scale NNs, albeit in a much simpler setting.

[793] Monte Carlo with kernel-based Gibbs measures: Guarantees for probabilistic herding

Martin Rouault, Rémi Bardenet, Mylène Maïda

Main category: cs.LG

TL;DR: The paper explores kernel herding, a deterministic quadrature method, and proves it outperforms i.i.d. Monte Carlo in worst-case integration error, suggesting faster convergence rates.

Details

Motivation: The study aims to provide theoretical support for the observed faster convergence rates of kernel herding in RKHS, which lacks rigorous proof despite practical success.

Method: The authors analyze a joint probability distribution over quadrature nodes, comparing its worst-case error concentration to i.i.d. Monte Carlo.

Result: The paper proves tighter concentration bounds for kernel herding, indicating improved performance over i.i.d. methods.

Conclusion: The findings suggest kernel herding’s potential for faster convergence, supported by Gibbs measures, and highlight computational challenges for practical applications.

Abstract: Kernel herding belongs to a family of deterministic quadratures that seek to minimize the worst-case integration error over a reproducing kernel Hilbert space (RKHS). These quadrature rules come with strong experimental evidence that this worst-case error decreases at a faster rate than the standard square root of the number of quadrature nodes. This conjectured fast rate is key for integrating expensive-to-evaluate functions, as in Bayesian inference of expensive models, and makes up for the increased computational cost of sampling, compared to i.i.d. or MCMC quadratures. However, there is little theoretical support for this faster-than-square-root rate, at least in the usual case where the RKHS is infinite-dimensional, while recent progress on distribution compression suggests that results on the direct minimization of worst-case integration are possible. In this paper, we study a joint probability distribution over quadrature nodes, whose support tends to minimize the same worst-case error as kernel herding. Our main contribution is to prove that it does outperform i.i.d Monte Carlo, in the sense of coming with a tighter concentration inequality on the worst-case integration error. This first step towards proving a fast error decay demonstrates that the mathematical toolbox developed around Gibbs measures can help understand to what extent kernel herding and its variants improve on computationally cheaper methods. Moreover, we investigate the computational bottlenecks of approximately sampling our quadrature, and we demonstrate on toy examples that a faster rate of convergence, though not worst-case, is likely.

[794] MS-IMAP – A Multi-Scale Graph Embedding Approach for Interpretable Manifold Learning

Shay Deutsch, Lionel Yelibi, Alex Tong Lin, Arjun Ravi Kannan

Main category: cs.LG

TL;DR: A framework for multi-scale graph network embedding using spectral graph wavelets and contrastive learning, offering flexibility and feature importance insights.

Details

Motivation: To derive meaningful representations from high-dimensional data in unsupervised settings, addressing limitations of traditional operators like the Laplacian.

Method: Uses spectral graph wavelets in a contrastive learning framework to create embeddings with controlled smoothness and feature correspondence.

Result: Validated on public datasets, showing effectiveness in tasks like clustering and unsupervised feature importance.

Conclusion: The framework provides flexible, interpretable embeddings with proven utility in diverse applications.

Abstract: Deriving meaningful representations from complex, high-dimensional data in unsupervised settings is crucial across diverse machine learning applications. This paper introduces a framework for multi-scale graph network embedding based on spectral graph wavelets that employs a contrastive learning approach. We theoretically show that in Paley-Wiener spaces on combinatorial graphs, the spectral graph wavelets operator provides greater flexibility and control over smoothness compared to the Laplacian operator, motivating our approach. A key advantage of the proposed embedding is its ability to establish a correspondence between the embedding and input feature spaces, enabling the derivation of feature importance. We validate the effectiveness of our graph embedding framework on multiple public datasets across various downstream tasks, including clustering and unsupervised feature importance.

[795] Reward-Directed Score-Based Diffusion Models via q-Learning

Xuefeng Gao, Jiale Zha, Xun Yu Zhou

Main category: cs.LG

TL;DR: A novel RL approach for training continuous-time score-based diffusion models in generative AI, focusing on reward maximization while staying close to target data distributions, without relying on pretrained models or score function learning.

Details

Motivation: To address limitations of existing methods that depend on pretrained models or score function learning, proposing a more flexible and efficient RL-based solution.

Method: Formulates the problem as entropy-regularized continuous-time RL, deriving optimal Gaussian policies and developing an actor-critic q-learning algorithm with noisy score function observations.

Result: Demonstrates effectiveness through comparisons with state-of-the-art RL methods on high-dimensional generative tasks like image generation.

Conclusion: The approach is versatile, extending to probability flow ODE and conditional diffusion models, offering a robust alternative to existing techniques.

Abstract: We propose a new reinforcement learning (RL) formulation for training continuous-time score-based diffusion models for generative AI to generate samples that maximize reward functions while keeping the generated distributions close to the unknown target data distributions. Different from most existing studies, ours does not involve any pretrained model for the unknown score functions of the noise-perturbed data distributions, nor does it attempt to learn the score functions. Instead, we formulate the problem as entropy-regularized continuous-time RL and show that the optimal stochastic policy has a Gaussian distribution with a known covariance matrix. Based on this result, we parameterize the mean of Gaussian policies and develop an actor–critic type (little) q-learning algorithm to solve the RL problem. A key ingredient in our algorithm design is to obtain noisy observations from the unknown score function via a ratio estimator. Our formulation can also be adapted to solve pure score-matching and fine-tuning pretrained models. Numerically, we show the effectiveness of our approach by comparing its performance with two state-of-the-art RL methods that fine-tune pretrained models on several generative tasks including high-dimensional image generations. Finally, we discuss extensions of our RL formulation to probability flow ODE implementation of diffusion models and to conditional diffusion models.

[796] MaCP: Minimal yet Mighty Adaptation via Hierarchical Cosine Projection

Yixian Shen, Qi Bi, Jia-Hong Huang, Hongyi Zhu, Andy D. Pimentel, Anuj Pathania

Main category: cs.LG

TL;DR: MaCP is a lightweight adaptation method using cosine projection for efficient fine-tuning of large models, improving accuracy and reducing resource usage.

Details

Motivation: To enhance model efficiency and accuracy with minimal parameters and memory for fine-tuning large foundation models.

Method: Projects weight changes into discrete cosine space, partitions them by frequency, and selects critical components.

Result: Outperforms alternatives in accuracy, computational complexity, and memory usage across various tasks.

Conclusion: MaCP is a highly effective, resource-efficient method for adapting large models.

Abstract: We present a new adaptation method MaCP, Minimal yet Mighty adaptive Cosine Projection, that achieves exceptional performance while requiring minimal parameters and memory for fine-tuning large foundation models. Its general idea is to exploit the superior energy compaction and decorrelation properties of cosine projection to improve both model efficiency and accuracy. Specifically, it projects the weight change from the low-rank adaptation into the discrete cosine space. Then, the weight change is partitioned over different levels of the discrete cosine spectrum, and each partition’s most critical frequency components are selected. Extensive experiments demonstrate the effectiveness of MaCP across a wide range of single-modality tasks, including natural language understanding, natural language generation, text summarization, as well as multi-modality tasks such as image classification and video understanding. MaCP consistently delivers superior accuracy, significantly reduced computational complexity, and lower memory requirements compared to existing alternatives.

[797] UoMo: A Universal Model of Mobile Traffic Forecasting for Wireless Network Optimization

Haoye Chai, Shiyuan Zhang, Xiaoqian Qi, Baohua Qiu, Yong Li

Main category: cs.LG

TL;DR: The paper introduces FoMo, a foundation model for mobile traffic forecasting, combining diffusion models and transformers to handle diverse tasks across cities, outperforming existing models.

Details

Motivation: Existing mobile traffic forecasting models are task-specific and lack generalization across diverse tasks and urban environments, limiting their effectiveness.

Method: FoMo integrates diffusion models and transformers, uses spatio-temporal masks for task-specific learning, and employs contrastive learning to link traffic and urban contexts.

Result: FoMo outperforms current models in diverse forecasting tasks and zero/few-shot learning, demonstrating strong universality across 9 real-world datasets.

Conclusion: FoMo is a versatile and effective foundation model for mobile traffic forecasting, enhancing generalization and performance across various tasks and environments.

Abstract: Mobile traffic forecasting allows operators to anticipate network dynamics and performance in advance, offering substantial potential for enhancing service quality and improving user experience. However, existing models are often task-oriented and are trained with tailored data, which limits their effectiveness in diverse mobile network tasks of Base Station (BS) deployment, resource allocation, energy optimization, etc. and hinders generalization across different urban environments. Foundation models have made remarkable strides across various domains of NLP and CV due to their multi-tasking adaption and zero/few-shot learning capabilities. In this paper, we propose an innovative Foundation model for Mo}bile traffic forecasting (FoMo), aiming to handle diverse forecasting tasks of short/long-term predictions and distribution generation across multiple cities to support network planning and optimization. FoMo combines diffusion models and transformers, where various spatio-temporal masks are proposed to enable FoMo to learn intrinsic features of different tasks, and a contrastive learning strategy is developed to capture the correlations between mobile traffic and urban contexts, thereby improving its transfer learning capability. Extensive experiments on 9 real-world datasets demonstrate that FoMo outperforms current models concerning diverse forecasting tasks and zero/few-shot learning, showcasing a strong universality.

[798] ADAM-SINDy: An Efficient Optimization Framework for Parameterized Nonlinear Dynamical System Identification

Siva Viknesh, Younes Tatari, Chase Christenson, Amirhossein Arzani

Main category: cs.LG

TL;DR: ADAM-SINDy is a novel method combining SINDy and ADAM optimization to improve nonlinear dynamical system identification by optimizing parameters and coefficients simultaneously.

Details

Motivation: Traditional methods like SINDy and symbolic regression have limitations in handling nonlinear parameters, prompting the need for a more adaptive approach.

Method: ADAM-SINDy integrates ADAM optimization within SINDy to dynamically adjust unknown variables, reducing sensitivity to candidate function libraries.

Result: The method shows significant improvements in identifying parameterized dynamical systems across various benchmarks.

Conclusion: ADAM-SINDy enhances the SINDy framework’s applicability for complex dynamical system identification challenges.

Abstract: Identifying dynamical systems characterized by nonlinear parameters presents significant challenges in deriving mathematical models that enhance understanding of physics. Traditional methods, such as Sparse Identification of Nonlinear Dynamics (SINDy) and symbolic regression, can extract governing equations from observational data; however, they also come with distinct advantages and disadvantages. This paper introduces a novel method within the SINDy framework, termed ADAM-SINDy, which synthesizes the strengths of established approaches by employing the ADAM optimization algorithm. This facilitates the simultaneous optimization of nonlinear parameters and coefficients associated with nonlinear candidate functions, enabling precise parameter estimation without requiring prior knowledge of nonlinear characteristics such as trigonometric frequencies, exponential bandwidths, or polynomial exponents, thereby addressing a key limitation of SINDy. Through an integrated global optimization, ADAM-SINDy dynamically adjusts all unknown variables in response to data, resulting in an adaptive identification procedure that reduces the sensitivity to the library of candidate functions. The performance of the ADAM-SINDy methodology is demonstrated across a spectrum of dynamical systems, including benchmark coupled nonlinear ordinary differential equations such as oscillators, chaotic fluid flows, reaction kinetics, pharmacokinetics, as well as nonlinear partial differential equations (wildfire transport). The results demonstrate significant improvements in identifying parameterized dynamical systems and underscore the importance of concurrently optimizing all parameters, particularly those characterized by nonlinear parameters. These findings highlight the potential of ADAM-SINDy to extend the applicability of the SINDy framework in addressing more complex challenges in dynamical system identification.

[799] An information-matching approach to optimal experimental design and active learning

Yonatan Kurniawan, Tracianne B. Neilsen, Benjamin L. Francis, Alex M. Stankovic, Mingjian Wen, Ilia Nikiforov, Ellad B. Tadmor, Vasily V. Bulatov, Vincenzo Lordi, Mark K. Transtrum

Main category: cs.LG

TL;DR: The paper introduces an information-matching criterion to select optimal training data for models, ensuring efficient learning of relevant parameters for downstream predictions.

Details

Motivation: Collecting high-quality training data is expensive and challenging, and many models contain unidentifiable parameters. The goal is to efficiently learn only the parameters needed for accurate predictions.

Method: An information-matching criterion based on the Fisher Information Matrix is used to select the most informative training data, formulated as a convex optimization problem.

Result: The method is effective across diverse fields, showing that a small set of optimal data suffices for precise predictions.

Conclusion: The approach is scalable and promising for active learning in large models, with broad applications in scientific fields.

Abstract: The efficacy of mathematical models heavily depends on the quality of the training data, yet collecting sufficient data is often expensive and challenging. Many modeling applications require inferring parameters only as a means to predict other quantities of interest (QoI). Because models often contain many unidentifiable (sloppy) parameters, QoIs often depend on a relatively small number of parameter combinations. Therefore, we introduce an information-matching criterion based on the Fisher Information Matrix to select the most informative training data from a candidate pool. This method ensures that the selected data contain sufficient information to learn only those parameters that are needed to constrain downstream QoIs. It is formulated as a convex optimization problem, making it scalable to large models and datasets. We demonstrate the effectiveness of this approach across various modeling problems in diverse scientific fields, including power systems and underwater acoustics. Finally, we use information-matching as a query function within an Active Learning loop for material science applications. In all these applications, we find that a relatively small set of optimal training data can provide the necessary information for achieving precise predictions. These results are encouraging for diverse future applications, particularly active learning in large machine learning models.

[800] sbi reloaded: a toolkit for simulation-based inference workflows

Jan Boelts, Michael Deistler, Manuel Gloeckler, Álvaro Tejero-Cantero, Jan-Matthis Lueckmann, Guy Moss, Peter Steinbach, Thomas Moreau, Fabio Muratore, Julia Linhart, Conor Durkan, Julius Vetter, Benjamin Kurt Miller, Maternus Herold, Abolfazl Ziaeemehr, Matthijs Pals, Theo Gruner, Sebastian Bischoff, Nastya Krouglova, Richard Gao, Janne K. Lappalainen, Bálint Mucsányi, Felix Pei, Auguste Schulz, Zinovia Stefanidi, Pedro Rodrigues, Cornelius Schröder, Faried Abu Zaid, Jonas Beck, Jaivardhan Kapoor, David S. Greenberg, Pedro J. Gonçalves, Jakob H. Macke

Main category: cs.LG

TL;DR: The paper discusses simulation-based inference (SBI) as a solution for tuning simulator parameters to match observed data without requiring likelihood evaluations. It introduces the sbi toolkit, a PyTorch-based package offering diverse SBI methods and tools for black-box simulators.

Details

Motivation: The challenge of tuning simulator parameters to align with observed data motivates the need for SBI, which bypasses traditional Bayesian inference limitations.

Method: The sbi toolkit implements neural network-based SBI algorithms, enabling parameter inference without likelihood evaluations or gradients, and supports parallelization and amortization.

Result: The toolkit provides a flexible, state-of-the-art solution for SBI, allowing customization and application to black-box simulators.

Conclusion: The sbi toolkit advances SBI by making it accessible and adaptable, facilitating better alignment of simulations with empirical data.

Abstract: Scientists and engineers use simulators to model empirically observed phenomena. However, tuning the parameters of a simulator to ensure its outputs match observed data presents a significant challenge. Simulation-based inference (SBI) addresses this by enabling Bayesian inference for simulators, identifying parameters that match observed data and align with prior knowledge. Unlike traditional Bayesian inference, SBI only needs access to simulations from the model and does not require evaluations of the likelihood function. In addition, SBI algorithms do not require gradients through the simulator, allow for massive parallelization of simulations, and can perform inference for different observations without further simulations or training, thereby amortizing inference. Over the past years, we have developed, maintained, and extended sbi, a PyTorch-based package that implements Bayesian SBI algorithms based on neural networks. The sbi toolkit implements a wide range of inference methods, neural network architectures, sampling methods, and diagnostic tools. In addition, it provides well-tested default settings, but also offers flexibility to fully customize every step of the simulation-based inference workflow. Taken together, the sbi toolkit enables scientists and engineers to apply state-of-the-art SBI methods to black-box simulators, opening up new possibilities for aligning simulations with empirically observed data.

[801] Understanding and Mitigating Memorization in Generative Models via Sharpness of Probability Landscapes

Dongjae Jeon, Dueun Kim, Albert No

Main category: cs.LG

TL;DR: A geometric framework analyzes memorization in diffusion models using log probability density sharpness, validates a score-difference metric, and introduces a new metric for early-stage memorization detection with a mitigation strategy.

Details

Motivation: To understand and quantify memorization in diffusion models, focusing on sharpness in log probability density and early-stage detection.

Method: Proposes two metrics: one validating a score-difference-based measure and another for initial-stage sharpness in latent diffusion models, plus a mitigation strategy using sharpness-aware regularization.

Result: Demonstrates effectiveness of the score-difference metric and introduces a novel metric for early memorization detection, along with a mitigation approach.

Conclusion: The framework provides tools to analyze and mitigate memorization in diffusion models, enhancing their reliability.

Abstract: In this paper, we introduce a geometric framework to analyze memorization in diffusion models through the sharpness of the log probability density. We mathematically justify a previously proposed score-difference-based memorization metric by demonstrating its effectiveness in quantifying sharpness. Additionally, we propose a novel memorization metric that captures sharpness at the initial stage of image generation in latent diffusion models, offering early insights into potential memorization. Leveraging this metric, we develop a mitigation strategy that optimizes the initial noise of the generation process using a sharpness-aware regularization term.

[802] $\ell_0$-Regularized Quadratic Surface Support Vector Machines

Ahmad Mousavi, Ramin Zandvakili

Main category: cs.LG

TL;DR: A sparse variant of kernel-free quadratic surface SVMs (QSVM) is proposed to address overfitting and interpretability issues by enforcing a cardinality constraint on model parameters. A penalty decomposition algorithm ensures computational efficiency and optimality.

Details

Motivation: To overcome the challenges of overfitting and poor interpretability in kernel-free QSVMs due to the high number of parameters.

Method: Introduces a sparse QSVM with an ℓ0-norm constraint, solved via a penalty decomposition algorithm, accommodating hinge and quadratic loss functions.

Result: The method reduces overfitting while maintaining classification performance, with efficient subproblem solutions and proven convergence.

Conclusion: The proposed sparse QSVM is effective and computationally feasible, validated on real-world datasets.

Abstract: Kernel-free quadratic surface support vector machines have recently gained traction due to their flexibility in modeling nonlinear decision boundaries without relying on kernel functions. However, the introduction of a full quadratic classifier significantly increases the number of model parameters, scaling quadratically with data dimensionality, which often leads to overfitting and makes interpretation difficult. To address these challenges, we propose a sparse variant of the QSVM by enforcing a cardinality constraint on the model parameters. While enhancing generalization and promoting sparsity, leveraging the $\ell_0$-norm inevitably incurs additional computational complexity. To tackle this, we develop a penalty decomposition algorithm capable of producing solutions that provably satisfy the first-order Lu-Zhang optimality conditions. Our approach accommodates both hinge and quadratic loss functions. In both cases, we demonstrate that the subproblems arising within the algorithm either admit closed-form solutions or can be solved efficiently through dual formulations, which contributes to the method’s overall effectiveness. We also analyze the convergence behavior of the algorithm under both loss settings. Finally, we validate our approach on several real-world datasets, demonstrating its ability to reduce overfitting while maintaining strong classification performance. The complete implementation and experimental code are publicly available at https://github.com/raminzandvakili/L0-QSVM.

[803] Ehrenfeucht-Haussler Rank and Chain of Thought

Pablo Barceló, Alexander Kozachinskiy, Tomasz Steifer

Main category: cs.LG

TL;DR: The paper links the rank of Boolean functions to the number of Chain of Thought (CoT) steps in Transformers, providing tight bounds for specific problems and introducing multi-head rank for PAC-learnability analysis.

Details

Motivation: To bridge the gap between PAC learning theory and Transformer architectures by characterizing Boolean function rank in terms of CoT steps.

Method: The study uses a single-layer Transformer with hard attention to compute Boolean functions, analyzing CoT steps for function composition and sequence problems.

Result: Shows that ℓ-fold function composition requires ℓ CoT steps, and identifying the k-th 1 in a sequence requires k steps. Introduces multi-head rank for broader analysis.

Conclusion: The work connects theoretical learning bounds with practical Transformer computations, offering insights into PAC-learnability for functions with bounded multi-head rank.

Abstract: The notion of \emph{rank} of a Boolean function has been a cornerstone in PAC learning theory, enabling quasipolynomial-time learning algorithms for polynomial-size decision trees. We present a novel characterization of rank, grounded in the well-known Transformer architecture. We show that the rank of a function $f$ corresponds to the minimum number of \emph{Chain of Thought} (CoT) steps required by a single-layer Transformer with hard attention to compute $f$. Based on this characterization we establish tight bounds on the number of CoT steps required for specific problems, showing that (\ell)-fold function composition necessitates exactly (\ell) CoT steps. Furthermore, we analyze the problem of identifying the position of the (k)-th occurrence of 1 in a Boolean sequence, proving that it requires (k) CoT steps. Finally, we introduce the notion of the multi-head rank that captures multi-head single-layer transformers, and perform the analysis of PAC-learnability of the classes of functions with bounded multi-head rank.

[804] chebgreen: Learning and Interpolating Continuous Empirical Green’s Functions from Data

Harshwardhan Praveen, Jacob Brown, Christopher Earls

Main category: cs.LG

TL;DR: A data-driven library, chebgreen, models 1D systems with unknown PDEs using Empirical Green’s Functions via Rational Neural Networks and Chebyshev basis interpolation.

Details

Motivation: To address the challenge of modeling systems with unknown governing PDEs and control parameters.

Method: Learns Empirical Green’s Functions as Rational Neural Networks, interpolates singular functions and values in a Chebyshev basis.

Result: Successfully uncovers Green’s functions for unseen control parameters by interpolation on a manifold of Quasimatrices.

Conclusion: chebgreen provides a mesh-independent, data-driven approach for modeling systems with hidden PDEs.

Abstract: In this work, we present a mesh-independent, data-driven library, chebgreen, to mathematically model one-dimensional systems, possessing an associated control parameter, and whose governing partial differential equation is unknown. The proposed method learns an Empirical Green’s Function for the associated, but hidden, boundary value problem, in the form of a Rational Neural Network from which we subsequently construct a bivariate representation in a Chebyshev basis. We uncover the Green’s function, at an unseen control parameter value, by interpolating the left and right singular functions within a suitable library, expressed as points on a manifold of Quasimatrices, while the associated singular values are interpolated with Lagrange polynomials.

[805] Covering Multiple Objectives with a Small Set of Solutions Using Bayesian Optimization

Natalie Maus, Kyurae Kim, Yimeng Zeng, Haydn Thomas Jones, Fangping Wan, Marcelo Der Torossian Torres, Cesar de la Fuente-Nunez, Jacob R. Gardner

Main category: cs.LG

TL;DR: The paper introduces MOCOBO, an algorithm for finding a small set of solutions that collectively cover multiple objectives in black-box optimization, with applications in drug design.

Details

Motivation: The motivation stems from scenarios like drug design, where a small set of solutions (e.g., antibiotics) must cover multiple objectives (e.g., treating various pathogens).

Method: Proposes Multi-Objective Coverage Bayesian Optimization (MOCOBO) to efficiently find a covering set of K < T solutions for T objectives.

Result: MOCOBO achieves coverage comparable to optimizing each objective individually, and in vitro experiments confirm its effectiveness in drug discovery.

Conclusion: MOCOBO is a promising approach for multi-objective coverage problems, particularly in drug design, with demonstrated practical success.

Abstract: In multi-objective black-box optimization, the goal is typically to find solutions that optimize a set of $T$ black-box objective functions, $f_1$, …, $f_T$, simultaneously. Traditional approaches often seek a single Pareto-optimal set that balances trade-offs among all objectives. In this work, we consider a problem setting that departs from this paradigm: finding a small set of K < T solutions, that collectively “covers” the T objectives. A set of solutions is defined as “covering” if, for each objective $f_1$, …, $f_T$, there is at least one good solution. A motivating example for this problem setting occurs in drug design. For example, we may have T pathogens and aim to identify a set of K < T antibiotics such that at least one antibiotic can be used to treat each pathogen. To address this problem, we propose Multi-Objective Coverage Bayesian Optimization (MOCOBO), a principled algorithm designed to efficiently find a covering set. We validate our approach through experiments on challenging high-dimensional tasks, including applications in peptide and molecular design, where MOCOBO is shown to find high-performing covering sets of solutions. The results show that the coverage of the K < T solutions found by MOCOBO matches or nearly matches the coverage of T solutions obtained by optimizing each objective individually. Furthermore, in in vitro experiments, the peptides found by MOCOBO exhibited high potency against drug-resistant pathogens, further demonstrating the potential of MOCOBO for drug discovery. We make code available here: https://github.com/nataliemaus/mocobo.

[806] On the Emergence of Position Bias in Transformers

Xinyi Wu, Yifei Wang, Stefanie Jegelka, Ali Jadbabaie

Main category: cs.LG

TL;DR: The paper introduces a graph-theoretic framework to analyze position bias in multi-layer attention, revealing how causal masking and positional encodings shape biases and their trade-offs.

Details

Motivation: To comprehensively understand how attention masks and positional encodings contribute to position bias in transformers, which remains theoretically unclear despite observed phenomena like 'lost-in-the-middle' and attention sinks.

Method: Modeling attention masks as directed graphs to quantify token interactions based on sequential positions, analyzing the effects of causal masking and relative positional encodings (e.g., decay mask, RoPE).

Result: Causal masking biases attention toward earlier positions, while positional encodings introduce distance-based decay. Their combined effect creates a trade-off between long-term decay and early position importance, validated through experiments.

Conclusion: The framework provides a principled understanding of positional biases, clarifying the interplay of attention components and aiding better transformer design.

Abstract: Recent studies have revealed various manifestations of position bias in transformer architectures, from the “lost-in-the-middle” phenomenon to attention sinks, yet a comprehensive theoretical understanding of how attention masks and positional encodings shape these biases remains elusive. This paper presents a graph-theoretic framework for analyzing position bias in multi-layer attention. Modeling attention masks as directed graphs, we quantify how tokens interact with contextual information based on their sequential positions. We uncover two key insights: First, causal masking inherently biases attention toward earlier positions, as tokens in deeper layers attend to increasingly more contextualized representations of earlier tokens. Second, we characterize the competing effects of the causal mask and relative positional encodings, such as the decay mask and rotary positional encoding (RoPE): while both mechanisms introduce distance-based decay within individual attention maps, their aggregate effect across multiple attention layers$\unicode{x2013}$coupled with the causal mask$\unicode{x2013}$leads to a trade-off between the long-term decay effects and the cumulative importance of early sequence positions. Through controlled numerical experiments, we not only validate our theoretical findings but also reproduce position biases observed in real-world LLMs. Our framework offers a principled foundation for understanding positional biases in transformers, shedding light on the complex interplay of attention mechanism components and guiding more informed architectural design.

[807] Chaos into Order: Neural Framework for Expected Value Estimation of Stochastic Partial Differential Equations

Ísak Pétursson, María Óskarsdóttir

Main category: cs.LG

TL;DR: The paper introduces the Learned Expectation Collapser (LEC), a physics-informed neural framework to approximate the expected value of linear SPDE solutions without domain discretization, showing accuracy in lower dimensions and predictable performance decline in higher dimensions.

Details

Motivation: SPDE solutions are analytically intractable and computationally expensive; LEC aims to provide a scalable, simulator-free approximation method.

Method: LEC uses randomized sampling of space-time coordinates and noise realizations to train feedforward neural networks, minimizing residual loss across stochastic samples.

Result: LEC accurately approximates expected SPDE solutions in lower dimensions, with predictable accuracy decline in higher dimensions and improved stability under increased sampling.

Conclusion: LEC demonstrates neural networks’ ability to learn statistical structure from stochastic operators, offering a pathway to scalable SPDE solvers.

Abstract: Stochastic partial differential equations (SPDEs) describe the evolution of random processes over space and time, but their solutions are often analytically intractable and computationally expensive to estimate. In this paper, we propose the Learned Expectation Collapser (LEC), a physics-informed neural framework designed to approximate the expected value of linear SPDE solutions without requiring domain discretization. By leveraging randomized sampling of both space-time coordinates and noise realizations during training, LEC trains standard feedforward neural networks to minimize residual loss across multiple stochastic samples. We hypothesize and empirically confirm that this training regime drives the network to converge toward the expected value of the solution of the SPDE. Using the stochastic heat equation as a testbed, we evaluate performance across a diverse set of 144 experimental configurations that span multiple spatial dimensions, noise models, and forcing functions. The results show that the model consistently learns accurate approximations of the expected value of the solution in lower dimensions and a predictable decrease in accuracy with increased spatial dimensions, with improved stability and robustness under increased Monte Carlo sampling. Our findings offer new insight into how neural networks implicitly learn statistical structure from stochastic differential operators and suggest a pathway toward scalable, simulator-free SPDE solvers.

[808] Active Learning of Model Discrepancy with Bayesian Experimental Design

Huchen Yang, Chuanqi Chen, Jin-Long Wu

Main category: cs.LG

TL;DR: The paper proposes an efficient method to iteratively learn model discrepancy in digital twins using sequential Bayesian experimental design (BED), validated by a convection-diffusion example.

Details

Motivation: Model discrepancy in digital twins impacts performance, and data-driven approaches lack systematic data gathering. BED is hindered by model discrepancy.

Method: Sequential BED is used to iteratively learn model discrepancy, with an ensemble-based approximation for data informativeness.

Result: The method is efficient and robust for high-dimensional model discrepancy, compatible with classical and modern solvers.

Conclusion: The approach enhances active learning of model discrepancy, validated by numerical examples.

Abstract: Digital twins have been actively explored in many engineering applications, such as manufacturing and autonomous systems. However, model discrepancy is ubiquitous in most digital twin models and has significant impacts on the performance of using those models. In recent years, data-driven modeling techniques have been demonstrated promising in characterizing the model discrepancy in existing models, while the training data for the learning of model discrepancy is often obtained in an empirical way and an active approach of gathering informative data can potentially benefit the learning of model discrepancy. On the other hand, Bayesian experimental design (BED) provides a systematic approach to gathering the most informative data, but its performance is often negatively impacted by the model discrepancy. In this work, we build on sequential BED and propose an efficient approach to iteratively learn the model discrepancy based on the data from the BED. The performance of the proposed method is validated by a classical numerical example governed by a convection-diffusion equation, for which full BED is still feasible. The proposed method is then further studied in the same numerical example with a high-dimensional model discrepancy, which serves as a demonstration for the scenarios where full BED is not practical anymore. An ensemble-based approximation of information gain is further utilized to assess the data informativeness and to enhance learning model discrepancy. The results show that the proposed method is efficient and robust to the active learning of high-dimensional model discrepancy, using data suggested by the sequential BED. We also demonstrate that the proposed method is compatible with both classical numerical solvers and modern auto-differentiable solvers.

[809] Optimistic Interior Point Methods for Sequential Hypothesis Testing by Betting

Can Chen, Jun-Kun Wang

Main category: cs.LG

TL;DR: The paper introduces an interior-point method for nonparametric sequential hypothesis testing, enabling faster wealth accumulation and null rejection without gradient explosion risks.

Details

Motivation: Existing methods like Online Newton Step (ONS) are conservative due to halved decision spaces, limiting rapid wealth accumulation for hypothesis testing.

Method: Proposes a novel strategy using interior-point methods to update across the entire decision space, avoiding gradient explosion while maintaining computational efficiency.

Result: Achieves faster null hypothesis rejection under alternative hypotheses while preserving statistical guarantees.

Conclusion: The new method outperforms ONS in speed and efficiency without sacrificing reliability.

Abstract: The technique of ``testing by betting” frames nonparametric sequential hypothesis testing as a multiple-round game, where a player bets on future observations that arrive in a streaming fashion, accumulates wealth that quantifies evidence against the null hypothesis, and rejects the null once the wealth exceeds a specified threshold while controlling the false positive error. Designing an online learning algorithm that achieves a small regret in the game can help rapidly accumulate the bettor’s wealth, which in turn can shorten the time to reject the null hypothesis under the alternative $H_1$. However, many of the existing works employ the Online Newton Step (ONS) to update within a halved decision space to avoid a gradient explosion issue, which is potentially conservative for rapid wealth accumulation. In this paper, we introduce a novel strategy utilizing interior-point methods in optimization that allows updates across the entire interior of the decision space without the risk of gradient explosion. Our approach not only maintains strong statistical guarantees but also facilitates faster null hypothesis rejection, while being as computationally lightweight as ONS thanks to its closed-form updates.

[810] Active Advantage-Aligned Online Reinforcement Learning with Offline Data

Xuefeng Liu, Hung T. C. Le, Siyu Chen, Rick Stevens, Zhuoran Yang, Matthew R. Walter, Yuxin Chen

Main category: cs.LG

TL;DR: A3RL introduces a confidence-aware Active Advantage Aligned (A3) sampling strategy to dynamically prioritize data from online and offline sources, optimizing policy improvement in reinforcement learning.

Details

Motivation: The paper addresses challenges in combining online and offline RL, such as catastrophic forgetting, data quality robustness, and sample efficiency, aiming to harness the strengths of both approaches.

Method: A3RL employs a novel A3 sampling strategy to dynamically prioritize data aligned with the policy’s needs from online and offline sources, supported by theoretical insights.

Result: Empirical experiments and ablation studies show A3RL outperforms competing online RL techniques leveraging offline data.

Conclusion: A3RL effectively integrates online and offline RL, offering improved policy optimization through dynamic data prioritization.

Abstract: Online reinforcement learning (RL) enhances policies through direct interactions with the environment, but faces challenges related to sample efficiency. In contrast, offline RL leverages extensive pre-collected data to learn policies, but often produces suboptimal results due to limited data coverage. Recent efforts integrate offline and online RL in order to harness the advantages of both approaches. However, effectively combining online and offline RL remains challenging due to issues that include catastrophic forgetting, lack of robustness to data quality and limited sample efficiency in data utilization. In an effort to address these challenges, we introduce A3RL, which incorporates a novel confidence aware Active Advantage Aligned (A3) sampling strategy that dynamically prioritizes data aligned with the policy’s evolving needs from both online and offline sources, optimizing policy improvement. Moreover, we provide theoretical insights into the effectiveness of our active sampling strategy and conduct diverse empirical experiments and ablation studies, demonstrating that our method outperforms competing online RL techniques that leverage offline data.

[811] Fenchel-Young Variational Learning

Sophia Sklaviadis, Andre Martins, Mario Figueiredo

Main category: cs.LG

TL;DR: The paper introduces Fenchel-Young (FY) variational learning, a new class of methods generalizing classical variational approaches using FY losses. It includes novel concepts like FY free energy and FY posterior, with algorithms like FYEM and FYVAE, showing competitive performance and unique features like adaptive sparsity.

Details

Motivation: To generalize classical variational learning by introducing FY losses as divergences, enabling broader model classes and novel algorithmic features.

Method: Proposes FY variational learning with FY free energy, evidence, and posterior. Develops alternating minimization and gradient backpropagation algorithms, leading to FYEM and FYVAE.

Result: Empirically competitive, often outperforming classical methods, with novel features like adaptive sparsity in FYEM and sparse observations in FYVAE.

Conclusion: FY variational learning extends classical methods, offering improved performance and unique capabilities, such as sparsity, in statistical learning tasks.

Abstract: From a variational perspective, many statistical learning criteria involve seeking a distribution that balances empirical risk and regularization. In this paper, we broaden this perspective by introducing a new general class of variational methods based on Fenchel-Young (FY) losses, treated as divergences that generalize (and encompass) the familiar Kullback-Leibler divergence at the core of classical variational learning. Our proposed formulation – FY variational learning – includes as key ingredients new notions of FY free energy, FY evidence, FY evidence lower bound, and FY posterior. We derive alternating minimization and gradient backpropagation algorithms to compute (or lower bound) the FY evidence, which enables learning a wider class of models than previous variational formulations. This leads to generalized FY variants of classical algorithms, such as an FY expectation-maximization (FYEM) algorithm, and latent-variable models, such as an FY variational autoencoder (FYVAE). Our new methods are shown to be empirically competitive, often outperforming their classical counterparts, and most importantly, to have qualitatively novel features. For example, FYEM has an adaptively sparse E-step, while the FYVAE can support models with sparse observations and sparse posteriors.

[812] ElementaryNet: A Non-Strategic Neural Network for Predicting Human Behavior in Normal-Form Games

Greg d’Eon, Hala Murad, Kevin Leyton-Brown, James R. Wright

Main category: cs.LG

TL;DR: GameNet predicts human behavior in games but may emulate strategic reasoning. ElementaryNet, a restricted neural network, matches GameNet’s performance and provides interpretable insights into human behavior.

Details

Motivation: To address the potential overgeneralization of GameNet's level-0 model and ensure interpretability while maintaining predictive accuracy.

Method: Introduces ElementaryNet, a neural network provably incapable of strategic behavior, and compares its performance with GameNet.

Result: ElementaryNet performs as well as GameNet and offers interpretable insights into human iterative reasoning and learning.

Conclusion: ElementaryNet balances interpretability and performance, demonstrating the value of restricted models for behavioral insights.

Abstract: Behavioral game theory models serve two purposes: yielding insights into how human decision-making works, and predicting how people would behave in novel strategic settings. A system called GameNet represents the state of the art for predicting human behavior in the setting of unrepeated simultaneous-move games, combining a simple “level-k” model of strategic reasoning with a complex neural network model of non-strategic “level-0” behavior. Although this reliance on well-established ideas from cognitive science ought to make GameNet interpretable, the flexibility of its level-0 model raises the possibility that it is able to emulate strategic reasoning. In this work, we prove that GameNet’s level-0 model is indeed too general. We then introduce ElementaryNet, a novel neural network that is provably incapable of expressing strategic behavior. We show that these additional restrictions are empirically harmless, leading ElementaryNet to statistically indistinguishable predictive performance vs GameNet. We then show how it is possible to derive insights about human behavior by varying ElementaryNet’s features and interpreting its parameters, finding evidence of iterative reasoning, learning about the depth of this reasoning process, and showing the value of a rich level-0 specification.

[813] Navigating Demand Uncertainty in Container Shipping: Deep Reinforcement Learning for Enabling Adaptive and Feasible Master Stowage Planning

Jaike van Twiller, Yossiri Adulyasak, Erick Delage, Djordje Grbic, Rune Møller Jensen

Main category: cs.LG

TL;DR: A deep RL framework with encoder-decoder and feasibility layers solves stochastic sequential dynamic decision-making problems with state-dependent constraints, outperforming baselines in constrained RL and stochastic programming.

Details

Motivation: Address challenges of conventional RL in handling complex, real-world constraints, particularly in state-dependent action space feasibility, using the master stowage planning problem as a case study.

Method: Proposes a deep RL framework with encoder-decoder model and feasibility layers to embed problem instances, current solutions, and demand uncertainty, ensuring convex constraints and unbiased gradient flow.

Result: The model efficiently finds adaptive, feasible solutions, generalizes across distributions, scales to larger instances, and outperforms state-of-the-art baselines.

Conclusion: The framework bridges AI and operations research, enabling adaptive, uncertainty-aware decisions for resilient and sustainable planning.

Abstract: Reinforcement learning (RL) has shown promise in solving various combinatorial optimization problems. However, conventional RL faces challenges when dealing with complex, real-world constraints, especially when action space feasibility is explicit and dependent on the corresponding state or trajectory. In this work, we address stochastic sequential dynamic decision-making problems with state-dependent constraints. As a relevant and real-world case study, we focus on the master stowage planning problem in container shipping, which aims to optimize revenue and operational costs under demand uncertainty and operational constraints. We propose a deep RL framework with an encoder-decoder model and feasibility layers that satisfy convex constraints and maintain unbiased gradient flow, which embed problem instances, current solutions, and demand uncertainty to guide learning. Experiments show that our model efficiently finds adaptive, feasible solutions that generalize across varying distributions and scale to larger instances, outperforming state-of-the-art baselines in constrained RL and stochastic programming. By uniting artificial intelligence and operations research, our policy empowers humans to make adaptive, uncertainty-aware decisions for resilient and sustainable planning.

[814] Learning from Reward-Free Offline Data: A Case for Planning with Latent Dynamics Models

Vlad Sobal, Wancong Zhang, Kyunghyun Cho, Randall Balestriero, Tim G. J. Rudner, Yann LeCun

Main category: cs.LG

TL;DR: The paper compares RL and control-based methods for AI agents learning from offline data without rewards, highlighting their strengths in different data scenarios.

Details

Motivation: To understand the relative strengths of RL and optimal control in offline learning settings without reward annotations.

Method: Systematic analysis of RL (goal-conditioned, zero-shot) and control-based (JEPA-trained latent dynamics model) methods under varying dataset qualities.

Result: Model-free RL performs best with high-quality data, while model-based planning excels in generalization, trajectory stitching, and data-efficiency.

Conclusion: Latent dynamics model planning is promising for zero-shot generalization from suboptimal data.

Abstract: A long-standing goal in AI is to build agents that can solve a variety of tasks across different environments, including previously unseen ones. Two dominant approaches tackle this challenge: (i) reinforcement learning (RL), which learns policies through trial and error, and (ii) optimal control, which plans actions using a learned or known dynamics model. However, their relative strengths and weaknesses remain underexplored in the setting where agents must learn from offline trajectories without reward annotations. In this work, we systematically analyze the performance of different RL and control-based methods under datasets of varying quality. On the RL side, we consider goal-conditioned and zero-shot approaches. On the control side, we train a latent dynamics model using the Joint Embedding Predictive Architecture (JEPA) and use it for planning. We study how dataset properties-such as data diversity, trajectory quality, and environment variability-affect the performance of these approaches. Our results show that model-free RL excels when abundant, high-quality data is available, while model-based planning excels in generalization to novel environment layouts, trajectory stitching, and data-efficiency. Notably, planning with a latent dynamics model emerges as a promising approach for zero-shot generalization from suboptimal data.

[815] Real-Time Moving Flock Detection in Pedestrian Trajectories Using Sequential Deep Learning Models

Amartaivan Sanjjamts, Hiroshi Morita, Togootogtokh Enkhtogtokh

Main category: cs.LG

TL;DR: The paper proposes a two-stage deep learning approach using RNNs, LSTMs, and Transformers for real-time pedestrian flock detection, validated on real-world datasets with high accuracy.

Details

Motivation: Understanding collective pedestrian movement is essential for crowd management, autonomous navigation, and human-robot interaction.

Method: A two-stage process: pre-trained binary classification for pairwise trajectory classification, followed by dynamic multi-agent flock identification using learned representations.

Result: The model achieves high accuracy and stability in detecting pedestrian flocks, even in dynamic and noisy environments, and extends to other collective motions like convoys and swarms.

Conclusion: The approach enables robust real-time flock detection and broader multi-agent behavior analysis, with potential applications in diverse fields.

Abstract: Understanding collective pedestrian movement is crucial for applications in crowd management, autonomous navigation, and human-robot interaction. This paper investigates the use of sequential deep learning models, including Recurrent Neural Networks (RNNs), Long Short-Term Memory (LSTM) networks, and Transformers, for real-time flock detection in multi-pedestrian trajectories. Our proposed approach consists of a two-stage process: first, a pre-trained binary classification model is used for pairwise trajectory classification, and second, the learned representations are applied to identify multi-agent flocks dynamically. We validate our method using real-world group movement datasets, demonstrating its robustness across varying sequence lengths and diverse movement patterns. Experimental results indicate that our model consistently detects pedestrian flocks with high accuracy and stability, even in dynamic and noisy environments. Furthermore, we extend our approach to identify other forms of collective motion, such as convoys and swarms, paving the way for more comprehensive multi-agent behavior analysis.

[816] Robustness to Geographic Distribution Shift Using Location Encoders

Ruth Crasto

Main category: cs.LG

TL;DR: The paper proposes using location encoders to address geographic distribution shift, improving domain adaptation by leveraging geographic coordinates.

Details

Motivation: Geographic distribution shift is often ignored by treating regions as separate domains without using geographic metadata. This work aims to better model continuous domain assignment.

Method: The paper introduces non-parametric sine-cosine encoders and pre-trained location encoders, integrating them with standard domain adaptation methods.

Result: The methods achieve state-of-the-art results on geo-tagged remote sensing datasets from the WILDS benchmark.

Conclusion: Location encoders enhance robustness to geographic distribution shift, with publicly available code for reproducibility.

Abstract: Geographic distribution shift arises when the distribution of locations on Earth in a training dataset is different from what is seen at test time. The most common approaches to tackling geographic distribution shift treat regions delimited by administrative boundaries such as countries or continents as separate domains and apply standard domain adaptation methods, ignoring geographic coordinates that are often available as metadata. This paper proposes the use of location encoders for modeling continuous, learnable domain assignment. We show how both non-parametric sine-cosine encoders and pre-trained location encoders can be used in conjunction with standard domain adaptation methods for improved robustness to geographic distribution shift. Our proposed methods achieve new state-of-the-art results on two geo-tagged remote sensing datasets from the WILDS benchmark. We have made our code publicly available at: https://github.com/crastoru/wilds-geoshift.

[817] Average-DICE: Stationary Distribution Correction by Regression

Fengdi Che, Bryan Chan, Chen Ma, A. Rupam Mahmood

Main category: cs.LG

TL;DR: AVG-DICE is a simple Monte Carlo estimator for density ratios in off-policy evaluation, offering unbiased and consistent correction, often outperforming state-of-the-art methods.

Details

Motivation: Address the challenge of stationary state distribution mismatch in off-policy evaluation, which undermines stability and accuracy.

Method: Introduces AVG-DICE, a Monte Carlo estimator averaging discounted importance sampling ratios, extended to nonlinear function approximation via regression.

Result: AVG-DICE matches or outperforms state-of-the-art estimators in accuracy, sometimes with significant improvements, though hyperparameter sensitivity is noted.

Conclusion: AVG-DICE is a promising, simpler alternative for OPE, but hyperparameter tuning is crucial for optimal performance across different settings.

Abstract: Off-policy policy evaluation (OPE), an essential component of reinforcement learning, has long suffered from stationary state distribution mismatch, undermining both stability and accuracy of OPE estimates. While existing methods correct distribution shifts by estimating density ratios, they often rely on expensive optimization or backward Bellman-based updates and struggle to outperform simpler baselines. We introduce AVG-DICE, a computationally simple Monte Carlo estimator for the density ratio that averages discounted importance sampling ratios, providing an unbiased and consistent correction. AVG-DICE extends naturally to nonlinear function approximation using regression, which we roughly tune and test on OPE tasks based on Mujoco Gym environments and compare with state-of-the-art density-ratio estimators using their reported hyperparameters. In our experiments, AVG-DICE is at least as accurate as state-of-the-art estimators and sometimes offers orders-of-magnitude improvements. However, a sensitivity analysis shows that best-performing hyperparameters may vary substantially across different discount factors, so a re-tuning is suggested.

[818] Gradient Extrapolation for Debiased Representation Learning

Ihab Asaad, Maha Shadaydeh, Joachim Denzler

Main category: cs.LG

TL;DR: GERNE is a novel method for debiasing machine learning models by using gradient extrapolation to reduce reliance on spurious correlations, outperforming existing baselines.

Details

Motivation: ERM-trained models often rely on spurious correlations, leading to poor generalization when such correlations are absent in test data.

Method: GERNE uses gradient extrapolation from two batches with varying spurious correlations to guide debiased representation learning.

Result: GERNE achieves competitive or superior performance on vision and NLP benchmarks, adapting to maximize GBA or WGA.

Conclusion: GERNE provides a general debiasing framework, with theoretical bounds for its extrapolation factor, and demonstrates effectiveness across tasks.

Abstract: Machine learning classification models trained with empirical risk minimization (ERM) often inadvertently rely on spurious correlations. When absent in the test data, these unintended associations between non-target attributes and target labels lead to poor generalization. This paper addresses this problem from a model optimization perspective and proposes a novel method, Gradient Extrapolation for Debiased Representation Learning (GERNE), designed to learn debiased representations in both known and unknown attribute training cases. GERNE uses two distinct batches with different amounts of spurious correlations and defines the target gradient as a linear extrapolation of the gradients computed from each batch’s loss. Our analysis shows that when the extrapolated gradient points toward the batch gradient with fewer spurious correlations, it effectively guides training toward learning a debiased model. GERNE serves as a general framework for debiasing, encompassing ERM and Resampling methods as special cases. We derive the theoretical upper and lower bounds of the extrapolation factor employed by GERNE. By tuning this factor, GERNE can adapt to maximize either Group-Balanced Accuracy (GBA) or Worst-Group Accuracy (WGA). We validate GERNE on five vision and one NLP benchmarks, demonstrating competitive and often superior performance compared to state-of-the-art baselines. The project page is available at: https://gerne-debias.github.io/.

Xuefeng Liu, Songhao Jiang, Chih-chan Tien, Jinbo Xu, Rick Stevens

Main category: cs.LG

TL;DR: A multimodal bidirectional hierarchical fusion framework combines protein sequence and structural data, outperforming existing methods in protein representation learning tasks.

Details

Motivation: Protein language models (pLMs) lack structural context, while graph neural networks (GNNs) struggle with limited labeled structural data. Combining these complementary modalities can enhance protein representation.

Method: The framework uses attention and gating mechanisms to fuse pLMs-generated sequence representations and GNN-extracted structural features, enabling bidirectional and hierarchical (Bi-Hierarchical) interaction.

Result: The method outperforms baselines in tasks like enzyme classification, protein-ligand binding affinity prediction, and epitope prediction, setting a new state-of-the-art.

Conclusion: Bi-Hierarchical Fusion effectively bridges sequence and structural modalities, improving protein representation learning.

Abstract: Protein representation learning is critical for numerous biological tasks. Recently, large transformer-based protein language models (pLMs) pretrained on large scale protein sequences have demonstrated significant success in sequence-based tasks. However, pLMs lack structural context. Conversely, graph neural networks (GNNs) designed to leverage 3D structural information have shown promising generalization in protein-related prediction tasks, but their effectiveness is often constrained by the scarcity of labeled structural data. Recognizing that sequence and structural representations are complementary perspectives of the same protein entity, we propose a multimodal bidirectional hierarchical fusion framework to effectively merge these modalities. Our framework employs attention and gating mechanisms to enable effective interaction between pLMs-generated sequential representations and GNN-extracted structural features, improving information exchange and enhancement across layers of the neural network. This bidirectional and hierarchical (Bi-Hierarchical) fusion approach leverages the strengths of both modalities to capture richer and more comprehensive protein representations. Based on the framework, we further introduce local Bi-Hierarchical Fusion with gating and global Bi-Hierarchical Fusion with multihead self-attention approaches. Our method demonstrates consistent improvements over strong baselines and existing fusion techniques in a variety of protein representation learning benchmarks, including enzyme EC classification, model quality assessment, protein-ligand binding affinity prediction, protein-protein binding site prediction, and B cell epitopes prediction. Our method establishes a new state-of-the-art for multimodal protein representation learning, emphasizing the efficacy of Bi-Hierarchical Fusion in bridging sequence and structural modalities.

[820] Empirical Analysis of Privacy-Fairness-Accuracy Trade-offs in Federated Learning: A Step Towards Responsible AI

Dawood Wasif, Dian Chen, Sindhuja Madabushi, Nithin Alluru, Terrence J. Moore, Jin-Hee Cho

Main category: cs.LG

TL;DR: A study on privacy-fairness-utility trade-offs in Federated Learning (FL) compares Differential Privacy (DP), Homomorphic Encryption (HE), and Secure Multi-Party Computation (SMC) with fairness-aware optimizers, revealing HE and SMC outperform DP in fairness but at higher costs.

Details

Motivation: To address the challenge of balancing privacy preservation and fairness in FL, advancing responsible AI deployment.

Method: Systematic comparison of DP, HE, and SMC with fairness-aware optimizers (q-FedAvg, q-MAML, Ditto) under IID and non-IID scenarios using benchmark and real-world datasets.

Result: HE and SMC outperform DP in fairness under data skew, but with higher computational costs. DP can harm fairness, and fairness optimizers may reduce privacy effectiveness.

Conclusion: Practical guidelines are provided for designing FL systems that ensure equitable, privacy-preserving, and accurate outcomes.

Abstract: Federated Learning (FL) enables collaborative model training while preserving data privacy; however, balancing privacy preservation (PP) and fairness poses significant challenges. In this paper, we present the first unified large-scale empirical study of privacy-fairness-utility trade-offs in FL, advancing toward responsible AI deployment. Specifically, we systematically compare Differential Privacy (DP), Homomorphic Encryption (HE), and Secure Multi-Party Computation (SMC) with fairness-aware optimizers including q-FedAvg, q-MAML, Ditto, evaluating their performance under IID and non-IID scenarios using benchmark (MNIST, Fashion-MNIST) and real-world datasets (Alzheimer’s MRI, credit-card fraud detection). Our analysis reveals HE and SMC significantly outperform DP in achieving equitable outcomes under data skew, although at higher computational costs. Remarkably, we uncover unexpected interactions: DP mechanisms can negatively impact fairness, and fairness-aware optimizers can inadvertently reduce privacy effectiveness. We conclude with practical guidelines for designing robust FL systems that deliver equitable, privacy-preserving, and accurate outcomes.

[821] Uncertainty propagation in feed-forward neural network models

Jeremy Diamzon, Daniele Venturi

Main category: cs.LG

TL;DR: New methods for uncertainty propagation in neural networks with leaky ReLU activations, deriving PDFs and moments analytically. Linearization of leaky ReLU proves accurate for large input perturbations. Gaussian copula models approximate joint PDFs. Validated via Monte Carlo simulations.

Details

Motivation: To address uncertainty propagation in neural networks with leaky ReLU activations under random input perturbations, providing analytical tools for PDF and moment calculations.

Method: Derived analytical expressions for PDF and moments of network output, linearized leaky ReLU, and proposed Gaussian copula surrogate models. Validated with Monte Carlo simulations.

Result: Accurate statistical results even for large input perturbations, with excellent agreement between theory and simulations.

Conclusion: The proposed methods effectively propagate uncertainty in neural networks, validated by simulations, offering practical tools for uncertainty quantification.

Abstract: We develop new uncertainty propagation methods for feed-forward neural network architectures with leaky ReLU activation functions subject to random perturbations in the input vectors. In particular, we derive analytical expressions for the probability density function (PDF) of the neural network output and its statistical moments as a function of the input uncertainty and the parameters of the network, i.e., weights and biases. A key finding is that an appropriate linearization of the leaky ReLU activation function yields accurate statistical results even for large perturbations in the input vectors. This can be attributed to the way information propagates through the network. We also propose new analytically tractable Gaussian copula surrogate models to approximate the full joint PDF of the neural network output. To validate our theoretical results, we conduct Monte Carlo simulations and a thorough error analysis on a multi-layer neural network representing a nonlinear integro-differential operator between two polynomial function spaces. Our findings demonstrate excellent agreement between the theoretical predictions and Monte Carlo simulations.

[822] Model-Agnostic Policy Explanations with Large Language Models

Zhang Xi-Jia, Yue Guo, Shufei Chen, Simon Stepputtis, Matthew Gombolay, Katia Sycara, Joseph Campbell

Main category: cs.LG

TL;DR: A method for generating natural language explanations of agent behavior without accessing the agent’s internal model, improving interpretability and human understanding.

Details

Motivation: To enhance trust and meet ethical standards, intelligent agents need explainable behavior, but black-box models like deep neural networks limit interpretability.

Method: Learns a locally interpretable surrogate model from observations to guide a large language model in generating plausible explanations with minimal hallucination.

Result: Produces more comprehensible and correct explanations than baselines, validated by language models and human evaluators. User study shows improved prediction of agent actions.

Conclusion: The method effectively enhances interpretability and human understanding of agent behavior, fostering trust and compliance with ethical standards.

Abstract: Intelligent agents, such as robots, are increasingly deployed in real-world, human-centric environments. To foster appropriate human trust and meet legal and ethical standards, these agents must be able to explain their behavior. However, state-of-the-art agents are typically driven by black-box models like deep neural networks, limiting their interpretability. We propose a method for generating natural language explanations of agent behavior based only on observed states and actions – without access to the agent’s underlying model. Our approach learns a locally interpretable surrogate model of the agent’s behavior from observations, which then guides a large language model to generate plausible explanations with minimal hallucination. Empirical results show that our method produces explanations that are more comprehensible and correct than those from baselines, as judged by both language models and human evaluators. Furthermore, we find that participants in a user study more accurately predicted the agent’s future actions when given our explanations, suggesting improved understanding of agent behavior.

[823] Uniform Loss vs. Specialized Optimization: A Comparative Analysis in Multi-Task Learning

Gabriel S. Gama, Valdir Grassi Jr

Main category: cs.LG

TL;DR: SMTOs balance task learning in Multi-Task Learning, but critiques suggest equal-weighted tasks can match SMTOs. This paper evaluates SMTOs empirically, finding they perform well, though fixed weights can also compete.

Details

Motivation: To clarify whether SMTOs outperform equal-weighted tasks, addressing critiques about hyperparameter optimization and regularization.

Method: Extensive empirical evaluation of SMTOs on complex multi-task problems, comparing them to uniform loss and fixed weights.

Result: SMTOs perform well, but fixed weights can achieve competitive performance. Uniform loss sometimes matches SMTOs.

Conclusion: SMTOs are effective, but fixed weights can also perform well, with uniform loss occasionally matching SMTOs.

Abstract: Specialized Multi-Task Optimizers (SMTOs) balance task learning in Multi-Task Learning by addressing issues like conflicting gradients and differing gradient norms, which hinder equal-weighted task training. However, recent critiques suggest that equally weighted tasks can achieve competitive results compared to SMTOs, arguing that previous SMTO results were influenced by poor hyperparameter optimization and lack of regularization. In this work, we evaluate these claims through an extensive empirical evaluation of SMTOs, including some of the latest methods, on more complex multi-task problems to clarify this behavior. Our findings indicate that SMTOs perform well compared to uniform loss and that fixed weights can achieve competitive performance compared to SMTOs. Furthermore, we demonstrate why uniform loss perform similarly to SMTOs in some instances. The source code is available at https://github.com/Gabriel-SGama/UnitScal_vs_SMTOs.

[824] Resource-efficient Inference with Foundation Model Programs

Lunyiu Nie, Zhimin Ding, Kevin Yu, Marco Cheung, Chris Jermaine, Swarat Chaudhuri

Main category: cs.LG

TL;DR: The paper proposes foundation model programs to reduce inference-time resource costs in large language and vision models by dynamically selecting backends based on input complexity.

Details

Motivation: Addressing the challenge of high resource costs in deploying large language and vision models in production.

Method: Translates tasks into programs and learns a policy to allocate resources by selecting appropriate foundation model backends for each module.

Result: Achieves up to 98% resource savings with minimal accuracy loss on streaming visual question-answering tasks.

Conclusion: Demonstrates scalable and resource-efficient multi-modal inference by leveraging smaller backends for simpler tasks and larger ones for complex ones.

Abstract: The inference-time resource costs of large language and vision models present a growing challenge in production deployments. We propose the use of foundation model programs, i.e., programs that can invoke foundation models with varying resource costs and performance, as an approach to this problem. Specifically, we present a method that translates a task into a program, then learns a policy for resource allocation that, on each input, selects foundation model “backends” for each program module. The policy uses smaller, cheaper backends to handle simpler subtasks, while allowing more complex subtasks to leverage larger, more capable models. We evaluate the method on two new “streaming” visual question-answering tasks in which a system answers a question on a sequence of inputs, receiving ground-truth feedback after each answer. Compared to monolithic multi-modal models, our implementation achieves up to 98% resource savings with minimal accuracy loss, demonstrating its potential for scalable and resource-efficient multi-modal inference.

Matthew B. Webster, Dongheon Lee, Joonnyong Lee

Main category: cs.LG

TL;DR: The paper proposes a self-supervised multi-encoder autoencoder (MEAE) for blind source separation (BSS) in biosignals, specifically to extract heartbeat-related signals from PPG data, improving heart rate detection in noisy conditions.

Details

Motivation: Biosignals like PPG are mixtures of physiological events, and BSS can extract underlying sources. The goal is to enhance heart rate detection in noisy PPG data without pre-processing.

Method: A self-supervised MEAE is trained on PPG signals from a polysomnography database. The network is then applied to noisy PPG data from daily activities of nine subjects.

Result: The MEAE-extracted heartbeat-related signal significantly improves heart rate detection compared to the original PPG.

Conclusion: The self-supervised MEAE shows strong potential for BSS in biosignal analysis, especially for noisy PPG data.

Abstract: Biosignals can be viewed as mixtures measuring particular physiological events, and blind source separation (BSS) aims to extract underlying source signals from mixtures. This paper proposes a self-supervised multi-encoder autoencoder (MEAE) to separate heartbeat-related source signals from photoplethysmogram (PPG), enhancing heart rate (HR) detection in noisy PPG data. The MEAE is trained on PPG signals from a large open polysomnography database without any pre-processing or data selection. The trained network is then applied to a noisy PPG dataset collected during the daily activities of nine subjects. The extracted heartbeat-related source signal significantly improves HR detection as compared to the original PPG. The absence of pre-processing and the self-supervised nature of the proposed method, combined with its strong performance, highlight the potential of MEAE for BSS in biosignal analysis.

[826] Time Marching Neural Operator FE Coupling: AI Accelerated Physics Modeling

Wei Wang, Maryam Hakimzadeh, Haihui Ruan, Somdatta Goswami

Main category: cs.LG

TL;DR: A hybrid framework combining physics-informed DeepONet and FEM via domain decomposition improves accuracy and reduces computational costs in multiscale PDE simulations.

Details

Motivation: Addressing the challenges of neural operators in PDE simulations, such as data dependency, error accumulation, and poor generalization, by integrating physics-based and machine learning methods.

Method: A hybrid solver coupling FEM and DeepONet via Schwarz method, embedding time stepping in DeepONet, and using adaptive subdomain evolution.

Result: Achieves 20% faster convergence, error margins below 3%, and reduced computational costs by eliminating fine mesh requirements.

Conclusion: The framework offers a scalable, reliable solution for high-fidelity multiscale simulations by combining FEM and DeepONet effectively.

Abstract: Numerical solvers for PDEs often struggle to balance computational cost with accuracy, especially in multiscale and time-dependent systems. Neural operators offer a promising way to accelerate simulations, but their practical deployment is hindered by several challenges: they typically require large volumes of training data generated from high-fidelity solvers, tend to accumulate errors over time in dynamical settings, and often exhibit poor generalization in multiphysics scenarios. This work introduces a novel hybrid framework that integrates physics-informed deep operator network with FEM through domain decomposition and leverages numerical analysis for time marching. Our innovation lies in efficient coupling FE and DeepONet subdomains via a Schwarz method, expecting to solve complex and nonlinear regions by a pretrained DeepONet, while the remainder is handled by conventional FE. To address the challenges of dynamic systems, we embed a time stepping scheme directly into the DeepONet, substantially reducing long-term error propagation. Furthermore, an adaptive subdomain evolution strategy enables the ML-resolved region to expand dynamically, capturing fine-scale features without remeshing. Our framework shows accelerated convergence rates (up to 20% improvement in convergence rates compared to conventional FE coupling approaches) while preserving solution fidelity with error margins consistently below 3%. Our study shows that our proposed hybrid solver: (1) reduces computational costs by eliminating fine mesh requirements, (2) mitigates error accumulation in time-dependent simulations, and (3) enables automatic adaptation to evolving physical phenomena. This work establishes a new paradigm for coupling state of the art physics based and machine learning solvers in a unified framework, offering a robust, reliable, and scalable pathway for high fidelity multiscale simulations.

[827] FP4 All the Way: Fully Quantized Training of LLMs

Brian Chmiel, Maxim Fishman, Ron Banner, Daniel Soudry

Main category: cs.LG

TL;DR: First demonstration of fully quantized training (FQT) for large language models (LLMs) using 4-bit floating-point (FP4) precision, achieving performance comparable to BF16 baselines.

Details

Motivation: To explore the feasibility and efficiency of training LLMs with ultra-low precision (FP4) for weights, activations, and gradients.

Method: Investigates FP4 design choices (block sizes, scaling formats, rounding methods), uses NVFP4 format with stochastic rounding for backward passes, and identifies a threshold for effective quantized training.

Result: Successfully trains a 7B-parameter model on 256 Intel Gaudi2 accelerators, achieving comparable downstream task performance to BF16 baselines.

Conclusion: FP4 training is practical and highly efficient for large-scale LLM training, with a reference implementation provided.

Abstract: We demonstrate, for the first time, fully quantized training (FQT) of large language models (LLMs) using predominantly 4-bit floating-point (FP4) precision for weights, activations, and gradients on datasets up to 200 billion tokens. We extensively investigate key design choices for FP4, including block sizes, scaling formats, and rounding methods. Our analysis shows that the NVFP4 format, where each block of 16 FP4 values (E2M1) shares a scale represented in E4M3, provides optimal results. We use stochastic rounding for backward and update passes and round-to-nearest for the forward pass to enhance stability. Additionally, we identify a theoretical and empirical threshold for effective quantized training: when the gradient norm falls below approximately $\sqrt{3}$ times the quantization noise, quantized training becomes less effective. Leveraging these insights, we successfully train a 7-billion-parameter model on 256 Intel Gaudi2 accelerators. The resulting FP4-trained model achieves downstream task performance comparable to a standard BF16 baseline, confirming that FP4 training is a practical and highly efficient approach for large-scale LLM training. A reference implementation is supplied in https://github.com/Anonymous1252022/fp4-all-the-way .

[828] CAOTE: KV Cache Eviction for LLMs via Attention Output Error-Based Token Selection

Raghavv Goel, Junyoung Park, Mukul Gagrani, Dalton Jones, Matthew Morse, Harper Langston, Mingu Lee, Chris Lott

Main category: cs.LG

TL;DR: The paper introduces CAOTE, a token eviction method that improves efficiency in large language models by considering token contributions to attention outputs, outperforming traditional attention score-based methods.

Details

Motivation: Addressing the inefficiency of attention scores as token importance metrics in resource-restricted devices by incorporating token contributions to attention outputs.

Method: Proposes CAOTE, a token eviction criterion integrating attention scores and value vectors to minimize eviction error.

Result: CAOTE enhances accuracy in downstream tasks when combined with state-of-the-art attention score-based methods.

Conclusion: Leveraging value vector information alongside attention scores improves token eviction, highlighting CAOTE’s effectiveness and flexibility.

Abstract: While long context support of large language models has extended their abilities, it also incurs challenges in memory and compute which becomes crucial bottlenecks in resource-restricted devices. Token eviction, a widely adopted post-training methodology designed to alleviate the bottlenecks by evicting less important tokens from the cache, typically uses attention scores as proxy metrics for token importance. However, one major limitation of attention score as a token-wise importance metrics is that it lacks the information about contribution of tokens to the attention output. In this paper, we propose a simple eviction criterion based on the contribution of cached tokens to attention outputs. Our method, CAOTE, optimizes for eviction error due to token eviction, by seamlessly integrating attention scores and value vectors. This is the first method which uses value vector information on top of attention-based eviction scores. Additionally, CAOTE can act as a meta-heuristic method with flexible usage with any token eviction method. We show that CAOTE, when combined with the state-of-the-art attention score-based methods, always improves accuracies on the downstream task, indicating the importance of leveraging information from values during token eviction process.

[829] Unveiling 3D Ocean Biogeochemical Provinces in the North Atlantic: A Systematic Comparison and Validation of Clustering Methods

Yvonne Jenniges, Maike Sonnewald, Sebastian Maneth, Are Olsen, Boris P. Koch

Main category: cs.LG

TL;DR: The paper objectively defines North Atlantic regions using clustering methods within the NEMI framework, identifying UMAP-DBSCAN as the best method. It highlights unreliable internal validation metrics and achieves high reproducibility, offering detailed regionalization compared to previous concepts.

Details

Motivation: To objectively define ocean regions and water masses, avoiding subjective decisions that lead to misleading outcomes, and to support downstream tasks like marine protected area definition.

Method: Used clustering methods (k-Means, Ward, DBSCAN) on 300 million salinity, temperature, and nutrient measurements, with UMAP for dimensionality reduction. Validated methods systematically and aggregated results from 100 UMAP-DBSCAN runs.

Result: UMAP-DBSCAN best represented the data, with high reproducibility (88.81% ensemble overlap). Case studies aligned with known water mass definitions, revealing more detailed regionalization than Longhurst provinces.

Conclusion: The method is objective, efficient, and reproducible, providing a foundation for future research on biogeochemical differences and oceanic changes.

Abstract: Defining ocean regions and water masses helps to understand marine processes and can serve downstream tasks such as defining marine protected areas. However, such definitions often result from subjective decisions potentially producing misleading, unreproducible outcomes. Here, the aim was to objectively define regions of the North Atlantic through systematic comparison of clustering methods within the Native Emergent Manifold Interrogation (NEMI) framework (Sonnewald, 2023). About 300 million measured salinity, temperature, and oxygen, nitrate, phosphate and silicate concentration values served as input for various clustering methods (k-Means, agglomerative Ward, and Density-Based Spatial Clustering of Applications with Noise (DBSCAN)). Uniform Manifold Approximation and Projection (UMAP) emphasised (dis-)similarities in the data while reducing dimensionality. Based on systematic validation of clustering methods and their hyperparameters using internal, external and relative validation techniques, results showed that UMAP-DBSCAN best represented the data. Strikingly, internal validation metrics proved systematically unreliable for comparing clustering methods. To address stochastic variability, 100 UMAP-DBSCAN clustering runs were conducted and aggregated following NEMI, yielding a final set of 321 clusters. Reproducibility was evaluated via ensemble overlap ($88.81\pm1.8%$) and mean grid cell-wise uncertainty ($15.49\pm20%$). Case studies of the Mediterranean Sea, deep Atlantic waters and Labrador Sea showed strong agreement with common water mass definitions. This study revealed a more detailed regionalisation compared to previous concepts such as the Longhurst provinces through systematic clustering method comparison. The applied method is objective, efficient and reproducible and will support future research on biogeochemical differences and changes in oceanic regions.

[830] MaCP: Minimal yet Mighty Adaptation via Hierarchical Cosine Projection

Yixian Shen, Qi Bi, Jia-Hong Huang, Hongyi Zhu, Andy D. Pimentel, Anuj Pathania

Main category: cs.LG

TL;DR: MaCP is a lightweight adaptation method using cosine projection to enhance efficiency and accuracy in fine-tuning large models.

Details

Motivation: To improve model efficiency and accuracy while minimizing parameters and memory usage.

Method: Projects weight changes into discrete cosine space, partitions them, and selects critical frequency components.

Result: Superior accuracy, reduced computational complexity, and lower memory requirements across various tasks.

Conclusion: MaCP is a highly effective, efficient, and versatile adaptation method for foundation models.

[831] DualRAG: A Dual-Process Approach to Integrate Reasoning and Retrieval for Multi-Hop Question Answering

Rong Cheng, Jinyi Liu, Yan Zheng, Fei Ni, Jiazhen Du, Hangyu Mao, Fuzheng Zhang, Bo Wang, Jianye Hao

Main category: cs.LG

TL;DR: DualRAG is a dual-process framework combining reasoning and retrieval to improve multi-hop question answering, outperforming existing methods in accuracy and coherence.

Details

Motivation: Existing methods for multi-hop question answering struggle with dynamic knowledge organization, prompting the need for a better integration of reasoning and retrieval.

Method: DualRAG uses Reasoning-augmented Querying (RaQ) and progressive Knowledge Aggregation (pKA) to iteratively refine reasoning and integrate knowledge.

Result: DualRAG significantly enhances answer accuracy and coherence, sometimes surpassing oracle knowledge access performance.

Conclusion: DualRAG is a versatile and efficient solution for complex multi-hop reasoning tasks.

Abstract: Multi-Hop Question Answering (MHQA) tasks permeate real-world applications, posing challenges in orchestrating multi-step reasoning across diverse knowledge domains. While existing approaches have been improved with iterative retrieval, they still struggle to identify and organize dynamic knowledge. To address this, we propose DualRAG, a synergistic dual-process framework that seamlessly integrates reasoning and retrieval. DualRAG operates through two tightly coupled processes: Reasoning-augmented Querying (RaQ) and progressive Knowledge Aggregation (pKA). They work in concert: as RaQ navigates the reasoning path and generates targeted queries, pKA ensures that newly acquired knowledge is systematically integrated to support coherent reasoning. This creates a virtuous cycle of knowledge enrichment and reasoning refinement. Through targeted fine-tuning, DualRAG preserves its sophisticated reasoning and retrieval capabilities even in smaller-scale models, demonstrating its versatility and core advantages across different scales. Extensive experiments demonstrate that this dual-process approach substantially improves answer accuracy and coherence, approaching, and in some cases surpassing, the performance achieved with oracle knowledge access. These results establish DualRAG as a robust and efficient solution for complex multi-hop reasoning tasks.

[832] Winner-takes-all for Multivariate Probabilistic Time Series Forecasting

Adrien Cortés, Rémi Rehm, Victor Letzelter

Main category: cs.LG

TL;DR: TimeMCL uses Multiple Choice Learning (MCL) to forecast diverse time series futures with a neural network and WTA loss, showing promising results.

Details

Motivation: Addressing the need for predicting multiple plausible futures in time-series forecasting, especially for ambiguous tasks.

Method: Adapts MCL with a neural network (multi-head) and WTA loss to promote prediction diversity, linking it to implicit quantization.

Result: Demonstrates promising performance on synthetic and real-world data with low computational cost.

Conclusion: TimeMCL is an efficient method for diverse time-series forecasting, validated by experiments.

Abstract: We introduce TimeMCL, a method leveraging the Multiple Choice Learning (MCL) paradigm to forecast multiple plausible time series futures. Our approach employs a neural network with multiple heads and utilizes the Winner-Takes-All (WTA) loss to promote diversity among predictions. MCL has recently gained attention due to its simplicity and ability to address ill-posed and ambiguous tasks. We propose an adaptation of this framework for time-series forecasting, presenting it as an efficient method to predict diverse futures, which we relate to its implicit quantization objective. We provide insights into our approach using synthetic data and evaluate it on real-world time series, demonstrating its promising performance at a light computational cost.

[833] Forecasting at Full Spectrum: Holistic Multi-Granular Traffic Modeling under High-Throughput Inference Regimes

Zhaoyan Wang, Xiangchi Song, In-Young Ko

Main category: cs.LG

TL;DR: The paper introduces MultiGran-STGCNFog, a fog-based system for efficient traffic forecasting using multi-granular spatiotemporal feature fusion, addressing limitations of existing GCN methods.

Details

Motivation: Existing GCN-based traffic forecasting methods fail to fully extract and fuse multi-granular spatiotemporal features, leading to less accurate results and slow inference times.

Method: Proposes MultiGran-STGCNFog, a model with multi-granular feature fusion on dynamic traffic graphs, and GA-DPHDS, a scheduling algorithm for optimizing inference throughput.

Result: Experiments on real-world datasets show the method outperforms GCN baselines in accuracy and efficiency.

Conclusion: The proposed system effectively captures traffic dynamics and improves inference speed, making it suitable for intelligent transportation systems.

Abstract: Notably, current intelligent transportation systems rely heavily on accurate traffic forecasting and swift inference provision to make timely decisions. While Graph Convolutional Networks (GCNs) have shown benefits in modeling complex traffic dependencies, the existing GCN-based approaches cannot fully extract and fuse multi-granular spatiotemporal features across various spatial and temporal scales sufficiently in a complete manner, proven to yield less accurate results. Besides, as extracting multi-granular features across scales has been a promising strategy across domains such as computer vision, natural language processing, and time-series forecasting, pioneering studies have attempted to leverage a similar mechanism for spatiotemporal traffic data mining. However, additional feature extraction branches introduced in prior studies critically increased model complexity and extended inference time, making it challenging to provide fast forecasts. In this paper, we propose MultiGran-STGCNFog, an efficient fog distributed inference system with a novel traffic forecasting model that employs multi-granular spatiotemporal feature fusion on generated dynamic traffic graphs to fully capture interdependent traffic dynamics. The proposed scheduling algorithm GA-DPHDS, optimizing layer execution order and layer-device scheduling scheme simultaneously, contributes to considerable inference throughput improvement by coordinating heterogeneous fog devices in a pipelined manner. Extensive experiments on real-world datasets demonstrate the superiority of the proposed method over selected GCN baselines.

[834] FedSDAF: Leveraging Source Domain Awareness for Enhanced Federated Domain Generalization

Hongze Li, Zesheng Zhou, Zhenbiao Cao, Xinhui Li, Wei Chen, Xiaojin Zhang

Main category: cs.LG

TL;DR: FedSDAF enhances FedDG by leveraging source domain-aware features, outperforming existing methods through a dual-adapter architecture and bidirectional knowledge distillation.

Details

Motivation: Traditional FedDG methods overlook unique source domain knowledge, while FedSDAF exploits it for better generalization.

Method: Uses a dual-adapter architecture (Domain-Aware and Domain-Invariant Adapters) with bidirectional knowledge distillation for knowledge exchange.

Result: Outperforms existing FedDG methods on benchmark datasets (OfficeHome, PACS, VLCS, DomainNet).

Conclusion: FedSDAF effectively leverages source domain knowledge, improving generalization in federated learning.

Abstract: Traditional Federated Domain Generalization (FedDG) methods focus on learning domain-invariant features or adapting to unseen target domains, often overlooking the unique knowledge embedded within the source domain, especially in strictly isolated federated learning environments. Through experimentation, we discovered a counterintuitive phenomenon.: features learned from a complete source domain have superior generalization capabilities compared to those learned directly from the target domain. This insight leads us to propose the Federated Source Domain Awareness Framework (FedSDAF), the first systematic approach to enhance FedDG by leveraging source domain-aware features. FedSDAF employs a dual-adapter architecture that decouples “local expertise” from “global generalization consensus”. A Domain-Aware Adapter, retained locally, extracts and protects the unique discriminative knowledge of each source domain, while a Domain-Invariant Adapter, shared across clients, builds a robust global consensus. To enable knowledge exchange, we introduce a Bidirectional Knowledge Distillation mechanism that facilitates efficient dialogue between the adapters. Extensive experiments on four benchmark datasets (OfficeHome, PACS, VLCS, DomainNet) show that FedSDAF significantly outperforms existing FedDG methods.The source code is available at https://github.com/pizzareapers/FedSDAF.

[835] Rethinking Irregular Time Series Forecasting: A Simple yet Effective Baseline

Xvyuan Liu, Xiangfei Qiu, Xingjian Wu, Zhengyu Li, Chenjuan Guo, Jilin Hu, Bin Yang

Main category: cs.LG

TL;DR: APN introduces a Time-Aware Patch Aggregation (TAPA) module for efficient and accurate forecasting of irregular multivariate time series, outperforming existing methods.

Details

Motivation: Addressing the challenges of non-uniformity, missing data, and computational inefficiency in forecasting irregular multivariate time series.

Method: APN uses TAPA to dynamically define segments and compute patch representations via time-aware weighted aggregation, avoiding interpolation.

Result: APN achieves state-of-the-art performance in prediction accuracy and computational efficiency on real-world datasets.

Conclusion: APN provides a general, efficient framework for IMTS forecasting, preserving data fidelity and improving performance.

Abstract: The forecasting of irregular multivariate time series (IMTS) is a critical task in domains like healthcare and climate science. However, this task faces two significant hurdles: 1) the inherent non-uniformity and missing data in IMTS complicate the modeling of temporal dynamics, and 2) existing methods often rely on computationally expensive architectures. To address these dual challenges, we introduce APN, a general and efficient forecasting framework. At the core of APN is a novel Time-Aware Patch Aggregation (TAPA) module that introduces an aggregation-based paradigm for adaptive patching, moving beyond the limitations of fixed-span segmentation and interpolation-based methods. TAPA first learns dynamic temporal boundaries to define data-driven segments. Crucially, instead of resampling or interpolating, it directly computes patch representations via a time-aware weighted aggregation of all raw observations, where weights are determined by each observation’s temporal relevance to the segment. This approach provides two key advantages: it preserves data fidelity by avoiding the introduction of artificial data points and ensures complete information coverage by design.The resulting regularized and information-rich patch representations enable the use of a lightweight query module for historical context aggregation and a simple MLP for final prediction. Extensive experiments on multiple real-world datasets demonstrate that APN establishes a new state-of-the-art, significantly outperforming existing methods in both prediction accuracy and computational efficiency.

[836] A Square Peg in a Square Hole: Meta-Expert for Long-Tailed Semi-Supervised Learning

Yaxin Hou, Yuheng Jia

Main category: cs.LG

TL;DR: The paper addresses long-tailed semi-supervised learning with distribution mismatch by proposing dynamic expert assignment and multi-depth feature fusion modules to improve pseudo-label quality and mitigate model bias.

Details

Motivation: Existing methods for LTSSL under distribution mismatch fail to fully utilize the expertise of auxiliary classifiers, leading to suboptimal performance.

Method: Introduces dynamic expert assignment to estimate sample class membership and assign suitable experts, and multi-depth feature fusion to balance bias and discriminative ability.

Result: The method outperforms baselines on CIFAR-10-LT, STL-10-LT, and SVHN-LT datasets, showing improved generalization.

Conclusion: Dynamic expert assignment and feature fusion effectively address LTSSL challenges, supported by theoretical and empirical evidence.

Abstract: This paper studies the long-tailed semi-supervised learning (LTSSL) with distribution mismatch, where the class distribution of the labeled training data follows a long-tailed distribution and mismatches with that of the unlabeled training data. Most existing methods introduce auxiliary classifiers (experts) to model various unlabeled data distributions and produce pseudo-labels, but the expertises of various experts are not fully utilized. We observe that different experts are good at predicting different intervals of samples, e.g., long-tailed expert is skilled in samples located in the head interval and uniform expert excels in samples located in the medium interval. Therefore, we propose a dynamic expert assignment module that can estimate the class membership (i.e., head, medium, or tail class) of samples, and dynamically assigns suitable expert to each sample based on the estimated membership to produce high-quality pseudo-label in the training phase and produce prediction in the testing phase. We also theoretically reveal that integrating different experts’ strengths will lead to a smaller generalization error bound. Moreover, we find that the deeper features are more biased toward the head class but with more discriminative ability, while the shallower features are less biased but also with less discriminative ability. We, therefore, propose a multi-depth feature fusion module to utilize different depth features to mitigate the model bias. Our method demonstrates its effectiveness through comprehensive experiments on the CIFAR-10-LT, STL-10-LT, and SVHN-LT datasets across various settings.

[837] MMET: A Multi-Input and Multi-Scale Transformer for Efficient PDEs Solving

Yichen Luo, Jia Wang, Dapeng Lan, Yu Liu, Zhibo Pang

Main category: cs.LG

TL;DR: The paper introduces MMET, a transformer-based framework for solving PDEs efficiently, addressing multi-input and multi-scale challenges with reduced computational costs.

Details

Motivation: Solving PDEs with machine learning is challenging due to generalization and computational issues.

Method: MMET decouples mesh and query points, uses GCE for embedding, and employs Hilbert curve-based reserialization to reduce input length.

Result: MMET outperforms SOTA methods in accuracy and efficiency across diverse benchmarks.

Conclusion: MMET is a scalable solution for real-time PDE solving, with potential for future domain-specific pre-trained models.

Abstract: Partial Differential Equations (PDEs) are fundamental for modeling physical systems, yet solving them in a generic and efficient manner using machine learning-based approaches remains challenging due to limited multi-input and multi-scale generalization capabilities, as well as high computational costs. This paper proposes the Multi-input and Multi-scale Efficient Transformer (MMET), a novel framework designed to address the above challenges. MMET decouples mesh and query points as two sequences and feeds them into the encoder and decoder, respectively, and uses a Gated Condition Embedding (GCE) layer to embed input variables or functions with varying dimensions, enabling effective solutions for multi-scale and multi-input problems. Additionally, a Hilbert curve-based reserialization and patch embedding mechanism decrease the input length. This significantly reduces the computational cost when dealing with large-scale geometric models. These innovations enable efficient representations and support multi-scale resolution queries for large-scale and multi-input PDE problems. Experimental evaluations on diverse benchmarks spanning different physical fields demonstrate that MMET outperforms SOTA methods in both accuracy and computational efficiency. This work highlights the potential of MMET as a robust and scalable solution for real-time PDE solving in engineering and physics-based applications, paving the way for future explorations into pre-trained large-scale models in specific domains. This work is open-sourced at https://github.com/YichenLuo-0/MMET.

[838] Wasserstein Barycenter Soft Actor-Critic

Zahra Shahrooei, Ali Baheri

Main category: cs.LG

TL;DR: WBSAC improves sample efficiency in reinforcement learning by combining pessimistic and optimistic policies using Wasserstein barycenter for directed exploration.

Details

Motivation: Addressing poor sample efficiency in deep off-policy actor-critic algorithms, especially in sparse-reward environments.

Method: Proposes WBSAC, which uses Wasserstein barycenter of pessimistic and optimistic policies for exploration, adjusting exploration dynamically.

Result: WBSAC outperforms state-of-the-art off-policy actor-critic algorithms in MuJoCo tasks.

Conclusion: WBSAC offers a principled and effective exploration strategy for improving sample efficiency in continuous control domains.

Abstract: Deep off-policy actor-critic algorithms have emerged as the leading framework for reinforcement learning in continuous control domains. However, most of these algorithms suffer from poor sample efficiency, especially in environments with sparse rewards. In this paper, we take a step towards addressing this issue by providing a principled directed exploration strategy. We propose Wasserstein Barycenter Soft Actor-Critic (WBSAC) algorithm, which benefits from a pessimistic actor for temporal difference learning and an optimistic actor to promote exploration. This is achieved by using the Wasserstein barycenter of the pessimistic and optimistic policies as the exploration policy and adjusting the degree of exploration throughout the learning process. We compare WBSAC with state-of-the-art off-policy actor-critic algorithms and show that WBSAC is more sample-efficient on MuJoCo continuous control tasks.

[839] Granular-Ball-Induced Multiple Kernel K-Means

Shuyin Xia, Yifan Wang, Lifeng Shen, Guoyin Wang

Main category: cs.LG

TL;DR: The paper introduces a granular-ball computing approach to enhance multi-kernel clustering, improving efficiency and robustness by using adaptive ball-based data descriptions.

Details

Motivation: Existing multi-kernel clustering methods face challenges in computational efficiency and robustness due to reliance on point-to-point relationships and complex kernel interactions.

Method: The authors propose granular-ball computing, introducing granular-ball kernels (GBK) and a granular-ball multi-kernel K-means framework (GB-MKKM) for clustering.

Result: GB-MKKM demonstrates superior efficiency and clustering performance in empirical evaluations.

Conclusion: Granular-ball computing effectively addresses limitations of traditional multi-kernel clustering, offering a more efficient and robust solution.

Abstract: Most existing multi-kernel clustering algorithms, such as multi-kernel K-means, often struggle with computational efficiency and robustness when faced with complex data distributions. These challenges stem from their dependence on point-to-point relationships for optimization, which can lead to difficulty in accurately capturing data sets’ inherent structure and diversity. Additionally, the intricate interplay between multiple kernels in such algorithms can further exacerbate these issues, effectively impacting their ability to cluster data points in high-dimensional spaces. In this paper, we leverage granular-ball computing to improve the multi-kernel clustering framework. The core of granular-ball computing is to adaptively fit data distribution by balls from coarse to acceptable levels. Each ball can enclose data points based on a density consistency measurement. Such ball-based data description thus improves the computational efficiency and the robustness to unknown noises. Specifically, based on granular-ball representations, we introduce the granular-ball kernel (GBK) and its corresponding granular-ball multi-kernel K-means framework (GB-MKKM) for efficient clustering. Using granular-ball relationships in multiple kernel spaces, the proposed GB-MKKM framework shows its superiority in efficiency and clustering performance in the empirical evaluation of various clustering tasks.

[840] Robust Behavior Cloning Via Global Lipschitz Regularization

Shili Wu, Yizhao Jin, Puhua Niu, Aniruddha Datta, Sean B. Andersson

Main category: cs.LG

TL;DR: The paper proposes using global Lipschitz regularization to enhance the robustness of Behavior Cloning (BC) policies against observation perturbations, providing a theoretical robustness certificate and empirical validation.

Details

Motivation: BC policies are vulnerable to measurement errors or adversarial disturbances in observations, leading to sub-optimal actions. Ensuring robustness is critical for safety-critical applications like autonomous vehicles.

Method: The authors employ global Lipschitz regularization to train a robust policy network, ensuring the policy’s resilience to bounded norm perturbations. They also propose a method to construct Lipschitz neural networks.

Result: The approach provides a robustness certificate for the policy and is empirically validated across various Gymnasium environments.

Conclusion: Lipschitz regularization effectively enhances BC policy robustness, making it more reliable in real-world deployments with noisy or adversarial observations.

Abstract: Behavior Cloning (BC) is an effective imitation learning technique and has even been adopted in some safety-critical domains such as autonomous vehicles. BC trains a policy to mimic the behavior of an expert by using a dataset composed of only state-action pairs demonstrated by the expert, without any additional interaction with the environment. However, During deployment, the policy observations may contain measurement errors or adversarial disturbances. Since the observations may deviate from the true states, they can mislead the agent into making sub-optimal actions. In this work, we use a global Lipschitz regularization approach to enhance the robustness of the learned policy network. We then show that the resulting global Lipschitz property provides a robustness certificate to the policy with respect to different bounded norm perturbations. Then, we propose a way to construct a Lipschitz neural network that ensures the policy robustness. We empirically validate our theory across various environments in Gymnasium. Keywords: Robust Reinforcement Learning; Behavior Cloning; Lipschitz Neural Network

[841] PAE MobiLLM: Privacy-Aware and Efficient LLM Fine-Tuning on the Mobile Device via Additive Side-Tuning

Xingke Yang, Liang Li, Zhiyi Wan, Sicong Li, Xiaoqi Qi, Jiang Liu, Tomoaki Ohtsuki, Xin Fu, Miao Pan

Main category: cs.LG

TL;DR: PAE MobiLLM is a privacy-aware and efficient method for fine-tuning large language models on mobile devices, reducing communication costs and protecting data privacy.

Details

Motivation: Addressing the gap between mobile device resource limitations and the demand for on-device LLM fine-tuning, while mitigating privacy and communication issues in server-assisted methods.

Method: Uses server-assisted additive side-tuning with activation caching, activation shortcuts, and additive adapter side-network design to improve efficiency and privacy.

Result: Demonstrates superior performance in reducing communication costs and protecting data privacy while enabling efficient LLM fine-tuning.

Conclusion: PAE MobiLLM effectively bridges the gap for on-device LLM fine-tuning with enhanced privacy and efficiency.

Abstract: There is a huge gap between numerous intriguing applications fostered by on-device large language model (LLM) fine-tuning (FT) from fresh mobile data and the limited resources of a mobile device. While existing server-assisted methods (e.g., split learning or side-tuning) may enable LLM FT on the local mobile device, they suffer from heavy communication burdens of activation transmissions, and may disclose data and labels to the server. To address those issues, we develop PAE MobiLLM, a a privacy-aware and efficient LLM FT method which can be deployed on the mobile device via server-assisted additive side-tuning. To further accelerate FT convergence and improve computing efficiency, PAE MobiLLM integrates activation caching on the server side, which allows the server to reuse historical activations and saves the mobile device from repeatedly computing forward passes for the recurring data samples. Besides, to reduce communication cost, PAE MobiLLM develops an activation shortcut that transmits only the token involved in the loss calculation instead of full activation matrices to guide the side network tuning. Last but not least, PAE MobiLLM introduces the additive adapter side-network design which makes the server train the adapter modules based on device-defined prediction differences rather than raw ground-truth labels. In this way, the server can only assist device-defined side-network computing, and learn nothing about data and labels. Extensive experimental results demonstrate PAE MobiLLM’s superiority.

[842] Bridging the Last Mile of Prediction: Enhancing Time Series Forecasting with Conditional Guided Flow Matching

Huibo Xu, Runlong Yu, Likang Wu, Xianquan Wang, Qi Liu

Main category: cs.LG

TL;DR: CGFM improves time series forecasting by integrating auxiliary model predictions and leveraging residual patterns, outperforming existing methods.

Details

Motivation: Existing generative models fail to capture temporal dependencies and residual patterns, limiting accuracy.

Method: CGFM extends flow matching by using auxiliary model outputs, historical data as conditions, and affine paths to enhance learning.

Result: CGFM consistently outperforms state-of-the-art models in experiments.

Conclusion: CGFM advances forecasting by refining predictions and preserving temporal consistency.

Abstract: Existing generative models for time series forecasting often transform simple priors (typically Gaussian) into complex data distributions. However, their sampling initialization, independent of historical data, hinders the capture of temporal dependencies, limiting predictive accuracy. They also treat residuals merely as optimization targets, ignoring that residuals often exhibit meaningful patterns like systematic biases or nontrivial distributional structures. To address these, we propose Conditional Guided Flow Matching (CGFM), a novel model-agnostic framework that extends flow matching by integrating outputs from an auxiliary predictive model. This enables learning from the probabilistic structure of prediction residuals, leveraging the auxiliary model’s prediction distribution as a source to reduce learning difficulty and refine forecasts. CGFM incorporates historical data as both conditions and guidance, uses two-sided conditional paths (with source and target conditioned on the same history), and employs affine paths to expand the path space, avoiding path crossing without complex mechanisms, preserving temporal consistency, and strengthening distribution alignment. Experiments across datasets and baselines show CGFM consistently outperforms state-of-the-art models, advancing forecasting.

[843] ZClassifier: Temperature Tuning and Manifold Approximation via KL Divergence on Logit Space

Shim Soon Yong

Main category: cs.LG

TL;DR: ZClassifier replaces deterministic logits with Gaussian-distributed ones, improving robustness, calibration, and latent separation in classification tasks.

Details

Motivation: To address temperature scaling and manifold approximation in classification by unifying uncertainty calibration and latent control probabilistically.

Method: Uses diagonal Gaussian-distributed logits, minimizing KL divergence between predicted Gaussians and a unit isotropic Gaussian.

Result: Outperforms softmax classifiers on CIFAR-10 and CIFAR-100 in robustness, calibration, and latent separation.

Conclusion: ZClassifier provides a principled probabilistic framework for classification, enhancing interpretability and performance.

Abstract: We introduce a novel classification framework, ZClassifier, that replaces conventional deterministic logits with diagonal Gaussian-distributed logits. Our method simultaneously addresses temperature scaling and manifold approximation by minimizing the KL divergence between the predicted Gaussian distributions and a unit isotropic Gaussian. This unifies uncertainty calibration and latent control in a principled probabilistic manner, enabling a natural interpretation of class confidence and geometric consistency. Experiments on CIFAR-10 and CIFAR-100 demonstrate that ZClassifier improves over softmax classifiers in robustness, calibration, and latent separation, with consistent benefits across small-scale and large-scale classification settings.

[844] A Bit of Freedom Goes a Long Way: Classical and Quantum Algorithms for Reinforcement Learning under a Generative Model

Andris Ambainis, Joao F. Doriguello, Debbie Lim

Main category: cs.LG

TL;DR: Novel classical and quantum online algorithms for learning MDPs, leveraging hybrid exploration-generative RL and quantum advantages for improved regret bounds.

Details

Motivation: To improve regret bounds in RL for MDPs by avoiding traditional paradigms like optimism and posterior sampling, and leveraging quantum algorithms for better performance.

Method: Hybrid exploration-generative RL model with classical and quantum algorithms for approximating optimal policies under a generative model.

Result: Quantum algorithms achieve logarithmic regret in finite-horizon MDPs and poly-logarithmic regret in infinite-horizon MDPs, outperforming classical bounds.

Conclusion: The proposed methods offer significant improvements in regret bounds, especially for quantum algorithms, and generalize to compact state spaces.

Abstract: We propose novel classical and quantum online algorithms for learning finite-horizon and infinite-horizon average-reward Markov Decision Processes (MDPs). Our algorithms are based on a hybrid exploration-generative reinforcement learning (RL) model wherein the agent can, from time to time, freely interact with the environment in a generative sampling fashion, i.e., by having access to a “simulator”. By employing known classical and new quantum algorithms for approximating optimal policies under a generative model within our learning algorithms, we show that it is possible to avoid several paradigms from RL like “optimism in the face of uncertainty” and “posterior sampling” and instead compute and use optimal policies directly, which yields better regret bounds compared to previous works. For finite-horizon MDPs, our quantum algorithms obtain regret bounds which only depend logarithmically on the number of time steps $T$, thus breaking the $O(\sqrt{T})$ classical barrier. This matches the time dependence of the prior quantum works of Ganguly et al. (arXiv'23) and Zhong et al. (ICML'24), but with improved dependence on other parameters like state space size $S$ and action space size $A$. For infinite-horizon MDPs, our classical and quantum bounds still maintain the $O(\sqrt{T})$ dependence but with better $S$ and $A$ factors. Nonetheless, we propose a novel measure of regret for infinite-horizon MDPs with respect to which our quantum algorithms have $\operatorname{poly}\log{T}$ regret, exponentially better compared to classical algorithms. Finally, we generalise all of our results to compact state spaces.

[845] Multi-Treatment-DML: Causal Estimation for Multi-Dimensional Continuous Treatments with Monotonicity Constraints in Personal Loan Risk Optimization

Kexin Zhao, Bo Wang, Cuiying Zhao, Tongyao Wan

Main category: cs.LG

TL;DR: The paper introduces Multi-Treatment-DML, a framework using Double Machine Learning to optimize continuous, multi-dimensional loan treatments (credit limits, interest rates, etc.) while addressing biases in observational data and enforcing domain-specific monotonic constraints.

Details

Motivation: Existing causal methods fail to handle continuous, multi-dimensional treatments in loan optimization, and observational data is often biased. Financial domain also requires provably monotonic treatment-outcome relationships.

Method: Proposes Multi-Treatment-DML, leveraging Double Machine Learning to debias observational data, handle continuous treatments, and enforce monotonic constraints.

Result: Demonstrates effectiveness on public benchmarks and real-world datasets, with online A/B testing confirming practical superiority in loan operations.

Conclusion: Multi-Treatment-DML successfully addresses gaps in existing methods, proving effective for real-world loan optimization.

Abstract: Optimizing credit limits, interest rates, and loan terms is crucial for managing borrower risk and lifetime value (LTV) in personal loan platform. However, counterfactual estimation of these continuous, multi-dimensional treatments faces significant challenges: randomized trials are often prohibited by risk controls and long repayment cycles, forcing reliance on biased observational data. Existing causal methods primarily handle binary/discrete treatments and struggle with continuous, multi-dimensional settings. Furthermore, financial domain knowledge mandates provably monotonic treatment-outcome relationships (e.g., risk increases with credit limit).To address these gaps, we propose Multi-Treatment-DML, a novel framework leveraging Double Machine Learning (DML) to: (i) debias observational data for causal effect estimation; (ii) handle arbitrary-dimensional continuous treatments; and (iii) enforce monotonic constraints between treatments and outcomes, guaranteeing adherence to domain requirements.Extensive experiments on public benchmarks and real-world industrial datasets demonstrate the effectiveness of our approach. Furthermore, online A/B testing conducted on a realworld personal loan platform, confirms the practical superiority of Multi-Treatment-DML in real-world loan operations.

[846] BoostTransformer: Enhancing Transformer Models with Subgrid Selection and Importance Sampling

Biyi Fang, Jean Utke, Truong Vo, Diego Klabjan

Main category: cs.LG

TL;DR: BoostTransformer enhances transformers with boosting principles for efficiency and performance, achieving faster convergence and higher accuracy in text classification.

Details

Motivation: To address the computational and hyperparameter challenges of standard transformers.

Method: Incorporates boosting principles via subgrid token selection and importance-weighted sampling, with a least square boosting objective.

Result: Demonstrates faster convergence and higher accuracy on fine-grained text classification benchmarks.

Conclusion: BoostTransformer outperforms standard transformers while reducing architectural search overhead.

Abstract: Transformer architectures dominate modern NLP but often demand heavy computational resources and intricate hyperparameter tuning. To mitigate these challenges, we propose a novel framework, BoostTransformer, that augments transformers with boosting principles through subgrid token selection and importance-weighted sampling. Our method incorporates a least square boosting objective directly into the transformer pipeline, enabling more efficient training and improved performance. Across multiple fine-grained text classification benchmarks, BoostTransformer demonstrates both faster convergence and higher accuracy, surpassing standard transformers while minimizing architectural search overhead.

[847] HiTeC: Hierarchical Contrastive Learning on Text-Attributed Hypergraph with Semantic-Aware Augmentation

Mengting Pan, Fan Li, Xiaoyang Wang, Wenjie Zhang, Xuemin Lin

Main category: cs.LG

TL;DR: HiTeC introduces a two-stage hierarchical contrastive learning framework for text-attributed hypergraphs, addressing limitations of prior methods with semantic-aware augmentation and multi-scale contrastive loss.

Details

Motivation: Existing CL methods for hypergraphs overlook textual information and suffer from noise, suboptimal representations, and scalability issues.

Method: HiTeC uses a two-stage approach: (1) structure-aware contrastive pretraining of the text encoder, and (2) semantic-aware augmentation and multi-scale contrastive loss.

Result: HiTeC improves scalability and representation quality, outperforming prior methods.

Conclusion: HiTeC effectively addresses the limitations of CL for text-attributed hypergraphs, offering a scalable and high-quality solution.

Abstract: Contrastive learning (CL) has become a dominant paradigm for self-supervised hypergraph learning, enabling effective training without costly labels. However, node entities in real-world hypergraphs are often associated with rich textual information, which is overlooked in prior works. Directly applying existing CL-based methods to such text-attributed hypergraphs (TAHGs) leads to three key limitations: (1) The common use of graph-agnostic text encoders overlooks the correlations between textual content and hypergraph topology, resulting in suboptimal representations. (2) Their reliance on random data augmentations introduces noise and weakens the contrastive objective. (3) The primary focus on node- and hyperedge-level contrastive signals limits the ability to capture long-range dependencies, which is essential for expressive representation learning. Although HyperBERT pioneers CL on TAHGs, its co-training paradigm suffers from poor scalability. To fill the research gap, we introduce HiTeC, a two-stage hierarchical contrastive learning framework with semantic-aware augmentation for scalable and effective self-supervised learning on TAHGs. In the first stage, we pre-train the text encoder with a structure-aware contrastive objective to overcome the graph-agnostic nature of conventional methods. In the second stage, we introduce two semantic-aware augmentation strategies, including prompt-enhanced text augmentation and semantic-aware hyperedge drop, to facilitate informative view generation. Furthermore, we propose a multi-scale contrastive loss that extends existing objectives with an $s$-walk-based subgraph-level contrast to better capture long-range dependencies. By decoupling text encoder pretraining from hypergraph contrastive learning, this two-stage design enhances scalability without compromising representation quality. Extensive experiments confirm the effectiveness of HiTeC.

[848] Echo: Decoupling Inference and Training for Large-Scale RL Alignment on Heterogeneous Swarms

Jie Xiao, Changyuan Fan, Qingnan Ren, Alfred Long, Yuchen Zhang, Rymon Yu, Eric Yang, Lynn Ai, Shaoduo Gan

Main category: cs.LG

TL;DR: Echo decouples RL-based post-training for LLMs into inference and training phases, using lightweight sync protocols to maintain efficiency and performance on heterogeneous hardware.

Details

Motivation: Current RL systems for LLMs co-locate trajectory sampling and policy optimization, violating SPMD assumptions and causing inefficiencies.

Method: Echo introduces sequential pull and asynchronous push-pull synchronization protocols to decouple phases across heterogeneous swarms.

Result: Echo matches co-located baselines in convergence and reward while utilizing decentralized, commodity hardware.

Conclusion: Echo enables large-scale RL for LLMs with datacentre-grade performance using heterogeneous resources.

Abstract: Modern RL-based post-training for large language models (LLMs) co-locate trajectory sampling and policy optimisation on the same GPU cluster, forcing the system to switch between inference and training workloads. This serial context switching violates the single-program-multiple-data (SPMD) assumption underlying today’s distributed training systems. We present Echo, the RL system that cleanly decouples these two phases across heterogeneous “inference” and “training” swarms while preserving statistical efficiency. Echo introduces two lightweight synchronization protocols: a sequential pull mode that refreshes policy weights according to API call for minimal bias, and an asynchronous push-pull mode that streams version-tagged rollouts through a replay buffer to maximise hardware utilisation. Training three representative RL workloads with Qwen3-4B, Qwen2.5-7B and Qwen3-32B on a geographically distributed cluster, Echo matches a fully co-located Verl baseline in convergence speed and final reward while off-loading trajectory generation to commodity edge hardware. These promising results demonstrate that large-scale RL for LLMs could achieve datacentre-grade performance using decentralised, heterogeneous resources.

[849] Diagrams-to-Dynamics (D2D): Exploring Causal Loop Diagram Leverage Points under Uncertainty

Jeroen F. Uleman, Loes Crielaard, Leonie K. Elsenburg, Guido A. Veldhuis, Karien Stronks, Naja Hulvej Rod, Rick Quax, Vítor V. Vasconcelos

Main category: cs.LG

TL;DR: D2D converts causal loop diagrams (CLDs) into exploratory system dynamics models (SDMs) for dynamic analysis, outperforming network centrality analysis and aiding in identifying leverage points.

Details

Motivation: CLDs are limited in dynamic analysis and intervention strategy support, and quantitative methods like network centrality analysis often lead to false inferences.

Method: D2D transforms CLDs into SDMs using structural information (link existence and polarity) with minimal user input (labeling variables as stocks, flows, auxiliaries, or constants).

Result: D2D distinguishes high- and low-ranked leverage points, shows consistency with data-driven models, and provides uncertainty estimates.

Conclusion: D2D is a promising tool for dynamic modeling, implemented in an open-source Python package and web app, with potential for broader validation and application.

Abstract: Causal loop diagrams (CLDs) are widely used in health and environmental research to represent hypothesized causal structures underlying complex problems. However, as qualitative and static representations, CLDs are limited in their ability to support dynamic analysis and inform intervention strategies. Additionally, quantitative CLD analysis methods like network centrality analysis often lead to false inference. We propose Diagrams-to-Dynamics (D2D), a method for converting CLDs into exploratory system dynamics models (SDMs) in the absence of empirical data. With minimal user input - following a protocol to label variables as stocks, flows or auxiliaries, and constants - D2D leverages the structural information already encoded in CLDs, namely, link existence and polarity, to simulate hypothetical interventions and explore potential leverage points under uncertainty. Results suggest that D2D helps distinguish between high- and low-ranked leverage points. We compare D2D to a data-driven SDM constructed from the same CLD and variable labels. D2D showed greater consistency with the data-driven model than network centrality analysis, while providing uncertainty estimates and guidance for future data collection. The method is implemented in an open-source Python package and a web-based application to support further testing and lower the barrier to dynamic modeling for researchers working with CLDs. We expect additional validation will further establish the approach’s utility across a broad range of cases and domains.

[850] End-to-End Text-to-SQL with Dataset Selection: Leveraging LLMs for Adaptive Query Generation

Anurag Tripathi, Vaibhav Patle, Abhinav Jain, Ayush Pundir, Sairam Menon, Ajeet Kumar Singh, Dorien Herremans

Main category: cs.LG

TL;DR: A three-stage text-to-SQL framework improves database intent prediction and SQL generation by leveraging LLMs, prompt engineering, and critic agents.

Details

Motivation: Traditional text-to-SQL methods assume a pre-specified database, which is problematic for multiple extensive databases. Identifying the correct database is crucial but often overlooked.

Method: Proposes a three-stage framework: 1) Uses LLMs and prompt engineering to extract rules from NLQs, 2) Trains a RoBERTa-based model for db_id prediction, 3) Refines SQL with critic agents.

Result: Outperforms state-of-the-art models in database intent prediction and SQL generation accuracy.

Conclusion: The framework effectively addresses the challenge of identifying the correct database and improves SQL generation.

Abstract: Text-to-SQL bridges the gap between natural language and structured database language, thus allowing non-technical users to easily query databases. Traditional approaches model text-to-SQL as a direct translation task, where a given Natural Language Query (NLQ) is mapped to an SQL command. Recent advances in large language models (LLMs) have significantly improved translation accuracy, however, these methods all require that the target database is pre-specified. This becomes problematic in scenarios with multiple extensive databases, where identifying the correct database becomes a crucial yet overlooked step. In this paper, we propose a three-stage end-to-end text-to-SQL framework to identify the user’s intended database before generating SQL queries. Our approach leverages LLMs and prompt engineering to extract implicit information from natural language queries (NLQs) in the form of a ruleset. We then train a large db_id prediction model, which includes a RoBERTa-based finetuned encoder, to predict the correct Database identifier (db_id) based on both the NLQ and the LLM-generated rules. Finally, we refine the generated SQL by using critic agents to correct errors. Experimental results demonstrate that our framework outperforms the current state-of-the-art models in both database intent prediction and SQL generation accuracy.

[851] A New Lens on Homelessness: Daily Tent Monitoring with 311 Calls and Street Images

Wooyong Jung, Sola Kim, Dongwook Kim, Maryam Tabar, Dongwon Lee

Main category: cs.LG

TL;DR: The study introduces a novel method using crowdsourced data (311 Service Calls and street-level imagery) to track and forecast homeless tent trends in San Francisco, offering more detailed and timely insights than traditional point-in-time counts.

Details

Motivation: Existing homelessness monitoring methods like PIT counts lack frequency, consistency, and spatial detail, limiting their effectiveness for policy and intervention evaluation.

Method: The study leverages publicly available, crowdsourced data (311 Service Calls and street-level imagery) to develop a predictive model for tracking and forecasting homeless tent trends at daily and neighborhood levels.

Result: The model captures fine-grained variations, revealing rapid fluctuations during COVID-19 and spatial shifts in tent locations, which traditional methods often miss.

Conclusion: This approach provides timely, localized, and cost-effective data, enhancing policy responses and intervention evaluations for unsheltered homelessness.

Abstract: Homelessness in the United States has surged to levels unseen since the Great Depression. However, existing methods for monitoring it, such as point-in-time (PIT) counts, have limitations in terms of frequency, consistency, and spatial detail. This study proposes a new approach using publicly available, crowdsourced data, specifically 311 Service Calls and street-level imagery, to track and forecast homeless tent trends in San Francisco. Our predictive model captures fine-grained daily and neighborhood-level variations, uncovering patterns that traditional counts often overlook, such as rapid fluctuations during the COVID-19 pandemic and spatial shifts in tent locations over time. By providing more timely, localized, and cost-effective information, this approach serves as a valuable tool for guiding policy responses and evaluating interventions aimed at reducing unsheltered homelessness.

cs.MA

[852] Energy Efficient Task Offloading in UAV-Enabled MEC Using a Fully Decentralized Deep Reinforcement Learning Approach

Hamidreza Asadian-Rad, Hossein Soleimani, Shahrokh Farahmand

Main category: cs.MA

TL;DR: The paper proposes a decentralized approach using deep reinforcement learning (DRL) for UAV trajectory and user assignment in MEC, addressing challenges of non-convex optimization, user mobility, and centralized bottlenecks.

Details

Motivation: To overcome the limitations of centralized and semi-centralized methods in UAV-assisted MEC, such as communication overhead, lack of scalability, and robustness issues.

Method: A fully decentralized setup where UAVs use local observations and neighbor communications, employing Graph Attention Layers (GAT) and Experience and Parameter Sharing Proximal Policy Optimization (EPS-PPO).

Result: The proposed method outperforms existing MADDPG and IPPO algorithms, achieving better performance with local communications only.

Conclusion: The decentralized DRL approach is effective for UAV trajectory optimization in MEC, offering scalability, flexibility, and robustness.

Abstract: Unmanned aerial vehicles (UAVs) have been recently utilized in multi-access edge computing (MEC) as edge servers. It is desirable to design UAVs’ trajectories and user to UAV assignments to ensure satisfactory service to the users and energy efficient operation simultaneously. The posed optimization problem is challenging to solve because: (i) The formulated problem is non-convex, (ii) Due to the mobility of ground users, their future positions and channel gains are not known in advance, (iii) Local UAVs’ observations should be communicated to a central entity that solves the optimization problem. The (semi-) centralized processing leads to communication overhead, communication/processing bottlenecks, lack of flexibility and scalability, and loss of robustness to system failures. To simultaneously address all these limitations, we advocate a fully decentralized setup with no centralized entity. Each UAV obtains its local observation and then communicates with its immediate neighbors only. After sharing information with neighbors, each UAV determines its next position via a locally run deep reinforcement learning (DRL) algorithm. None of the UAVs need to know the global communication graph. Two main components of our proposed solution are (i) Graph attention layers (GAT), and (ii) Experience and parameter sharing proximal policy optimization (EPS-PPO). Our proposed approach eliminates all the limitations of semi-centralized MADRL methods such as MAPPO and MA deep deterministic policy gradient (MADDPG), while guaranteeing a better performance than independent local DRLs such as in IPPO. Numerical results reveal notable performance gains in several different criteria compared to the existing MADDPG algorithm, demonstrating the potential for offering a better performance, while utilizing local communications only.

[853] A Survey on Agentic Service Ecosystems: Measurement, Analysis, and Optimization

Xuwen Zhang, Xiao Xue, Xia Xie, Qun Ma, Xiangning Yu, Deyu Zhou, Yifan Wang, Ming Zhang

Main category: cs.MA

TL;DR: The paper proposes a framework to analyze swarm intelligence emergence in Agentic Service Ecosystems, addressing gaps in current research.

Details

Motivation: Traditional methods fail to capture the complexity of autonomous agents in ecosystems, necessitating a new approach.

Method: A three-step framework (measurement, analysis, optimization) is introduced to study swarm intelligence emergence.

Result: The framework offers theoretical support and practical methods for optimizing agentic ecosystems.

Conclusion: The proposed approach addresses research gaps and provides actionable insights for real-world applications.

Abstract: The Agentic Service Ecosystem consists of heterogeneous autonomous agents (e.g., intelligent machines, humans, and human-machine hybrid systems) that interact through resource exchange and service co-creation. These agents, with distinct behaviors and motivations, exhibit autonomous perception, reasoning, and action capabilities, which increase system complexity and make traditional linear analysis methods inadequate. Swarm intelligence, characterized by decentralization, self-organization, emergence, and dynamic adaptability, offers a novel theoretical lens and methodology for understanding and optimizing such ecosystems. However, current research, owing to fragmented perspectives and cross-ecosystem differences, fails to comprehensively capture the complexity of swarm-intelligence emergence in agentic contexts. The lack of a unified methodology further limits the depth and systematic treatment of the research. This paper proposes a framework for analyzing the emergence of swarm intelligence in Agentic Service Ecosystems, with three steps: measurement, analysis, and optimization, to reveal the cyclical mechanisms and quantitative criteria that foster emergence. By reviewing existing technologies, the paper analyzes their strengths and limitations, identifies unresolved challenges, and shows how this framework provides both theoretical support and actionable methods for real-world applications.

[854] Retrieval-Augmented Multi-Agent System for Rapid Statement of Work Generation

Amulya Suravarjhula, Rashi Chandrashekhar Agrawal, Sakshi Jayesh Patel, Rahul Gupta

Main category: cs.MA

TL;DR: An AI-driven system automates drafting Statements of Work (SOW), improving speed, accuracy, and customization compared to manual methods.

Details

Motivation: Manual SOW drafting is slow, complex, and error-prone, necessitating a more efficient solution.

Method: The system uses three AI agents: one drafts, one checks legal compliance, and one handles formatting.

Result: The system drafts SOWs in under three minutes with high accuracy and quality, tested on real business cases.

Conclusion: AI can streamline legal and business processes, reducing risks and saving time.

Abstract: Drafting a Statement of Work (SOW) is a vital part of business and legal projects. It outlines key details like deliverables, timelines, responsibilities, and legal terms. However, creating these documents is often a slow and complex process. It usually involves multiple people, takes several days, and leaves room for errors or outdated content. This paper introduces a new AI-driven automation system that makes the entire SOW drafting process faster, easier, and more accurate. Instead of relying completely on humans, the system uses three intelligent components or ‘agents’ that each handle a part of the job. One agent writes the first draft, another checks if everything is legally correct, and the third agent formats the document and ensures everything is in order. Unlike basic online tools that just fill in templates, this system understands the meaning behind the content and customizes the SOW to match the needs of the project. It also checks legal compliance and formatting so that users can trust the result. The system was tested using real business examples. It was able to create a full SOW in under three minutes, compared to several hours or days using manual methods. It also performed well in accuracy and quality, showing that it can reduce legal risks and save a lot of time. This solution shows how artificial intelligence can be used to support legal and business professionals by taking care of routine work and helping them focus on more important decisions. It’s a step toward making legal processes smarter, faster, and more reliable.

[855] Toward Goal-Oriented Communication in Multi-Agent Systems: An overview

Themistoklis Charalambous, Nikolaos Pappas, Nikolaos Nomikos, Risto Wichman

Main category: cs.MA

TL;DR: A survey on goal-oriented communication in multi-agent systems (MAS), bridging information theory, communication theory, and machine learning, with a focus on task relevance and applications like swarm robotics and federated learning.

Details

Motivation: Efficient communication under resource constraints in MAS is critical, but traditional paradigms overlook task relevance. Goal-oriented communication addresses this gap by prioritizing information importance for shared objectives.

Method: The paper reviews foundational concepts, learning-based approaches, and emergent protocols in goal-oriented communication, with emphasis on coordination under constraints.

Result: The survey highlights applications in swarm robotics, federated learning, and edge computing, showcasing the effectiveness of goal-oriented communication.

Conclusion: Open challenges and future research directions are identified at the intersection of communication theory, machine learning, and multi-agent decision making.

Abstract: As multi-agent systems (MAS) become increasingly prevalent in autonomous systems, distributed control, and edge intelligence, efficient communication under resource constraints has emerged as a critical challenge. Traditional communication paradigms often emphasize message fidelity or bandwidth optimization, overlooking the task relevance of the exchanged information. In contrast, goal-oriented communication prioritizes the importance of information with respect to the agents’ shared objectives. This review provides a comprehensive survey of goal-oriented communication in MAS, bridging perspectives from information theory, communication theory, and machine learning. We examine foundational concepts alongside learning-based approaches and emergent protocols. Special attention is given to coordination under communication constraints, as well as applications in domains such as swarm robotics, federated learning, and edge computing. The paper concludes with a discussion of open challenges and future research directions at the intersection of communication theory, machine learning, and multi-agent decision making.

[856] Multi-agent systems for chemical engineering: A review and perspective

Sophia Rupprecht, Qinghe Gao, Tanuj Karia, Artur M. Schweidtmann

Main category: cs.MA

TL;DR: LLM-based multi-agent systems (MASs) are transforming chemical engineering by breaking down complex workflows into collaborative agents. This review highlights current advancements, challenges, and future opportunities.

Details

Motivation: To explore the potential of MASs in revolutionizing chemical engineering workflows through specialized, collaborative agents.

Method: Survey of state-of-the-art MAS applications in chemical engineering, identifying key advancements and challenges.

Result: Early studies show promise, but challenges like tailored architectures, data integration, and safety remain.

Conclusion: MASs present exciting opportunities to rethink chemical engineering, though further research is needed to address existing challenges.

Abstract: Large language model (LLM)-based multi-agent systems (MASs) are a recent but rapidly evolving technology with the potential to transform chemical engineering by decomposing complex workflows into teams of collaborative agents with specialized knowledge and tools. This review surveys the state-of-the-art of MAS within chemical engineering. While early studies demonstrate promising results, scientific challenges remain, including the design of tailored architectures, integration of heterogeneous data modalities, development of foundation models with domain-specific modalities, and strategies for ensuring transparency, safety, and environmental impact. As a young but fast-moving field, MASs offer exciting opportunities to rethink chemical engineering workflows.

[857] AI-Generated Compromises for Coalition Formation

Eyal Briman, Ehud Shapiro, Nimrod Talmon

Main category: cs.MA

TL;DR: The paper addresses the challenge of finding compromise proposals in coalition formation by formalizing a model with bounded rationality and uncertainty, using NLP and large language models to suggest supported compromises in collaborative document writing.

Details

Motivation: The need to identify majority-supported compromise proposals in coalition formation, especially in collaborative tasks like democratic document drafting, where traditional tools fall short.

Method: Formalizes a model incorporating bounded rationality and uncertainty, uses NLP and large language models to create a semantic metric space over text, and designs algorithms for compromise proposals.

Result: AI methods effectively generate compromise proposals, facilitating large-scale democratic text editing, as demonstrated in simulations.

Conclusion: AI can successfully address the challenge of finding compromise proposals in coalition formation, particularly in collaborative document writing.

Abstract: The challenge of finding compromises between agent proposals is fundamental to AI subfields such as argumentation, mediation, and negotiation. Building on this tradition, Elkind et al. (2021) introduced a process for coalition formation that seeks majority-supported proposals preferable to the status quo, using a metric space where each agent has an ideal point. A crucial step in this process involves identifying compromise proposals around which agent coalitions can unite. How to effectively find such compromise proposals remains an open question. We address this gap by formalizing a model that incorporates agent bounded rationality and uncertainty, and by developing AI methods to generate compromise proposals. We focus on the domain of collaborative document writing, such as the democratic drafting of a community constitution. Our approach uses natural language processing techniques and large language models to induce a semantic metric space over text. Based on this space, we design algorithms to suggest compromise points likely to receive broad support. To evaluate our methods, we simulate coalition formation processes and show that AI can facilitate large-scale democratic text editing, a domain where traditional tools are limited.

cs.MM

[858] Narrative Memory in Machines: Multi-Agent Arc Extraction in Serialized TV

Roberto Balestri, Guglielmo Pescatore

Main category: cs.MM

TL;DR: A multi-agent system (MAS) uses computational memory architectures to analyze serialized TV narratives, identifying arc types and storing them for structured analysis. It combines AI and human oversight but faces challenges with overlapping arcs.

Details

Motivation: To address the complexity of serialized TV narratives by leveraging computational memory architectures for structured analysis and semantic comparison.

Method: The MAS uses LLMs for semantic memory, a vector database for episodic memories, and a multi-agent workflow for integration. Tested on Grey’s Anatomy.

Result: Identified three arc types (Anthology, Soap, Genre-Specific) but struggled with overlapping arcs. Demonstrated potential for AI-human collaboration.

Conclusion: The memory-centric approach shows promise for serialized narratives, with future work focusing on multimodal inputs and broader testing.

Abstract: Serialized television narratives present significant analytical challenges due to their complex, temporally distributed storylines that necessitate sophisticated information management. This paper introduces a multi-agent system (MAS) designed to extract and analyze narrative arcs by implementing principles of computational memory architectures. The system conceptualizes narrative understanding through analogues of human memory: Large Language Models (LLMs) provide a form of semantic memory for general narrative patterns, while a vector database stores specific arc progressions as episodic memories. A multi-agent workflow simulates working memory processes to integrate these information types. Tested on the first season of Grey’s Anatomy (ABC 2005-), the MAS identifies three arc types: Anthology (self-contained), Soap (relationship-focused), and Genre-Specific. These arcs and their episodic developments are stored in a vector database, facilitating structured analysis and semantic comparison. To bridge automation with critical interpretation, a graphical interface enables human oversight and refinement of the system’s narrative memory. While demonstrating strong performance in identifying Anthology Arcs and character entities, the system’s reliance on textual paratexts (episode summaries) revealed limitations in discerning overlapping arcs and opaque dynamics, underscoring the challenges in computational memory consolidation versus human holistic understanding. This memory-centric approach highlights the potential of combining AI-driven memory processing with human expertise. Beyond television, it offers promise for serialized written formats where narrative is entirely text-based. Future work will focus on integrating multimodal inputs to enrich episodic memory, refining memory integration mechanisms within the MAS, and expanding testing across diverse genres.

[859] Reversible Video Steganography Using Quick Response Codes and Modified ElGamal Cryptosystem

Ramadhan J. Mstafa

Main category: cs.MM

TL;DR: A novel reversible video steganography method using DWT and QR codes, enhanced with ElGamal encryption, ensures high security, invisibility, and robustness against noise attacks.

Details

Motivation: Addressing challenges in digital video steganography like visual imperceptibility, robustness, and embedding capacity, while enhancing data security.

Method: Combines DWT for video frame decomposition, ElGamal encryption for QR codes, and LSB embedding in specific sub-bands and components.

Result: Achieves high security, invisibility (SSIM > 0.91), robustness against noise, PSNR of 52.143 dB, and embedding capacity of 1 bpp.

Conclusion: The proposed method outperforms existing techniques in security, imperceptibility, and capacity, making it a robust solution for video steganography.

Abstract: The rapid transmission of multimedia information has been achieved mainly by recent advancements in the Internet’s speed and information technology. In spite of this, advancements in technology have resulted in breaches of privacy and data security. When it comes to protecting private information in today’s Internet era, digital steganography is vital. Many academics are interested in digital video because it has a great capability for concealing important data. There have been a vast number of video steganography solutions developed lately to guard against the theft of confidential data. The visual imperceptibility, robustness, and embedding capacity of these approaches are all challenges that must be addressed. In this paper, a novel solution to reversible video steganography based on DWT and QR codes is proposed to address these concerns. In order to increase the security level of the suggested method, an enhanced ElGamal cryptosystem has also been proposed. Prior to the embedding stage, the suggested method uses the modified ElGamal algorithm to encrypt secret QR codes. Concurrently, it applies two-dimensional DWT on the Y-component of each video frame resulting in LL, LH, HL, and HH sub-bands. Then, the encrypted Low (L), Medium (M), Quantile (Q), and High (H) QR codes are embedded into the HL sub-band, HH sub-band, U-component, and V-component of video frames, respectively, using the LSB technique. As a consequence of extensive testing of the approach, it was shown to be very secure and highly invisible, as well as highly resistant to attacks from Salt & Pepper, Gaussian, Poisson, and Speckle noises, which has an average SSIM of more than 0.91. Aside from visual imperceptibility, the suggested method exceeds current methods in terms of PSNR average of 52.143 dB, and embedding capacity 1 bpp.

[860] FineBadminton: A Multi-Level Dataset for Fine-Grained Badminton Video Understanding

Xusheng He, Wei Liu, Shanshan Ma, Qian Liu, Chenghao Ma, Jianlong Wu

Main category: cs.MM

TL;DR: The paper introduces FineBadminton, a dataset with multi-level semantic annotations for badminton analysis, and FBBench, a benchmark to evaluate MLLMs. It proposes optimized baseline methods, showing performance gains despite challenges.

Details

Motivation: Address the scarcity of domain-specific datasets for fine-grained sports analysis, particularly in badminton, to advance MLLMs in sports intelligence.

Method: Developed FineBadminton with a multi-level semantic annotation hierarchy and an annotation pipeline combining MLLM proposals and human refinement. Proposed Hit-Centric Keyframe Selection and Coordinate-Guided Condensation for analysis.

Result: Current MLLMs struggle with deep sports video analysis, but the proposed methods achieve notable performance improvements on FBBench.

Conclusion: FineBadminton and FBBench provide a foundation for advancing fine-grained video understanding and MLLM capabilities in sports intelligence.

Abstract: Fine-grained analysis of complex and high-speed sports like badminton presents a significant challenge for Multimodal Large Language Models (MLLMs), despite their notable advancements in general video understanding. This difficulty arises primarily from the scarcity of datasets with sufficiently rich and domain-specific annotations. To bridge this gap, we introduce FineBadminton, a novel and large-scale dataset featuring a unique multi-level semantic annotation hierarchy (Foundational Actions, Tactical Semantics, and Decision Evaluation) for comprehensive badminton understanding. The construction of FineBadminton is powered by an innovative annotation pipeline that synergistically combines MLLM-generated proposals with human refinement. We also present FBBench, a challenging benchmark derived from FineBadminton, to rigorously evaluate MLLMs on nuanced spatio-temporal reasoning and tactical comprehension. Together, FineBadminton and FBBench provide a crucial ecosystem to catalyze research in fine-grained video understanding and advance the development of MLLMs in sports intelligence. Furthermore, we propose an optimized baseline approach incorporating Hit-Centric Keyframe Selection to focus on pivotal moments and Coordinate-Guided Condensation to distill salient visual information. The results on FBBench reveal that while current MLLMs still face significant challenges in deep sports video analysis, our proposed strategies nonetheless achieve substantial performance gains. The project homepage is available at https://finebadminton.github.io/FineBadminton/.

[861] MSPT: A Lightweight Face Image Quality Assessment Method with Multi-stage Progressive Training

Xiongwei Xiao, Baoying Chen, Jishen Zeng, Jianquan Yang

Main category: cs.MM

TL;DR: A lightweight face quality assessment network (MSPT) is proposed, using multi-stage progressive training to balance performance and efficiency.

Details

Motivation: Traditional methods struggle with face image quality assessment, and learning-based approaches are too complex for practical use.

Method: The MSPT network uses a three-stage progressive training strategy, gradually increasing data diversity and resolution.

Result: MSPT achieved the second highest score on the VQualA 2025 benchmark, matching or outperforming state-of-the-art methods efficiently.

Conclusion: MSPT offers a lightweight yet high-performance solution for face image quality assessment, addressing practical deployment challenges.

Abstract: Accurately assessing the perceptual quality of face images is crucial, especially with the rapid progress in face restoration and generation. Traditional quality assessment methods often struggle with the unique characteristics of face images, limiting their generalizability. While learning-based approaches demonstrate superior performance due to their strong fitting capabilities, their high complexity typically incurs significant computational and storage costs, hindering practical deployment. To address this, we propose a lightweight face quality assessment network with Multi-Stage Progressive Training (MSPT). Our network employs a three-stage progressive training strategy that gradually introduces more diverse data samples and increases input image resolution. This novel approach enables lightweight networks to achieve high performance by effectively learning complex quality features while significantly mitigating catastrophic forgetting. Our MSPT achieved the second highest score on the VQualA 2025 face image quality assessment benchmark dataset, demonstrating that MSPT achieves comparable or better performance than state-of-the-art methods while maintaining efficient inference.

[862] AD-AVSR: Asymmetric Dual-stream Enhancement for Robust Audio-Visual Speech Recognition

Junxiao Xue, Xiaozhen Liu, Xuecheng Wu, Xinyi Yin, Danlei Huang, Fei Yu

Main category: cs.MM

TL;DR: AD-AVSR introduces bidirectional modality enhancement for AVSR, improving noise robustness and performance by refining audio and visual representations collaboratively.

Details

Motivation: Existing AVSR methods lack effective handling of asymmetric information and heterogeneous correlations between audio-visual data.

Method: Proposes bidirectional enhancement with audio dual-stream encoding, Audio-aware Visual Refinement, and Cross-modal Noise Suppression Masking, plus a threshold-based selection mechanism.

Result: Outperforms SOTA methods on LRS2 and LRS3 datasets in performance and noise robustness.

Conclusion: AD-AVSR’s bidirectional approach effectively addresses asymmetric information challenges, enhancing AVSR performance.

Abstract: Audio-visual speech recognition (AVSR) combines audio-visual modalities to improve speech recognition, especially in noisy environments. However, most existing methods deploy the unidirectional enhancement or symmetric fusion manner, which limits their capability to capture heterogeneous and complementary correlations of audio-visual data-especially under asymmetric information conditions. To tackle these gaps, we introduce a new AVSR framework termed AD-AVSR based on bidirectional modality enhancement. Specifically, we first introduce the audio dual-stream encoding strategy to enrich audio representations from multiple perspectives and intentionally establish asymmetry to support subsequent cross-modal interactions. The enhancement process involves two key components, Audio-aware Visual Refinement Module for enhanced visual representations under audio guidance, and Cross-modal Noise Suppression Masking Module which refines audio representations using visual cues, collaboratively leading to the closed-loop and bidirectional information flow. To further enhance correlation robustness, we adopt a threshold-based selection mechanism to filter out irrelevant or weakly correlated audio-visual pairs. Extensive experimental results on the LRS2 and LRS3 datasets indicate that our AD-AVSR consistently surpasses SOTA methods in both performance and noise robustness, highlighting the effectiveness of our model design.

Xianbing Zhao, Shengzun Yang, Buzhou Tang, Ronghuan Jiang

Main category: cs.MM

TL;DR: The paper proposes a multimodal retrieval-augmented framework to improve sentiment analysis by leveraging both inter-sample modality-level and cross-sample sample-level reference contexts.

Details

Motivation: Current cross-modal approaches lack sufficient reference context, particularly cross-sample relationships, limiting feature enhancement in multimodal sentiment analysis.

Method: The framework includes a contrastive cross-modal retrieval module, modality-level and sample-level prompts, and a cross-modal retrieval-augmented encoder.

Result: Experiments show the model’s effectiveness and superiority on two public datasets.

Conclusion: The proposed framework successfully addresses the challenge of insufficient reference context in multimodal sentiment analysis.

Abstract: Multimodal sentiment analysis is a fundamental problem in the field of affective computing. Although significant progress has been made in cross-modal interaction, it remains a challenge due to the insufficient reference context in cross-modal interactions. Current cross-modal approaches primarily focus on leveraging modality-level reference context within a individual sample for cross-modal feature enhancement, neglecting the potential cross-sample relationships that can serve as sample-level reference context to enhance the cross-modal features. To address this issue, we propose a novel multimodal retrieval-augmented framework to simultaneously incorporate inter-sample modality-level reference context and cross-sample sample-level reference context to enhance the multimodal features. In particular, we first design a contrastive cross-modal retrieval module to retrieve semantic similar samples and enhance target modality. To endow the model to capture both inter-sample and intra-sample information, we integrate two different types of prompts, modality-level prompts and sample-level prompts, to generate modality-level and sample-level reference contexts, respectively. Finally, we design a cross-modal retrieval-augmented encoder that simultaneously leverages modality-level and sample-level reference contexts to enhance the target modality. Extensive experiments demonstrate the effectiveness and superiority of our model on two publicly available datasets.

Haisong Gong, Bolan Su, Xinrong Zhang, Jing Li, Qiang Liu, Shu Wu, Liang Wang

Main category: cs.MM

TL;DR: DugFND enhances fake news detection in short videos by modeling uploader and event-driven communities using a heterogeneous graph and time-aware attention network.

Details

Motivation: Short videos' rapid spread and multi-modal nature make fake news detection challenging, with existing methods overlooking implicit relationships among videos, uploaders, and events.

Method: Proposes DugFND, a method using a heterogeneous graph (uploader, video, event nodes) and time-aware graph attention network, with reconstruction-based pretraining.

Result: Experiments show significant performance gains, proving the value of dual-community modeling.

Conclusion: DugFND effectively improves fake news detection in short videos by leveraging community patterns.

Abstract: Short video platforms have become a major medium for information sharing, but their rapid content generation and algorithmic amplification also enable the widespread dissemination of fake news. Detecting misinformation in short videos is challenging due to their multi-modal nature and the limited context of individual videos. While recent methods focus on analyzing content signals-visual, textual, and audio-they often overlook implicit relationships among videos, uploaders, and events. To address this gap, we propose DugFND (Dual-community graph for fake news detection), a novel method that enhances existing video classifiers by modeling two key community patterns: (1) uploader communities, where uploaders with shared interests or similar content creation patterns group together, and (2) event-driven communities, where videos related to the same or semantically similar public events form localized clusters. We construct a heterogeneous graph connecting uploader, video, and event nodes, and design a time-aware heterogeneous graph attention network to enable effective message passing. A reconstruction-based pretraining phase further improves node representation learning. DugFND can be applied to any pre-trained classifier. Experiments on public datasets show that our method achieves significant performance gains, demonstrating the value of dual-community modeling for fake news detection in short videos.

[865] VGGSounder: Audio-Visual Evaluations for Foundation Models

Daniil Zverev, Thaddäus Wiedemer, Ameya Prabhu, Matthias Bethge, Wieland Brendel, A. Sophia Koepke

Main category: cs.MM

TL;DR: VGGSounder is re-annotated to address VGGSound’s limitations, providing a better benchmark for audio-visual foundation models.

Details

Motivation: Existing benchmarks like VGGSound have flaws (incomplete labels, overlapping classes, misaligned modalities) that distort model evaluations.

Method: Introduces VGGSounder, a re-annotated multi-label test set with detailed modality annotations and a new modality confusion metric.

Result: Enables precise analysis of modality-specific performance and reveals model limitations when adding input modalities.

Conclusion: VGGSounder improves evaluation reliability for audio-visual foundation models.

Abstract: The emergence of audio-visual foundation models underscores the importance of reliably assessing their multi-modal understanding. The VGGSounder dataset is commonly used as a benchmark for evaluation audio-visual classification. However, our analysis identifies several limitations of VGGSounder, including incomplete labelling, partially overlapping classes, and misaligned modalities. These lead to distorted evaluations of auditory and visual capabilities. To address these limitations, we introduce VGGSounder, a comprehensively re-annotated, multi-label test set that extends VGGSound and is specifically designed to evaluate audio-visual foundation models. VGGSounder features detailed modality annotations, enabling precise analyses of modality-specific performance. Furthermore, we reveal model limitations by analysing performance degradation when adding another input modality with our new modality confusion metric.

[866] Iola Walker: A Mobile Footfall Detection System for Music Composition

William B. James

Main category: cs.MM

TL;DR: A music playback system, Iola Walker, enhances music via wearable tech, adapting to the listener’s gait, aiming for societal impact in the music industry.

Details

Motivation: To improve music experience through wearable devices and address societal issues in the entertainment industry.

Method: Developed Iola Walker, a system where music adapts to the listener’s gait using hardware and software.

Result: Potential for a preferred new medium of music playback, with societal benefits.

Conclusion: Iola Walker represents a step toward prosocial reform in music technology.

Abstract: This outing is part of a larger music technology research project. The objective is to find a method for materially enhancing music using hardware and software. There is a strong likelihood that there exists a new medium for experiencing music via a wearable device that ordinary listeners prefer over the current state of the art. If such a medium is discovered, it is a step towards altruistic, prosocial reform in the music industry. A new playback system infrastructure has a chance to soothe some of the societal problems tied to the larger entertainment industry ecosystem. Iola walker is a music playback system that allows musicians to compose music that changes in accordance with the listener’s gait. Artifacts are available here: https://github.com/willbjames/iolawalker

[867] How Far Are We from Generating Missing Modalities with Foundation Models?

Guanzhou Ke, Bo Wang, Guoqing Chao, Weiming Hu, Shengfeng He

Main category: cs.MM

TL;DR: The paper explores multimodal foundation models for missing modality reconstruction, identifies limitations, and proposes an agentic framework with self-refinement to improve accuracy and adaptability.

Details

Motivation: To address the underexplored potential of multimodal foundation models in reconstructing missing modalities and their limitations in semantic extraction and validation.

Method: Proposes an agentic framework for dynamic modality-aware mining and a self-refinement mechanism for iterative quality enhancement.

Result: Reduces FID for missing image reconstruction by 14% and MER for missing text reconstruction by 10% compared to baselines.

Conclusion: The proposed framework effectively addresses current limitations, improving reconstruction accuracy and downstream task adaptability.

Abstract: Multimodal foundation models have demonstrated impressive capabilities across diverse tasks. However, their potential as plug-and-play solutions for missing modality reconstruction remains underexplored. To bridge this gap, we identify and formalize three potential paradigms for missing modality reconstruction, and perform a comprehensive evaluation across these paradigms, covering 42 model variants in terms of reconstruction accuracy and adaptability to downstream tasks. Our analysis reveals that current foundation models often fall short in two critical aspects: (i) fine-grained semantic extraction from the available modalities, and (ii) robust validation of generated modalities. These limitations lead to suboptimal and, at times, misaligned generations. To address these challenges, we propose an agentic framework tailored for missing modality reconstruction. This framework dynamically formulates modality-aware mining strategies based on the input context, facilitating the extraction of richer and more discriminative semantic features. In addition, we introduce a self-refinement mechanism, which iteratively verifies and enhances the quality of generated modalities through internal feedback. Experimental results show that our method reduces FID for missing image reconstruction by at least 14% and MER for missing text reconstruction by at least 10% compared to baselines. Code are released at: https://github.com/Guanzhou-Ke/AFM2.

eess.AS

[868] Differentiable Grouped Feedback Delay Networks for Learning Coupled Volume Acoustics

Orchisama Das, Gloria Dal Santo, Sebastian J. Schlecht, Vesa Valimaki, Zoran Cvetkovic

Main category: eess.AS

TL;DR: The paper introduces DiffGFDNs, a differentiable version of Grouped Feedback Delay Networks, optimized for rendering multi-slope late reverberation in XR applications with low computational and memory costs.

Details

Motivation: Rendering dynamic reverberation for moving sources and listeners in XR is challenging due to high costs and impracticality of capturing spatially varying RIRs, and computational demands of dynamic convolution.

Method: Proposes DiffGFDNs with tunable parameters optimized to match late reverberation profiles of RIRs, using a parallel processing pipeline for octave bands.

Result: DiffGFDNs achieve better EDR error and comparable EDC errors to the CS model, with significantly lower computational requirements.

Conclusion: DiffGFDNs offer an efficient, generalizable solution for rendering multi-slope reverberation in XR, outperforming the CS model in computational efficiency.

Abstract: Rendering dynamic reverberation in a complicated acoustic space for moving sources and listeners is challenging but crucial for enhancing user immersion in extended-reality (XR) applications. Capturing spatially varying room impulse responses (RIRs) is costly and often impractical. Moreover, dynamic convolution with measured RIRs is computationally expensive with high memory demands, typically not available on wearable computing devices. Grouped Feedback Delay Networks (GFDNs), on the other hand, allow efficient rendering of coupled room acoustics. However, its parameters need to be tuned to match the reverberation profile of a coupled space. In this work, we propose the concept of Differentiable GFDNs (DiffGFDNs), which have tunable parameters that are optimised to match the late reverberation profile of a set of RIRs captured from a space that exhibits multi-slope decay. Once trained on a finite set of measurements, the DiffGFDN generalises to unmeasured locations in the space. We propose a parallel processing pipeline that has multiple DiffGFDNs with frequency-independent parameters processing each octave band. The parameters of the DiffGFDN can be updated rapidly during inferencing as sources and listeners move. We evaluate the proposed architecture against the Common Slopes (CS) model on a dataset of RIRs for three coupled rooms. The proposed architecture generates multi-slope late reverberation with low memory and computational requirements, achieving better energy decay relief (EDR) error and slightly worse octave-band energy decay curve (EDC) errors compared to the CS model. Furthermore, DiffGFDN requires an order of magnitude fewer floating-point operations per sample than the CS renderer.

[869] FlowSE: Flow Matching-based Speech Enhancement

Seonggyu Lee, Sein Cheong, Sangwook Han, Jong Won Shin

Main category: eess.AS

TL;DR: The paper proposes a speech enhancement method using conditional flow matching, achieving performance comparable to diffusion models with fewer function evaluations (NFE) and no fine-tuning.

Details

Motivation: Diffusion models for speech enhancement are computationally heavy due to high NFE. The goal is to reduce complexity while maintaining performance.

Method: The method uses conditional flow matching to train continuous normalizing flows, modeling probability paths from known to unknown distributions.

Result: The proposed method matched diffusion model performance with NFE of 60 using only NFE of 5, and showed similar results to fine-tuned diffusion models at NFE 1-5.

Conclusion: Conditional flow matching is an efficient alternative to diffusion models for speech enhancement, reducing computational cost without performance loss.

Abstract: Diffusion probabilistic models have shown impressive performance for speech enhancement, but they typically require 25 to 60 function evaluations in the inference phase, resulting in heavy computational complexity. Recently, a fine-tuning method was proposed to correct the reverse process, which significantly lowered the number of function evaluations (NFE). Flow matching is a method to train continuous normalizing flows which model probability paths from known distributions to unknown distributions including those described by diffusion processes. In this paper, we propose a speech enhancement based on conditional flow matching. The proposed method achieved the performance comparable to those for the diffusion-based speech enhancement with the NFE of 60 when the NFE was 5, and showed similar performance with the diffusion model correcting the reverse process at the same NFE from 1 to 5 without additional fine tuning procedure. We also have shown that the corresponding diffusion model derived from the conditional probability path with a modified optimal transport conditional vector field demonstrated similar performances with the NFE of 5 without any fine-tuning procedure.

[870] Speech Enhancement based on cascaded two flow

Seonggyu Lee, Sein Cheong, Sangwook Han, Kihyuk Kim, Jong Won Shi

Main category: eess.AS

TL;DR: The paper proposes a unified flow matching model for speech enhancement (SE) and generating enhanced speech as an initial point, reducing function evaluations while maintaining or improving performance.

Details

Motivation: Existing SE methods using diffusion models require high computational cost (NFE), while flow matching shows promise but lacks integration of predictive and generative models.

Method: The authors use a single flow matching model for both SE and generating enhanced speech as a conditioning variable, eliminating the need for a separate predictive model.

Result: The method achieves equivalent or better performance than baselines with the same or fewer NFEs, even with cascaded generative steps.

Conclusion: The proposed unified flow matching model is efficient and effective for SE, reducing computational overhead without sacrificing performance.

Abstract: Speech enhancement (SE) based on diffusion probabilistic models has exhibited impressive performance, while requiring a relatively high number of function evaluations (NFE). Recently, SE based on flow matching has been proposed, which showed competitive performance with a small NFE. Early approaches adopted the noisy speech as the only conditioning variable. There have been other approaches which utilize speech enhanced with a predictive model as another conditioning variable and to sample an initial value, but they require a separate predictive model on top of the generative SE model. In this work, we propose to employ an identical model based on flow matching for both SE and generating enhanced speech used as an initial starting point and a conditioning variable. Experimental results showed that the proposed method required the same or fewer NFEs even with two cascaded generative methods while achieving equivalent or better performances to the previous baselines.

[871] Head-steered channel selection method for hearing aid applications using remote microphones

Vasudha Sathyapriyan, Michael S. Pedersen, Mike Brookes, Jan Østergaard, Patrick A. Naylor, Jesper Jensen

Main category: eess.AS

TL;DR: A channel selection method for hearing aids using remote microphones, leveraging head-steering direction to identify the target talker signal, outperforming existing methods.

Details

Motivation: Improve hearing aid performance in environments with multiple competing talkers by accurately selecting the target talker's channel without additional sensors.

Method: Poses channel selection as a multiple hypothesis testing problem, deriving a maximum likelihood solution based on weighted squared absolute correlation coefficients.

Result: Simulations show the method consistently outperforms existing techniques in identifying the target talker’s channel.

Conclusion: The proposed method effectively enhances hearing aid functionality in multi-talker scenarios without extra hardware.

Abstract: We propose a channel selection method for hearing aid applications using remote microphones, in the presence of multiple competing talkers. The proposed channel selection method uses the hearing aid user’s head-steering direction to identify the remote channel originating from the frontal direction of the hearing aid user, which captures the target talker signal. We pose the channel selection task as a multiple hypothesis testing problem, and derive a maximum likelihood solution. Under realistic, simplifying assumptions, the solution selects the remote channel which has the highest weighted squared absolute correlation coefficient with the output of the head-steered hearing aid beamformer. We analyze the performance of the proposed channel selection method using close-talking remote microphones and table microphone arrays. Through simulations using realistic acoustic scenes, we show that the proposed channel selection method consistently outperforms existing methods in accurately finding the remote channel that captures the target talker signal, in the presence of multiple competing talkers, without the use of any additional sensors.

[872] TurboBias: Universal ASR Context-Biasing powered by GPU-accelerated Phrase-Boosting Tree

Andrei Andrusenko, Vladimir Bataev, Lilit Grigoryan, Vitaly Lavrukhin, Boris Ginsburg

Main category: eess.AS

TL;DR: A universal ASR context-biasing framework is proposed, supporting major ASR types without speed degradation, outperforming existing methods in accuracy and speed.

Details

Motivation: Existing context-biasing approaches require additional training, slow decoding, or limit ASR system choices.

Method: Uses a GPU-accelerated word boosting tree for shallow fusion with greedy/beam search, handling up to 20K key phrases.

Result: High efficiency, surpassing open-source methods in accuracy and speed.

Conclusion: The framework is open-sourced in the NeMo toolkit, offering a versatile and efficient solution.

Abstract: Recognizing specific key phrases is an essential task for contextualized Automatic Speech Recognition (ASR). However, most existing context-biasing approaches have limitations associated with the necessity of additional model training, significantly slow down the decoding process, or constrain the choice of the ASR system type. This paper proposes a universal ASR context-biasing framework that supports all major types: CTC, Transducers, and Attention Encoder-Decoder models. The framework is based on a GPU-accelerated word boosting tree, which enables it to be used in shallow fusion mode for greedy and beam search decoding without noticeable speed degradation, even with a vast number of key phrases (up to 20K items). The obtained results showed high efficiency of the proposed method, surpassing the considered open-source context-biasing approaches in accuracy and decoding speed. Our context-biasing framework is open-sourced as a part of the NeMo toolkit.

[873] ParaNoise-SV: Integrated Approach for Noise-Robust Speaker Verification with Parallel Joint Learning of Speech Enhancement and Noise Extraction

Minu Kim, Kangwook Jang, Hoirin Kim

Main category: eess.AS

TL;DR: ParaNoise-SV improves noise-robust speaker verification by explicitly modeling noise and refining speech with dual U-Nets, achieving an 8.4% lower EER.

Details

Motivation: Existing joint SE-SV methods struggle with implicit noise suppression, failing to distinguish noise from speaker characteristics effectively.

Method: Proposes ParaNoise-SV with dual U-Nets: a noise extraction (NE) network to model noise explicitly and a speech enhancement (SE) network to refine speech, guided by NE.

Result: ParaNoise-SV achieves an 8.4% lower equal error rate (EER) compared to previous joint SE-SV models.

Conclusion: Explicit noise modeling in ParaNoise-SV enhances noise resilience and speaker verification performance.

Abstract: Noise-robust speaker verification leverages joint learning of speech enhancement (SE) and speaker verification (SV) to improve robustness. However, prevailing approaches rely on implicit noise suppression, which struggles to separate noise from speaker characteristics as they do not explicitly distinguish noise from speech during training. Although integrating SE and SV helps, it remains limited in handling noise effectively. Meanwhile, recent SE studies suggest that explicitly modeling noise, rather than merely suppressing it, enhances noise resilience. Reflecting this, we propose ParaNoise-SV, with dual U-Nets combining a noise extraction (NE) network and a speech enhancement (SE) network. The NE U-Net explicitly models noise, while the SE U-Net refines speech with guidance from NE through parallel connections, preserving speaker-relevant features. Experimental results show that ParaNoise-SV achieves a relatively 8.4% lower equal error rate (EER) than previous joint SE-SV models.

[874] Lessons Learnt: Revisit Key Training Strategies for Effective Speech Emotion Recognition in the Wild

Jing-Tong Tzeng, Bo-Hao Su, Ya-Tse Wu, Hsing-Hang Chou, Chi-Chun Lee

Main category: eess.AS

TL;DR: Simple training strategies like balancing, activation functions, and fine-tuning improve speech emotion recognition (SER) without complex architectures. A fusion model achieves top valence performance.

Details

Motivation: To enhance SER in naturalistic conditions by revisiting overlooked training strategies rather than deepening models.

Method: Explored balancing strategies, activation functions, and fine-tuning techniques. Used a multi-modal fusion model with RoBERTa and WavLM, optimized separately and fused.

Result: Achieved valence CCC of 0.6953, the best in Task 2. Focal loss and activation functions improved performance without added complexity.

Conclusion: Refining core components, not deepening models, leads to robust SER in-the-wild.

Abstract: In this study, we revisit key training strategies in machine learning often overlooked in favor of deeper architectures. Specifically, we explore balancing strategies, activation functions, and fine-tuning techniques to enhance speech emotion recognition (SER) in naturalistic conditions. Our findings show that simple modifications improve generalization with minimal architectural changes. Our multi-modal fusion model, integrating these optimizations, achieves a valence CCC of 0.6953, the best valence score in Task 2: Emotional Attribute Regression. Notably, fine-tuning RoBERTa and WavLM separately in a single-modality setting, followed by feature fusion without training the backbone extractor, yields the highest valence performance. Additionally, focal loss and activation functions significantly enhance performance without increasing complexity. These results suggest that refining core components, rather than deepening models, leads to more robust SER in-the-wild.

Mohammad Reza Peyghan, Fatemeh Rajabi, Saman Soleimani Roudi, Saeedreza Zouashkiani, Sajjad Amini, Shahrokh Ghaemmaghami

Main category: eess.AS

TL;DR: The paper surveys non-intrusive refinement techniques for improving ASR systems, classifying them into five categories, and discusses evaluation methods and future research directions.

Details

Motivation: ASR systems struggle with speech variability and domain-specific terminology, leading to errors. Non-intrusive refinement techniques are needed to improve accuracy without costly model redesigns.

Method: Systematically reviews and classifies non-intrusive refinement approaches into five classes: fusion, re-scoring, correction, distillation, and training adjustment. Also surveys adaptation techniques, evaluation datasets, and proposes standardized metrics.

Result: Provides a structured overview of refinement methods, their advantages, drawbacks, and ideal scenarios, along with evaluation insights.

Conclusion: The survey aims to guide researchers and practitioners in developing more robust ASR refinement pipelines and identifies open research gaps for future work.

Abstract: Automatic Speech Recognition (ASR) has become an integral component of modern technology, powering applications such as voice-activated assistants, transcription services, and accessibility tools. Yet ASR systems continue to struggle with the inherent variability of human speech, such as accents, dialects, and speaking styles, as well as environmental interference, including background noise. Moreover, domain-specific conversations often employ specialized terminology, which can exacerbate transcription errors. These shortcomings not only degrade raw ASR accuracy but also propagate mistakes through subsequent natural language processing pipelines. Because redesigning an ASR model is costly and time-consuming, non-intrusive refinement techniques that leave the model’s architecture unchanged have become increasingly popular. In this survey, we systematically review current non-intrusive refinement approaches and group them into five classes: fusion, re-scoring, correction, distillation, and training adjustment. For each class, we outline the main methods, advantages, drawbacks, and ideal application scenarios. Beyond method classification, this work surveys adaptation techniques aimed at refining ASR in domain-specific contexts, reviews commonly used evaluation datasets along with their construction processes, and proposes a standardized set of metrics to facilitate fair comparisons. Finally, we identify open research gaps and suggest promising directions for future work. By providing this structured overview, we aim to equip researchers and practitioners with a clear foundation for developing more robust, accurate ASR refinement pipelines.

[876] XEmoRAG: Cross-Lingual Emotion Transfer with Controllable Intensity Using Retrieval-Augmented Generation

Tianlun Zuo, Jingbin Hu, Yuke Li, Xinfa Zhu, Hai Li, Ying Yan, Junhui Liu, Danming Xie, Lei Xie

Main category: eess.AS

TL;DR: XEmoRAG enables zero-shot emotion transfer from Chinese to Thai speech without parallel data, using LLM-based embeddings and flow-matching alignment for natural prosody.

Details

Motivation: The challenge lies in transferring emotion across languages without parallel emotional corpora, avoiding foreign accents, and separating emotion from language-specific prosody.

Method: XEmoRAG extracts language-agnostic emotional embeddings from Chinese speech, retrieves matched Thai utterances, and aligns pitch/duration via flow-matching. It blends Chinese timbre into Thai synthesis.

Result: XEmoRAG synthesizes expressive Thai speech using Chinese references, without explicit emotion labels, achieving natural prosody and emotional consistency.

Conclusion: XEmoRAG demonstrates flexible, low-resource emotion transfer across languages, validated by expressive and natural Thai speech synthesis.

Abstract: Zero-shot emotion transfer in cross-lingual speech synthesis refers to generating speech in a target language, where the emotion is expressed based on reference speech from a different source language.However, this task remains challenging due to the scarcity of parallel multilingual emotional corpora, the presence of foreign accent artifacts, and the difficulty of separating emotion from language-specific prosodic features.In this paper, we propose XEmoRAG, a novel framework to enable zero-shot emotion transfer from Chinese to Thai using a large language model (LLM)-based model, without relying on parallel emotional data.XEmoRAG extracts language-agnostic emotional embeddings from Chinese speech and retrieves emotionally matched Thai utterances from a curated emotional database, enabling controllable emotion transfer without explicit emotion labels. Additionally, a flow-matching alignment module minimizes pitch and duration mismatches, ensuring natural prosody. It also blends Chinese timbre into the Thai synthesis, enhancing rhythmic accuracy and emotional expression, while preserving speaker characteristics and emotional consistency.Experimental results show that XEmoRAG synthesizes expressive and natural Thai speech using only Chinese reference audio, without requiring explicit emotion labels.These results highlight XEmoRAG’s capability to achieve flexible and low-resource emotional transfer across languages.Our demo is available at https://tlzuo-lesley.github.io/Demo-page/.

[877] FlexCTC: GPU-powered CTC Beam Decoding with advanced Contextual Abilities

Lilit Grigoryan, Vladimir Bataev, Nikolay Karpov, Andrei Andrusenko, Vitaly Lavrukhin, Boris Ginsburg

Main category: eess.AS

TL;DR: FlexCTC is a GPU-based beam decoding toolkit for CTC models, offering fast, batched GPU implementation with advanced contextualization techniques.

Details

Motivation: Standard beam search implementations are slow and CPU-bound, limiting hardware utilization.

Method: Developed in Python/PyTorch, FlexCTC eliminates CPU-GPU sync and minimizes kernel overhead via CUDA Graphs, supporting N-gram LM fusion and phrase boosting.

Result: The toolkit provides accurate, efficient decoding suitable for research and production.

Conclusion: FlexCTC is a high-performance, user-friendly alternative to traditional decoders.

Abstract: While beam search improves speech recognition quality over greedy decoding, standard implementations are slow, often sequential, and CPU-bound. To fully leverage modern hardware capabilities, we present a novel open-source FlexCTC toolkit for fully GPU-based beam decoding, designed for Connectionist Temporal Classification (CTC) models. Developed entirely in Python and PyTorch, it offers a fast, user-friendly, and extensible alternative to traditional C++, CUDA, or WFST-based decoders. The toolkit features a high-performance, fully batched GPU implementation with eliminated CPU-GPU synchronization and minimized kernel launch overhead via CUDA Graphs. It also supports advanced contextualization techniques, including GPU-powered N-gram language model fusion and phrase-level boosting. These features enable accurate and efficient decoding, making them suitable for both research and production use.

[878] KLASSify to Verify: Audio-Visual Deepfake Detection Using SSL-based Audio and Handcrafted Visual Features

Ivan Kukanov, Jun Wah Ng

Main category: eess.AS

TL;DR: A multimodal approach for detecting and localizing deepfakes, combining handcrafted visual features and self-supervised audio learning, achieves high performance on the AV-Deepfake1M++ dataset.

Details

Motivation: The rise of advanced deepfake techniques necessitates robust, adaptable, and interpretable detection methods, as current detectors struggle with generalization and computational efficiency.

Method: Proposes a multimodal system: handcrafted visual features for interpretability and a self-supervised learning backbone with graph attention networks for rich audio representations.

Result: Achieves 92.78% AUC for classification and 0.3536 IoU for temporal localization using audio alone on AV-Deepfake1M++.

Conclusion: The approach balances performance and deployability, offering resilience and interpretability for deepfake detection.

Abstract: The rapid development of audio-driven talking head generators and advanced Text-To-Speech (TTS) models has led to more sophisticated temporal deepfakes. These advances highlight the need for robust methods capable of detecting and localizing deepfakes, even under novel, unseen attack scenarios. Current state-of-the-art deepfake detectors, while accurate, are often computationally expensive and struggle to generalize to novel manipulation techniques. To address these challenges, we propose multimodal approaches for the AV-Deepfake1M 2025 challenge. For the visual modality, we leverage handcrafted features to improve interpretability and adaptability. For the audio modality, we adapt a self-supervised learning (SSL) backbone coupled with graph attention networks to capture rich audio representations, improving detection robustness. Our approach strikes a balance between performance and real-world deployment, focusing on resilience and potential interpretability. On the AV-Deepfake1M++ dataset, our multimodal system achieves AUC of 92.78% for deepfake classification task and IoU of 0.3536 for temporal localization using only the audio modality.

[879] Scalable Controllable Accented TTS

Henry Li Xinyuan, Zexin Cai, Ashi Garg, Kevin Duh, Leibny Paola García-Perera, Sanjeev Khudanpur, Nicholas Andrews, Matthew Wiesner

Main category: eess.AS

TL;DR: The paper addresses scaling accented TTS systems by using geolocation for accent label discovery and timbre augmentation via kNN voice conversion, outperforming existing benchmarks.

Details

Motivation: To enhance accented TTS systems by expanding training data and improving accent label diversity, especially for underrepresented accents.

Method: 1. Accent label discovery via a speech geolocation model. 2. Timbre augmentation using kNN voice conversion. Validated on CommonVoice with XTTS-v2 fine-tuning.

Result: The model outperforms XTTS-v2 fine-tuned on self-reported labels and existing accented TTS benchmarks.

Conclusion: The proposed strategies effectively scale accented TTS systems, improving performance and robustness.

Abstract: We tackle the challenge of scaling accented TTS systems, expanding their capabilities to include much larger amounts of training data and a wider variety of accent labels, even for accents that are poorly represented or unlabeled in traditional TTS datasets. To achieve this, we employ two strategies: 1. Accent label discovery via a speech geolocation model, which automatically infers accent labels from raw speech data without relying solely on human annotation; 2. Timbre augmentation through kNN voice conversion to increase data diversity and model robustness. These strategies are validated on CommonVoice, where we fine-tune XTTS-v2 for accented TTS with accent labels discovered or enhanced using geolocation. We demonstrate that the resulting accented TTS model not only outperforms XTTS-v2 fine-tuned on self-reported accent labels in CommonVoice, but also existing accented TTS benchmarks.

[880] Real-time CARFAC Cochlea Model Acceleration on FPGA for Underwater Acoustic Sensing Systems

Bram Bremer, Matthew Bigelow, Stuart Anstee, Gregory Cohen, Andre van Schaik, Ying Xu

Main category: eess.AS

TL;DR: A real-time, energy-efficient embedded system using CARFAC cochlea models for underwater sound analysis, implemented on AMD Kria KV260 SoM with Rust software and FPGA acceleration.

Details

Motivation: To improve scalability, processing speed, and resource efficiency in underwater sound analysis using CARFAC models.

Method: Combines Rust-based software for real-time interfacing with hydrophones and FPGA-accelerated CARFAC models using optimized time-multiplexing and pipelined design.

Result: Achieves 13.5% hardware utilization for a 64-channel CARFAC and 3.11 W power consumption at 256 kHz real-time processing.

Conclusion: The system demonstrates efficient real-time underwater sound analysis with improved performance and reduced resource usage.

Abstract: This paper presents a real-time, energy-efficient embedded system implementing an array of Cascade of Asymmetric Resonators with Fast-Acting Compression (CARFAC) cochlea models for underwater sound analysis. Built on the AMD Kria KV260 System-on-Module (SoM), the system integrates a Rust-based software framework on the processor for real-time interfacing and synchronization with multiple hydrophone inputs, and a hardware-accelerated implementation of the CARFAC models on a Field-Programmable Gate Array (FPGA) for real-time sound pre-processing. Compared to prior work, the CARFAC accelerator achieves improved scalability and processing speed while reducing resource usage through optimized time-multiplexing, pipelined design, and elimination of costly division circuits. Experimental results demonstrate 13.5% hardware utilization for a single 64-channel CARFAC instance and a whole board power consumption of 3.11 W when processing a 256 kHz input signal in real time.

[881] UniFlow: Unifying Speech Front-End Tasks via Continuous Generative Modeling

Ziqian Wang, Zikai Liu, Yike Zhu, Xingchen Li, Boyi Kang, Jixun Yao, Xianjun Xia, Chuanzeng Huang, Lei Xie

Main category: eess.AS

TL;DR: UniFlow is a unified framework using generative modeling to address diverse speech front-end tasks in a shared latent space, outperforming task-specific solutions.

Details

Motivation: Current speech front-end tasks are tackled by disparate, task-specific methods, leading to redundancy and inefficiency. UniFlow aims to unify these tasks.

Method: UniFlow employs a waveform VAE for latent representation and a Diffusion Transformer for updates, using task-specific embeddings for adaptability.

Result: UniFlow achieves consistent gains over state-of-the-art baselines on multiple benchmarks.

Conclusion: UniFlow provides an extensible, unified foundation for generative speech processing, with plans to open-source the codebase.

Abstract: Generative modeling has recently achieved remarkable success across image, video, and audio domains, demonstrating powerful capabilities for unified representation learning. Yet speech front-end tasks such as speech enhancement (SE), target speaker extraction (TSE), acoustic echo cancellation (AEC), and language-queried source separation (LASS) remain largely tackled by disparate, task-specific solutions. This fragmentation leads to redundant engineering effort, inconsistent performance, and limited extensibility. To address this gap, we introduce UniFlow, a unified framework that employs continuous generative modeling to tackle diverse speech front-end tasks in a shared latent space. Specifically, UniFlow utilizes a waveform variational autoencoder (VAE) to learn a compact latent representation of raw audio, coupled with a Diffusion Transformer (DiT) that predicts latent updates. To differentiate the speech processing task during the training, learnable condition embeddings indexed by a task ID are employed to enable maximal parameter sharing while preserving task-specific adaptability. To balance model performance and computational efficiency, we investigate and compare three generative objectives: denoising diffusion, flow matching, and mean flow within the latent domain. We validate UniFlow on multiple public benchmarks, demonstrating consistent gains over state-of-the-art baselines. UniFlow’s unified latent formulation and conditional design make it readily extensible to new tasks, providing an integrated foundation for building and scaling generative speech processing pipelines. To foster future research, we will open-source our codebase.

[882] Is GAN Necessary for Mel-Spectrogram-based Neural Vocoder?

Hui-Peng Du, Yang Ai, Rui-Chen Zheng, Ye-Xin Lu, Zhen-Hua Ling

Main category: eess.AS

TL;DR: FreeGAN vocoder eliminates GAN training, using amplitude-phase prediction and novel techniques to match GAN-based vocoders in quality while improving efficiency.

Details

Motivation: To explore if GAN is necessary for mel-spectrogram-based neural vocoders, addressing GAN's training inefficiency and complexity.

Method: Uses amplitude-phase serial prediction, amplitude prior input, SNAKE-ConvNeXt v2 backbone, and frequency-weighted anti-wrapping phase loss.

Result: FreeGAN matches GAN-based vocoders in speech quality, with better training efficiency and reduced complexity.

Conclusion: GAN is not essential for mel-spectrogram vocoders; FreeGAN’s framework offers a viable alternative.

Abstract: Recently, mainstream mel-spectrogram-based neural vocoders rely on generative adversarial network (GAN) for high-fidelity speech generation, e.g., HiFi-GAN and BigVGAN. However, the use of GAN restricts training efficiency and model complexity. Therefore, this paper proposes a novel FreeGAN vocoder, aiming to answer the question of whether GAN is necessary for mel-spectrogram-based neural vocoders. The FreeGAN employs an amplitude-phase serial prediction framework, eliminating the need for GAN training. It incorporates amplitude prior input, SNAKE-ConvNeXt v2 backbone and frequency-weighted anti-wrapping phase loss to compensate for the performance loss caused by the absence of GAN. Experimental results confirm that the speech quality of FreeGAN is comparable to that of advanced GAN-based vocoders, while significantly improving training efficiency and complexity. Other explicit-phase-prediction-based neural vocoders can also work without GAN, leveraging our proposed methods.

[883] Score-Informed BiLSTM Correction for Refining MIDI Velocity in Automatic Piano Transcription

Zhanhong He, Roberto Togneri, Defeng, Huang

Main category: eess.AS

TL;DR: The paper proposes a BiLSTM correction module to refine MIDI velocity estimates from automatic music transcription (AMT), focusing on loudness correction rather than timing.

Details

Motivation: MIDI velocity (loudness control) in AMT outputs often requires correction. Existing methods replace AMT estimates, but this paper aims to refine them.

Method: A BiLSTM correction module is introduced to refine AMT-estimated MIDI velocity, tested on the high-resolution piano transcription (HPT) system.

Result: The method achieved significant improvements, though not state-of-the-art, validating its effectiveness.

Conclusion: The BiLSTM correction module successfully refines AMT velocity estimates, demonstrating potential for practical use.

Abstract: MIDI is a modern standard for storing music, recording how musical notes are played. Many piano performances have corresponding MIDI scores available online. Some of these are created by the original performer, recording on an electric piano alongside the audio, while others are through manual transcription. In recent years, automatic music transcription (AMT) has rapidly advanced, enabling machines to transcribe MIDI from audio. However, these transcriptions often require further correction. Assuming a perfect timing correction, we focus on the loudness correction in terms of MIDI velocity (a parameter in MIDI for loudness control). This task can be approached through score-informed MIDI velocity estimation, which has undergone several developments. While previous approaches introduced specifically built models to re-estimate MIDI velocity, thereby replacing AMT estimates, we propose a BiLSTM correction module to refine AMT-estimated velocity. Although we did not reach state-of-the-art performance, we validated our method on the well-known AMT system, the high-resolution piano transcription (HPT), and achieved significant improvements.

[884] Auditory Intelligence: Understanding the World Through Sound

Hyeonuk Nam

Main category: eess.AS

TL;DR: The paper proposes a reframing of auditory intelligence to include deeper reasoning and interaction, introducing four task paradigms (ASPIRE, SODA, AUX, AUGMENT) for layered auditory understanding.

Details

Motivation: Current auditory intelligence systems focus on surface-level recognition, lacking deeper understanding of context, causality, or implications. The paper aims to address this gap.

Method: Introduces four cognitively inspired task paradigms (ASPIRE, SODA, AUX, AUGMENT) for layered auditory understanding, covering pattern captioning, hierarchical description, causal explanation, and goal-driven interpretation.

Result: The proposed paradigms provide a roadmap for more generalizable, explainable, and human-aligned auditory intelligence.

Conclusion: The paper advocates for a broader discussion on machine understanding of sound, emphasizing deeper reasoning and interaction.

Abstract: Recent progress in auditory intelligence has yielded high-performing systems for sound event detection (SED), acoustic scene classification (ASC), automated audio captioning (AAC), and audio question answering (AQA). Yet these tasks remain largely constrained to surface-level recognition-capturing what happened but not why, what it implies, or how it unfolds in context. I propose a conceptual reframing of auditory intelligence as a layered, situated process that encompasses perception, reasoning, and interaction. To instantiate this view, I introduce four cognitively inspired task paradigms-ASPIRE, SODA, AUX, and AUGMENT-those structure auditory understanding across time-frequency pattern captioning, hierarchical event/scene description, causal explanation, and goal-driven interpretation, respectively. Together, these paradigms provide a roadmap toward more generalizable, explainable, and human-aligned auditory intelligence, and are intended to catalyze a broader discussion of what it means for machines to understand sound.

[885] G-IFT: A Gated Linear Unit adapter with Iterative Fine-Tuning for Low-Resource Children’s Speaker Verification

Vishwas M. Shetty, Jiusi Zheng, Abeer Alwan

Main category: eess.AS

TL;DR: The paper proposes G-IFT, a framework using a Gated Linear Unit adapter and iterative fine-tuning to improve children’s speaker verification by efficiently transferring knowledge from adult speech models.

Details

Motivation: Speaker verification systems perform poorly on children's speech due to acoustic mismatch and limited data. Fine-tuning alone is ineffective.

Method: G-IFT inserts a Gated Linear Unit adapter between a pre-trained speaker embedding model and classifier, optimizing them iteratively.

Result: Experiments show consistent reductions in Equal Error Rates across ECAPA-TDNN, ResNet, and X-vector architectures.

Conclusion: G-IFT effectively bridges the gap between adult and children’s speech domains, improving speaker verification performance.

Abstract: Speaker Verification (SV) systems trained on adults speech often underperform on children’s SV due to the acoustic mismatch, and limited children speech data makes fine-tuning not very effective. In this paper, we propose an innovative framework, a Gated Linear Unit adapter with Iterative Fine-Tuning (G-IFT), to enhance knowledge transfer efficiency between the high-resource adults speech domain and the low-resource children’s speech domain. In this framework, a Gated Linear Unit adapter is first inserted between the pre-trained speaker embedding model and the classifier. Then the classifier, adapter, and pre-trained speaker embedding model are optimized sequentially in an iterative way. This framework is agnostic to the type of the underlying architecture of the SV system. Our experiments on ECAPA-TDNN, ResNet, and X-vector architectures using the OGI and MyST datasets demonstrate that the G-IFT framework yields consistent reductions in Equal Error Rates compared to baseline methods.

[886] MSU-Bench: Towards Understanding the Conversational Multi-talker Scenarios

Shuai Wang, Zhaokai Sun, Zhennan Lin, Chengyou Wang, Zhou Pan, Lei Xie

Main category: eess.AS

TL;DR: MSU-Bench is a new benchmark for evaluating multi-speaker conversational SLU, highlighting performance gaps as task complexity increases.

Details

Motivation: Existing SLU benchmarks overlook multi-speaker challenges, which are common in real-world scenarios.

Method: A hierarchical framework with four tiers evaluates speaker-centric understanding, from single-speaker to multi-speaker interactions.

Result: All models show declining performance with increasing task complexity, with open-source models lagging behind closed-source ones.

Conclusion: MSU-Bench effectively assesses conversational understanding in multi-speaker environments, revealing gaps in current models.

Abstract: Spoken Language Understanding (SLU) has progressed from traditional single-task methods to large audio language model (LALM) solutions. Yet, most existing speech benchmarks focus on single-speaker or isolated tasks, overlooking the challenges posed by multi-speaker conversations that are common in real-world scenarios. We introduce MSU-Bench, a comprehensive benchmark for evaluating multi-speaker conversational understanding with a speaker-centric design. Our hierarchical framework covers four progressive tiers: single-speaker static attribute understanding, single-speaker dynamic attribute understanding, multi-speaker background understanding, and multi-speaker interaction understanding. This structure ensures all tasks are grounded in speaker-centric contexts, from basic perception to complex reasoning across multiple speakers. By evaluating state-of-the-art models on MSU-Bench, we demonstrate that as task complexity increases across the benchmark’s tiers, all models exhibit a significant performance decline. We also observe a persistent capability gap between open-source models and closed-source commercial ones, particularly in multi-speaker interaction reasoning. These findings validate the effectiveness of MSU-Bench for assessing and advancing conversational understanding in realistic multi-speaker environments. Demos can be found in the supplementary material.

[887] Interleaved Speech-Text Language Models for Simple Streaming Text-to-Speech Synthesis

Yifan Yang, Shujie Liu, Jinyu Li, Hui Wang, Lingwei Meng, Haiyang Sun, Yuzhe Liang, Ziyang Ma, Yuxuan Hu, Rui Zhao, Jianwei Yu, Yan Lu, Xie Chen

Main category: eess.AS

TL;DR: IST-LM is a zero-shot streaming TTS model trained on interleaved text-speech sequences, achieving near-non-streaming performance with minimal overhead.

Details

Motivation: To simplify streaming TTS by eliminating complex designs like forced alignment and enabling real-time integration with text streams.

Method: Train IST-LM on interleaved text-speech sequences with a fixed ratio, analyzing key factors like token distance and accessibility.

Result: Optimal streaming TTS performance with minimal performance gap compared to non-streaming systems.

Conclusion: IST-LM is simple, effective, and scalable for real-time TTS applications.

Abstract: This paper introduces Interleaved Speech-Text Language Model (IST-LM) for zero-shot streaming Text-to-Speech (TTS). Unlike many previous approaches, IST-LM is directly trained on interleaved sequences of text and speech tokens with a fixed ratio, eliminating the need for additional efforts like forced alignment or complex designs. The ratio of text chunk size to speech chunk size is crucial for the performance of IST-LM. To explore this, we conducted a comprehensive series of statistical analyses on the training data and performed correlation analysis with the final performance, uncovering several key factors: 1) the distance between speech tokens and their corresponding text tokens, 2) the number of future text tokens accessible to each speech token, and 3) the frequency of speech tokens precedes their corresponding text tokens. Experimental results demonstrate how to achieve an optimal streaming TTS system with a limited performance gap compared to its non-streaming counterpart. IST-LM is conceptually simple and empirically powerful, enabling streaming TTS with minimal overhead while largely preserving performance, and offering broad potential for integration with real-time text streams from large language models.

[888] Enhancing Lung Disease Diagnosis via Semi-Supervised Machine Learning

Xiaoran Xu, In-Ho Ra, Ravi Sankar

Main category: eess.AS

TL;DR: The study explores semi-supervised learning (MixMatch, Co-Refinement, Co-Refurbishing) with MFCC+CNN for lung sound detection, achieving 92.9% accuracy (+3.8% over baseline), reducing reliance on manual annotations.

Details

Motivation: Traditional lung disease diagnostics are costly and invasive; this research aims to improve detection using semi-supervised learning to address limited labeled data.

Method: Combines MFCC+CNN with semi-supervised modules (MixMatch, Co-Refinement, Co-Refurbishing) to enhance lung sound signal detection.

Result: Achieved 92.9% accuracy, a 3.8% improvement over the baseline model.

Conclusion: Semi-supervised learning effectively improves lung sound detection, addressing challenges like individual differences and insufficient labeled data.

Abstract: Lung diseases, including lung cancer and COPD, are significant health concerns globally. Traditional diagnostic methods can be costly, time-consuming, and invasive. This study investigates the use of semi supervised learning methods for lung sound signal detection using a model combination of MFCC+CNN. By introducing semi supervised learning modules such as Mix Match, Co-Refinement, and Co Refurbishing, we aim to enhance the detection performance while reducing dependence on manual annotations. With the add-on semi-supervised modules, the accuracy rate of the MFCC+CNN model is 92.9%, an increase of 3.8% to the baseline model. The research contributes to the field of lung disease sound detection by addressing challenges such as individual differences, feature insufficient labeled data.

[889] Fairness in Dysarthric Speech Synthesis: Understanding Intrinsic Bias in Dysarthric Speech Cloning using F5-TTS

Anuprabha M, Krishna Gurugubelli, Anil Kumar Vuppala

Main category: eess.AS

TL;DR: The paper evaluates F5-TTS for dysarthric speech synthesis, revealing biases toward intelligibility over speaker and prosody preservation, and suggests fairness-aware improvements.

Details

Motivation: Addressing biases in synthetic dysarthric speech generation to improve assistive technologies.

Method: Uses F5-TTS and TORGO dataset to analyze intelligibility, speaker similarity, and prosody, with fairness metrics like Disparate Impact and Parity Difference.

Result: F5-TTS prioritizes intelligibility, neglecting speaker and prosody preservation, showing biases across dysarthric severity levels.

Conclusion: Fairness-aware synthesis can enhance inclusivity in dysarthric speech technologies.

Abstract: Dysarthric speech poses significant challenges in developing assistive technologies, primarily due to the limited availability of data. Recent advances in neural speech synthesis, especially zero-shot voice cloning, facilitate synthetic speech generation for data augmentation; however, they may introduce biases towards dysarthric speech. In this paper, we investigate the effectiveness of state-of-the-art F5-TTS in cloning dysarthric speech using TORGO dataset, focusing on intelligibility, speaker similarity, and prosody preservation. We also analyze potential biases using fairness metrics like Disparate Impact and Parity Difference to assess disparities across dysarthric severity levels. Results show that F5-TTS exhibits a strong bias toward speech intelligibility over speaker and prosody preservation in dysarthric speech synthesis. Insights from this study can help integrate fairness-aware dysarthric speech synthesis, fostering the advancement of more inclusive speech technologies.

[890] Privacy Disclosure of Similarity Rank in Speech and Language Processing

Tom Bäckström, Mohammad Hassan Vali, My Nguyen, Silas Rech

Main category: eess.AS

TL;DR: The paper proposes a method to quantify privacy disclosure in biometric identification by analyzing similarity rank distributions, using entropy to measure information leakage.

Details

Motivation: To address the unreliability of similarity measures in biometric identification and the potential privacy risks posed by rank-based disclosures.

Method: Estimates the probability distribution of similarity ranks using histograms or beta-binomial models, measuring disclosure in terms of entropy.

Result: All tested biometric features contain personally identifying information (PII), with speaker recognition embeddings being the most informative. Disclosure increases with sample length but is bounded by template length.

Conclusion: The proposed similarity rank disclosure metric helps compare and merge PII risks across biometric features, aiding privacy threat evaluation in biometric technologies.

Abstract: Speaker, author, and other biometric identification applications often compare a sample’s similarity to a database of templates to determine the identity. Given that data may be noisy and similarity measures can be inaccurate, such a comparison may not reliably identify the true identity as the most similar. Still, even the similarity rank based on an inaccurate similarity measure can disclose private information about the true identity. We propose a methodology for quantifying the privacy disclosure of such a similarity rank by estimating its probability distribution. It is based on determining the histogram of the similarity rank of the true speaker, or when data is scarce, modeling the histogram with the beta-binomial distribution. We express the disclosure in terms of entropy (bits), such that the disclosure from independent features are additive. Our experiments demonstrate that all tested speaker and author characterizations contain personally identifying information (PII) that can aid in identification, with embeddings from speaker recognition algorithms containing the most information, followed by phone embeddings, linguistic embeddings, and fundamental frequency. Our initial experiments show that the disclosure of PII increases with the length of test samples, but it is bounded by the length of database templates. The provided metric, similarity rank disclosure, provides a way to compare the disclosure of PII between biometric features and merge them to aid identification. It can thus aid in the holistic evaluation of threats to privacy in speech and other biometric technologies.

eess.IV

[891] Sea-Undistort: A Dataset for Through-Water Image Restoration in High Resolution Airborne Bathymetric Mapping

Maximilian Kromer, Panagiotis Agrafiotis, Begüm Demir

Main category: eess.IV

TL;DR: Sea-Undistort is a synthetic dataset for training models to correct optical distortions in shallow water bathymetric mapping, improving accuracy in real-world applications.

Details

Motivation: Challenges in shallow water bathymetric mapping due to optical distortions like waves, scattering, and sunglint necessitate a synthetic dataset for supervised training.

Method: A dataset of 1200 paired distorted and distortion-free images with metadata is created using Blender. It benchmarks image restoration methods, including a diffusion-based framework with a sun-glint mask.

Result: The enhanced diffusion model improves seabed mapping accuracy, reduces errors, and restores fine details in real aerial data.

Conclusion: Sea-Undistort enables effective training for distortion correction, advancing shallow water bathymetric mapping.

Abstract: Accurate image-based bathymetric mapping in shallow waters remains challenging due to the complex optical distortions such as wave induced patterns, scattering and sunglint, introduced by the dynamic water surface, the water column properties, and solar illumination. In this work, we introduce Sea-Undistort, a comprehensive synthetic dataset of 1200 paired 512x512 through-water scenes rendered in Blender. Each pair comprises a distortion-free and a distorted view, featuring realistic water effects such as sun glint, waves, and scattering over diverse seabeds. Accompanied by per-image metadata such as camera parameters, sun position, and average depth, Sea-Undistort enables supervised training that is otherwise infeasible in real environments. We use Sea-Undistort to benchmark two state-of-the-art image restoration methods alongside an enhanced lightweight diffusion-based framework with an early-fusion sun-glint mask. When applied to real aerial data, the enhanced diffusion model delivers more complete Digital Surface Models (DSMs) of the seabed, especially in deeper areas, reduces bathymetric errors, suppresses glint and scattering, and crisply restores fine seabed details. Dataset, weights, and code are publicly available at https://www.magicbathy.eu/Sea-Undistort.html.

[892] PCA-Guided Autoencoding for Structured Dimensionality Reduction in Active Infrared Thermography

Mohammed Salah, Numan Saeed, Davor Svetinovic, Stefano Sfarra, Mohammed Omar, Yusra Abdulrahman

Main category: eess.IV

TL;DR: The paper proposes a PCA-guided autoencoder framework for structured dimensionality reduction in Active Infrared Thermography (AIRT) data, improving defect characterization.

Details

Motivation: Current autoencoders for AIRT data lack structured latent spaces, limiting their effectiveness in defect characterization.

Method: Introduces a PCA-guided autoencoder with a novel PCA distillation loss to align latent representations with structured PCA components while capturing non-linear features.

Result: The proposed method outperforms state-of-the-art techniques on PVC, CFRP, and PLA samples in contrast, SNR, and neural network-based metrics.

Conclusion: The PCA-guided autoencoder enhances defect characterization by enforcing a structured latent space, validated by superior performance metrics.

Abstract: Active Infrared thermography (AIRT) is a widely adopted non-destructive testing (NDT) technique for detecting subsurface anomalies in industrial components. Due to the high dimensionality of AIRT data, current approaches employ non-linear autoencoders (AEs) for dimensionality reduction. However, the latent space learned by AIRT AEs lacks structure, limiting their effectiveness in downstream defect characterization tasks. To address this limitation, this paper proposes a principal component analysis guided (PCA-guided) autoencoding framework for structured dimensionality reduction to capture intricate, non-linear features in thermographic signals while enforcing a structured latent space. A novel loss function, PCA distillation loss, is introduced to guide AIRT AEs to align the latent representation with structured PCA components while capturing the intricate, non-linear patterns in thermographic signals. To evaluate the utility of the learned, structured latent space, we propose a neural network-based evaluation metric that assesses its suitability for defect characterization. Experimental results show that the proposed PCA-guided AE outperforms state-of-the-art dimensionality reduction methods on PVC, CFRP, and PLA samples in terms of contrast, signal-to-noise ratio (SNR), and neural network-based metrics.

[893] Deep Learning-Based Desikan-Killiany Parcellation of the Brain Using Diffusion MRI

Yousef Sadegheih, Dorit Merhof

Main category: eess.IV

TL;DR: A deep learning framework for direct brain parcellation in dMRI space using the DK atlas, outperforming existing methods in accuracy and robustness.

Details

Motivation: Existing brain parcellation methods rely on anatomical MRI, introducing errors and limiting versatility. This study aims to provide a direct, registration-free solution using only dMRI data.

Method: A hierarchical, two-stage segmentation network: coarse parcellation followed by detailed subregion delineation. Evaluated diffusion-derived parameter maps (FA, trace, sphericity, max eigenvalue) for optimal accuracy.

Result: Achieved superior Dice Similarity Coefficients on HCP and CNP datasets, with robust generalization across resolutions and protocols.

Conclusion: The method advances dMRI-based segmentation, offering precise, reliable, and registration-free parcellation for research and clinical use.

Abstract: Accurate brain parcellation in diffusion MRI (dMRI) space is essential for advanced neuroimaging analyses. However, most existing approaches rely on anatomical MRI for segmentation and inter-modality registration, a process that can introduce errors and limit the versatility of the technique. In this study, we present a novel deep learning-based framework for direct parcellation based on the Desikan-Killiany (DK) atlas using only diffusion MRI data. Our method utilizes a hierarchical, two-stage segmentation network: the first stage performs coarse parcellation into broad brain regions, and the second stage refines the segmentation to delineate more detailed subregions within each coarse category. We conduct an extensive ablation study to evaluate various diffusion-derived parameter maps, identifying an optimal combination of fractional anisotropy, trace, sphericity, and maximum eigenvalue that enhances parellation accuracy. When evaluated on the Human Connectome Project and Consortium for Neuropsychiatric Phenomics datasets, our approach achieves superior Dice Similarity Coefficients compared to existing state-of-the-art models. Additionally, our method demonstrates robust generalization across different image resolutions and acquisition protocols, producing more homogeneous parcellations as measured by the relative standard deviation within regions. This work represents a significant advancement in dMRI-based brain segmentation, providing a precise, reliable, and registration-free solution that is critical for improved structural connectivity and microstructural analyses in both research and clinical applications. The implementation of our method is publicly available on github.com/xmindflow/DKParcellationdMRI.

[894] MIND: A Noise-Adaptive Denoising Framework for Medical Images Integrating Multi-Scale Transformer

Tao Tang, Chengxu Yang

Main category: eess.IV

TL;DR: A medical image denoising model (MI-ND) combining multi-scale convolution and Transformer architecture, with noise level estimation and adaptive attention, outperforms existing methods in quality and diagnostic tasks.

Details

Motivation: Medical images often suffer from non-uniform noise, impacting diagnosis accuracy. Current methods lack adaptability to noise variations.

Method: Proposes MI-ND with noise level estimator (NLE) and noise adaptive attention module (NAAB) for channel-spatial regulation and cross-modal fusion.

Result: Outperforms others in PSNR, SSIM, LPIPS, and improves F1 score and ROC-AUC in diagnostic tasks.

Conclusion: MI-ND enhances image quality, diagnostic sensitivity, and cross-modal robustness, offering practical value for AI-assisted diagnosis.

Abstract: The core role of medical images in disease diagnosis makes their quality directly affect the accuracy of clinical judgment. However, due to factors such as low-dose scanning, equipment limitations and imaging artifacts, medical images are often accompanied by non-uniform noise interference, which seriously affects structure recognition and lesion detection. This paper proposes a medical image adaptive denoising model (MI-ND) that integrates multi-scale convolutional and Transformer architecture, introduces a noise level estimator (NLE) and a noise adaptive attention module (NAAB), and realizes channel-spatial attention regulation and cross-modal feature fusion driven by noise perception. Systematic testing is carried out on multimodal public datasets. Experiments show that this method significantly outperforms the comparative methods in image quality indicators such as PSNR, SSIM, and LPIPS, and improves the F1 score and ROC-AUC in downstream diagnostic tasks, showing strong prac-tical value and promotional potential. The model has outstanding benefits in structural recovery, diagnostic sensitivity, and cross-modal robustness, and provides an effective solution for medical image enhancement and AI-assisted diagnosis and treatment.

[895] Learned Regularization for Microwave Tomography

Bowen Tong, Hao Chen, Shaorui Guo, Dong Liu

Main category: eess.IV

TL;DR: A physics-informed hybrid framework, SSD-Reg, integrates diffusion models for regularization in Microwave Tomography, improving reconstruction without paired data.

Details

Motivation: Conventional methods fail to recover fine details in MWT, and deep learning approaches require large datasets and struggle with generalization.

Method: Proposes SSD-Reg, embedding diffusion priors into iterative reconstruction, combining physics and learned structural distributions.

Result: SSD-Reg enhances accuracy, stability, and robustness, demonstrated through extensive experiments.

Conclusion: SSD-Reg offers a flexible, effective solution for ill-posed functional image reconstruction.

Abstract: Microwave Tomography (MWT) aims to reconstruct the dielectric properties of tissues from measured scattered electromagnetic fields. This inverse problem is highly nonlinear and ill-posed, posing significant challenges for conventional optimization-based methods, which, despite being grounded in physical models, often fail to recover fine structural details. Recent deep learning strategies, including end-to-end and post-processing networks, have improved reconstruction quality but typically require large paired training datasets and may struggle to generalize. To overcome these limitations, we propose a physics-informed hybrid framework that integrates diffusion models as learned regularization within a data-consistency-driven variational scheme. Specifically, we introduce Single-Step Diffusion Regularization (SSD-Reg), a novel approach that embeds diffusion priors into the iterative reconstruction process, enabling the recovery of complex anatomical structures without the need for paired data. SSD-Reg maintains fidelity to both the governing physics and learned structural distributions, improving accuracy, stability, and robustness. Extensive experiments demonstrate that SSD-Reg, implemented as a Plug-and-Play (PnP) module, provides a flexible and effective solution for tackling the ill-posedness inherent in functional image reconstruction.

[896] Mamba-FCS: Joint Spatio- Frequency Feature Fusion, Change-Guided Attention, and SeK Loss for Enhanced Semantic Change Detection in Remote Sensing

Buddhi Wijenayake, Athulya Ratnayake, Praveen Sumanasekara, Roshan Godaliyadda, Parakrama Ekanayake, Vijitha Herath, Nichula Wasalathilaka

Main category: eess.IV

TL;DR: Mamba-FCS, a new SCD framework, combines Visual State Space Models with innovative components like Joint Spatio-Frequency Fusion and Change-Guided Attention to achieve state-of-the-art performance in semantic change detection.

Details

Motivation: Addressing the need for models that balance spatial context, computational efficiency, and sensitivity to class-imbalanced land-cover transitions in remote sensing imagery.

Method: Introduces Mamba-FCS with a Visual State Space Model backbone, Joint Spatio-Frequency Fusion, Change-Guided Attention, and Separated Kappa loss.

Result: Achieves 88.62% OA, 65.78% F_scd on SECOND and 96.25% OA, 89.27% F_scd on Landsat-SCD, setting new benchmarks.

Conclusion: Mamba-FCS demonstrates the potential of Mamba architectures for scalable and effective semantic change detection, with all components contributing significantly.

Abstract: Semantic Change Detection (SCD) from remote sensing imagery requires models balancing extensive spatial context, computational efficiency, and sensitivity to class-imbalanced land-cover transitions. While Convolutional Neural Networks excel at local feature extraction but lack global context, Transformers provide global modeling at high computational costs. Recent Mamba architectures based on state-space models offer compelling solutions through linear complexity and efficient long-range modeling. In this study, we introduce Mamba-FCS, a SCD framework built upon Visual State Space Model backbone incorporating, a Joint Spatio-Frequency Fusion block incorporating log-amplitude frequency domain features to enhance edge clarity and suppress illumination artifacts, a Change-Guided Attention (CGA) module that explicitly links the naturally intertwined BCD and SCD tasks, and a Separated Kappa (SeK) loss tailored for class-imbalanced performance optimization. Extensive evaluation on SECOND and Landsat-SCD datasets shows that Mamba-FCS achieves state-of-the-art metrics, 88.62% Overall Accuracy, 65.78% F_scd, and 25.50% SeK on SECOND, 96.25% Overall Accuracy, 89.27% F_scd, and 60.26% SeK on Landsat-SCD. Ablation analyses confirm distinct contributions of each novel component, with qualitative assessments highlighting significant improvements in SCD. Our results underline the substantial potential of Mamba architectures, enhanced by proposed techniques, setting a new benchmark for effective and scalable semantic change detection in remote sensing applications. The complete source code, configuration files, and pre-trained models will be publicly available upon publication.

[897] Transfer Learning with EfficientNet for Accurate Leukemia Cell Classification

Faisal Ahmed

Main category: eess.IV

TL;DR: The paper explores transfer learning with pretrained CNNs for classifying Acute Lymphoblastic Leukemia (ALL) from blood smear images, achieving high accuracy with EfficientNet-B3.

Details

Motivation: Accurate classification of ALL is crucial for early diagnosis and treatment, motivating the use of advanced deep learning techniques.

Method: The study used transfer learning with models like ResNet50, ResNet101, and EfficientNet variants, alongside data augmentation to balance the dataset.

Result: EfficientNet-B3 performed best with an F1-score of 94.30%, accuracy of 92.02%, and AUC of 94.79%, surpassing prior methods.

Conclusion: Combining data augmentation with transfer learning, especially EfficientNet-B3, enhances diagnostic accuracy for hematologic malignancies.

Abstract: Accurate classification of Acute Lymphoblastic Leukemia (ALL) from peripheral blood smear images is essential for early diagnosis and effective treatment planning. This study investigates the use of transfer learning with pretrained convolutional neural networks (CNNs) to improve diagnostic performance. To address the class imbalance in the dataset of 3,631 Hematologic and 7,644 ALL images, we applied extensive data augmentation techniques to create a balanced training set of 10,000 images per class. We evaluated several models, including ResNet50, ResNet101, and EfficientNet variants B0, B1, and B3. EfficientNet-B3 achieved the best results, with an F1-score of 94.30%, accuracy of 92.02%, andAUCof94.79%,outperformingpreviouslyreported methods in the C-NMCChallenge. Thesefindings demonstrate the effectiveness of combining data augmentation with advanced transfer learning models, particularly EfficientNet-B3, in developing accurate and robust diagnostic tools for hematologic malignancy detection.

[898] Accurate Measles Rash Detection via Vision Transformer Fine-Tuning

Qingguo Wang

Main category: eess.IV

TL;DR: A transfer learning approach using DeiT achieved high accuracy (96.38%) in diagnosing measles rashes, aiding outbreak control.

Details

Motivation: Measles resurgence in 2025 highlighted the need for fast, reliable diagnostic systems to prevent spread.

Method: Fine-tuned a pretrained DeiT model on a diverse skin rash dataset, comparing it with ResNet-50.

Result: DeiT achieved 96.38% accuracy, 96.24% precision, 96.38% recall, and 96.23% F1-score.

Conclusion: DeiT is highly effective for measles rash detection, with potential for future research improvements.

Abstract: Measles, a highly contagious disease declared eliminated in the United States in 2000 after decades of successful vaccination campaigns, resurged in 2025, with 1,356 confirmed cases reported as of August 5, 2025. Given its rapid spread among susceptible individuals, fast and reliable diagnostic systems are critical for early prevention and containment. In this work, we applied transfer learning to fine-tune a pretrained Data-efficient Image Transformer (DeiT) model for distinguishing measles rashes from other skin conditions. Trained on a diverse, curated skin rash image dataset, the DeiT model achieved a median classification accuracy of 96.38%, precision of 96.24%, recall of 96.38%, and an F1-score of 96.23%, demonstrating high effectiveness in accurate detection to aid outbreak control. We also compared the DeiT model with a convolutional neural network, ResNet-50, and discussed the directions for future research.

[899] Expectation-maximization for structure determination directly from cryo-EM micrographs

Shay Kreymer, Amit Singer, Tamir Bendory

Main category: eess.IV

TL;DR: The paper introduces an expectation-maximization algorithm to directly estimate 3-D molecular structures from low-SNR cryo-EM micrographs, bypassing the need for image extraction.

Details

Motivation: Existing cryo-EM methods fail for small molecular structures due to low SNR, making projection image detection unreliable.

Method: An approximate expectation-maximization algorithm is devised to estimate the 3-D structure directly from the micrograph without locating projections.

Result: Successful structure recoveries from simulated noisy measurements are demonstrated.

Conclusion: The proposed method enables reconstruction of small molecular structures in low-SNR regimes where standard techniques fail.

Abstract: A single-particle cryo-electron microscopy (cryo-EM) measurement, called a micrograph, consists of multiple two-dimensional tomographic projections of a three-dimensional (3-D) molecular structure at unknown locations, taken under unknown viewing directions. All existing cryo-EM algorithmic pipelines first locate and extract the projection images, and then reconstruct the structure from the extracted images. However, if the molecular structure is small, the signal-to-noise ratio (SNR) of the data is very low, making it challenging to accurately detect projection images within the micrograph. Consequently, all standard techniques fail in low-SNR regimes. To recover molecular structures from measurements of low SNR, and in particular small molecular structures, we devise an approximate expectation-maximization algorithm to estimate the 3-D structure directly from the micrograph, bypassing the need to locate the projection images. We corroborate our computational scheme with numerical experiments and present successful structure recoveries from simulated noisy measurements.

[900] Less is More: Skim Transformer for Light Field Image Super-resolution

Zeke Zexi Hu, Haodong Chen, Hui Ye, Xiaoming Chen, Vera Yuk Ying Chung, Yiran Shen, Weidong Cai

Main category: eess.IV

TL;DR: The paper introduces the Skim Transformer for light field image processing, focusing on disparity-aware attention to reduce redundancy and improve efficiency. SkimLFSR, based on this, achieves state-of-the-art super-resolution results with fewer parameters.

Details

Motivation: Existing light field methods inefficiently use all sub-aperture images (SAIs), causing disparity entanglement. The goal is to address this by selectively utilizing SAIs based on disparity significance.

Method: Proposes the Skim Transformer, a multi-branch architecture where each branch focuses on a specific disparity range by attending to a skimmed subset of SAIs. SkimLFSR is built on this for light field super-resolution.

Result: SkimLFSR outperforms existing methods by 0.59 dB (2x) and 0.35 dB (4x) in PSNR, using only 67% of parameters. It shows disparity-aware behavior in attending to visual cues.

Conclusion: The Skim Transformer and SkimLFSR offer an efficient, disparity-aware paradigm for light field image processing, achieving superior performance with reduced computational cost.

Abstract: A light field image captures scenes through an array of micro-lenses, providing a rich representation that encompasses spatial and angular information. While this richness comes at the cost of significant data redundancy, most existing light field methods still tend to indiscriminately utilize all the information from sub-aperture images (SAIs) in an attempt to harness every visual cue regardless of their disparity significance. However, this paradigm inevitably leads to disparity entanglement, a fundamental cause of inefficiency in light field image processing. To address this limitation, we introduce the Skim Transformer, a novel architecture inspired by the ``less is more” philosophy. Unlike conventional light field Transformers, our Skim Transformer features a multi-branch structure where each branch is dedicated to a specific disparity range by constructing its attention score matrix over a skimmed subset of SAIs, rather than all of them. Building upon this core component, we present SkimLFSR, an efficient yet powerful network for light field super-resolution (LFSR). Requiring only 67% of parameters, SkimLFSR achieves state-of-the-art results surpassing the best existing method by an average of 0.59 dB and 0.35 dB in PSNR at the 2x and 4x tasks, respectively. Through in-depth analyses, we reveal that SkimLFSR, guided by the predefined skimmed SAI sets as prior knowledge, demonstrates distinct disparity-aware behaviors in attending to visual cues. These findings highlight its effectiveness and adaptability as a promising paradigm for light field image processing.

[901] FQGA-single: Towards Fewer Training Epochs and Fewer Model Parameters for Image-to-Image Translation Tasks

Cho Yang

Main category: eess.IV

TL;DR: FQGA-single, a novel model inspired by CycleGAN, efficiently produces high-quality synthetic CT images, outperforming CycleGAN in both single and multiple epoch training.

Details

Motivation: To improve the efficiency and quality of synthetic CT (sCT) generation in medical imaging, particularly for CBCT-to-sCT conversion.

Method: Proposes FQGA-single, evaluates it on the SynthRAD dataset, and compares it with CycleGAN using quantitative and qualitative metrics. Also explores single-epoch training inspired by “One Epoch Is All You Need.”

Result: FQGA-single trained on a single epoch outperforms itself with multiple epochs and surpasses CycleGAN in all configurations, including a modified version.

Conclusion: FQGA-single is a highly efficient and superior model for sCT generation, challenging traditional multi-epoch training paradigms.

Abstract: This paper proposes a novel model inspired by CycleGAN: FQGA-single to produce high quality medical synthetic CT (sCT) generated images more efficiently. Evaluations were done on the SynthRAD Grand Challenge dataset with the CycleGAN model used for benchmarking and for comparing the quality of CBCT-to-sCT generated images from both a quantitative and qualitative perspective. Finally, this paper also explores ideas from the paper “One Epoch Is All You Need” to compare models trained on a single epoch versus multiple epochs. Astonishing results from FQGA-single were obtained during this exploratory experiment, which show that the performance of FQGA-single when trained on a single epoch surpasses itself when trained on multiple epochs. More surprising is that its performance also surpasses CycleGAN’s multiple-epoch and single-epoch models, and even a modified version of CycleGAN.

[902] Rethinking Theoretical Illumination for Efficient Low-Light Image Enhancement

Shyang-En Weng, Cheng-Yen Hsiao, Li-Wei Lu, Yu-Shen Huang, Tzu-Han Chen, Shaou-Gang Miaou, Ricky Christanto

Main category: eess.IV

TL;DR: CPGA-Net+ enhances low-light image processing with lightweight and strong versions, balancing performance and computational efficiency.

Details

Motivation: Addressing the challenge of low-light image enhancement and the need for lightweight models for edge devices.

Method: Extended CPGA-Net with attention mechanisms for local and global processing, tested in ultra-lightweight and stronger versions.

Result: Lightweight version reduces costs by two-thirds; stronger version balances local and global processing. Both outperform recent lightweight methods.

Conclusion: CPGA-Net+ offers scalable, efficient solutions for low-light image enhancement with limited resources.

Abstract: Enhancing low-light images remains a critical challenge in computer vision, as does designing lightweight models for edge devices that can handle the computational demands of deep learning. This article introduces an extended version of the Channel-Prior and Gamma-Estimation Network (CPGA-Net), termed CPGA-Net+, incorporating the theoretically-based Attentions for illumination in local and global processing. Additionally, we assess our approach through a theoretical analysis of the block design by introducing both an ultra-lightweight and a stronger version, following the same design principles. The lightweight version significantly reduces computational costs by over two-thirds by utilizing the local branch as an auxiliary component. Meanwhile, the stronger version achieves an impressive balance by maximizing local and global processing capabilities. Our proposed methods have been validated as effective compared to recent lightweight approaches, offering superior performance and scalable solutions with limited computational resources.

[903] A Plug-and-Play Method for Guided Multi-contrast MRI Reconstruction based on Content/Style Modeling

Chinmay Rao, Matthias van Osch, Nicola Pezzotti, Jeroen de Bresser, Mark van Buchem, Laurens Beljaards, Jakob Meineke, Elwin de Weerdt, Huangling Lu, Mariya Doneva, Marius Staring

Main category: eess.IV

TL;DR: A modular two-stage approach, PnP-CoSMo, is proposed for guided MRI reconstruction using shared and non-shared generative factors, improving generalizability and acceleration over existing methods.

Details

Motivation: Addressing the challenge of requiring large paired datasets for learning-based guided MRI reconstruction by leveraging unpaired data and disentangling content/style factors.

Method: A content/style model is learned from unpaired image data, applied as a plug-and-play operator in iterative reconstruction, combining data consistency and corrective processes.

Result: Improved generalizability on the NYU fastMRI dataset and up to 32.6% more acceleration on in-house datasets compared to non-guided reconstruction.

Conclusion: PnP-CoSMo offers a practical, interpretable solution for multi-contrast MRI reconstruction, reducing reliance on paired data and enhancing performance.

Abstract: Since multiple MRI contrasts of the same anatomy contain redundant information, one contrast can guide the reconstruction of an undersampled subsequent contrast. To this end, several end-to-end learning-based guided reconstruction methods have been proposed. However, a key challenge is the requirement of large paired training datasets comprising raw data and aligned reference images. We propose a modular two-stage approach addressing this issue, additionally providing an explanatory framework for the multi-contrast problem based on the shared and non-shared generative factors underlying two given contrasts. A content/style model of two-contrast image data is learned from a largely unpaired image-domain dataset and is subsequently applied as a plug-and-play operator in iterative reconstruction. The disentanglement of content and style allows explicit representation of contrast-independent and contrast-specific factors. Consequently, incorporating prior information into the reconstruction reduces to a simple replacement of the aliased content of the reconstruction iterate with high-quality content derived from the reference scan. Combining this component with a data consistency step and introducing a general corrective process for the content yields an iterative scheme. We name this novel approach PnP-CoSMo. Various aspects like interpretability and convergence are explored via simulations. Furthermore, its practicality is demonstrated on the public NYU fastMRI DICOM dataset, showing improved generalizability compared to end-to-end methods, and on two in-house multi-coil raw datasets, offering up to 32.6% more acceleration over learning-based non-guided reconstruction for a given SSIM.

[904] L-FUSION: Laplacian Fetal Ultrasound Segmentation & Uncertainty Estimation

Johanna P. Müller, Robert Wright, Thomas G. Day, Lorenzo Venturini, Samuel F. Budd, Hadrien Reynaud, Joseph V. Hajnal, Reza Razavi, Bernhard Kainz

Main category: eess.IV

TL;DR: L-FUSION integrates uncertainty quantification and foundation models for robust fetal US segmentation, improving accuracy and diagnostic feedback.

Details

Motivation: Operator dependency and technical limitations in prenatal US complicate image interpretation and diagnostic uncertainty.

Method: Uses aleatoric logit distributions, Laplace approximations, and Dropout for uncertainty quantification and segmentation.

Result: Superior segmentation accuracy and reliable uncertainty quantification in fetal US, aiding clinical decision-making.

Conclusion: L-FUSION offers a scalable, automated solution for advancing fetal ultrasound analysis in clinical settings.

Abstract: Accurate analysis of prenatal ultrasound (US) is essential for early detection of developmental anomalies. However, operator dependency and technical limitations (e.g. intrinsic artefacts and effects, setting errors) can complicate image interpretation and the assessment of diagnostic uncertainty. We present L-FUSION (Laplacian Fetal US Segmentation with Integrated FoundatiON models), a framework that integrates uncertainty quantification through unsupervised, normative learning and large-scale foundation models for robust segmentation of fetal structures in normal and pathological scans. We propose to utilise the aleatoric logit distributions of Stochastic Segmentation Networks and Laplace approximations with fast Hessian estimations to estimate epistemic uncertainty only from the segmentation head. This enables us to achieve reliable abnormality quantification for instant diagnostic feedback. Combined with an integrated Dropout component, L-FUSION enables reliable differentiation of lesions from normal fetal anatomy with enhanced uncertainty maps and segmentation counterfactuals in US imaging. It improves epistemic and aleatoric uncertainty interpretation and removes the need for manual disease-labelling. Evaluations across multiple datasets show that L-FUSION achieves superior segmentation accuracy and consistent uncertainty quantification, supporting on-site decision-making and offering a scalable solution for advancing fetal ultrasound analysis in clinical settings.

[905] Dual-domain Modulation Network for Lightweight Image Super-Resolution

Wenjie Li, Heng Guo, Yuefeng Hou, Guangwei Gao, Zhanyu Ma

Main category: eess.IV

TL;DR: The paper proposes a Dual-domain Modulation Network (DMNet) combining wavelet and Fourier information for lightweight image super-resolution (SR), achieving high-quality results with reduced computational costs.

Details

Motivation: Existing frequency-based SR methods struggle to balance overall structure reconstruction and high-frequency details while being inefficient for lightweight applications.

Method: The method integrates wavelet-domain modulation via a Wavelet-domain Modulation Transformer (WMT) with global Fourier supervision for complementary spectral learning.

Result: The model achieves comparable PSNR to SRFormer and MambaIR with significantly fewer FLOPs (less than 50% and 60%) and faster inference speeds (15.4x and 5.4x).

Conclusion: The proposed DMNet effectively balances SR quality and computational efficiency, making it suitable for lightweight applications.

Abstract: Lightweight image super-resolution (SR) aims to reconstruct high-resolution images from low-resolution images under limited computational costs. We find that existing frequency-based SR methods cannot balance the reconstruction of overall structures and high-frequency parts. Meanwhile, these methods are inefficient for handling frequency features and unsuitable for lightweight SR. In this paper, we show that introducing both wavelet and Fourier information allows our model to consider both high-frequency features and overall SR structure reconstruction while reducing costs. Specifically, we propose a Dual-domain Modulation Network that integrates both wavelet and Fourier information for enhanced frequency modeling. Unlike existing methods that rely on a single frequency representation, our design combines wavelet-domain modulation via a Wavelet-domain Modulation Transformer (WMT) with global Fourier supervision, enabling complementary spectral learning well-suited for lightweight SR. Experimental results show that our method achieves a comparable PSNR to SRFormer and MambaIR while with less than 50% and 60% of their FLOPs and achieving inference speeds 15.4x and 5.4x faster, respectively, demonstrating the effectiveness of our method on SR quality and lightweight. Code link: https://github.com/24wenjie-li/DMNet

[906] Evaluating structural uncertainty in accelerated MRI: are voxelwise measures useful surrogates?

Luca L. C. Trautmann, Peter A. Wijeratne, Itamar Ronen, Ivor J. A. Simpson

Main category: eess.IV

TL;DR: The paper highlights the inadequacy of voxel-level uncertainty measures in capturing morphological uncertainty in accelerated reconstruction algorithms, using segmentation as a downstream task to demonstrate this limitation.

Details

Motivation: Current uncertainty quantification methods in clinical reconstruction algorithms focus on voxel intensity variations, which lack structural interpretation and fail to predict morphological uncertainty.

Method: The study employs ensembles of reconstruction models to measure uncertainty, specifically analyzing variability and bias in morphological structures through segmentation.

Result: Findings reveal that voxel-level uncertainty does not correlate well with morphological uncertainty, and within-ensemble variability cannot be predicted by voxel intensity variations alone.

Conclusion: The work underscores the need for uncertainty measures that account for structural and morphological aspects in clinical reconstruction algorithms.

Abstract: Introducing accelerated reconstruction algorithms into clinical settings requires measures of uncertainty quantification that accurately assess the relevant uncertainty introduced by the reconstruction algorithm. Many currently deployed approaches quantifying uncertainty by focusing on measuring the variability in voxelwise intensity variation. Although these provide interpretable maps, they lack a structural interpretation and do not show a clear relationship to how the data will be analysed subsequently. In this work we show that voxel level uncertainty does not provide insight into morphological uncertainty. To do so, we use segmentation as a clinically-relevant downstream task and deploy ensembles of reconstruction modes to measure uncertainty in the reconstructions. We show that variability and bias in the morphological structures are present and within-ensemble variability cannot be predicted well with uncertainty measured only by voxel intensity variations.

[907] SCReedSolo: A Secure and Robust LSB Image Steganography Framework with Randomized Symmetric Encryption and Reed-Solomon Coding

Syed Rifat Raiyan, Md. Hasanul Kabir

Main category: eess.IV

TL;DR: The paper introduces SCR{\small EED}S{\small OLO}, a framework for hiding binary data in images using encryption, error correction, and LSB steganography, achieving high payload and resilience.

Details

Motivation: To address vulnerabilities in image steganography by combining security (encryption) and resilience (error correction) while maintaining high payload capacity.

Method: Uses Random Shuffling, Fernet Symmetric Encryption, Reed-Solomon Error Correction, and LSB Steganography to embed data.

Result: Achieves 3 bits per pixel payload, resists steganalysis, and ensures successful transmission with error correction.

Conclusion: SCR{\small EED}S{\small OLO} is effective for secure and resilient image steganography.

Abstract: Image steganography is an information-hiding technique that involves the surreptitious concealment of covert informational content within digital images. In this paper, we introduce ${\rm SCR{\small EED}S{\small OLO}}$, a novel framework for concealing arbitrary binary data within images. Our approach synergistically leverages Random Shuffling, Fernet Symmetric Encryption, and Reed-Solomon Error Correction Codes to encode the secret payload, which is then discretely embedded into the carrier image using LSB (Least Significant Bit) Steganography. The combination of these methods addresses the vulnerability vectors of both security and resilience against bit-level corruption in the resultant stego-images. We show that our framework achieves a data payload of 3 bits per pixel for an RGB image, and mathematically assess the probability of successful transmission for the amalgamated $n$ message bits and $k$ error correction bits. Additionally, we find that ${\rm SCR{\small EED}S{\small OLO}}$ yields good results upon being evaluated with multiple performance metrics, successfully eludes detection by various passive steganalysis tools, and is immune to simple active steganalysis attacks. Our code and data are available at https://github.com/Starscream-11813/SCReedSolo-Steganography.

[908] Retuve: Automated Multi-Modality Analysis of Hip Dysplasia with Open Source AI

Adam McArthur, Stephanie Wichuk, Stephen Burnside, Andrew Kirby, Alexander Scammon, Damian Sol, Abhilash Hareendranathan, Jacob L. Jaremko

Main category: eess.IV

TL;DR: Retuve is an open-source framework for multi-modality DDH analysis, addressing reproducibility and standardization issues in DDH screening with AI-driven tools.

Details

Motivation: Current DDH screening lacks standardization, and AI studies face reproducibility challenges due to limited data and code availability.

Method: Retuve provides a complete workflow with open datasets, pre-trained models, and a Python API for automated segmentation and landmark detection in US and X-ray images.

Result: The framework enables automated measurement of diagnostic parameters like the alpha angle and acetabular index.

Conclusion: Retuve promotes transparency and collaboration in DDH research, potentially improving screening, early diagnosis, and patient outcomes.

Abstract: Developmental dysplasia of the hip (DDH) poses significant diagnostic challenges, hindering timely intervention. Current screening methodologies lack standardization, and AI-driven studies suffer from reproducibility issues due to limited data and code availability. To address these limitations, we introduce Retuve, an open-source framework for multi-modality DDH analysis, encompassing both ultrasound (US) and X-ray imaging. Retuve provides a complete and reproducible workflow, offering open datasets comprising expert-annotated US and X-ray images, pre-trained models with training code and weights, and a user-friendly Python Application Programming Interface (API). The framework integrates segmentation and landmark detection models, enabling automated measurement of key diagnostic parameters such as the alpha angle and acetabular index. By adhering to open-source principles, Retuve promotes transparency, collaboration, and accessibility in DDH research. This initiative has the potential to democratize DDH screening, facilitate early diagnosis, and ultimately improve patient outcomes by enabling widespread screening and early intervention. The GitHub repository/code can be found here: https://github.com/radoss-org/retuve

[909] A Multimodal Deep Learning Approach for White Matter Shape Prediction in Diffusion MRI Tractography

Yui Lo, Yuqian Chen, Dongnan Liu, Leo Zekelman, Jarrett Rushmore, Yogesh Rathi, Nikos Makris, Alexandra J. Golby, Fan Zhang, Weidong Cai, Lauren J. O’Donnell

Main category: eess.IV

TL;DR: Tract2Shape is a deep learning framework for predicting white matter tractography shape measures efficiently, outperforming SOTA models and showing strong generalizability.

Details

Motivation: Conventional methods for computing shape measures are computationally expensive, limiting large-scale analysis. Tract2Shape aims to provide a faster, scalable solution.

Method: Uses multimodal deep learning with geometric (point cloud) and scalar (tabular) features, plus dimensionality reduction (PCA), to predict shape measures. Evaluated on HCP-YA and PPMI datasets.

Result: Outperforms SOTA models in accuracy (highest Pearson’s r, lowest nMSE) and generalizability (high performance on unseen PPMI dataset). Multimodal input and PCA enhance performance.

Conclusion: Tract2Shape offers fast, accurate, and scalable prediction of white matter shape measures, supporting future large-scale analyses.

Abstract: Shape measures have emerged as promising descriptors of white matter tractography, offering complementary insights into anatomical variability and associations with cognitive and clinical phenotypes. However, conventional methods for computing shape measures are computationally expensive and time-consuming for large-scale datasets due to reliance on voxel-based representations. We propose Tract2Shape, a novel multimodal deep learning framework that leverages geometric (point cloud) and scalar (tabular) features to predict ten white matter tractography shape measures. To enhance model efficiency, we utilize a dimensionality reduction algorithm for the model to predict five primary shape components. The model is trained and evaluated on two independently acquired datasets, the HCP-YA dataset, and the PPMI dataset. We evaluate the performance of Tract2Shape by training and testing it on the HCP-YA dataset and comparing the results with state-of-the-art models. To further assess its robustness and generalization ability, we also test Tract2Shape on the unseen PPMI dataset. Tract2Shape outperforms SOTA deep learning models across all ten shape measures, achieving the highest average Pearson’s r and the lowest nMSE on the HCP-YA dataset. The ablation study shows that both multimodal input and PCA contribute to performance gains. On the unseen testing PPMI dataset, Tract2Shape maintains a high Pearson’s r and low nMSE, demonstrating strong generalizability in cross-dataset evaluation. Tract2Shape enables fast, accurate, and generalizable prediction of white matter shape measures from tractography data, supporting scalable analysis across datasets. This framework lays a promising foundation for future large-scale white matter shape analysis.

[910] Efficient RAW Image Deblurring with Adaptive Frequency Modulation

Wenlong Jiao, Binglong Li, Wei Shang, Ping Wang, Dongwei Ren

Main category: eess.IV

TL;DR: FrENet is a frequency-domain framework for RAW-to-RAW image deblurring, outperforming state-of-the-art methods with adaptive frequency modulation and skip connections.

Details

Motivation: RAW images offer superior restoration potential over sRGB images but are underexplored, and existing methods struggle with frequency-dependent blur and efficiency.

Method: FrENet operates in the frequency domain with an Adaptive Frequency Positional Modulation module and frequency domain skip connections.

Result: FrENet achieves better restoration quality and efficiency (reduced MACs) than state-of-the-art methods and adapts well to sRGB images.

Conclusion: FrENet is a highly effective and adaptable solution for RAW and sRGB image deblurring, with potential for broader applications.

Abstract: Image deblurring plays a crucial role in enhancing visual clarity across various applications. Although most deep learning approaches primarily focus on sRGB images, which inherently lose critical information during the image signal processing pipeline, RAW images, being unprocessed and linear, possess superior restoration potential but remain underexplored. Deblurring RAW images presents unique challenges, particularly in handling frequency-dependent blur while maintaining computational efficiency. To address these issues, we propose Frequency Enhanced Network (FrENet), a framework specifically designed for RAW-to-RAW deblurring that operates directly in the frequency domain. We introduce a novel Adaptive Frequency Positional Modulation module, which dynamically adjusts frequency components according to their spectral positions, thereby enabling precise control over the deblurring process. Additionally, frequency domain skip connections are adopted to further preserve high-frequency details. Experimental results demonstrate that FrENet surpasses state-of-the-art deblurring methods in RAW image deblurring, achieving significantly better restoration quality while maintaining high efficiency in terms of reduced MACs. Furthermore, FrENet’s adaptability enables it to be extended to sRGB images, where it delivers comparable or superior performance compared to methods specifically designed for sRGB data. The code will be available at https://github.com/WenlongJiao/FrENet .

[911] Spatio-Temporal Representation Decoupling and Enhancement for Federated Instrument Segmentation in Surgical Videos

Zheng Fang, Xiaoming Qi, Chun-Mei Feng, Jialun Pei, Weixin Si, Yueming Jin

Main category: eess.IV

TL;DR: A novel Personalized Federated Learning (FL) scheme, FedST, is proposed for surgical instrument segmentation, leveraging domain knowledge to address diverse anatomical backgrounds and synthetic data usage.

Details

Motivation: Existing FL methods for surgical data science lack consideration of domain-specific characteristics like diverse backgrounds and synthetic data potential.

Method: FedST uses a Representation Separation and Cooperation (RSC) mechanism for local training and Synthesis-based Explicit Representation Quantification (SERQ) for global training.

Result: The method enhances segmentation by decoupling background encoding and capturing consistent instrument representations.

Conclusion: FedST effectively integrates surgical domain knowledge into FL, improving segmentation performance and generalization.

Abstract: Surgical instrument segmentation under Federated Learning (FL) is a promising direction, which enables multiple surgical sites to collaboratively train the model without centralizing datasets. However, there exist very limited FL works in surgical data science, and FL methods for other modalities do not consider inherent characteristics in surgical domain: i) different scenarios show diverse anatomical backgrounds while highly similar instrument representation; ii) there exist surgical simulators which promote large-scale synthetic data generation with minimal efforts. In this paper, we propose a novel Personalized FL scheme, Spatio-Temporal Representation Decoupling and Enhancement (FedST), which wisely leverages surgical domain knowledge during both local-site and global-server training to boost segmentation. Concretely, our model embraces a Representation Separation and Cooperation (RSC) mechanism in local-site training, which decouples the query embedding layer to be trained privately, to encode respective backgrounds. Meanwhile, other parameters are optimized globally to capture the consistent representations of instruments, including the temporal layer to capture similar motion patterns. A textual-guided channel selection is further designed to highlight site-specific features, facilitating model adapta tion to each site. Moreover, in global-server training, we propose Synthesis-based Explicit Representation Quantification (SERQ), which defines an explicit representation target based on synthetic data to synchronize the model convergence during fusion for improving model generalization.

[912] Speckle2Self: Self-Supervised Ultrasound Speckle Reduction Without Clean Data

Xuesong Li, Nassir Navab, Zhongliang Jiang

Main category: eess.IV

TL;DR: Speckle2Self is a self-supervised algorithm for denoising medical ultrasound images by leveraging multi-scale perturbations to isolate speckle noise while preserving anatomical structure.

Details

Motivation: Existing denoising methods fail for ultrasound speckle noise due to its tissue-dependent nature and high spatial dependency, making traditional approaches like Noise2Noise or blind-spot networks ineffective.

Method: Speckle2Self uses multi-scale perturbation (MSP) to introduce tissue-dependent variations in speckle patterns, modeling the clean image as low-rank and isolating noise as sparse.

Result: The method outperforms conventional filter-based and state-of-the-art learning-based denoising algorithms on simulated and real human carotid ultrasound images.

Conclusion: Speckle2Self effectively addresses the unique challenges of ultrasound speckle noise, demonstrating strong performance and adaptability across different ultrasound machines.

Abstract: Image denoising is a fundamental task in computer vision, particularly in medical ultrasound (US) imaging, where speckle noise significantly degrades image quality. Although recent advancements in deep neural networks have led to substantial improvements in denoising for natural images, these methods cannot be directly applied to US speckle noise, as it is not purely random. Instead, US speckle arises from complex wave interference within the body microstructure, making it tissue-dependent. This dependency means that obtaining two independent noisy observations of the same scene, as required by pioneering Noise2Noise, is not feasible. Additionally, blind-spot networks also cannot handle US speckle noise due to its high spatial dependency. To address this challenge, we introduce Speckle2Self, a novel self-supervised algorithm for speckle reduction using only single noisy observations. The key insight is that applying a multi-scale perturbation (MSP) operation introduces tissue-dependent variations in the speckle pattern across different scales, while preserving the shared anatomical structure. This enables effective speckle suppression by modeling the clean image as a low-rank signal and isolating the sparse noise component. To demonstrate its effectiveness, Speckle2Self is comprehensively compared with conventional filter-based denoising algorithms and SOTA learning-based methods, using both realistic simulated US images and human carotid US images. Additionally, data from multiple US machines are employed to evaluate model generalization and adaptability to images from unseen domains. Project page: https://noseefood.github.io/us-speckle2self/

[913] Are Vision Foundation Models Ready for Out-of-the-Box Medical Image Registration?

Hanxue Gu, Yaqian Chen, Nicholas Konz, Qihang Li, Maciej A. Mazurowski

Main category: eess.IV

TL;DR: Foundation models like SAM show promise for zero-shot breast MRI registration, outperforming traditional methods in global alignment but struggling with fine details. Domain-specific training doesn’t consistently improve results.

Details

Motivation: To evaluate if foundation models can handle the complexity of breast MRI registration, given challenges like anatomical variation and deformable structures.

Method: Assessed five pre-trained encoders (DINO-v2, SAM, MedSAM, SSLSAM, MedCLIP) across four breast registration tasks, comparing performance against traditional baselines.

Result: SAM outperformed traditional methods in global alignment but struggled with fine fibroglandular tissue details. Domain-specific pre-training didn’t consistently help.

Conclusion: Foundation models show potential for breast MRI registration, but further work is needed to improve fine-detail accuracy and understand domain-specific training effects.

Abstract: Foundation models, pre-trained on large image datasets and capable of capturing rich feature representations, have recently shown potential for zero-shot image registration. However, their performance has mostly been tested in the context of rigid or less complex structures, such as the brain or abdominal organs, and it remains unclear whether these models can handle more challenging, deformable anatomy. Breast MRI registration is particularly difficult due to significant anatomical variation between patients, deformation caused by patient positioning, and the presence of thin and complex internal structure of fibroglandular tissue, where accurate alignment is crucial. Whether foundation model-based registration algorithms can address this level of complexity remains an open question. In this study, we provide a comprehensive evaluation of foundation model-based registration algorithms for breast MRI. We assess five pre-trained encoders, including DINO-v2, SAM, MedSAM, SSLSAM, and MedCLIP, across four key breast registration tasks that capture variations in different years and dates, sequences, modalities, and patient disease status (lesion versus no lesion). Our results show that foundation model-based algorithms such as SAM outperform traditional registration baselines for overall breast alignment, especially under large domain shifts, but struggle with capturing fine details of fibroglandular tissue. Interestingly, additional pre-training or fine-tuning on medical or breast-specific images in MedSAM and SSLSAM, does not improve registration performance and may even decrease it in some cases. Further work is needed to understand how domain-specific training influences registration and to explore targeted strategies that improve both global alignment and fine structure accuracy. We also publicly release our code at \href{https://github.com/mazurowski-lab/Foundation-based-reg}{Github}.

[914] A Steel Surface Defect Detection Method Based on Lightweight Convolution Optimization

Cong Chen, Ming Chen, Hoileong Lee, Yan Li, Jiyang Yu

Main category: eess.IV

TL;DR: A deep learning framework (YOLOv9s with C3Ghost, SCConv, and CARAFE) improves multi-scale steel surface defect detection by optimizing feature representation, reducing redundancy, and enhancing upsampling.

Details

Motivation: Traditional methods struggle with accuracy and miss-detection for small defects in steel surfaces due to varied defect sizes and shapes.

Method: Combines YOLOv9s with SCConv (reduces feature redundancy), C3Ghost (enhances feature extraction), and CARAFE (improves upsampling).

Result: Higher accuracy and robustness in defect detection compared to other methods.

Conclusion: The proposed framework effectively addresses multi-scale steel surface defect detection challenges.

Abstract: Surface defect detection of steel, especially the recognition of multi-scale defects, has always been a major challenge in industrial manufacturing. Steel surfaces not only have defects of various sizes and shapes, which limit the accuracy of traditional image processing and detection methods in complex environments. However, traditional defect detection methods face issues of insufficient accuracy and high miss-detection rates when dealing with small target defects. To address this issue, this study proposes a detection framework based on deep learning, specifically YOLOv9s, combined with the C3Ghost module, SCConv module, and CARAFE upsampling operator, to improve detection accuracy and model performance. First, the SCConv module is used to reduce feature redundancy and optimize feature representation by reconstructing the spatial and channel dimensions. Second, the C3Ghost module is introduced to enhance the model’s feature extraction ability by reducing redundant computations and parameter volume, thereby improving model efficiency. Finally, the CARAFE upsampling operator, which can more finely reorganize feature maps in a content-aware manner, optimizes the upsampling process and ensures detailed restoration of high-resolution defect regions. Experimental results demonstrate that the proposed model achieves higher accuracy and robustness in steel surface defect detection tasks compared to other methods, effectively addressing defect detection problems.

[915] Edge2Prompt: Modality-Agnostic Model for Out-of-Distribution Liver Segmentation

Nathan Hollet, Oumeymah Cherkaoui, Philippe C. Cattin, Sidaty El Hadramy

Main category: eess.IV

TL;DR: Edge2Prompt is a modality-agnostic liver segmentation pipeline combining edge detection and foundation models, outperforming classical methods in data-scarce and OOD scenarios.

Details

Motivation: Liver segmentation is crucial for clinical workflows but faces challenges due to modality-specific tools and limited data.

Method: Integrates edge detection with U-Net and SAM-2 to generate prompts for 2D segmentation, reconstructing 3D volumes.

Result: Achieves 86.4% mean Dice Score on OOD tasks, outperforming U-Net by 27.4% and other methods by 9.1%.

Conclusion: Edge2Prompt bridges classical and foundation models for adaptable, data-efficient segmentation.

Abstract: Liver segmentation is essential for preoperative planning in interventions like tumor resection or transplantation, but implementation in clinical workflows faces challenges due to modality-specific tools and data scarcity. We propose Edge2Prompt, a novel pipeline for modality-agnostic liver segmentation that generalizes to out-of-distribution (OOD) data. Our method integrates classical edge detection with foundation models. Modality-agnostic edge maps are first extracted from input images, then processed by a U-Net to generate logit-based prompts. These prompts condition the Segment Anything Model 2 (SAM-2) to generate 2D liver segmentations, which can then be reconstructed into 3D volumes. Evaluated on the multi-modal CHAOS dataset, Edge2Prompt achieves competitive results compared to classical segmentation methods when trained and tested in-distribution (ID), and outperforms them in data-scarce scenarios due to the SAM-2 module. Furthermore, it achieves a mean Dice Score of 86.4% on OOD tasks, outperforming U-Net baselines by 27.4% and other self-prompting methods by 9.1%, demonstrating its effectiveness. This work bridges classical and foundation models for clinically adaptable, data-efficient segmentation.

[916] LWT-ARTERY-LABEL: A Lightweight Framework for Automated Coronary Artery Identification

Shisheng Zhang, Ramtin Gharleghi, Sonit Singh, Daniel Moses, Dona Adikari, Arcot Sowmya, Susann Beier

Main category: eess.IV

TL;DR: A lightweight method combining anatomical knowledge and rule-based topology constraints for automated coronary artery labelling, achieving state-of-the-art performance.

Details

Motivation: Coronary artery disease (CAD) is a leading cause of death, and CTCA is a key diagnostic tool. Current labelling methods are inefficient or lack clinical knowledge.

Method: Integrates anatomical knowledge with rule-based topology constraints for coronary artery labelling.

Result: Achieves state-of-the-art performance on benchmark datasets.

Conclusion: Proposes a promising alternative for automated coronary artery labelling, addressing inefficiencies and knowledge gaps in existing methods.

Abstract: Coronary artery disease (CAD) remains the leading cause of death globally, with computed tomography coronary angiography (CTCA) serving as a key diagnostic tool. However, coronary arterial analysis using CTCA, such as identifying artery-specific features from computational modelling, is labour-intensive and time-consuming. Automated anatomical labelling of coronary arteries offers a potential solution, yet the inherent anatomical variability of coronary trees presents a significant challenge. Traditional knowledge-based labelling methods fall short in leveraging data-driven insights, while recent deep-learning approaches often demand substantial computational resources and overlook critical clinical knowledge. To address these limitations, we propose a lightweight method that integrates anatomical knowledge with rule-based topology constraints for effective coronary artery labelling. Our approach achieves state-of-the-art performance on benchmark datasets, providing a promising alternative for automated coronary artery labelling.

[917] Fusion-Based Brain Tumor Classification Using Deep Learning and Explainable AI, and Rule-Based Reasoning

Melika Filvantorkaman, Mohsen Piri, Maral Filvan Torkaman, Ashkan Zabihi, Hamidreza Moradi

Main category: eess.IV

TL;DR: An ensemble deep learning framework combining MobileNetV2 and DenseNet121 CNNs with soft voting achieves high accuracy (91.7%) in classifying brain tumors (glioma, meningioma, pituitary adenoma). It includes Explainable AI (XAI) for transparency and clinical validation.

Details

Motivation: Accurate and interpretable brain tumor classification from MRI is crucial for diagnosis and treatment planning.

Method: The study uses an ensemble of MobileNetV2 and DenseNet121 CNNs with soft voting, trained on the Figshare dataset via 5-fold cross-validation. It integrates Grad-CAM++ for saliency visualization and Clinical Decision Rule Overlay (CDRO) for radiological heuristics.

Result: The ensemble achieved 91.7% accuracy, with strong spatial alignment (Dice 0.88, IoU 0.78) between model attention and expert annotations. Radiologists rated explanation usefulness highly (mean 4.4).

Conclusion: The framework provides a robust, interpretable solution for brain tumor classification, enhancing clinical trust and integration of deep learning in neurodiagnostics.

Abstract: Accurate and interpretable classification of brain tumors from magnetic resonance imaging (MRI) is critical for effective diagnosis and treatment planning. This study presents an ensemble-based deep learning framework that combines MobileNetV2 and DenseNet121 convolutional neural networks (CNNs) using a soft voting strategy to classify three common brain tumor types: glioma, meningioma, and pituitary adenoma. The models were trained and evaluated on the Figshare dataset using a stratified 5-fold cross-validation protocol. To enhance transparency and clinical trust, the framework integrates an Explainable AI (XAI) module employing Grad-CAM++ for class-specific saliency visualization, alongside a symbolic Clinical Decision Rule Overlay (CDRO) that maps predictions to established radiological heuristics. The ensemble classifier achieved superior performance compared to individual CNNs, with an accuracy of 91.7%, precision of 91.9%, recall of 91.7%, and F1-score of 91.6%. Grad-CAM++ visualizations revealed strong spatial alignment between model attention and expert-annotated tumor regions, supported by Dice coefficients up to 0.88 and IoU scores up to 0.78. Clinical rule activation further validated model predictions in cases with distinct morphological features. A human-centered interpretability assessment involving five board-certified radiologists yielded high Likert-scale scores for both explanation usefulness (mean = 4.4) and heatmap-region correspondence (mean = 4.0), reinforcing the framework’s clinical relevance. Overall, the proposed approach offers a robust, interpretable, and generalizable solution for automated brain tumor classification, advancing the integration of deep learning into clinical neurodiagnostics.

[918] Spatio-Temporal Conditional Diffusion Models for Forecasting Future Multiple Sclerosis Lesion Masks Conditioned on Treatments

Gian Mario Favero, Ge Ya Luo, Nima Fathi, Justin Szeto, Douglas L. Arnold, Brennan Nichyporuk, Chris Pal, Tal Arbel

Main category: eess.IV

TL;DR: A treatment-aware spatio-temporal diffusion model is introduced to predict future lesion evolution in Multiple Sclerosis (MS) using multi-modal patient data, showing potential for clinical applications.

Details

Motivation: To advance personalized medicine for heterogeneous diseases like MS by predicting lesion evolution and treatment outcomes.

Method: A voxel-space approach incorporating MRI and treatment data to forecast new and enlarging T2 lesion masks.

Result: Accurate prediction of lesion masks across six treatments, validated on 2131 patient MRIs, with potential for clinical tasks like lesion count and counterfactual analysis.

Conclusion: The model demonstrates the promise of causal, image-based generative models for improving MS prognostics.

Abstract: Image-based personalized medicine has the potential to transform healthcare, particularly for diseases that exhibit heterogeneous progression such as Multiple Sclerosis (MS). In this work, we introduce the first treatment-aware spatio-temporal diffusion model that is able to generate future masks demonstrating lesion evolution in MS. Our voxel-space approach incorporates multi-modal patient data, including MRI and treatment information, to forecast new and enlarging T2 (NET2) lesion masks at a future time point. Extensive experiments on a multi-centre dataset of 2131 patient 3D MRIs from randomized clinical trials for relapsing-remitting MS demonstrate that our generative model is able to accurately predict NET2 lesion masks for patients across six different treatments. Moreover, we demonstrate our model has the potential for real-world clinical applications through downstream tasks such as future lesion count and location estimation, binary lesion activity classification, and generating counterfactual future NET2 masks for several treatments with different efficacies. This work highlights the potential of causal, image-based generative models as powerful tools for advancing data-driven prognostics in MS.

[919] Trustworthy Medical Imaging with Large Language Models: A Study of Hallucinations Across Modalities

Anindya Bijoy Das, Shahnewaz Karim Sakib, Shibbir Ahmed

Main category: eess.IV

TL;DR: The study investigates hallucinations in LLMs applied to medical imaging, analyzing errors in image-to-text and text-to-image tasks, and identifies common patterns and contributing factors.

Details

Motivation: To address the issue of hallucinations in LLMs used for medical imaging, which can mislead clinical decisions, by examining errors in both interpretive and generative tasks.

Method: Analyzes hallucinations in two directions: image-to-text (report generation from scans) and text-to-image (image creation from prompts), using expert-informed criteria across imaging modalities.

Result: Reveals common hallucination patterns in both tasks, highlighting implications for clinical reliability and identifying contributing factors like model architecture and training data.

Conclusion: Provides insights for improving the safety and trustworthiness of LLM-driven medical imaging systems by systematically studying errors in understanding and generation.

Abstract: Large Language Models (LLMs) are increasingly applied to medical imaging tasks, including image interpretation and synthetic image generation. However, these models often produce hallucinations, which are confident but incorrect outputs that can mislead clinical decisions. This study examines hallucinations in two directions: image to text, where LLMs generate reports from X-ray, CT, or MRI scans, and text to image, where models create medical images from clinical prompts. We analyze errors such as factual inconsistencies and anatomical inaccuracies, evaluating outputs using expert informed criteria across imaging modalities. Our findings reveal common patterns of hallucination in both interpretive and generative tasks, with implications for clinical reliability. We also discuss factors contributing to these failures, including model architecture and training data. By systematically studying both image understanding and generation, this work provides insights into improving the safety and trustworthiness of LLM driven medical imaging systems.

[920] 3DGS-VBench: A Comprehensive Video Quality Evaluation Benchmark for 3DGS Compression

Yuke Xing, William Gordon, Qi Yang, Kaifa Yang, Jiarui Wang, Yiling Xu

Main category: eess.IV

TL;DR: 3DGS-VBench is introduced as a dataset and benchmark for assessing the quality of compressed 3D Gaussian Splatting (3DGS) models, addressing the lack of systematic quality evaluation in existing compression methods.

Details

Motivation: The substantial storage requirements of 3DGS hinder practical deployment, and existing compression methods introduce distortions without proper quality assessment.

Method: A large-scale dataset (3DGS-VBench) is created with 660 compressed 3DGS models and videos from 11 scenes, annotated by 50 participants for MOS scores. Six SOTA compression algorithms are benchmarked.

Result: The dataset provides reliable MOS scores and benchmarks storage efficiency and visual quality of 3DGS compression methods. Fifteen quality assessment metrics are evaluated.

Conclusion: 3DGS-VBench facilitates specialized VQA model training and advances research in 3DGS compression and quality assessment.

Abstract: 3D Gaussian Splatting (3DGS) enables real-time novel view synthesis with high visual fidelity, but its substantial storage requirements hinder practical deployment, prompting state-of-the-art (SOTA) 3DGS methods to incorporate compression modules. However, these 3DGS generative compression techniques introduce unique distortions lacking systematic quality assessment research. To this end, we establish 3DGS-VBench, a large-scale Video Quality Assessment (VQA) Dataset and Benchmark with 660 compressed 3DGS models and video sequences generated from 11 scenes across 6 SOTA 3DGS compression algorithms with systematically designed parameter levels. With annotations from 50 participants, we obtained MOS scores with outlier removal and validated dataset reliability. We benchmark 6 3DGS compression algorithms on storage efficiency and visual quality, and evaluate 15 quality assessment metrics across multiple paradigms. Our work enables specialized VQA model training for 3DGS, serving as a catalyst for compression and quality assessment research. The dataset is available at https://github.com/YukeXing/3DGS-VBench.

[921] SAGCNet: Spatial-Aware Graph Completion Network for Missing Slice Imputation in Population CMR Imaging

Junkai Liu, Nay Aung, Theodoros N. Arvanitis, Stefan K. Piechnik, Joao A C Lima, Steffen E. Petersen, Le Zhang

Main category: eess.IV

TL;DR: SAGCNet improves MRI slice imputation by modeling inter-slice relationships and leveraging 3D spatial context, outperforming existing methods.

Details

Motivation: MRI slice imputation is challenging due to missing slices and the complexity of 3D data. Existing methods struggle with inter-slice correlations and spatial context.

Method: SAGCNet introduces a volumetric slice graph completion module and a spatial adapter to model inter-slice relationships and capture 3D spatial information.

Result: SAGCNet outperforms state-of-the-art methods in synthesizing missing CMR slices, even with limited data.

Conclusion: SAGCNet effectively addresses MRI slice imputation challenges, offering improved accuracy and robustness.

Abstract: Magnetic resonance imaging (MRI) provides detailed soft-tissue characteristics that assist in disease diagnosis and screening. However, the accuracy of clinical practice is often hindered by missing or unusable slices due to various factors. Volumetric MRI synthesis methods have been developed to address this issue by imputing missing slices from available ones. The inherent 3D nature of volumetric MRI data, such as cardiac magnetic resonance (CMR), poses significant challenges for missing slice imputation approaches, including (1) the difficulty of modeling local inter-slice correlations and dependencies of volumetric slices, and (2) the limited exploration of crucial 3D spatial information and global context. In this study, to mitigate these issues, we present Spatial-Aware Graph Completion Network (SAGCNet) to overcome the dependency on complete volumetric data, featuring two main innovations: (1) a volumetric slice graph completion module that incorporates the inter-slice relationships into a graph structure, and (2) a volumetric spatial adapter component that enables our model to effectively capture and utilize various forms of 3D spatial context. Extensive experiments on cardiac MRI datasets demonstrate that SAGCNet is capable of synthesizing absent CMR slices, outperforming competitive state-of-the-art MRI synthesis methods both quantitatively and qualitatively. Notably, our model maintains superior performance even with limited slice data.

[922] Large-scale Multi-sequence Pretraining for Generalizable MRI Analysis in Versatile Clinical Applications

Zelin Qiu, Xi Wang, Zhuoyao Xie, Juan Zhou, Yu Wang, Lingjie Yang, Xinrui Jiang, Juyoung Bae, Moo Hyun Son, Qiang Ye, Dexuan Chen, Rui Zhang, Tao Li, Neeraj Ramesh Mahboobani, Varut Vardhanabhuti, Xiaohui Duan, Yinghua Zhao, Hao Chen

Main category: eess.IV

TL;DR: PRISM, a foundation model pre-trained with large-scale multi-sequence MRI, outperforms non-pretrained and existing models across 44 downstream tasks, demonstrating robust generalization across diverse MRI protocols.

Details

Motivation: The heterogeneity among MRI sequences challenges deep learning models' generalization, limiting clinical utility. PRISM aims to address this by learning robust representations.

Method: PRISM uses a novel pretraining paradigm to disentangle anatomically invariant features from sequence-specific variations, leveraging 336,476 volumetric MRI scans from 34 datasets.

Result: PRISM achieved top results in 39/44 downstream tasks, showing significant improvements over baselines.

Conclusion: PRISM enhances AI’s translational potential in radiology by delivering consistent performance across diverse MRI protocols.

Abstract: Multi-sequence Magnetic Resonance Imaging (MRI) offers remarkable versatility, enabling the distinct visualization of different tissue types. Nevertheless, the inherent heterogeneity among MRI sequences poses significant challenges to the generalization capability of deep learning models. These challenges undermine model performance when faced with varying acquisition parameters, thereby severely restricting their clinical utility. In this study, we present PRISM, a foundation model PRe-trained with large-scale multI-Sequence MRI. We collected a total of 64 datasets from both public and private sources, encompassing a wide range of whole-body anatomical structures, with scans spanning diverse MRI sequences. Among them, 336,476 volumetric MRI scans from 34 datasets (8 public and 26 private) were curated to construct the largest multi-organ multi-sequence MRI pretraining corpus to date. We propose a novel pretraining paradigm that disentangles anatomically invariant features from sequence-specific variations in MRI, while preserving high-level semantic representations. We established a benchmark comprising 44 downstream tasks, including disease diagnosis, image segmentation, registration, progression prediction, and report generation. These tasks were evaluated on 32 public datasets and 5 private cohorts. PRISM consistently outperformed both non-pretrained models and existing foundation models, achieving first-rank results in 39 out of 44 downstream benchmarks with statistical significance improvements. These results underscore its ability to learn robust and generalizable representations across unseen data acquired under diverse MRI protocols. PRISM provides a scalable framework for multi-sequence MRI analysis, thereby enhancing the translational potential of AI in radiology. It delivers consistent performance across diverse imaging protocols, reinforcing its clinical applicability.

[923] HaDM-ST: Histology-Assisted Differential Modeling for Spatial Transcriptomics Generation

Xuepeng Liu, Zheng Jiang, Pinan Zhu, Hanyu Liu, Chao Li

Main category: eess.IV

TL;DR: HaDM-ST improves spatial transcriptomics resolution by leveraging H&E images and low-resolution ST, addressing key challenges in feature extraction, alignment, and gene-specific modeling.

Details

Motivation: Current spatial transcriptomics (ST) methods lack resolution and face challenges in integrating H&E images, precise alignment, and gene-specific variation modeling.

Method: HaDM-ST uses a semantic distillation network for H&E feature extraction, a spatial alignment module for pixel-wise correspondence, and a channel-aware adversarial learner for gene-level modeling.

Result: HaDM-ST outperforms prior methods, enhancing spatial fidelity and gene-level coherence in high-resolution ST predictions across diverse tissues and species.

Conclusion: HaDM-ST effectively addresses resolution limitations in ST by integrating H&E images and low-resolution ST, offering improved accuracy and coherence.

Abstract: Spatial transcriptomics (ST) reveals spatial heterogeneity of gene expression, yet its resolution is limited by current platforms. Recent methods enhance resolution via H&E-stained histology, but three major challenges persist: (1) isolating expression-relevant features from visually complex H&E images; (2) achieving spatially precise multimodal alignment in diffusion-based frameworks; and (3) modeling gene-specific variation across expression channels. We propose HaDM-ST (Histology-assisted Differential Modeling for ST Generation), a high-resolution ST generation framework conditioned on H&E images and low-resolution ST. HaDM-ST includes: (i) a semantic distillation network to extract predictive cues from H&E; (ii) a spatial alignment module enforcing pixel-wise correspondence with low-resolution ST; and (iii) a channel-aware adversarial learner for fine-grained gene-level modeling. Experiments on 200 genes across diverse tissues and species show HaDM-ST consistently outperforms prior methods, enhancing spatial fidelity and gene-level coherence in high-resolution ST predictions.

[924] Towards Human-AI Collaboration System for the Detection of Invasive Ductal Carcinoma in Histopathology Images

Shuo Han, Ahmed Karam Eldaly, Solomon Sunday Oyelere

Main category: eess.IV

TL;DR: A human-in-the-loop deep learning system combines AI (EfficientNetV2S) and medical expertise to improve invasive ductal carcinoma detection in histopathology images, achieving 93.65% accuracy and further enhancing performance through iterative feedback.

Details

Motivation: Early and accurate diagnosis of invasive ductal carcinoma (IDC) is crucial for patient survival, and combining AI with human expertise can enhance precision and efficiency.

Method: A human-in-the-loop system uses EfficientNetV2S for initial diagnosis, followed by expert review and correction of misclassified images, iteratively refining the model.

Result: The system achieves 93.65% accuracy, with further improvements from human feedback, outperforming existing methods.

Conclusion: The collaborative human-AI approach advances IDC detection, offering a promising future for AI-assisted medical diagnostics.

Abstract: Invasive ductal carcinoma (IDC) is the most prevalent form of breast cancer, and early, accurate diagnosis is critical to improving patient survival rates by guiding treatment decisions. Combining medical expertise with artificial intelligence (AI) holds significant promise for enhancing the precision and efficiency of IDC detection. In this work, we propose a human-in-the-loop (HITL) deep learning system designed to detect IDC in histopathology images. The system begins with an initial diagnosis provided by a high-performance EfficientNetV2S model, offering feedback from AI to the human expert. Medical professionals then review the AI-generated results, correct any misclassified images, and integrate the revised labels into the training dataset, forming a feedback loop from the human back to the AI. This iterative process refines the model’s performance over time. The EfficientNetV2S model itself achieves state-of-the-art performance compared to existing methods in the literature, with an overall accuracy of 93.65%. Incorporating the human-in-the-loop system further improves the model’s accuracy using four experimental groups with misclassified images. These results demonstrate the potential of this collaborative approach to enhance AI performance in diagnostic systems. This work contributes to advancing automated, efficient, and highly accurate methods for IDC detection through human-AI collaboration, offering a promising direction for future AI-assisted medical diagnostics.

Johanna P. Müller, Anika Knupfer, Pedro Blöss, Edoardo Berardi Vittur, Bernhard Kainz, Jana Hutter

Main category: eess.IV

TL;DR: A novel diffusion-based framework for generating high-fidelity synthetic uterine MRI images, addressing data scarcity and privacy in gynaecological imaging.

Details

Motivation: Existing diffusion models struggle with anatomical precision in female pelvic images, hindering applications in gynaecology due to data scarcity and privacy concerns.

Method: Combines unconditional and conditioned Denoising Diffusion Probabilistic Models (DDPMs) and Latent Diffusion Models (LDMs) in 2D and 3D to generate synthetic images.

Result: Produces anatomically coherent, high-fidelity synthetic images validated by expert evaluation and improves diagnostic accuracy.

Conclusion: The framework advances equitable AI in gynaecology by providing a synthetic dataset with privacy safeguards for reproducible research.

Abstract: Despite significant progress in generative modelling, existing diffusion models often struggle to produce anatomically precise female pelvic images, limiting their application in gynaecological imaging, where data scarcity and patient privacy concerns are critical. To overcome these barriers, we introduce a novel diffusion-based framework for uterine MRI synthesis, integrating both unconditional and conditioned Denoising Diffusion Probabilistic Models (DDPMs) and Latent Diffusion Models (LDMs) in 2D and 3D. Our approach generates anatomically coherent, high fidelity synthetic images that closely mimic real scans and provide valuable resources for training robust diagnostic models. We evaluate generative quality using advanced perceptual and distributional metrics, benchmarking against standard reconstruction methods, and demonstrate substantial gains in diagnostic accuracy on a key classification task. A blinded expert evaluation further validates the clinical realism of our synthetic images. We release our models with privacy safeguards and a comprehensive synthetic uterine MRI dataset to support reproducible research and advance equitable AI in gynaecology.

[926] DiffVC-OSD: One-Step Diffusion-based Perceptual Neural Video Compression Framework

Wenzhuo Ma, Zhenzhong Chen

Main category: eess.IV

TL;DR: DiffVC-OSD is a one-step diffusion-based neural video compression framework that improves perceptual quality and speed over multi-step methods.

Details

Motivation: To enhance perceptual quality and reduce computational complexity in video compression by simplifying the diffusion process to a single step.

Method: Uses a One-Step Diffusion Model with a Temporal Context Adapter for fine-grained guidance and End-to-End Finetuning for better compression.

Result: Achieves state-of-the-art perceptual performance, 20× faster decoding, and 86.92% bitrate reduction compared to multi-step variants.

Conclusion: DiffVC-OSD is a highly efficient and effective solution for perceptual neural video compression.

Abstract: In this work, we first propose DiffVC-OSD, a One-Step Diffusion-based Perceptual Neural Video Compression framework. Unlike conventional multi-step diffusion-based methods, DiffVC-OSD feeds the reconstructed latent representation directly into a One-Step Diffusion Model, enhancing perceptual quality through a single diffusion step guided by both temporal context and the latent itself. To better leverage temporal dependencies, we design a Temporal Context Adapter that encodes conditional inputs into multi-level features, offering more fine-grained guidance for the Denoising Unet. Additionally, we employ an End-to-End Finetuning strategy to improve overall compression performance. Extensive experiments demonstrate that DiffVC-OSD achieves state-of-the-art perceptual compression performance, offers about 20$\times$ faster decoding and a 86.92% bitrate reduction compared to the corresponding multi-step diffusion-based variant.

[927] Anatomy-Aware Low-Dose CT Denoising via Pretrained Vision Models and Semantic-Guided Contrastive Learning

Runze Wang, Zeli Chen, Zhiyun Song, Wei Fang, Jiajin Zhang, Danyang Tu, Yuxing Tang, Minfeng Xu, Xianghua Ye, Le Lu, Dakai Jin

Main category: eess.IV

TL;DR: ALDEN is an anatomy-aware LDCT denoising method that integrates semantic features from pretrained vision models, using adversarial and contrastive learning to improve noise reduction while preserving anatomical details.

Details

Motivation: Current deep learning-based LDCT denoising methods often ignore anatomical semantics, leading to suboptimal results. ALDEN aims to address this by incorporating tissue-specific features.

Method: ALDEN combines adversarial learning with a novel anatomy-aware discriminator and semantic-guided contrastive learning, leveraging cross-attention and PVM-derived features for tissue-specific realism and consistency.

Result: ALDEN outperforms existing methods on LDCT denoising datasets, reducing over-smoothing and better preserving anatomical details. It also excels in a downstream multi-organ segmentation task.

Conclusion: ALDEN effectively integrates anatomical semantics into LDCT denoising, achieving state-of-the-art performance and demonstrating practical utility in medical imaging tasks.

Abstract: To reduce radiation exposure and improve the diagnostic efficacy of low-dose computed tomography (LDCT), numerous deep learning-based denoising methods have been developed to mitigate noise and artifacts. However, most of these approaches ignore the anatomical semantics of human tissues, which may potentially result in suboptimal denoising outcomes. To address this problem, we propose ALDEN, an anatomy-aware LDCT denoising method that integrates semantic features of pretrained vision models (PVMs) with adversarial and contrastive learning. Specifically, we introduce an anatomy-aware discriminator that dynamically fuses hierarchical semantic features from reference normal-dose CT (NDCT) via cross-attention mechanisms, enabling tissue-specific realism evaluation in the discriminator. In addition, we propose a semantic-guided contrastive learning module that enforces anatomical consistency by contrasting PVM-derived features from LDCT, denoised CT and NDCT, preserving tissue-specific patterns through positive pairs and suppressing artifacts via dual negative pairs. Extensive experiments conducted on two LDCT denoising datasets reveal that ALDEN achieves the state-of-the-art performance, offering superior anatomy preservation and substantially reducing over-smoothing issue of previous work. Further validation on a downstream multi-organ segmentation task (encompassing 117 anatomical structures) affirms the model’s ability to maintain anatomical awareness.

[928] RedDino: A foundation model for red blood cell analysis

Luca Zedda, Andrea Loddo, Cecilia Di Ruberto, Carsten Marr

Main category: eess.IV

TL;DR: RedDino is a self-supervised foundation model for RBC image analysis, outperforming state-of-the-art models in classification and generalization.

Details

Motivation: Precise RBC morphological analysis is crucial for diagnosing hematological disorders, but comprehensive AI solutions are lacking.

Method: RedDino adapts DINOv2 for RBC analysis, trained on 1.25 million diverse RBC images. Evaluations include linear probing and nearest neighbor classification.

Result: RedDino excels in RBC shape classification and demonstrates strong feature representation and generalization.

Conclusion: RedDino advances computational hematology by capturing nuanced RBC features, offering reliable diagnostic tools. Code and models are publicly available.

Abstract: Red blood cells (RBCs) are essential to human health, and their precise morphological analysis is important for diagnosing hematological disorders. Despite the promise of foundation models in medical diagnostics, comprehensive AI solutions for RBC analysis remain scarce. We present RedDino, a self-supervised foundation model designed for RBC image analysis. RedDino uses an RBC-specific adaptation of the DINOv2 self-supervised learning framework and is trained on a curated dataset of 1.25 million RBC images from diverse acquisition modalities and sources. Extensive evaluations show that RedDino outperforms existing state-of-the-art models on RBC shape classification. Through assessments including linear probing and nearest neighbor classification, we confirm its strong feature representations and generalization ability. Our main contributions are: (1) a foundation model tailored for RBC analysis, (2) ablation studies exploring DINOv2 configurations for RBC modeling, and (3) a detailed evaluation of generalization performance. RedDino addresses key challenges in computational hematology by capturing nuanced morphological features, advancing the development of reliable diagnostic tools. The source code and pretrained models for RedDino are available at https://github.com/Snarci/RedDino, and the pretrained models can be downloaded from our Hugging Face collection at https://huggingface.co/collections/Snarcy/reddino-689a13e29241d2e5690202fc

[929] A Physics-Driven Neural Network with Parameter Embedding for Generating Quantitative MR Maps from Weighted Images

Lingjing Chen, Chengxiu Zhang, Yinqiao Yi, Yida Wang, Yang Song, Xu Yan, Shengfang Xu, Dalin Zhu, Mengqiu Cao, Yan Zhou, Chenglong Wang, Guang Yang

Main category: eess.IV

TL;DR: A deep learning approach integrates MRI sequence parameters to enhance quantitative image synthesis, outperforming conventional models in accuracy and generalization.

Details

Motivation: To improve the accuracy and generalizability of quantitative MRI synthesis by embedding sequence parameters (TR, TE, TI) into the model.

Method: A physics-driven neural network uses parameter embedding to learn MRI signal formation, synthesizing T1, T2, and PD maps from conventional weighted images.

Result: Achieved high performance (PSNR >34 dB, SSIM >0.92) and accurately synthesized maps for unseen pathological regions.

Conclusion: The method enhances qMRI reliability and clinical utility by leveraging physical principles of MRI signals.

Abstract: We propose a deep learning-based approach that integrates MRI sequence parameters to improve the accuracy and generalizability of quantitative image synthesis from clinical weighted MRI. Our physics-driven neural network embeds MRI sequence parameters – repetition time (TR), echo time (TE), and inversion time (TI) – directly into the model via parameter embedding, enabling the network to learn the underlying physical principles of MRI signal formation. The model takes conventional T1-weighted, T2-weighted, and T2-FLAIR images as input and synthesizes T1, T2, and proton density (PD) quantitative maps. Trained on healthy brain MR images, it was evaluated on both internal and external test datasets. The proposed method achieved high performance with PSNR values exceeding 34 dB and SSIM values above 0.92 for all synthesized parameter maps. It outperformed conventional deep learning models in accuracy and robustness, including data with previously unseen brain structures and lesions. Notably, our model accurately synthesized quantitative maps for these unseen pathological regions, highlighting its superior generalization capability. Incorporating MRI sequence parameters via parameter embedding allows the neural network to better learn the physical characteristics of MR signals, significantly enhancing the performance and reliability of quantitative MRI synthesis. This method shows great potential for accelerating qMRI and improving its clinical utility.

Today’s Research Highlights

Table of Contents

cs.CL

[1] Semi-automated Fact-checking in Portuguese: Corpora Enrichment using Retrieval with Claim extraction

[2] Retrieval augmented generation based dynamic prompting for few-shot biomedical named entity recognition using large language models

[3] CarbonScaling: Extending Neural Scaling Laws for Carbon Footprint in Large Language Models

[4] The Art of Breaking Words: Rethinking Multilingual Tokenizer Design

[5] Factor Augmented Supervised Learning with Text Embeddings

[6] SEVADE: Self-Evolving Multi-Agent Analysis with Decoupled Evaluation for Hallucination-Resistant Irony Detection

[7] Discerning minds or generic tutors? Evaluating instructional guidance capabilities in Socratic LLMs

[8] LLM Unlearning Without an Expert Curated Dataset

[9] BrowseComp-Plus: A More Fair and Transparent Evaluation Benchmark of Deep-Research Agent

[10] Train It and Forget It: Merge Lists are Unnecessary for BPE Inference in Language Models

[11] Measuring Stereotype and Deviation Biases in Large Language Models

[12] Testing the Limits of Machine Translation from One Book

[13] Do Biased Models Have Biased Thoughts?

[14] Play Favorites: A Statistical Method to Measure Self-Bias in LLM-as-a-Judge

[15] Large Language Models for Oral History Understanding with Text Classification and Sentiment Analysis

[16] Many-Turn Jailbreaking

[17] Annotating Errors in English Learners’ Written Language Production: Advancing Automated Written Feedback Systems

[18] Text to Speech System for Meitei Mayek Script

[19] ESNERA: Empirical and semantic named entity alignment for named entity dataset merging

[20] How Does a Deep Neural Network Look at Lexical Stress?

[21] The ReQAP System for Question Answering over Personal Information

[22] Think Before You Talk: Enhancing Meaningful Dialogue Generation in Full-Duplex Speech Language Models with Planning-Inspired Text Guidance

[23] Score Before You Speak: Improving Persona Consistency in Dialogue Generation using Response Quality Scores

[24] Incorporating Contextual Paralinguistic Understanding in Large Speech-Language Models

[25] Model-Agnostic Sentiment Distribution Stability Analysis for Robust LLM-Generated Texts Detection

[26] Two-Stage Quranic QA via Ensemble Retrieval and Instruction-Tuned Answer Extraction

[27] Rethinking 1-bit Optimization Leveraging Pre-trained Large Language Models

[28] Vec2Summ: Text Summarization via Probabilistic Sentence Embeddings

[29] SEADialogues: A Multilingual Culturally Grounded Multi-turn Dialogue Dataset on Southeast Asian Languages

[30] ReaGAN: Node-as-Agent-Reasoning Graph Agentic Network

[31] CLAIR-A: Leveraging Large Language Models to Judge Audio Captions

[32] BharatBBQ: A Multilingual Bias Benchmark for Question Answering in the Indian Context

[33] RCR-Router: Efficient Role-Aware Context Routing for Multi-Agent LLM Systems with Structured Memory

[34] URO-Bench: Towards Comprehensive Evaluation for End-to-End Spoken Dialogue Models

[35] Less Is More: Training-Free Sparse Attention with Global Locality for Efficient Reasoning

[36] Investigating Intersectional Bias in Large Language Models using Confidence Disparities in Coreference Resolution

[37] Fairness of Automatic Speech Recognition: Looking Through a Philosophical Lens

[38] Gradient Surgery for Safe LLM Fine-Tuning

[39] Omni-SafetyBench: A Benchmark for Safety Evaluation of Audio-Visual Large Language Models

[40] Improved Personalized Headline Generation via Denoising Fake Interests from Implicit Feedback

[41] Schema Lineage Extraction at Scale: Multilingual Pipelines, Composite Evaluation, and Language-Model Benchmarks

[42] DySK-Attn: A Framework for Efficient, Real-Time Knowledge Updating in Large Language Models via Dynamic Sparse Knowledge Attention

[43] Adapting LLMs to Time Series Forecasting via Temporal Heterogeneity Modeling and Semantic Alignment

[44] Enhancing Rumor Detection Methods with Propagation Structure Infused Language Model

[45] Prompt Tuning for Few-Shot Continual Learning Named Entity Recognition

[46] The 2D+ Dynamic Articulatory Model DYNARTmo: Tongue-Palate Contact Area Estimation

[47] MAQuA: Adaptive Question-Asking for Multidimensional Mental Health Screening using Item Response Theory

[48] “Pull or Not to Pull?’’: Investigating Moral Biases in Leading Large Language Models Across Ethical Dilemmas

[49] Arce: Augmented Roberta with Contextualized Elucidations for Ner in Automated Rule Checking

[50] CCFQA: A Benchmark for Cross-Lingual and Cross-Modal Speech and Text Factuality Evaluation

[51] HealthBranches: Synthesizing Clinically-Grounded Question Answering Datasets via Decision Pathways

[52] ObfusQAte: A Proposed Framework to Evaluate LLM Robustness on Obfuscated Factual Question Answering

[53] Strategies of Code-switching in Human-Machine Dialogs

[54] Grounding Multilingual Multimodal LLMs With Cultural Knowledge

[55] Let’s Revise Step-by-Step: A Unified Local Search Framework for Code Generation with LLMs

[56] Positional Biases Shift as Inputs Approach Context Window Limits

[57] ALOPE: Adaptive Layer Optimization for Translation Quality Estimation using Large Language Models

[58] Augmenting Bias Detection in LLMs Using Topological Data Analysis

[59] Word Clouds as Common Voices: LLM-Assisted Visualization of Participant-Weighted Themes in Qualitative Interviews

[60] From Trial-and-Error to Improvement: A Systematic Analysis of LLM Exploration Mechanisms in RLVR

[61] IBPS: Indian Bail Prediction System

[62] Keyword-Centric Prompting for One-Shot Event Detection with Self-Generated Rationale Enhancements

[63] InterChart: Benchmarking Visual Reasoning Across Decomposed and Distributed Chart Information

[64] LoSemB: Logic-Guided Semantic Bridging for Inductive Tool Retrieval

[65] What am I missing here?: Evaluating Large Language Models for Masked Sentence Prediction

[66] Exploring Causal Effect of Social Bias on Faithfulness Hallucinations in Large Language Models

[67] SASST: Leveraging Syntax-Aware Chunking and LLMs for Simultaneous Speech Translation

[68] Grove MoE: Towards Efficient and Superior MoE LLMs with Adjugate Experts

[69] Can You Trick the Grader? Adversarial Persuasion of LLM Judges

[70] Evaluating Compositional Approaches for Focus and Sentiment Analysis

[71] Evaluating Large Language Models as Expert Annotators

[72] LLMs for Law: Evaluating Legal-Specific LLMs on Contract Understanding

[73] Large Language Models for Czech Aspect-Based Sentiment Analysis

[74] Few-shot Cross-lingual Aspect-Based Sentiment Analysis with Sequence-to-Sequence Models

[75] Tailored Emotional LLM-Supporter: Enhancing Cultural Sensitivity

[76] Challenges and opportunities in portraying emotion in generated sign language

[77] Expert Preference-based Evaluation of Automated Related Work Generation