Today’s Research Highlights

AI-enhanced summaries of the latest research papers from arXiv.

cs.CL [Total: 80]
cs.CV [Total: 155]
cs.AI [Total: 41]
cs.SD [Total: 8]
cs.LG [Total: 104]
cs.MA [Total: 4]
cs.MM [Total: 1]
eess.AS [Total: 8]
eess.IV [Total: 26]

cs.CL

[1] Shop-R1: Rewarding LLMs to Simulate Human Behavior in Online Shopping via Reinforcement Learning

Yimeng Zhang, Tian Wang, Jiri Gesi, Ziyi Wang, Yuxuan Lu, Jiacheng Lin, Sinong Zhan, Vianne Gao, Ruochen Jiao, Junze Liu, Kun Qian, Yuxin Tang, Ran Xue, Houyu Zhang, Qingjun Cui, Yufan Guo, Dakuo Wang

Main category: cs.CL

TL;DR: Shop-R1 is a reinforcement learning framework that enhances LLMs’ reasoning for simulating human behavior in online shopping by decomposing tasks into rationale generation and action prediction with distinct rewards.

Details

Motivation: To overcome the limitations of LLM-generated rationales in bounded reasoning capabilities for simulating human behavior in web environments.

Method: Decomposes the task into rationale generation (guided by internal model signals) and action prediction (using a hierarchical reward structure with difficulty-aware scaling).

Result: Achieves a 65% relative improvement over the baseline in simulating human behavior.

Conclusion: Shop-R1 effectively enhances LLM reasoning for realistic human behavior simulation in online shopping.

Abstract: Large Language Models (LLMs) have recently demonstrated strong potential in generating ‘believable human-like’ behavior in web environments. Prior work has explored augmenting training data with LLM-synthesized rationales and applying supervised fine-tuning (SFT) to enhance reasoning ability, which in turn can improve downstream action prediction. However, the performance of such approaches remains inherently bounded by the reasoning capabilities of the model used to generate the rationales. In this paper, we introduce Shop-R1, a novel reinforcement learning (RL) framework aimed at enhancing the reasoning ability of LLMs for simulation of real human behavior in online shopping environments Specifically, Shop-R1 decomposes the human behavior simulation task into two stages: rationale generation and action prediction, each guided by distinct reward signals. For rationale generation, we leverage internal model signals (e.g., logit distributions) to guide the reasoning process in a self-supervised manner. For action prediction, we propose a hierarchical reward structure with difficulty-aware scaling to prevent reward hacking and enable fine-grained reward assignment. This design evaluates both high-level action types and the correctness of fine-grained sub-action details (attributes and values), rewarding outputs proportionally to their difficulty. Experimental results show that our method achieves a relative improvement of over 65% compared to the baseline.

[2] Dynamic and Generalizable Process Reward Modeling

Zhangyue Yin, Qiushi Sun, Zhiyuan Zeng, Qinyuan Cheng, Xipeng Qiu, Xuanjing Huang

Main category: cs.CL

TL;DR: DG-PRM introduces a dynamic and generalizable reward modeling approach for LLMs, using a reward tree and Pareto dominance to improve cross-domain performance and adaptability.

Details

Motivation: Existing PRMs rely on heuristics and lack cross-domain generalization, while LLM-as-judge overlooks textual guidance. Static criteria also fail in complex process supervision.

Method: DG-PRM uses a reward tree for fine-grained criteria and dynamically selects step-wise rewards. Pareto dominance identifies discriminative positive/negative pairs.

Result: DG-PRM achieves strong performance on benchmarks, boosting model performance and adapting well to out-of-distribution scenarios.

Conclusion: DG-PRM offers a robust solution for dynamic and generalizable process reward modeling, enhancing LLM performance in complex tasks.

Abstract: Process Reward Models (PRMs) are crucial for guiding Large Language Models (LLMs) in complex scenarios by providing dense reward signals. However, existing PRMs primarily rely on heuristic approaches, which struggle with cross-domain generalization. While LLM-as-judge has been proposed to provide generalized rewards, current research has focused mainly on feedback results, overlooking the meaningful guidance embedded within the text. Additionally, static and coarse-grained evaluation criteria struggle to adapt to complex process supervision. To tackle these challenges, we propose Dynamic and Generalizable Process Reward Modeling (DG-PRM), which features a reward tree to capture and store fine-grained, multi-dimensional reward criteria. DG-PRM dynamically selects reward signals for step-wise reward scoring. To handle multifaceted reward signals, we pioneeringly adopt Pareto dominance estimation to identify discriminative positive and negative pairs. Experimental results show that DG-PRM achieves stunning performance on prevailing benchmarks, significantly boosting model performance across tasks with dense rewards. Further analysis reveals that DG-PRM adapts well to out-of-distribution scenarios, demonstrating exceptional generalizability.

[3] VeriMinder: Mitigating Analytical Vulnerabilities in NL2SQL

Shubham Mohole, Sainyam Galhotra

Main category: cs.CL

TL;DR: VeriMinder is an interactive system for detecting and mitigating cognitive biases in analytical questions posed via NLIDBs, improving analysis quality.

Details

Motivation: The democratization of data analysis through NLIDBs has highlighted the need to address cognitive biases in user queries, an underexplored area despite focus on text-to-SQL accuracy.

Method: VeriMinder uses a contextual semantic mapping framework, an analytical framework based on the Hard-to-Vary principle, and an optimized LLM-powered system for prompt generation.

Result: User testing showed 82.5% positive impact on analysis quality, with VeriMinder outperforming alternatives by at least 20% in concreteness, comprehensiveness, and accuracy.

Conclusion: VeriMinder effectively mitigates ‘wrong question’ vulnerabilities in data analysis and is available as open-source software to foster further research and adoption.

Abstract: Application systems using natural language interfaces to databases (NLIDBs) have democratized data analysis. This positive development has also brought forth an urgent challenge to help users who might use these systems without a background in statistical analysis to formulate bias-free analytical questions. Although significant research has focused on text-to-SQL generation accuracy, addressing cognitive biases in analytical questions remains underexplored. We present VeriMinder, https://veriminder.ai, an interactive system for detecting and mitigating such analytical vulnerabilities. Our approach introduces three key innovations: (1) a contextual semantic mapping framework for biases relevant to specific analysis contexts (2) an analytical framework that operationalizes the Hard-to-Vary principle and guides users in systematic data analysis (3) an optimized LLM-powered system that generates high-quality, task-specific prompts using a structured process involving multiple candidates, critic feedback, and self-reflection. User testing confirms the merits of our approach. In direct user experience evaluation, 82.5% participants reported positively impacting the quality of the analysis. In comparative evaluation, VeriMinder scored significantly higher than alternative approaches, at least 20% better when considered for metrics of the analysis’s concreteness, comprehensiveness, and accuracy. Our system, implemented as a web application, is set to help users avoid “wrong question” vulnerability during data analysis. VeriMinder code base with prompts, https://reproducibility.link/veriminder, is available as an MIT-licensed open-source software to facilitate further research and adoption within the community.

[4] One Whisper to Grade Them All

Nhan Phan, Anusha Porwal, Yaroslav Getman, Ekaterina Voskoboinik, Tamás Grósz, Mikko Kurimo

Main category: cs.CL

TL;DR: An efficient end-to-end system for holistic Automatic Speaking Assessment (ASA) using a single Whisper-small encoder and lightweight aggregator, outperforming text-based baselines with reduced parameters and improved data efficiency.

Details

Motivation: To simplify and enhance ASA for multi-part second-language tests by eliminating transcription needs and per-part models, making it practical for large-scale applications.

Method: Uses a single Whisper-small encoder to process all spoken responses, combines information via a lightweight aggregator, and employs a data sampling strategy for training efficiency.

Result: Achieved RMSE of 0.384, outperforming the text-based baseline (0.44), with reduced parameters (168M) and training on only 44.8% of speakers (RMSE 0.383).

Conclusion: The system is efficient, scalable, and data-efficient, demonstrating strong performance for ASA in language learning systems.

Abstract: We present an efficient end-to-end approach for holistic Automatic Speaking Assessment (ASA) of multi-part second-language tests, developed for the 2025 Speak & Improve Challenge. Our system’s main novelty is the ability to process all four spoken responses with a single Whisper-small encoder, combine all information via a lightweight aggregator, and predict the final score. This architecture removes the need for transcription and per-part models, cuts inference time, and makes ASA practical for large-scale Computer-Assisted Language Learning systems. Our system achieved a Root Mean Squared Error (RMSE) of 0.384, outperforming the text-based baseline (0.44) while using at most 168M parameters (about 70% of Whisper-small). Furthermore, we propose a data sampling strategy, allowing the model to train on only 44.8% of the speakers in the corpus and still reach 0.383 RMSE, demonstrating improved performance on imbalanced classes and strong data efficiency.

[5] Evaluating the Performance of AI Text Detectors, Few-Shot and Chain-of-Thought Prompting Using DeepSeek Generated Text

Hulayyil Alshammari, Praveen Rao

Main category: cs.CL

TL;DR: The study evaluates six AI detection tools’ ability to identify text generated by DeepSeek, a new LLM, under adversarial attacks like paraphrasing and humanization. Few-shot and CoT prompting showed high accuracy.

Details

Motivation: The rise of LLMs has raised concerns about writing integrity, but existing studies overlook DeepSeek. This work fills that gap by testing detection tools against DeepSeek-generated text.

Method: The study used 49 human-authored and 49 AI-generated samples, plus 196 adversarial variants, to test six detectors. Few-shot and CoT prompting were also evaluated.

Result: QuillBot and Copyleaks performed well on original/paraphrased text, while others were inconsistent. Humanization attacks reduced accuracy significantly. Few-shot and CoT prompting achieved high accuracy.

Conclusion: Detection tools vary in robustness against DeepSeek text, with humanization being the most effective attack. Few-shot and CoT methods show promise for accurate classification.

Abstract: Large language models (LLMs) have rapidly transformed the creation of written materials. LLMs have led to questions about writing integrity, thereby driving the creation of artificial intelligence (AI) detection technologies. Adversarial attacks, such as standard and humanized paraphrasing, inhibit detectors’ ability to detect machine-generated text. Previous studies have mainly focused on ChatGPT and other well-known LLMs and have shown varying accuracy across detectors. However, there is a clear gap in the literature about DeepSeek, a recently published LLM. Therefore, in this work, we investigate whether six generally accessible AI detection tools – AI Text Classifier, Content Detector AI, Copyleaks, QuillBot, GPT-2, and GPTZero – can consistently recognize text generated by DeepSeek. The detectors were exposed to the aforementioned adversarial attacks. We also considered DeepSeek as a detector by performing few-shot prompting and chain-of-thought reasoning (CoT) for classifying AI and human-written text. We collected 49 human-authored question-answer pairs from before the LLM era and generated matching responses using DeepSeek-v3, producing 49 AI-generated samples. Then, we applied adversarial techniques such as paraphrasing and humanizing to add 196 more samples. These were used to challenge detector robustness and assess accuracy impact. While QuillBot and Copyleaks showed near-perfect performance on original and paraphrased DeepSeek text, others – particularly AI Text Classifier and GPT-2 – showed inconsistent results. The most effective attack was humanization, reducing accuracy to 71% for Copyleaks, 58% for QuillBot, and 52% for GPTZero. Few-shot and CoT prompting showed high accuracy, with the best five-shot result misclassifying only one of 49 samples (AI recall 96%, human recall 100%).

[6] Are LLM Belief Updates Consistent with Bayes’ Theorem?

Sohaib Imran, Ihor Kendiukhov, Matthew Broerman, Aditya Thomas, Riccardo Campanella, Rob Lamb, Peter M. Atkinson

Main category: cs.CL

TL;DR: Larger, more capable language models align better with Bayes’ theorem when updating beliefs, as measured by the Bayesian Coherence Coefficient (BCC).

Details

Motivation: To determine if larger language models update beliefs more consistently with Bayes' theorem when presented with evidence.

Method: Formulated the BCC metric, generated a dataset, and measured BCC across five model families, comparing parameters, training data, and benchmark scores.

Result: Larger and more capable models show credences more coherent with Bayes’ theorem.

Conclusion: Findings impact understanding and governance of LLMs, suggesting improved Bayesian reasoning with scale.

Abstract: Do larger and more capable language models learn to update their “beliefs” about propositions more consistently with Bayes’ theorem when presented with evidence in-context? To test this, we formulate a Bayesian Coherence Coefficient (BCC) metric and generate a dataset with which to measure the BCC. We measure BCC for multiple pre-trained-only language models across five model families, comparing against the number of model parameters, the amount of training data, and model scores on common benchmarks. Our results provide evidence for our hypothesis that larger and more capable pre-trained language models assign credences that are more coherent with Bayes’ theorem. These results have important implications for our understanding and governance of LLMs.

[7] Natural Language Processing for Tigrinya: Current State and Future Directions

Fitsum Gaim, Jong C. Park

Main category: cs.CL

TL;DR: A survey of NLP research for Tigrinya, analyzing 40+ studies from 2011-2025, highlighting progress, challenges, and future directions.

Details

Motivation: Tigrinya is underrepresented in NLP despite being widely spoken, necessitating a review of its research landscape.

Method: Systematic review of 40+ studies, focusing on computational resources, models, and applications across ten NLP tasks.

Result: Progress from rule-based to neural systems, driven by resource creation; challenges include morphological complexity and scarcity.

Conclusion: Provides a reference and roadmap for advancing Tigrinya NLP, with publicly available metadata.

Abstract: Despite being spoken by millions of people, Tigrinya remains severely underrepresented in Natural Language Processing (NLP) research. This work presents a comprehensive survey of NLP research for Tigrinya, analyzing over 40 studies spanning more than a decade of work from 2011 to 2025. We systematically review the current state of computational resources, models, and applications across ten distinct downstream tasks, including morphological processing, machine translation, speech recognition, and question-answering. Our analysis reveals a clear trajectory from foundational, rule-based systems to modern neural architectures, with progress consistently unlocked by resource creation milestones. We identify key challenges rooted in Tigrinya’s morphological complexity and resource scarcity, while highlighting promising research directions, including morphology-aware modeling, cross-lingual transfer, and community-centered resource development. This work serves as both a comprehensive reference for researchers and a roadmap for advancing Tigrinya NLP. A curated metadata of the surveyed studies and resources is made publicly available.\footnote{Tigrinya NLP Anthology: https://github.com/fgaim/tigrinya-nlp-anthology.

[8] Technical Report of TeleChat2, TeleChat2.5 and T1

Zihan Wang, Xinzhang Liu, Yitong Yao, Chao Wang, Yu Zhao, Zhihao Yang, Wenmin Deng, Kaipeng Jia, Jiaxin Peng, Yuyao Huang, Sishi Xiong, Zhuo Jiang, Kaidong Yu, Xiaohui Hu, Fubei Yao, Ruiyu Fang, Zhuoru Jiang, Ruiting Song, Qiyi Xie, Rui Xue, Xuewei He, Yanlei Xue, Zhu Yuan, Zhaoxi Zhang, Zilu Huang, Shiquan Wang, Xin Wang, Hanming Wu, Mingyuan Wang, Xufeng Zhan, Yuhan Sun, Zhaohu Xing, Yuhao Jiang, Bingkai Yang, Shuangyong Song, Yongxiang Li, Zhongjiang He, Xuelong Li

Main category: cs.CL

TL;DR: The TeleChat series (TeleChat2, TeleChat2.5, T1) improves performance through enhanced training strategies, with T1 excelling in reasoning and TeleChat2.5 in speed. Both outperform predecessors and some proprietary models.

Details

Motivation: To advance language model performance with minimal architectural changes, focusing on training strategies and domain-specific enhancements.

Method: Enhanced pre-training (10T tokens), SFT, DPO, continual pretraining, and RL for domain-specific tasks. Models include 115B-parameter dense Transformers.

Result: TeleChat2.5 and T1 show significant gains in reasoning and speed, with T1 outperforming proprietary models like GPT-4o.

Conclusion: The TeleChat series offers state-of-the-art models for diverse applications, publicly released to support developers and researchers.

Abstract: We introduce the latest series of TeleChat models: \textbf{TeleChat2}, \textbf{TeleChat2.5}, and \textbf{T1}, offering a significant upgrade over their predecessor, TeleChat. Despite minimal changes to the model architecture, the new series achieves substantial performance gains through enhanced training strategies in both pre-training and post-training stages. The series begins with \textbf{TeleChat2}, which undergoes pretraining on 10 trillion high-quality and diverse tokens. This is followed by Supervised Fine-Tuning (SFT) and Direct Preference Optimization (DPO) to further enhance its capabilities. \textbf{TeleChat2.5} and \textbf{T1} expand the pipeline by incorporating a continual pretraining phase with domain-specific datasets, combined with reinforcement learning (RL) to improve performance in code generation and mathematical reasoning tasks. The \textbf{T1} variant is designed for complex reasoning, supporting long Chain-of-Thought (CoT) reasoning and demonstrating substantial improvements in mathematics and coding. In contrast, \textbf{TeleChat2.5} prioritizes speed, delivering rapid inference. Both flagship models of \textbf{T1} and \textbf{TeleChat2.5} are dense Transformer-based architectures with 115B parameters, showcasing significant advancements in reasoning and general task performance compared to the original TeleChat. Notably, \textbf{T1-115B} outperform proprietary models such as OpenAI’s o1-mini and GPT-4o. We publicly release \textbf{TeleChat2}, \textbf{TeleChat2.5} and \textbf{T1}, including post-trained versions with 35B and 115B parameters, to empower developers and researchers with state-of-the-art language models tailored for diverse applications.

[9] NeuralDB: Scaling Knowledge Editing in LLMs to 100,000 Facts with Neural KV Database

Weizhi Fei, Hao Shi, Jing Xu, Jingchen Peng, Jiazheng Li, Jingzhao Zhang, Bo Bai, Wei Han, Zhenyuan Chen, Xueyan Niu

Main category: cs.CL

TL;DR: NeuralDB is a framework for efficiently editing facts in large language models (LLMs) without compromising their general abilities, outperforming existing methods in scalability and performance.

Details

Motivation: To address the limitations of current Locate-and-Edit (L&E) methods, which may degrade LLM performance or forget edited facts when scaled.

Method: Models L&E as a Key-Value (KV) database and introduces NeuralDB with a non-linear gated retrieval module to preserve general abilities.

Result: NeuralDB excels in editing efficacy, generalization, specificity, fluency, and consistency, and scales effectively to 100,000 facts.

Conclusion: NeuralDB offers a scalable and effective solution for editing LLMs while maintaining their general capabilities.

Abstract: Efficiently editing knowledge stored in large language models (LLMs) enables model updates without large-scale training. One possible solution is Locate-and-Edit (L&E), allowing simultaneous modifications of a massive number of facts. However, such editing may compromise the general abilities of LLMs and even result in forgetting edited facts when scaling up to thousands of edits. In this paper, we model existing linear L&E methods as querying a Key-Value (KV) database. From this perspective, we then propose NeuralDB, an editing framework that explicitly represents the edited facts as a neural KV database equipped with a non-linear gated retrieval module, % In particular, our gated module only operates when inference involves the edited facts, effectively preserving the general abilities of LLMs. Comprehensive experiments involving the editing of 10,000 facts were conducted on the ZsRE and CounterFacts datasets, using GPT2-XL, GPT-J (6B) and Llama-3 (8B). The results demonstrate that NeuralDB not only excels in editing efficacy, generalization, specificity, fluency, and consistency, but also preserves overall performance across six representative text understanding and generation tasks. Further experiments indicate that NeuralDB maintains its effectiveness even when scaled to 100,000 facts (\textbf{50x} more than in prior work).

[10] TELEVAL: A Dynamic Benchmark Designed for Spoken Language Models in Chinese Interactive Scenarios

Zehan Li, Hongjie Chen, Yuxin Zhang, Jing Zhou, Xuening Wang, Hang Lv, Mengjie Du, Yaodong Song, Jie Lian, Jian Kang, Jie Li, Yongxiang Li, Zhongjiang He, Xuelong Li

Main category: cs.CL

TL;DR: TELEVAL is a dynamic benchmark for evaluating spoken language models (SLMs) in realistic Chinese conversational settings, focusing on implicit cues and user-centered interactions.

Details

Motivation: Existing SLM benchmarks often misalign with natural conversational scenarios, emphasizing complex tasks over real-world usability.

Method: TELEVAL evaluates SLMs across three dimensions (Explicit Semantics, Paralinguistic and Implicit Semantics, System Abilities) using dialogue formats and separate text/audio outputs.

Result: Experiments show current SLMs still struggle with natural conversational tasks, highlighting room for improvement.

Conclusion: TELEVAL aims to enhance SLM development by prioritizing user experience and realistic interaction evaluation.

Abstract: Spoken language models (SLMs) have seen rapid progress in recent years, along with the development of numerous benchmarks for evaluating their performance. However, most existing benchmarks primarily focus on evaluating whether SLMs can perform complex tasks comparable to those tackled by large language models (LLMs), often failing to align with how users naturally interact in real-world conversational scenarios. In this paper, we propose TELEVAL, a dynamic benchmark specifically designed to evaluate SLMs’ effectiveness as conversational agents in realistic Chinese interactive settings. TELEVAL defines three evaluation dimensions: Explicit Semantics, Paralinguistic and Implicit Semantics, and System Abilities. It adopts a dialogue format consistent with real-world usage and evaluates text and audio outputs separately. TELEVAL particularly focuses on the model’s ability to extract implicit cues from user speech and respond appropriately without additional instructions. Our experiments demonstrate that despite recent progress, existing SLMs still have considerable room for improvement in natural conversational tasks. We hope that TELEVAL can serve as a user-centered evaluation framework that directly reflects the user experience and contributes to the development of more capable dialogue-oriented SLMs.

[11] GrAInS: Gradient-based Attribution for Inference-Time Steering of LLMs and VLMs

Duy Nguyen, Archiki Prasad, Elias Stengel-Eskin, Mohit Bansal

Main category: cs.CL

TL;DR: GrAInS introduces a gradient-based, token-level steering method for LLMs and VLMs, outperforming fine-tuning and baselines without weight updates.

Details

Motivation: Existing inference-time steering methods lack token-level causal influence and gradient utilization, especially in multimodal settings.

Method: Uses contrastive gradient-based attribution (Integrated Gradients) to identify influential tokens, constructs steering vectors, and adjusts activations during inference.

Result: Achieves 13.22% accuracy gain on TruthfulQA, reduces hallucination rates, and improves alignment win rates while preserving fluency.

Conclusion: GrAInS provides fine-grained, interpretable control over model behavior without retraining, outperforming existing methods.

Abstract: Inference-time steering methods offer a lightweight alternative to fine-tuning large language models (LLMs) and vision-language models (VLMs) by modifying internal activations at test time without updating model weights. However, most existing approaches rely on fixed, global intervention vectors, overlook the causal influence of individual input tokens, and fail to leverage informative gradients from the model’s logits, particularly in multimodal settings where visual and textual inputs contribute unevenly. To address these limitations, we introduce GrAInS, an inference-time steering approach that operates across both language-only and vision-language models and tasks. GrAInS uses contrastive, gradient-based attribution via Integrated Gradients to identify the top-k most influential tokens, both positively and negatively attributed based on their contribution to preferred versus dispreferred outputs. These tokens are then used to construct directional steering vectors that capture semantic shifts from undesirable to desirable behavior. During inference, GrAInS adjusts hidden activations at transformer layers guided by token-level attribution signals, and normalizes activations to preserve representational scale. This enables fine-grained, interpretable, and modular control over model behavior, without retraining or auxiliary supervision. Empirically, GrAInS consistently outperforms both fine-tuning and existing steering baselines: it achieves a 13.22% accuracy gain on TruthfulQA using Llama-3.1-8B, reduces hallucination rates on MMHal-Bench from 0.624 to 0.514 with LLaVA-1.6-7B, and improves alignment win rates on SPA-VL by 8.11%, all while preserving the model’s fluency and general capabilities.

[12] GOAT-SLM: A Spoken Language Model with Paralinguistic and Speaker Characteristic Awareness

Hongjie Chen, Zehan Li, Yaodong Song, Wenming Deng, Yitong Yao, Yuxin Zhang, Hang Lv, Xuechao Zhu, Jian Kang, Jie Lian, Jie Li, Chao Wang, Shuangyong Song, Yongxiang Li, Zhongjiang He

Main category: cs.CL

TL;DR: GOAT-SLM is a spoken language model that incorporates paralinguistic and speaker characteristics, improving natural spoken interactions beyond text semantics.

Details

Motivation: Existing SLMs overlook paralinguistic cues like emotion, dialect, and age. GOAT-SLM aims to address this gap.

Method: Uses a dual-modality head architecture and modular training to align linguistic, paralinguistic, and speaker characteristics.

Result: Outperforms existing models in emotion, dialect, and age-sensitive tasks on TELEVAL benchmark.

Conclusion: GOAT-SLM advances more natural and socially aware spoken language systems by modeling beyond linguistic content.

Abstract: Recent advances in end-to-end spoken language models (SLMs) have significantly improved the ability of AI systems to engage in natural spoken interactions. However, most existing models treat speech merely as a vehicle for linguistic content, often overlooking the rich paralinguistic and speaker characteristic cues embedded in human speech, such as dialect, age, emotion, and non-speech vocalizations. In this work, we introduce GOAT-SLM, a novel spoken language model with paralinguistic and speaker characteristic awareness, designed to extend spoken language modeling beyond text semantics. GOAT-SLM adopts a dual-modality head architecture that decouples linguistic modeling from acoustic realization, enabling robust language understanding while supporting expressive and adaptive speech generation. To enhance model efficiency and versatility, we propose a modular, staged training strategy that progressively aligns linguistic, paralinguistic, and speaker characteristic information using large-scale speech-text corpora. Experimental results on TELEVAL, a multi-dimensional evaluation benchmark, demonstrate that GOAT-SLM achieves well-balanced performance across both semantic and non-semantic tasks, and outperforms existing open-source models in handling emotion, dialectal variation, and age-sensitive interactions. This work highlights the importance of modeling beyond linguistic content and advances the development of more natural, adaptive, and socially aware spoken language systems.

[13] Synthetic Data Generation for Phrase Break Prediction with Large Language Model

Hoyeon Lee, Sejung Son, Ye-Eun Kang, Jong-Hwan Kim

Main category: cs.CL

TL;DR: The paper explores using large language models (LLMs) to generate synthetic phrase break annotations, reducing reliance on costly manual annotations in text-to-speech systems.

Details

Motivation: Current methods for phrase break prediction require extensive human annotations, which are costly and inconsistent due to speech variability. LLMs offer a promising solution by generating synthetic data.

Method: The study leverages LLMs to create synthetic phrase break annotations, comparing them with traditional annotations and evaluating their effectiveness across multiple languages.

Result: LLM-based synthetic data generation effectively addresses data challenges in phrase break prediction, demonstrating its potential for speech-related tasks.

Conclusion: LLMs provide a viable solution for reducing manual annotation efforts in the speech domain, offering consistent and scalable synthetic data.

Abstract: Current approaches to phrase break prediction address crucial prosodic aspects of text-to-speech systems but heavily rely on vast human annotations from audio or text, incurring significant manual effort and cost. Inherent variability in the speech domain, driven by phonetic factors, further complicates acquiring consistent, high-quality data. Recently, large language models (LLMs) have shown success in addressing data challenges in NLP by generating tailored synthetic data while reducing manual annotation needs. Motivated by this, we explore leveraging LLM to generate synthetic phrase break annotations, addressing the challenges of both manual annotation and speech-related tasks by comparing with traditional annotations and assessing effectiveness across multiple languages. Our findings suggest that LLM-based synthetic data generation effectively mitigates data challenges in phrase break prediction and highlights the potential of LLMs as a viable solution for the speech domain.

[14] Privacy-Preserving Synthetic Review Generation with Diverse Writing Styles Using LLMs

Tevin Atwal, Chan Nam Tieu, Yefeng Yuan, Zhan Shi, Yuhong Liu, Liang Cheng

Main category: cs.CL

TL;DR: The paper evaluates the diversity and privacy risks of synthetic text data from LLMs, proposes metrics for assessment, and introduces a prompt-based method to improve diversity while preserving privacy.

Details

Motivation: To address the underexplored challenges of diversity and privacy in synthetic data generated by LLMs, which is increasingly used in data-driven applications.

Method: Proposes metrics to assess diversity (linguistic expression, sentiment, user perspective) and privacy (re-identification risk, stylistic outliers) of synthetic datasets. Introduces a prompt-based approach to enhance diversity and privacy.

Result: Experiments show LLMs have limitations in generating diverse and privacy-preserving synthetic data. The prompt-based method improves diversity while maintaining privacy.

Conclusion: The study highlights the need for better methods to ensure diversity and privacy in synthetic data, offering a practical solution through prompt-based enhancements.

Abstract: The increasing use of synthetic data generated by Large Language Models (LLMs) presents both opportunities and challenges in data-driven applications. While synthetic data provides a cost-effective, scalable alternative to real-world data to facilitate model training, its diversity and privacy risks remain underexplored. Focusing on text-based synthetic data, we propose a comprehensive set of metrics to quantitatively assess the diversity (i.e., linguistic expression, sentiment, and user perspective), and privacy (i.e., re-identification risk and stylistic outliers) of synthetic datasets generated by several state-of-the-art LLMs. Experiment results reveal significant limitations in LLMs’ capabilities in generating diverse and privacy-preserving synthetic data. Guided by the evaluation results, a prompt-based approach is proposed to enhance the diversity of synthetic reviews while preserving reviewer privacy.

[15] Hybrid and Unitary Fine-Tuning of Large Language Models: Methods and Benchmarking under Resource Constraints

Haomin Qi, Zihan Dai, Chengbo Huang

Main category: cs.CL

TL;DR: A hybrid PEFT method combining BOFT and LoRA-GA outperforms individual techniques, achieving near full fine-tuning accuracy with reduced resource usage.

Details

Motivation: Address the computational bottleneck in fine-tuning large language models (LLMs) due to their scale and memory demands.

Method: Introduces a hybrid PEFT strategy integrating BOFT’s stability and LoRA-GA’s rapid convergence, with adaptive per-layer updates. Also adapts uRNN principles to transformers.

Result: Outperforms individual PEFT baselines on benchmarks (GLUE, GSM8K, MT-Bench, HumanEval), reducing training time by 2.1x and memory by 50%.

Conclusion: The hybrid approach is a scalable solution for fine-tuning LLMs under resource constraints.

Abstract: Fine-tuning large language models (LLMs) remains a computational bottleneck due to their scale and memory demands. This paper presents a comprehensive evaluation of parameter-efficient fine-tuning (PEFT) techniques, including LoRA, BOFT, LoRA-GA, and uRNN, and introduces a novel hybrid strategy that dynamically integrates BOFT’s orthogonal stability with LoRA-GA’s gradient-aligned rapid convergence. By computing per-layer adaptive updates guided by gradient norms, the hybrid method achieves superior convergence efficiency and generalization across diverse tasks. We also explore, for the first time, the adaptation of unitary RNN (uRNN) principles to transformer-based LLMs, enhancing gradient stability through structured unitary constraints. Empirical evaluations on four benchmarks – GLUE, GSM8K, MT-Bench, and HumanEval – using models ranging from 7B to 405B parameters demonstrate that our hybrid method consistently outperforms individual PEFT baselines, approaching full fine-tuning accuracy while reducing resource consumption by up to 2.1 times in training time and 50 percent in memory usage. These findings establish the hybrid approach as a practical and scalable fine-tuning solution for real-world deployment of LLMs under resource constraints.

[16] Step-Audio 2 Technical Report

Boyong Wu, Chao Yan, Chen Hu, Cheng Yi, Chengli Feng, Fei Tian, Feiyu Shen, Gang Yu, Haoyang Zhang, Jingbei Li, Mingrui Chen, Peng Liu, Wang You, Xiangyu Tony Zhang, Xingyuan Li, Xuerui Yang, Yayue Deng, Yechang Huang, Yuxin Li, Yuxin Zhang, Zhao You, Brian Li, Changyi Wan, Hanpeng Hu, Jiangjie Zhen, Siyu Chen, Song Yuan, Xuelin Zhang, Yimin Jiang, Yu Zhou, Yuxiang Yang, Bingxin Li, Buyun Ma, Changhe Song, Dongqing Pang, Guoqiang Hu, Haiyang Sun, Kang An, Na Wang, Shuli Gao, Wei Ji, Wen Li, Wen Sun, Xuan Wen, Yong Ren, Yuankai Ma, Yufan Lu, Bin Wang, Bo Li, Changxin Miao, Che Liu, Chen Xu, Dapeng Shi, Dingyuan Hu, Donghang Wu, Enle Liu, Guanzhe Huang, Gulin Yan, Han Zhang, Hao Nie, Haonan Jia, Hongyu Zhou, Jianjian Sun, Jiaoren Wu, Jie Wu, Jie Yang, Jin Yang, Junzhe Lin, Kaixiang Li, Lei Yang, Liying Shi, Li Zhou, Longlong Gu, Ming Li, Mingliang Li, Mingxiao Li, Nan Wu, Qi Han, Qinyuan Tan, Shaoliang Pang, Shengjie Fan, Siqi Liu, Tiancheng Cao, Wanying Lu, Wenqing He, Wuxun Xie, Xu Zhao, Xueqi Li, Yanbo Yu, Yang Yang, Yi Liu, Yifan Lu, Yilei Wang, Yuanhao Ding, Yuanwei Liang, Yuanwei Lu, Yuchu Luo, Yuhe Yin, Yumeng Zhan, Yuxiang Zhang, Zidong Yang, Zixin Zhang, Binxing Jiao, Daxin Jiang, Heung-Yeung Shum, Jiansheng Chen, Jing Li, Xiangyu Zhang, Yibo Zhu

Main category: cs.CL

TL;DR: Step-Audio 2 is an advanced multi-modal LLM for audio understanding and speech conversation, integrating latent audio encoding, RL, and RAG for superior ASR and responsiveness to paralinguistic cues.

Details

Motivation: To enhance audio understanding and speech conversation by combining textual and acoustic knowledge, leveraging real-world data for improved performance.

Method: Uses latent audio encoder, RL, and RAG, incorporating discrete audio tokens and external tools like web and audio search.

Result: Achieves state-of-the-art performance in ASR and audio understanding, excelling in responsiveness and expressiveness.

Conclusion: Step-Audio 2 sets a new benchmark for audio understanding and conversational AI, validated by superior performance on diverse benchmarks.

Abstract: This paper presents Step-Audio 2, an end-to-end multi-modal large language model designed for industry-strength audio understanding and speech conversation. By integrating a latent audio encoder and reasoning-centric reinforcement learning (RL), Step-Audio 2 achieves promising performance in automatic speech recognition (ASR) and audio understanding. To facilitate genuine end-to-end speech conversation, Step-Audio 2 incorporates the generation of discrete audio tokens into language modeling, significantly enhancing its responsiveness to paralinguistic information such as speaking styles and emotions. To effectively leverage the rich textual and acoustic knowledge in real-world data, Step-Audio 2 integrates retrieval-augmented generation (RAG) and is able to call external tools such as web search to mitigate hallucination and audio search to switch timbres. Trained on millions of hours of speech and audio data, Step-Audio 2 delivers intelligence and expressiveness across diverse conversational scenarios. Evaluation results demonstrate that Step-Audio 2 achieves state-of-the-art performance on various audio understanding and conversational benchmarks compared to other open-source and commercial solutions. Please visit https://github.com/stepfun-ai/Step-Audio2 for more information.

[17] Seed LiveInterpret 2.0: End-to-end Simultaneous Speech-to-speech Translation with Your Voice

Shanbo Cheng, Yu Bao, Zhichao Huang, Yu Lu, Ningxin Peng, Lu Xu, Runsheng Yu, Rong Cao, Ting Han, Zeyang Li, Sitong Liu, Shengtao Ma, Shiguang Pan, Jiongchen Xiao, Nuo Xu, Meng Yang, Rong Ye, Yiming Yu, Ruofei Zhang, Wanyi Zhang, Wenhao Zhu, Liehao Zou, Lu Lu, Yuxuan Wang, Yonghui Wu

Main category: cs.CL

TL;DR: Seed-LiveInterpret 2.0 is an advanced end-to-end SI model addressing challenges like transcription quality, latency, and multi-speaker confusion, achieving high accuracy and reduced latency.

Details

Motivation: To overcome the persistent issues in automatic Simultaneous Interpretation (SI) systems, such as poor transcription, high latency, and multi-speaker confusion.

Method: The study employs a novel duplex speech-to-speech understanding-generating framework, leveraging large-scale pretraining and reinforcement learning.

Result: The model achieves over 70% correctness in complex scenarios, reduces latency by 70% (from 10s to 3s), and outperforms commercial SI solutions.

Conclusion: Seed-LiveInterpret 2.0 significantly improves SI performance, offering a practical, high-quality, low-latency solution for real-world applications.

Abstract: Simultaneous Interpretation (SI) represents one of the most daunting frontiers in the translation industry, with product-level automatic systems long plagued by intractable challenges: subpar transcription and translation quality, lack of real-time speech generation, multi-speaker confusion, and translated speech inflation, especially in long-form discourses. In this study, we introduce Seed-LiveInterpret 2.0, an end-to-end SI model that delivers high-fidelity, ultra-low-latency speech-to-speech generation with voice cloning capabilities. As a fully operational product-level solution, Seed-LiveInterpret 2.0 tackles these challenges head-on through our novel duplex speech-to-speech understanding-generating framework. Experimental results demonstrate that through large-scale pretraining and reinforcement learning, the model achieves a significantly better balance between translation accuracy and latency, validated by human interpreters to exceed 70% correctness in complex scenarios. Notably, Seed-LiveInterpret 2.0 outperforms commercial SI solutions by significant margins in translation quality, while slashing the average latency of cloned speech from nearly 10 seconds to a near-real-time 3 seconds, which is around a near 70% reduction that drastically enhances practical usability.

[18] A New Pair of GloVes

Riley Carlson, John Bauer, Christopher D. Manning

Main category: cs.CL

TL;DR: The paper introduces updated 2024 English GloVe models, addressing gaps in documentation and incorporating modern linguistic and cultural relevance, while maintaining performance on traditional tasks and improving on newer datasets.

Details

Motivation: The original 2014 GloVe models lacked detailed documentation and needed updates to reflect evolving language and cultural contexts.

Method: Two sets of word embeddings were trained using Wikipedia, Gigaword, and a subset of Dolma, with evaluations through vocabulary comparison, direct testing, and NER tasks.

Result: The 2024 models include new culturally relevant words, perform similarly on structural tasks, and show improved performance on recent NER datasets, especially non-Western newswire data.

Conclusion: The updated GloVe models provide better documentation, modern relevance, and improved performance, making them a valuable resource for current NLP applications.

Abstract: This report documents, describes, and evaluates new 2024 English GloVe (Global Vectors for Word Representation) models. While the original GloVe models built in 2014 have been widely used and found useful, languages and the world continue to evolve and we thought that current usage could benefit from updated models. Moreover, the 2014 models were not carefully documented as to the exact data versions and preprocessing that were used, and we rectify this by documenting these new models. We trained two sets of word embeddings using Wikipedia, Gigaword, and a subset of Dolma. Evaluation through vocabulary comparison, direct testing, and NER tasks shows that the 2024 vectors incorporate new culturally and linguistically relevant words, perform comparably on structural tasks like analogy and similarity, and demonstrate improved performance on recent, temporally dependent NER datasets such as non-Western newswire data.

[19] MathOPEval: A Fine-grained Evaluation Benchmark for Visual Operations of MLLMs in Mathematical Reasoning

Xiaoyuan Li, Moxin Li, Wenjie Wang, Rui Men, Yichang Zhang, Fuli Feng, Dayiheng Liu, Junyang Lin

Main category: cs.CL

TL;DR: The paper evaluates Multi-modal Large Language Models (MLLMs) on their ability to perform accurate visual operations via code in multi-modal mathematical reasoning, highlighting gaps in current evaluations.

Details

Motivation: Existing evaluations of MLLMs focus on text-only reasoning, neglecting their ability to perform precise visual operations via code, which is crucial for multi-modal mathematical reasoning.

Method: The study introduces a framework evaluating two aspects: Multi-modal Code Generation (MCG) for constructing visualizations and Multi-modal Code Editing (MCE) for fine-grained operations (Deletion, Modification, Annotation). A dataset with five types of mathematical figures is used.

Result: Nine mainstream MLLMs were tested, revealing significant performance gaps compared to humans in fine-grained visual operations.

Conclusion: The work highlights the need for improved MLLM capabilities in visual operations via code, providing a foundation for future advancements in multi-modal reasoning.

Abstract: Recent progress in Multi-modal Large Language Models (MLLMs) has enabled step-by-step multi-modal mathematical reasoning by performing visual operations based on the textual instructions. A promising approach uses code as an intermediate representation to precisely express and manipulate the images in the reasoning steps. However, existing evaluations focus mainly on text-only reasoning outputs, leaving the MLLM’s ability to perform accurate visual operations via code largely unexplored. This work takes a first step toward addressing that gap by evaluating MLLM’s code-based capabilities in multi-modal mathematical reasoning.Specifically, our framework focuses on two key evaluation aspects: (1) Multi-modal Code Generation (MCG) evaluates the model’s ability to accurately understand and construct visualizations from scratch. (2) Multi-modal Code Editing (MCE) assesses the model’s capacity for fine-grained operations, which include three types: Deletion, Modification and Annotation. To evaluate the above tasks, we incorporate a dataset that covers the five most popular types of mathematical figures, including geometric diagrams, function plots, and three types of statistical charts, to provide a comprehensive and effective measurement of existing MLLMs. Our experimental evaluation involves nine mainstream MLLMs, and the results reveal that existing models still lag significantly behind human performance in performing fine-grained visual operations.

[20] HIVMedQA: Benchmarking large language models for HIV medical decision support

Gonzalo Cardenal Antolin, Jacques Fellay, Bashkim Jaha, Roger Kouyos, Niko Beerenwinkel, Diane Duroux

Main category: cs.CL

TL;DR: The study evaluates LLMs for HIV management, introducing HIVMedQA as a benchmark. Gemini 2.5 Pro performed best, but challenges like complexity and biases persist.

Details

Motivation: LLMs show promise in clinical decision-making, but their integration in HIV care lacks exploration and benchmarking.

Method: Developed HIVMedQA benchmark, evaluated 10 LLMs using prompt engineering, and assessed performance across clinical dimensions.

Result: Gemini 2.5 Pro excelled, but performance dropped with complexity. Medically fine-tuned models didn’t always outperform general ones.

Conclusion: Targeted development and evaluation are needed for safe, effective LLM use in clinical HIV care.

Abstract: Large language models (LLMs) are emerging as valuable tools to support clinicians in routine decision-making. HIV management is a compelling use case due to its complexity, including diverse treatment options, comorbidities, and adherence challenges. However, integrating LLMs into clinical practice raises concerns about accuracy, potential harm, and clinician acceptance. Despite their promise, AI applications in HIV care remain underexplored, and LLM benchmarking studies are scarce. This study evaluates the current capabilities of LLMs in HIV management, highlighting their strengths and limitations. We introduce HIVMedQA, a benchmark designed to assess open-ended medical question answering in HIV care. The dataset consists of curated, clinically relevant questions developed with input from an infectious disease physician. We evaluated seven general-purpose and three medically specialized LLMs, applying prompt engineering to enhance performance. Our evaluation framework incorporates both lexical similarity and an LLM-as-a-judge approach, extended to better reflect clinical relevance. We assessed performance across key dimensions: question comprehension, reasoning, knowledge recall, bias, potential harm, and factual accuracy. Results show that Gemini 2.5 Pro consistently outperformed other models across most dimensions. Notably, two of the top three models were proprietary. Performance declined as question complexity increased. Medically fine-tuned models did not always outperform general-purpose ones, and larger model size was not a reliable predictor of performance. Reasoning and comprehension were more challenging than factual recall, and cognitive biases such as recency and status quo were observed. These findings underscore the need for targeted development and evaluation to ensure safe, effective LLM integration in clinical care.

[21] Sticking to the Mean: Detecting Sticky Tokens in Text Embedding Models

Kexin Chen, Dongxia Wang, Yi Liu, Haonan Zhang, Wenhai Wang

Main category: cs.CL

TL;DR: The paper identifies ‘sticky tokens’ in Transformer-based text embedding models, which disrupt embedding reliability. It introduces STD for detection and analyzes their impact on downstream tasks.

Details

Motivation: To investigate and address the issue of 'sticky tokens' that degrade embedding reliability and downstream performance.

Method: Introduces Sticky Token Detector (STD) for detecting sticky tokens, analyzes their origins, and evaluates their impact on tasks like clustering and retrieval.

Result: Found 868 sticky tokens across 40 checkpoints, showing significant performance drops (up to 50%) in downstream tasks.

Conclusion: Highlights the need for improved tokenization strategies and model design to mitigate sticky token effects.

Abstract: Despite the widespread use of Transformer-based text embedding models in NLP tasks, surprising ‘sticky tokens’ can undermine the reliability of embeddings. These tokens, when repeatedly inserted into sentences, pull sentence similarity toward a certain value, disrupting the normal distribution of embedding distances and degrading downstream performance. In this paper, we systematically investigate such anomalous tokens, formally defining them and introducing an efficient detection method, Sticky Token Detector (STD), based on sentence and token filtering. Applying STD to 40 checkpoints across 14 model families, we discover a total of 868 sticky tokens. Our analysis reveals that these tokens often originate from special or unused entries in the vocabulary, as well as fragmented subwords from multilingual corpora. Notably, their presence does not strictly correlate with model size or vocabulary size. We further evaluate how sticky tokens affect downstream tasks like clustering and retrieval, observing significant performance drops of up to 50%. Through attention-layer analysis, we show that sticky tokens disproportionately dominate the model’s internal representations, raising concerns about tokenization robustness. Our findings show the need for better tokenization strategies and model design to mitigate the impact of sticky tokens in future text embedding applications.

[22] SCOPE: Stochastic and Counterbiased Option Placement for Evaluating Large Language Models

Wonjun Jeong, Dongseok Kim, Taegkeun Whangbo

Main category: cs.CL

TL;DR: SCOPE is a framework to mitigate selection bias in LLM evaluations by redistributing answer slots and blocking near-miss guesses, improving fairness and reliability.

Details

Motivation: LLMs exploit biases in option positions or labels, inflating scores without genuine understanding, necessitating a debiasing method.

Method: SCOPE uses a null prompt to estimate position-bias, redistributes answer slots inversely, and prevents semantically similar distractors near answers.

Result: SCOPE outperforms existing debiasing methods, showing stable performance improvements and clearer confidence distributions.

Conclusion: SCOPE enhances fairness and reliability in LLM evaluations, setting a new standard.

Abstract: Large Language Models (LLMs) can achieve inflated scores on multiple-choice tasks by exploiting inherent biases in option positions or labels, rather than demonstrating genuine understanding. This study introduces SCOPE, an evaluation framework designed to measure and mitigate such selection bias in a dataset-independent manner. By repeatedly invoking a null prompt that lacks semantic content, SCOPE estimates each model’s unique position-bias distribution. It then redistributes the answer slot according to the inverse-bias distribution, thereby equalizing the lucky-rate, the probability of selecting the correct answer by chance. Furthermore, it prevents semantically similar distractors from being placed adjacent to the answer, thereby blocking near-miss guesses based on superficial proximity cues. Across multiple benchmark experiments, SCOPE consistently outperformed existing debiasing methods in terms of stable performance improvements and showed clearer confidence distributions over correct options. This framework thus offers a new standard for enhancing the fairness and reliability of LLM evaluations.

[23] TN-AutoRCA: Benchmark Construction and Agentic Framework for Self-Improving Alarm-Based Root Cause Analysis in Telecommunication Networks

Keyu Wu, Qianjin Yu, Manlin Mei, Ruiting Liu, Jun Wang, Kailai Zhang, Yelun Bao

Main category: cs.CL

TL;DR: The paper addresses the challenges of using AI for Root Cause Analysis (RCA) in telecommunication networks due to complex graph-based reasoning and lack of benchmarks.

Details

Motivation: The motivation is to highlight the difficulties AI faces in RCA for telecom networks, emphasizing the need for better benchmarks and methods.

Method: The paper likely discusses the limitations of current AI approaches for RCA in telecom networks, focusing on graph-based reasoning challenges.

Result: The abstract suggests that current AI methods struggle with RCA in telecom networks due to complexity and benchmark scarcity.

Conclusion: The conclusion underscores the need for improved benchmarks and AI methods to tackle RCA in telecom networks effectively.

Abstract: Root Cause Analysis (RCA) in telecommunication networks is a critical task, yet it presents a formidable challenge for Artificial Intelligence (AI) due to its complex, graph-based reasoning requirements and the scarcity of realistic benchmarks.

[24] Integrating an ISO30401-compliant Knowledge management system with existing business processes of an organization

Aline Belloni, Patrick Prieur

Main category: cs.CL

TL;DR: The paper explores integrating ISO30401-compliant Knowledge Management Systems (KMS) with existing ISO9001-aligned business processes using the SECI model and PDCA cycles.

Details

Motivation: Organizations struggle to align ISO30401 KMS with operational processes, necessitating a clear integration method.

Method: The study uses process modeling principles from ISO9001 and applies the SECI model through PDCA cycles for KMS implementation.

Result: Demonstrates how KMS can seamlessly integrate with other processes in an Integrated Management System.

Conclusion: Effective KMS implementation requires aligning knowledge activities with operational workflows using structured models like SECI and PDCA.

Abstract: Business process modeling is used by most organizations as an essential framework for ensuring efficiency and effectiveness of the work and workflow performed by its employees and for ensuring the alignment of such work with its strategic goals. For organizations that are compliant or near-compliant with ISO 9001, this approach involves the detailed mapping of processes, sub-processes, activities, and tasks. ISO30401 is a Management System Standard, introduced in 2018, establishing universal requirements for the set up of a Knowledge Management System in an organization. As ``ISO30401 implementers’’ we regularly face the challenge of explaining our clients how the knowledge development, transformation and conveyances activities depicted in ISO30401 do integrate with existing operational processes. This article recaps process modelling principles in the context of ISO9001 and explores, based on our experience, how an ISO30401-compliant Knowledge Management System (KMS) entwines with all other processes of an Integrated Management System and in particular how it can be implemented by deploying the mechanisms of the SECI model through the steps of PDCA cycles.

[25] Safeguarding RAG Pipelines with GMTP: A Gradient-based Masked Token Probability Method for Poisoned Document Detection

San Kim, Jonghwi Kim, Yejin Jeon, Gary Geunbae Lee

Main category: cs.CL

TL;DR: GMTP is a defense method for RAG systems that detects poisoned documents by analyzing token gradients and masking high-impact tokens, achieving over 90% filtering accuracy.

Details

Motivation: RAG systems are vulnerable to poisoned documents, which can lead to harmful outputs. A robust defense mechanism is needed to ensure secure and accurate generation.

Method: GMTP identifies high-impact tokens using gradients of the retriever’s similarity function, masks them, and checks their probabilities via an MLM to detect malicious documents.

Result: GMTP filters out over 90% of poisoned content while preserving relevant documents, maintaining robust performance in adversarial settings.

Conclusion: GMTP effectively mitigates security risks in RAG systems by detecting and filtering poisoned documents, ensuring reliable and accurate generation.

Abstract: Retrieval-Augmented Generation (RAG) enhances Large Language Models (LLMs) by providing external knowledge for accurate and up-to-date responses. However, this reliance on external sources exposes a security risk, attackers can inject poisoned documents into the knowledge base to steer the generation process toward harmful or misleading outputs. In this paper, we propose Gradient-based Masked Token Probability (GMTP), a novel defense method to detect and filter out adversarially crafted documents. Specifically, GMTP identifies high-impact tokens by examining gradients of the retriever’s similarity function. These key tokens are then masked, and their probabilities are checked via a Masked Language Model (MLM). Since injected tokens typically exhibit markedly low masked-token probabilities, this enables GMTP to easily detect malicious documents and achieve high-precision filtering. Experiments demonstrate that GMTP is able to eliminate over 90% of poisoned content while retaining relevant documents, thus maintaining robust retrieval and generation performance across diverse datasets and adversarial settings.

[26] Exploring the Impact of Instruction-Tuning on LLM’s Susceptibility to Misinformation

Kyubeen Han, Junseo Jang, Hongjin Kim, Geunyeong Jeong, Harksoo Kim

Main category: cs.CL

TL;DR: Instruction-tuning improves LLM usability but increases susceptibility to misinformation, shifting reliance from the assistant to the user role.

Details

Motivation: To understand how instruction-tuning affects LLMs' vulnerability to misinformation and explore mitigating factors.

Method: Investigates the impact of instruction-tuning on misinformation acceptance, comparing base and instruction-tuned models, and analyzing factors like prompt structure and warnings.

Result: Instruction-tuned LLMs are more likely to accept user-provided misinformation, with factors like user role and prompt warnings influencing susceptibility.

Conclusion: Systematic approaches are needed to mitigate instruction-tuning’s unintended effects and improve LLM reliability.

Abstract: Instruction-tuning enhances the ability of large language models (LLMs) to follow user instructions more accurately, improving usability while reducing harmful outputs. However, this process may increase the model’s dependence on user input, potentially leading to the unfiltered acceptance of misinformation and the generation of hallucinations. Existing studies primarily highlight that LLMs are receptive to external information that contradict their parametric knowledge, but little research has been conducted on the direct impact of instruction-tuning on this phenomenon. In our study, we investigate the impact of instruction-tuning on LLM’s susceptibility to misinformation. Our analysis reveals that instruction-tuned LLMs are significantly more likely to accept misinformation when it is presented by the user. A comparison with base models shows that instruction-tuning increases reliance on user-provided information, shifting susceptibility from the assistant role to the user role. Furthermore, we explore additional factors influencing misinformation susceptibility, such as the role of the user in prompt structure, misinformation length, and the presence of warnings in the system prompt. Our findings underscore the need for systematic approaches to mitigate unintended consequences of instruction-tuning and enhance the reliability of LLMs in real-world applications.

[27] Prune&Comp: Free Lunch for Layer-Pruned LLMs via Iterative Pruning with Magnitude Compensation

Xinrui Chen, Hongxing Zhang, Fanyi Zeng, Yongxian Wei, Yizhi Wang, Xitong Ling, Guanghao Li, Chun Yuan

Main category: cs.CL

TL;DR: Prune&Comp is a training-free layer pruning method for LLMs that compensates for performance degradation by rescaling weights offline, improving existing pruning metrics.

Details

Motivation: Removing layers in LLMs causes performance degradation due to hidden state magnitude gaps, which Prune&Comp aims to mitigate.

Method: Estimates magnitude gaps from layer removal and compensates by rescaling remaining weights offline, with zero runtime overhead.

Result: When pruning 5 layers of LLaMA-3-8B, Prune&Comp halves perplexity and retains 93.19% of QA performance, outperforming baselines by 4.01%.

Conclusion: Prune&Comp effectively enhances layer pruning by compensating for magnitude gaps, achieving better performance without training.

Abstract: Layer pruning has emerged as a promising technique for compressing large language models (LLMs) while achieving acceleration proportional to the pruning ratio. In this work, we identify that removing any layer induces a significant magnitude gap in hidden states, resulting in substantial performance degradation. To address this issue, we propose Prune&Comp, a novel plug-and-play layer pruning scheme that leverages magnitude compensation to mitigate such gaps in a training-free manner. Specifically, we first estimate the magnitude gap caused by layer removal and then eliminate this gap by rescaling the remaining weights offline, with zero runtime overhead incurred. We further demonstrate the advantages of Prune&Comp through an iterative pruning strategy. When integrated with an iterative prune-and-compensate loop, Prune&Comp consistently enhances existing layer pruning metrics. For instance, when 5 layers of LLaMA-3-8B are pruned using the prevalent block influence metric, Prune&Comp nearly halves the perplexity and retains 93.19% of the original model’s question-answering performance, outperforming the baseline by 4.01%.

[28] Locate-and-Focus: Enhancing Terminology Translation in Speech Language Models

Suhang Wu, Jialong Tang, Chengyi Yang, Pei Zhang, Baosong Yang, Junhui Li, Junfeng Yao, Min Zhang, Jinsong Su

Main category: cs.CL

TL;DR: A novel Locate-and-Focus method improves terminology translation in direct speech translation by minimizing noise and better utilizing translation knowledge.

Details

Motivation: Accurate terminology translation in speech translation is challenging due to noise and underutilized translation knowledge.

Method: The proposed Locate-and-Focus method identifies terminology-containing speech clips, minimizes irrelevant noise, and associates translation knowledge with audio and text modalities.

Result: The method enhances terminology translation success and maintains general translation performance across datasets.

Conclusion: The Locate-and-Focus method effectively addresses terminology translation challenges in speech translation.

Abstract: Direct speech translation (ST) has garnered increasing attention nowadays, yet the accurate translation of terminology within utterances remains a great challenge. In this regard, current studies mainly concentrate on leveraging various translation knowledge into ST models. However, these methods often struggle with interference from irrelevant noise and can not fully utilize the translation knowledge. To address these issues, in this paper, we propose a novel Locate-and-Focus method for terminology translation. It first effectively locates the speech clips containing terminologies within the utterance to construct translation knowledge, minimizing irrelevant information for the ST model. Subsequently, it associates the translation knowledge with the utterance and hypothesis from both audio and textual modalities, allowing the ST model to better focus on translation knowledge during translation. Experimental results across various datasets demonstrate that our method effectively locates terminologies within utterances and enhances the success rate of terminology translation, while maintaining robust general translation performance.

[29] Zero-shot OCR Accuracy of Low-Resourced Languages: A Comparative Analysis on Sinhala and Tamil

Nevidu Jayatilleke, Nisansa de Silva

Main category: cs.CL

TL;DR: The paper compares six OCR engines for low-resourced languages (Sinhala and Tamil), finding Surya best for Sinhala and Document AI for Tamil, and introduces a new Tamil OCR dataset.

Details

Motivation: OCR for high-resourced languages is well-solved, but low-resourced languages with unique scripts remain challenging.

Method: Comparative analysis of six OCR engines (commercial and open-source) using five measurement techniques for accuracy at character and word levels.

Result: Surya performed best for Sinhala (WER 2.61%), while Document AI excelled for Tamil (CER 0.78%).

Conclusion: The study highlights the need for tailored OCR solutions for low-resourced languages and introduces a new Tamil dataset for benchmarking.

Abstract: Solving the problem of Optical Character Recognition (OCR) on printed text for Latin and its derivative scripts can now be considered settled due to the volumes of research done on English and other High-Resourced Languages (HRL). However, for Low-Resourced Languages (LRL) that use unique scripts, it remains an open problem. This study presents a comparative analysis of the zero-shot performance of six distinct OCR engines on two LRLs: Sinhala and Tamil. The selected engines include both commercial and open-source systems, aiming to evaluate the strengths of each category. The Cloud Vision API, Surya, Document AI, and Tesseract were evaluated for both Sinhala and Tamil, while Subasa OCR and EasyOCR were examined for only one language due to their limitations. The performance of these systems was rigorously analysed using five measurement techniques to assess accuracy at both the character and word levels. According to the findings, Surya delivered the best performance for Sinhala across all metrics, with a WER of 2.61%. Conversely, Document AI excelled across all metrics for Tamil, highlighted by a very low CER of 0.78%. In addition to the above analysis, we also introduce a novel synthetic Tamil OCR benchmarking dataset.

[30] StyleAdaptedLM: Enhancing Instruction Following Models with Efficient Stylistic Transfer

Pritika Ramu, Apoorv Saxena, Meghanath M Y, Varsha Sankar, Debraj Basu

Main category: cs.CL

TL;DR: StyleAdaptedLM uses LoRA to adapt LLMs to specific styles without paired data, maintaining task performance.

Details

Motivation: Enterprise communication requires LLMs to adopt brand or authorial tones, but current methods struggle without instruction-response data.

Method: Train LoRA adapters on unstructured stylistic corpora, then merge with instruction-following models.

Result: Improved stylistic consistency and instruction adherence, validated by human evaluations.

Conclusion: StyleAdaptedLM efficiently enables stylistic personalization in LLMs.

Abstract: Adapting LLMs to specific stylistic characteristics, like brand voice or authorial tones, is crucial for enterprise communication but challenging to achieve from corpora which lacks instruction-response formatting without compromising instruction adherence. We introduce StyleAdaptedLM, a framework that efficiently transfers stylistic traits to instruction-following models using Low-Rank Adaptation (LoRA). LoRA adapters are first trained on a base model with diverse unstructured stylistic corpora, then merged with a separate instruction-following model. This enables robust stylistic customization without paired data or sacrificing task performance. Experiments across multiple datasets and models demonstrate improved stylistic consistency while preserving instruction adherence, with human evaluations confirming brand-specific convention uptake. StyleAdaptedLM offers an efficient path for stylistic personalization in LLMs.

[31] BadReasoner: Planting Tunable Overthinking Backdoors into Large Reasoning Models for Fun or Profit

Biao Yi, Zekun Fei, Jianing Geng, Tong Li, Lihai Nie, Zheli Liu, Yiming Li

Main category: cs.CL

TL;DR: The paper introduces ‘overthinking backdoors’ as a novel attack on large reasoning models (LRMs), enabling precise control over reasoning verbosity via data poisoning while maintaining output correctness.

Details

Motivation: To uncover and exploit vulnerabilities in LRMs, specifically their chain-of-thought reasoning, by introducing stealthy, tunable backdoors.

Method: A data poisoning approach pairs tunable triggers (repetition count) with verbose CoT responses, generated by a teacher LLM to inject redundant steps without affecting correctness.

Result: Empirical results show controllable multi-fold increases in reasoning length without degrading answer accuracy.

Conclusion: The study demonstrates a new attack vector on LRMs, highlighting their vulnerability to resource-consumption attacks while preserving correctness.

Abstract: Large reasoning models (LRMs) have emerged as a significant advancement in artificial intelligence, representing a specialized class of large language models (LLMs) designed to tackle complex reasoning tasks. The defining characteristic of LRMs lies in their extensive chain-of-thought (CoT) reasoning capabilities. In this paper, we identify a previously unexplored attack vector against LRMs, which we term “overthinking backdoors”. We advance this concept by proposing a novel tunable backdoor, which moves beyond simple on/off attacks to one where an attacker can precisely control the extent of the model’s reasoning verbosity. Our attack is implemented through a novel data poisoning methodology. It pairs a tunable trigger-where the number of repetitions signals the desired intensity-with a correspondingly verbose CoT response. These responses are programmatically generated by instructing a teacher LLM to inject a controlled number of redundant refinement steps into a correct reasoning process. The approach preserves output correctness, which ensures stealth and establishes the attack as a pure resource-consumption vector. Extensive empirical results on various LRMs demonstrate that our method can reliably trigger a controllable, multi-fold increase in the length of the reasoning process, without degrading the final answer’s correctness. Our source code is available at https://github.com/FZaKK/BadReasoner.

[32] Uncertainty Quantification for Evaluating Machine Translation Bias

Ieva Raminta Staliūnaitė, Julius Cheng, Andreas Vlachos

Main category: cs.CL

TL;DR: MT models struggle with gender inference in ambiguous contexts, often relying on stereotypes. The study suggests models should maintain uncertainty in ambiguous cases and finds debiasing affects ambiguous and unambiguous translations differently.

Details

Motivation: To address gender bias in machine translation, especially when gender is ambiguous, and ensure models handle uncertainty appropriately.

Method: Analyzed MT models using semantic uncertainty metrics to evaluate gender accuracy and uncertainty in ambiguous vs. unambiguous contexts.

Result: Models with high accuracy in unambiguous cases often fail to show uncertainty in ambiguous ones. Debiasing impacts ambiguous and unambiguous translations differently.

Conclusion: MT models need better handling of gender ambiguity and uncertainty, with debiasing strategies tailored to context-specific challenges.

Abstract: In machine translation (MT), when the source sentence includes a lexeme whose gender is not overtly marked, but whose target-language equivalent requires gender specification, the model must infer the appropriate gender from the context and/or external knowledge. Studies have shown that MT models exhibit biased behaviour, relying on stereotypes even when they clash with contextual information. We posit that apart from confidently translating using the correct gender when it is evident from the input, models should also maintain uncertainty about the gender when it is ambiguous. Using recently proposed metrics of semantic uncertainty, we find that models with high translation and gender accuracy on unambiguous instances do not necessarily exhibit the expected level of uncertainty in ambiguous ones. Similarly, debiasing has independent effects on ambiguous and unambiguous translation instances.

[33] TDR: Task-Decoupled Retrieval with Fine-Grained LLM Feedback for In-Context Learning

Yifu Chen, Bingchen Huang, Zhiling Wang, Yuanchao Du, Junfeng Luo, Lei Shen, Zhineng chen

Main category: cs.CL

TL;DR: TDR is a novel framework that improves in-context learning (ICL) by decoupling examples from different tasks and modeling fine-grained feedback from LLMs to enhance retrieval quality.

Details

Motivation: Existing ICL methods struggle with distinguishing cross-task data distributions and connecting retriever output to LLM feedback.

Method: TDR decouples task-specific examples and uses LLM feedback to supervise retrieval training.

Result: TDR achieves state-of-the-art performance on 30 NLP tasks and is compatible with various LLMs.

Conclusion: TDR effectively addresses retrieval challenges in ICL and is a versatile, plug-and-play solution.

Abstract: In-context learning (ICL) has become a classic approach for enabling LLMs to handle various tasks based on a few input-output examples. The effectiveness of ICL heavily relies on the quality of these examples, and previous works which focused on enhancing example retrieval capabilities have achieved impressive performances. However, two challenges remain in retrieving high-quality examples: (1) Difficulty in distinguishing cross-task data distributions, (2) Difficulty in making the fine-grained connection between retriever output and feedback from LLMs. In this paper, we propose a novel framework called TDR. TDR decouples the ICL examples from different tasks, which enables the retrieval module to retrieve examples specific to the target task within a multi-task dataset. Furthermore, TDR models fine-grained feedback from LLMs to supervise and guide the training of the retrieval module, which helps to retrieve high-quality examples. We conducted extensive experiments on a suite of 30 NLP tasks, the results demonstrate that TDR consistently improved results across all datasets and achieves state-of-the-art performance. Meanwhile, our approach is a plug-and-play method, which can be easily combined with various LLMs to improve example retrieval abilities for ICL. The code is available at https://github.com/Nnn-s/TDR.

[34] Hybrid Annotation for Propaganda Detection: Integrating LLM Pre-Annotations with Human Intelligence

Ariana Sahitaj, Premtim Sahitaj, Veronika Solopova, Jiaao Li, Sebastian Möller, Vera Schmitt

Main category: cs.CL

TL;DR: A novel framework combining human expertise and LLM assistance improves propaganda detection on social media by enhancing annotation consistency and scalability.

Details

Motivation: The challenge of detecting propaganda on social media due to task complexity and limited labeled data drives the need for scalable and robust solutions.

Method: Proposes a hierarchical taxonomy, conducts human annotation, implements an LLM-assisted pre-annotation pipeline, and fine-tunes smaller models using LLM-generated data.

Result: Significant improvements in inter-annotator agreement and time-efficiency, with smaller models trained on LLM-generated data performing structured annotation.

Conclusion: The framework supports scalable and transparent propaganda detection, aligning with SDG 16 goals for accountable media ecosystems.

Abstract: Propaganda detection on social media remains challenging due to task complexity and limited high-quality labeled data. This paper introduces a novel framework that combines human expertise with Large Language Model (LLM) assistance to improve both annotation consistency and scalability. We propose a hierarchical taxonomy that organizes 14 fine-grained propaganda techniques into three broader categories, conduct a human annotation study on the HQP dataset that reveals low inter-annotator agreement for fine-grained labels, and implement an LLM-assisted pre-annotation pipeline that extracts propagandistic spans, generates concise explanations, and assigns local labels as well as a global label. A secondary human verification study shows significant improvements in both agreement and time-efficiency. Building on this, we fine-tune smaller language models (SLMs) to perform structured annotation. Instead of fine-tuning on human annotations, we train on high-quality LLM-generated data, allowing a large model to produce these annotations and a smaller model to learn to generate them via knowledge distillation. Our work contributes towards the development of scalable and robust propaganda detection systems, supporting the idea of transparent and accountable media ecosystems in line with SDG 16. The code is publicly available at our GitHub repository.

[35] CLEAR: Error Analysis via LLM-as-a-Judge Made Easy

Asaf Yehudai, Lilach Eden, Yotam Perlitz, Roy Bar-Haim, Michal Shmueli-Scheuer

Main category: cs.CL

TL;DR: CLEAR is an interactive, open-source tool for analyzing LLM errors by providing detailed feedback, system-level issues, and visualizations.

Details

Motivation: Current LLM evaluations lack actionable insights, focusing only on scores or rankings.

Method: CLEAR generates per-instance feedback, identifies system-level issues, and quantifies their prevalence, supported by an interactive dashboard.

Result: Demonstrated utility in RAG and Math benchmarks through a user case study.

Conclusion: CLEAR bridges the gap in LLM evaluation by offering detailed, actionable error analysis.

Abstract: The evaluation of Large Language Models (LLMs) increasingly relies on other LLMs acting as judges. However, current evaluation paradigms typically yield a single score or ranking, answering which model is better but not why. While essential for benchmarking, these top-level scores obscure the specific, actionable reasons behind a model’s performance. To bridge this gap, we introduce CLEAR, an interactive, open-source package for LLM-based error analysis. CLEAR first generates per-instance textual feedback, then it creates a set of system-level error issues, and quantifies the prevalence of each identified issue. Our package also provides users with an interactive dashboard that allows for a comprehensive error analysis through aggregate visualizations, applies interactive filters to isolate specific issues or score ranges, and drills down to the individual instances that exemplify a particular behavioral pattern. We demonstrate CLEAR analysis for RAG and Math benchmarks, and showcase its utility through a user case study.

[36] Factual Inconsistencies in Multilingual Wikipedia Tables

Silvia Cappa, Lingxiao Kong, Pille-Riin Peet, Fanfu Wei, Yuchen Zhou, Jan-Christoph Kalo

Main category: cs.CL

TL;DR: The study examines cross-lingual inconsistencies in Wikipedia’s structured content, particularly tabular data, proposing a methodology to analyze and categorize these inconsistencies.

Details

Motivation: Wikipedia's independent updates across languages lead to factual inconsistencies, affecting its reliability and AI systems relying on it.

Method: Developed a methodology to collect, align, and analyze multilingual Wikipedia tables, using quantitative and qualitative metrics.

Result: Identified categories of inconsistency and assessed multilingual alignment with a sample dataset.

Conclusion: Findings impact factual verification, multilingual knowledge interaction, and reliable AI system design.

Abstract: Wikipedia serves as a globally accessible knowledge source with content in over 300 languages. Despite covering the same topics, the different versions of Wikipedia are written and updated independently. This leads to factual inconsistencies that can impact the neutrality and reliability of the encyclopedia and AI systems, which often rely on Wikipedia as a main training source. This study investigates cross-lingual inconsistencies in Wikipedia’s structured content, with a focus on tabular data. We developed a methodology to collect, align, and analyze tables from Wikipedia multilingual articles, defining categories of inconsistency. We apply various quantitative and qualitative metrics to assess multilingual alignment using a sample dataset. These insights have implications for factual verification, multilingual knowledge interaction, and design for reliable AI systems leveraging Wikipedia content.

[37] FinDPO: Financial Sentiment Analysis for Algorithmic Trading through Preference Optimization of LLMs

Giorgos Iacovides, Wuyang Zhou, Danilo Mandic

Main category: cs.CL

TL;DR: FinDPO, a finance-specific LLM framework using Direct Preference Optimization (DPO), outperforms SFT models by 11% in sentiment analysis and achieves 67% annual returns with a Sharpe ratio of 2.0.

Details

Motivation: Online financial opinions impact trading, but SFT LLMs fail to generalize. FinDPO addresses this by aligning human preferences for better adaptation.

Method: FinDPO uses DPO for post-training alignment and introduces a ’logit-to-score’ conversion for continuous sentiment scoring.

Result: FinDPO outperforms SFT models by 11% and achieves 67% annual returns with a Sharpe ratio of 2.0.

Conclusion: FinDPO is a robust, high-performing framework for financial sentiment analysis and portfolio strategies.

Abstract: Opinions expressed in online finance-related textual data are having an increasingly profound impact on trading decisions and market movements. This trend highlights the vital role of sentiment analysis as a tool for quantifying the nature and strength of such opinions. With the rapid development of Generative AI (GenAI), supervised fine-tuned (SFT) large language models (LLMs) have become the de facto standard for financial sentiment analysis. However, the SFT paradigm can lead to memorization of the training data and often fails to generalize to unseen samples. This is a critical limitation in financial domains, where models must adapt to previously unobserved events and the nuanced, domain-specific language of finance. To this end, we introduce FinDPO, the first finance-specific LLM framework based on post-training human preference alignment via Direct Preference Optimization (DPO). The proposed FinDPO achieves state-of-the-art performance on standard sentiment classification benchmarks, outperforming existing supervised fine-tuned models by 11% on the average. Uniquely, the FinDPO framework enables the integration of a fine-tuned causal LLM into realistic portfolio strategies through a novel ’logit-to-score’ conversion, which transforms discrete sentiment predictions into continuous, rankable sentiment scores (probabilities). In this way, simulations demonstrate that FinDPO is the first sentiment-based approach to maintain substantial positive returns of 67% annually and strong risk-adjusted performance, as indicated by a Sharpe ratio of 2.0, even under realistic transaction costs of 5 basis points (bps).

[38] AraTable: Benchmarking LLMs’ Reasoning and Understanding of Arabic Tabular Data

Rana Alshaikh, Israa Alghanmi, Shelan Jeawak

Main category: cs.CL

TL;DR: AraTable is a new benchmark for evaluating LLMs on Arabic tabular data, showing their limitations in complex reasoning tasks and proposing an automated evaluation framework.

Details

Motivation: Address the underrepresentation of Arabic in tabular data benchmarks and the limited performance of LLMs in interpreting structured Arabic data.

Method: A hybrid pipeline where LLMs generate initial content, which is then filtered and verified by human experts to create a high-quality dataset.

Result: LLMs perform well on simple tasks like direct question answering but struggle with deeper reasoning and fact verification.

Conclusion: AraTable provides a valuable resource and framework for improving LLMs’ performance on Arabic tabular data, with potential for future enhancements.

Abstract: The cognitive and reasoning abilities of large language models (LLMs) have enabled remarkable progress in natural language processing. However, their performance in interpreting structured data, especially in tabular formats, remains limited. Although benchmarks for English tabular data are widely available, Arabic is still underrepresented because of the limited availability of public resources and its unique language features. To address this gap, we present AraTable, a novel and comprehensive benchmark designed to evaluate the reasoning and understanding capabilities of LLMs when applied to Arabic tabular data. AraTable consists of various evaluation tasks, such as direct question answering, fact verification, and complex reasoning, involving a wide range of Arabic tabular sources. Our methodology follows a hybrid pipeline, where initial content is generated by LLMs and subsequently filtered and verified by human experts to ensure high dataset quality. Initial analyses using AraTable show that, while LLMs perform adequately on simpler tabular tasks such as direct question answering, they continue to face significant cognitive challenges when tasks require deeper reasoning and fact verification. This indicates that there are substantial opportunities for future work to improve performance on complex tabular reasoning tasks. We also propose a fully automated evaluation framework that uses a self-deliberation mechanism and achieves performance nearly identical to that of human judges. This research provides a valuable, publicly available resource and evaluation framework that can help accelerate the development of foundational models for processing and analysing Arabic structured data.

[39] Restoring Rhythm: Punctuation Restoration Using Transformer Models for Bangla, a Low-Resource Language

Md Obyedullahil Mamun, Md Adyelullahil Mamun, Arif Ahmad, Md. Imran Hossain Emu

Main category: cs.CL

TL;DR: Transformer-based model (XLM-RoBERTa-large) achieves high accuracy in restoring punctuation in Bangla text, addressing low-resource challenges with data augmentation.

Details

Motivation: Enhancing readability and post-processing in ASR for low-resource languages like Bangla by automating punctuation restoration.

Method: Uses XLM-RoBERTa-large, focuses on four punctuation marks, employs data augmentation, and constructs a diverse training corpus.

Result: Achieves 97.1% accuracy on News, 91.2% on Reference, and 90.2% on ASR sets, showing strong generalization.

Conclusion: Establishes a baseline for Bangla punctuation restoration and provides public datasets/code for low-resource NLP research.

Abstract: Punctuation restoration enhances the readability of text and is critical for post-processing tasks in Automatic Speech Recognition (ASR), especially for low-resource languages like Bangla. In this study, we explore the application of transformer-based models, specifically XLM-RoBERTa-large, to automatically restore punctuation in unpunctuated Bangla text. We focus on predicting four punctuation marks: period, comma, question mark, and exclamation mark across diverse text domains. To address the scarcity of annotated resources, we constructed a large, varied training corpus and applied data augmentation techniques. Our best-performing model, trained with an augmentation factor of alpha = 0.20%, achieves an accuracy of 97.1% on the News test set, 91.2% on the Reference set, and 90.2% on the ASR set. Results show strong generalization to reference and ASR transcripts, demonstrating the model’s effectiveness in real-world, noisy scenarios. This work establishes a strong baseline for Bangla punctuation restoration and contributes publicly available datasets and code to support future research in low-resource NLP.

[40] Generation of Synthetic Clinical Text: A Systematic Review

Basel Alshaikhdeeb, Ahmed Abdelmonem Hemedan, Soumyabrata Ghosh, Irina Balaur, Venkata Satagopam

Main category: cs.CL

TL;DR: A systematic review on generating synthetic medical free-text, analyzing purpose, techniques, and evaluation methods, with findings on utility, privacy, and transformer architectures like GPTs.

Details

Motivation: Addressing clinical NLP issues like sparsity and privacy by generating synthetic medical text.

Method: Systematic review of 94 relevant articles from 1,398 collected, focusing on purpose, techniques, and evaluation methods.

Result: Transformer architectures (e.g., GPTs) dominate; utility is the main evaluation method; synthetic text aids NLP tasks but privacy concerns remain.

Conclusion: Synthetic medical text improves NLP tasks but requires more human assessment for privacy; advances will streamline workflows.

Abstract: Generating clinical synthetic text represents an effective solution for common clinical NLP issues like sparsity and privacy. This paper aims to conduct a systematic review on generating synthetic medical free-text by formulating quantitative analysis to three research questions concerning (i) the purpose of generation, (ii) the techniques, and (iii) the evaluation methods. We searched PubMed, ScienceDirect, Web of Science, Scopus, IEEE, Google Scholar, and arXiv databases for publications associated with generating synthetic medical unstructured free-text. We have identified 94 relevant articles out of 1,398 collected ones. A great deal of attention has been given to the generation of synthetic medical text from 2018 onwards, where the main purpose of such a generation is towards text augmentation, assistive writing, corpus building, privacy-preserving, annotation, and usefulness. Transformer architectures were the main predominant technique used to generate the text, especially the GPTs. On the other hand, there were four main aspects of evaluation, including similarity, privacy, structure, and utility, where utility was the most frequent method used to assess the generated synthetic medical text. Although the generated synthetic medical text demonstrated a moderate possibility to act as real medical documents in different downstream NLP tasks, it has proven to be a great asset as augmented, complementary to the real documents, towards improving the accuracy and overcoming sparsity/undersampling issues. Yet, privacy is still a major issue behind generating synthetic medical text, where more human assessments are needed to check for the existence of any sensitive information. Despite that, advances in generating synthetic medical text will considerably accelerate the adoption of workflows and pipeline development, discarding the time-consuming legalities of data transfer.

[41] Not All Features Deserve Attention: Graph-Guided Dependency Learning for Tabular Data Generation with Language Models

Zheyu Zhang, Shuo Yang, Bardh Prenkaj, Gjergji Kasneci

Main category: cs.CL

TL;DR: GraDe improves LLM-based tabular data generation by integrating sparse dependency graphs into attention mechanisms, outperforming existing methods by up to 12%.

Details

Motivation: LLMs struggle with sparse feature-level dependencies in tabular data, diluting attention on critical relationships.

Method: GraDe uses a lightweight dynamic graph learning module guided by functional dependencies to prioritize key feature interactions.

Result: GraDe outperforms existing LLM-based methods by up to 12% on complex datasets and matches state-of-the-art in synthetic data quality.

Conclusion: GraDe offers a practical, minimally intrusive solution for structure-aware tabular data modeling with LLMs.

Abstract: Large Language Models (LLMs) have shown strong potential for tabular data generation by modeling textualized feature-value pairs. However, tabular data inherently exhibits sparse feature-level dependencies, where many feature interactions are structurally insignificant. This creates a fundamental mismatch as LLMs’ self-attention mechanism inevitably distributes focus across all pairs, diluting attention on critical relationships, particularly in datasets with complex dependencies or semantically ambiguous features. To address this limitation, we propose GraDe (Graph-Guided Dependency Learning), a novel method that explicitly integrates sparse dependency graphs into LLMs’ attention mechanism. GraDe employs a lightweight dynamic graph learning module guided by externally extracted functional dependencies, prioritizing key feature interactions while suppressing irrelevant ones. Our experiments across diverse real-world datasets demonstrate that GraDe outperforms existing LLM-based approaches by up to 12% on complex datasets while achieving competitive results with state-of-the-art approaches in synthetic data quality. Our method is minimally intrusive yet effective, offering a practical solution for structure-aware tabular data modeling with LLMs.

[42] The Moral Gap of Large Language Models

Maciej Skorski, Alina Landowska

Main category: cs.CL

TL;DR: LLMs underperform in moral foundation detection compared to fine-tuned transformers, showing high false negatives and systematic under-detection. Fine-tuning is superior to prompting for this task.

Details

Motivation: To evaluate the effectiveness of LLMs versus fine-tuned transformers in detecting moral foundations, crucial for ethical AI and social discourse analysis.

Method: Comprehensive comparison using ROC, PR, and DET curve analysis on Twitter and Reddit datasets.

Result: LLMs exhibit high false negative rates and under-detection of moral content, despite prompt engineering.

Conclusion: Task-specific fine-tuning outperforms prompting for moral reasoning applications.

Abstract: Moral foundation detection is crucial for analyzing social discourse and developing ethically-aligned AI systems. While large language models excel across diverse tasks, their performance on specialized moral reasoning remains unclear. This study provides the first comprehensive comparison between state-of-the-art LLMs and fine-tuned transformers across Twitter and Reddit datasets using ROC, PR, and DET curve analysis. Results reveal substantial performance gaps, with LLMs exhibiting high false negative rates and systematic under-detection of moral content despite prompt engineering efforts. These findings demonstrate that task-specific fine-tuning remains superior to prompting for moral reasoning applications.

[43] Effective Multi-Task Learning for Biomedical Named Entity Recognition

João Ruano, Gonçalo M. Correia, Leonor Barreiros, Afonso Mendes

Main category: cs.CL

TL;DR: SRU-NER is a novel method for biomedical NER, handling nested entities and multi-dataset integration via dynamic loss adjustment, achieving strong performance and cross-domain generalization.

Details

Motivation: Addressing challenges in biomedical NER due to complex terminology and inconsistent annotations across datasets.

Method: Introduces SRU-NER, using slot-based recurrent units and multi-task learning with dynamic loss computation to handle nested entities and dataset gaps.

Result: Achieves competitive performance in biomedical and general-domain NER, with improved cross-domain generalization.

Conclusion: SRU-NER effectively tackles nested entities and dataset inconsistencies, demonstrating robust performance and generalization.

Abstract: Biomedical Named Entity Recognition presents significant challenges due to the complexity of biomedical terminology and inconsistencies in annotation across datasets. This paper introduces SRU-NER (Slot-based Recurrent Unit NER), a novel approach designed to handle nested named entities while integrating multiple datasets through an effective multi-task learning strategy. SRU-NER mitigates annotation gaps by dynamically adjusting loss computation to avoid penalizing predictions of entity types absent in a given dataset. Through extensive experiments, including a cross-corpus evaluation and human assessment of the model’s predictions, SRU-NER achieves competitive performance in biomedical and general-domain NER tasks, while improving cross-domain generalization.

[44] GLiNER2: An Efficient Multi-Task Information Extraction System with Schema-Driven Interface

Urchade Zaratiana, Gil Pasternak, Oliver Boyd, George Hurn-Maloney, Ash Lewis

Main category: cs.CL

TL;DR: GLiNER2 is a unified framework for information extraction tasks, offering efficiency and versatility without relying on large language models.

Details

Motivation: Existing IE solutions are either task-specific or computationally expensive, limiting accessibility and practicality.

Method: GLiNER2 enhances the original GLiNER architecture with a pretrained transformer encoder, supporting multi-task composition via a schema-based interface.

Result: Competitive performance in extraction and classification tasks, with improved deployment accessibility.

Conclusion: GLiNER2 provides an efficient, open-source alternative to LLM-based IE solutions, released with pre-trained models and documentation.

Abstract: Information extraction (IE) is fundamental to numerous NLP applications, yet existing solutions often require specialized models for different tasks or rely on computationally expensive large language models. We present GLiNER2, a unified framework that enhances the original GLiNER architecture to support named entity recognition, text classification, and hierarchical structured data extraction within a single efficient model. Built pretrained transformer encoder architecture, GLiNER2 maintains CPU efficiency and compact size while introducing multi-task composition through an intuitive schema-based interface. Our experiments demonstrate competitive performance across extraction and classification tasks with substantial improvements in deployment accessibility compared to LLM-based alternatives. We release GLiNER2 as an open-source pip-installable library with pre-trained models and documentation at https://github.com/fastino-ai/GLiNER2.

[45] GIIFT: Graph-guided Inductive Image-free Multimodal Machine Translation

Jiafeng Xiong, Yuting Zhao

Main category: cs.CL

TL;DR: GIIFT, a two-stage Graph-guided Inductive Image-Free MMT framework, outperforms existing methods by using multimodal scene graphs and cross-modal Graph Attention Networks, achieving state-of-the-art results without images during inference.

Details

Motivation: Existing MMT methods struggle with rigid visual-linguistic alignment and limited inference to trained multimodal domains.

Method: Constructs multimodal scene graphs and introduces GIIFT, a framework using cross-modal Graph Attention Networks to learn and generalize multimodal knowledge.

Result: Achieves state-of-the-art performance on Multi30K and WMT benchmarks, even without images during inference.

Conclusion: GIIFT effectively bridges the modality gap and generalizes to image-free translation domains, outperforming existing approaches.

Abstract: Multimodal Machine Translation (MMT) has demonstrated the significant help of visual information in machine translation. However, existing MMT methods face challenges in leveraging the modality gap by enforcing rigid visual-linguistic alignment whilst being confined to inference within their trained multimodal domains. In this work, we construct novel multimodal scene graphs to preserve and integrate modality-specific information and introduce GIIFT, a two-stage Graph-guided Inductive Image-Free MMT framework that uses a cross-modal Graph Attention Network adapter to learn multimodal knowledge in a unified fused space and inductively generalize it to broader image-free translation domains. Experimental results on the Multi30K dataset of English-to-French and English-to-German tasks demonstrate that our GIIFT surpasses existing approaches and achieves the state-of-the-art, even without images during inference. Results on the WMT benchmark show significant improvements over the image-free translation baselines, demonstrating the strength of GIIFT towards inductive image-free inference.

[46] Hybrid Tokenization Strategy for DNA Language Model using Byte Pair Encoding and K-MER Methods

Ganesh Sapkota, Md Hasibur Rahman

Main category: cs.CL

TL;DR: A hybrid tokenization strategy combining 6-mer and BPE-600 enhances DNA Language Models (DLMs), outperforming state-of-the-art models in next-k-mer prediction tasks.

Details

Motivation: Traditional k-mer tokenization struggles with uneven token distribution and limited global context understanding, prompting the need for a hybrid approach.

Method: The proposed method merges 6-mer tokens with BPE-600 tokens to create a balanced, context-aware vocabulary for DLMs.

Result: The model achieved accuracies of 10.78% (3-mers), 10.1% (4-mers), and 4.12% (5-mers), surpassing models like NT, DNABERT2, and GROVER.

Conclusion: The hybrid strategy effectively captures local and global DNA sequence patterns, advancing genomic language modeling for future biological research.

Abstract: This paper presents a novel hybrid tokenization strategy that enhances the performance of DNA Language Models (DLMs) by combining 6-mer tokenization with Byte Pair Encoding (BPE-600). Traditional k-mer tokenization is effective at capturing local DNA sequence structures but often faces challenges, including uneven token distribution and a limited understanding of global sequence context. To address these limitations, we propose merging unique 6mer tokens with optimally selected BPE tokens generated through 600 BPE cycles. This hybrid approach ensures a balanced and context-aware vocabulary, enabling the model to capture both short and long patterns within DNA sequences simultaneously. A foundational DLM trained on this hybrid vocabulary was evaluated using next-k-mer prediction as a fine-tuning task, demonstrating significantly improved performance. The model achieved prediction accuracies of 10.78% for 3-mers, 10.1% for 4-mers, and 4.12% for 5-mers, outperforming state-of-the-art models such as NT, DNABERT2, and GROVER. These results highlight the ability of the hybrid tokenization strategy to preserve both the local sequence structure and global contextual information in DNA modeling. This work underscores the importance of advanced tokenization methods in genomic language modeling and lays a robust foundation for future applications in downstream DNA sequence analysis and biological research.

[47] Wide-In, Narrow-Out: Revokable Decoding for Efficient and Effective DLLMs

Feng Hong, Geng Yu, Yushi Ye, Haicheng Huang, Huangjie Zheng, Ya Zhang, Yanfeng Wang, Jiangchao Yao

Main category: cs.CL

TL;DR: WINO is a training-free decoding algorithm for DLLMs that improves quality-speed trade-off by allowing revokable decoding through parallel draft-and-verify.

Details

Motivation: Existing DLLMs suffer from a quality-speed trade-off due to irreversible decoding, leading to performance degradation.

Method: WINO uses a parallel draft-and-verify mechanism to draft multiple tokens and verify/re-mask suspicious ones using bidirectional context.

Result: WINO achieves significant speedups (6× on GSM8K, 10× on Flickr30K) while improving accuracy (2.58% on GSM8K).

Conclusion: WINO effectively resolves the quality-speed trade-off in DLLMs, demonstrating superior performance in various benchmarks.

Abstract: Diffusion Large Language Models (DLLMs) have emerged as a compelling alternative to Autoregressive models, designed for fast parallel generation. However, existing DLLMs are plagued by a severe quality-speed trade-off, where faster parallel decoding leads to significant performance degradation. We attribute this to the irreversibility of standard decoding in DLLMs, which is easily polarized into the wrong decoding direction along with early error context accumulation. To resolve this, we introduce Wide-In, Narrow-Out (WINO), a training-free decoding algorithm that enables revokable decoding in DLLMs. WINO employs a parallel draft-and-verify mechanism, aggressively drafting multiple tokens while simultaneously using the model’s bidirectional context to verify and re-mask suspicious ones for refinement. Verified in open-source DLLMs like LLaDA and MMaDA, WINO is shown to decisively improve the quality-speed trade-off. For instance, on the GSM8K math benchmark, it accelerates inference by 6$\times$ while improving accuracy by 2.58%; on Flickr30K captioning, it achieves a 10$\times$ speedup with higher performance. More comprehensive experiments are conducted to demonstrate the superiority and provide an in-depth understanding of WINO.

[48] System Report for CCL25-Eval Task 10: SRAG-MAV for Fine-Grained Chinese Hate Speech Recognition

Jiahao Wang, Ramen Liu, Longhui Zhang, Jing Li

Main category: cs.CL

TL;DR: The paper introduces the SRAG-MAV framework for Fine-Grained Chinese Hate Speech Recognition, combining task reformulation, self-retrieval-augmented generation, and multi-round voting to outperform baselines like GPT-4o.

Details

Motivation: To improve performance in fine-grained hate speech recognition by addressing the limitations of existing methods.

Method: Proposes SRAG-MAV: task reformulation, dynamic retrieval for prompts, and multi-round voting for stable outputs using Qwen2.5-7B.

Result: Achieves Hard Score 26.66, Soft Score 48.35, and Average Score 37.505, surpassing GPT-4o and fine-tuned Qwen2.5-7B.

Conclusion: The SRAG-MAV framework effectively enhances hate speech recognition, demonstrating significant improvements over baseline models.

Abstract: This paper presents our system for CCL25-Eval Task 10, addressing Fine-Grained Chinese Hate Speech Recognition (FGCHSR). We propose a novel SRAG-MAV framework that synergistically integrates task reformulation(TR), Self-Retrieval-Augmented Generation (SRAG), and Multi-Round Accumulative Voting (MAV). Our method reformulates the quadruplet extraction task into triplet extraction, uses dynamic retrieval from the training set to create contextual prompts, and applies multi-round inference with voting to improve output stability and performance. Our system, based on the Qwen2.5-7B model, achieves a Hard Score of 26.66, a Soft Score of 48.35, and an Average Score of 37.505 on the STATE ToxiCN dataset, significantly outperforming baselines such as GPT-4o (Average Score 15.63) and fine-tuned Qwen2.5-7B (Average Score 35.365). The code is available at https://github.com/king-wang123/CCL25-SRAG-MAV.

[49] AQuilt: Weaving Logic and Self-Inspection into Low-Cost, High-Relevance Data Synthesis for Specialist LLMs

Xiaopeng Ke, Hexuan Deng, Xuebo Liu, Jun Rao, Zhenxi Song, Jun Yu, Min Zhang

Main category: cs.CL

TL;DR: AQuilt is a framework for creating instruction-tuning data from unlabeled data in specialized domains, improving LLM performance with logic and inspection, while reducing costs.

Details

Motivation: Large language models underperform in specialized domains, and existing data synthesis methods are costly or limited in generalization.

Method: AQuilt constructs instruction-tuning data from unlabeled data, incorporating logic, inspection, and customizable task instructions.

Result: AQuilt achieves performance comparable to DeepSeek-V3 with 17% of the cost and higher task relevance.

Conclusion: AQuilt offers an efficient, cost-effective solution for enhancing LLM performance in specialized domains.

Abstract: Despite the impressive performance of large language models (LLMs) in general domains, they often underperform in specialized domains. Existing approaches typically rely on data synthesis methods and yield promising results by using unlabeled data to capture domain-specific features. However, these methods either incur high computational costs or suffer from performance limitations, while also demonstrating insufficient generalization across different tasks. To address these challenges, we propose AQuilt, a framework for constructing instruction-tuning data for any specialized domains from corresponding unlabeled data, including Answer, Question, Unlabeled data, Inspection, Logic, and Task type. By incorporating logic and inspection, we encourage reasoning processes and self-inspection to enhance model performance. Moreover, customizable task instructions enable high-quality data generation for any task. As a result, we construct a dataset of 703k examples to train a powerful data synthesis model. Experiments show that AQuilt is comparable to DeepSeek-V3 while utilizing just 17% of the production cost. Further analysis demonstrates that our generated data exhibits higher relevance to downstream tasks. Source code, models, and scripts are available at https://github.com/Krueske/AQuilt.

[50] TRPrompt: Bootstrapping Query-Aware Prompt Optimization from Textual Rewards

Andreea Nica, Ivan Zakazov, Nicolas Mario Baldwin, Saibo Geng, Robert West

Main category: cs.CL

TL;DR: TRPrompt unifies textual feedback and numerical rewards for prompt optimization, improving LLM reasoning without parameter updates.

Details

Motivation: To bridge the gap between heuristic-based and reward-trained prompt optimization methods by incorporating textual feedback directly into training.

Method: Introduces TRPrompt, a framework that trains a prompt model using textual rewards, iteratively improving prompts without prior dataset collection.

Result: Achieves state-of-the-art performance on GSMHard and MATH datasets by generating query-specific prompts.

Conclusion: TRPrompt effectively combines textual feedback and training, enhancing LLM reasoning capabilities.

Abstract: Prompt optimization improves the reasoning abilities of large language models (LLMs) without requiring parameter updates to the target model. Following heuristic-based “Think step by step” approaches, the field has evolved in two main directions: while one group of methods uses textual feedback to elicit improved prompts from general-purpose LLMs in a training-free way, a concurrent line of research relies on numerical rewards to train a special prompt model, tailored for providing optimal prompts to the target model. In this paper, we introduce the Textual Reward Prompt framework (TRPrompt), which unifies these approaches by directly incorporating textual feedback into training of the prompt model. Our framework does not require prior dataset collection and is being iteratively improved with the feedback on the generated prompts. When coupled with the capacity of an LLM to internalize the notion of what a “good” prompt is, the high-resolution signal provided by the textual rewards allows us to train a prompt model yielding state-of-the-art query-specific prompts for the problems from the challenging math datasets GSMHard and MATH.

[51] Checklists Are Better Than Reward Models For Aligning Language Models

Vijay Viswanathan, Yanchao Sun, Shuang Ma, Xiang Kong, Meng Cao, Graham Neubig, Tongshuang Wu

Main category: cs.CL

TL;DR: The paper introduces Reinforcement Learning from Checklist Feedback (RLCF), a method using flexible, instruction-specific criteria to improve language models’ instruction-following capabilities, outperforming other alignment methods across multiple benchmarks.

Details

Motivation: Current reinforcement learning methods for adapting language models rely on fixed criteria, limiting their effectiveness. The goal is to broaden the impact of reinforcement learning by using instruction-specific feedback.

Method: Proposes RLCF, where checklists are extracted from instructions and responses are evaluated against these checklists using AI judges and verifier programs. The scores are combined to compute rewards for reinforcement learning.

Result: RLCF improves performance on all five benchmarks tested, including notable boosts in hard satisfaction rate, information fidelity, and win rate compared to other methods.

Conclusion: Checklist feedback is a powerful tool for enhancing language models’ ability to support diverse user queries, demonstrating significant improvements over existing alignment techniques.

Abstract: Language models must be adapted to understand and follow user instructions. Reinforcement learning is widely used to facilitate this – typically using fixed criteria such as “helpfulness” and “harmfulness”. In our work, we instead propose using flexible, instruction-specific criteria as a means of broadening the impact that reinforcement learning can have in eliciting instruction following. We propose “Reinforcement Learning from Checklist Feedback” (RLCF). From instructions, we extract checklists and evaluate how well responses satisfy each item - using both AI judges and specialized verifier programs - then combine these scores to compute rewards for RL. We compare RLCF with other alignment methods applied to a strong instruction following model (Qwen2.5-7B-Instruct) on five widely-studied benchmarks – RLCF is the only method to improve performance on every benchmark, including a 4-point boost in hard satisfaction rate on FollowBench, a 6-point increase on InFoBench, and a 3-point rise in win rate on Arena-Hard. These results establish checklist feedback as a key tool for improving language models’ support of queries that express a multitude of needs.

[52] Causally Testing Gender Bias in LLMs: A Case Study on Occupational Bias

Yuen Chen, Vethavikashini Chithrra Raghuram, Justus Mattern, Rada Mihalcea, Zhijing Jin

Main category: cs.CL

TL;DR: The paper introduces a causal framework for measuring bias in LLMs, proposes the OccuGender benchmark to assess occupational gender bias, and tests state-of-the-art models, revealing significant bias. It also discusses mitigation strategies and framework generalizability.

Details

Motivation: To understand and measure harmful, human-like biases in LLM-generated texts, particularly focusing on occupational gender bias.

Method: Introduces a causal formulation for bias measurement, designs the OccuGender benchmark, and tests models like Llama and Mistral.

Result: State-of-the-art LLMs exhibit substantial occupational gender bias.

Conclusion: The paper provides a robust framework for bias measurement, highlights bias in LLMs, and suggests mitigation strategies, with potential for broader application.

Abstract: Generated texts from large language models (LLMs) have been shown to exhibit a variety of harmful, human-like biases against various demographics. These findings motivate research efforts aiming to understand and measure such effects. This paper introduces a causal formulation for bias measurement in generative language models. Based on this theoretical foundation, we outline a list of desiderata for designing robust bias benchmarks. We then propose a benchmark called OccuGender, with a bias-measuring procedure to investigate occupational gender bias. We test several state-of-the-art open-source LLMs on OccuGender, including Llama, Mistral, and their instruction-tuned versions. The results show that these models exhibit substantial occupational gender bias. Lastly, we discuss prompting strategies for bias mitigation and an extension of our causal formulation to illustrate the generalizability of our framework. Our code and data https://github.com/chenyuen0103/gender-bias.

[53] DocTER: Evaluating Document-based Knowledge Editing

Suhang Wu, Ante Wang, Minlong Peng, Yujie Lin, Wenbo Li, Mingming Sun, Jinsong Su

Main category: cs.CL

TL;DR: The paper introduces DocTER, a benchmark for knowledge editing using documents instead of manually labeled triples, and evaluates it across four perspectives. It proposes an Extract-then-Edit pipeline and highlights challenges in document-based editing.

Details

Motivation: To address the limitations of using manually labeled factual triples for knowledge editing by leveraging easily accessible documents.

Method: Develops the DocTER benchmark and an Extract-then-Edit pipeline to adapt triplet-based methods for document-based editing.

Result: Document-based editing is more challenging, with even the best method lagging by 10 points in success compared to gold triples.

Conclusion: The study identifies key performance factors and provides insights for future research in document-based knowledge editing.

Abstract: Knowledge editing aims to correct outdated or inaccurate knowledge in neural networks. In this paper, we explore knowledge editing using easily accessible documents instead of manually labeled factual triples employed in earlier research. To advance this field, we establish the first evaluation benchmark, \textit{DocTER}, featuring Documents containing counterfactual knowledge for editing. A comprehensive four-perspective evaluation is introduced: Edit Success, Locality, Reasoning, and Cross-lingual Transfer. To adapt conventional triplet-based knowledge editing methods for this task, we develop an Extract-then-Edit pipeline that extracts triples from documents before applying existing methods. Experiments on popular knowledge editing methods demonstrate that editing with documents presents significantly greater challenges than using triples. In document-based scenarios, even the best-performing in-context editing approach still lags behind by 10 points in editing success when compared to using gold triples. This observation also holds for both reasoning and cross-lingual test sets. We further analyze key factors influencing task performance, including the quality of extracted triples, the frequency and position of edited knowledge in documents, various methods for enhancing reasoning, and performance differences across various directions in cross-lingual knowledge editing, which provide valuable insights for future research.

[54] Quantifying the Uniqueness and Divisiveness of Presidential Discourse

Karen Zhou, Alexander A. Meitus, Milo Chase, Grace Wang, Anne Mykland, William Howell, Chenhao Tan

Main category: cs.CL

TL;DR: The paper analyzes presidential speech uniqueness using language models and a divisive speech lexicon, finding Donald Trump’s speech notably distinct and more antagonistic than other presidents.

Details

Motivation: To determine if U.S. presidents speak differently from each other, especially in divisive language, and to assess these differences across communication mediums.

Method: Developed a uniqueness metric using large language models, created a lexicon for divisive speech, and analyzed presidential speech corpora.

Result: Trump’s speech is significantly more unique and divisive, especially toward opponents, compared to other recent presidents. Differences persist across contexts and are not due to general trends.

Conclusion: Trump’s speech patterns are distinctively more antagonistic and unique, highlighting a divergence in presidential communication styles.

Abstract: Do American presidents speak discernibly different from each other? If so, in what ways? And are these differences confined to any single medium of communication? To investigate these questions, this paper introduces a novel metric of uniqueness based on large language models, develops a new lexicon for divisive speech, and presents a framework for assessing the distinctive ways in which presidents speak about their political opponents. Applying these tools to a variety of corpora of presidential speeches, we find considerable evidence that Donald Trump’s speech patterns diverge from those of all major party nominees for the presidency in recent history. Trump is significantly more distinctive than his fellow Republicans, whose uniqueness values appear closer to those of the Democrats. Contributing to these differences is Trump’s employment of divisive and antagonistic language, particularly when targeting his political opponents. These differences hold across a variety of measurement strategies, arise on both the campaign trail and in official presidential addresses, and do not appear to be an artifact of secular changes in presidential communications.

[55] Weak-to-Strong Jailbreaking on Large Language Models

Xuandong Zhao, Xianjun Yang, Tianyu Pang, Chao Du, Lei Li, Yu-Xiang Wang, William Yang Wang

Main category: cs.CL

TL;DR: The paper introduces a computationally efficient weak-to-strong jailbreaking attack for LLMs, achieving high misalignment rates with minimal computational cost, and proposes a preliminary defense strategy.

Details

Motivation: Existing jailbreaking methods for LLMs are computationally expensive, and the paper aims to address this by proposing a more efficient attack method.

Method: The attack uses two smaller models (safe and unsafe) to adversarially modify a larger safe model’s decoding probabilities, requiring only one forward pass per example.

Result: The method increases misalignment rates to over 99% on two datasets, demonstrating its effectiveness.

Conclusion: The study highlights a critical safety issue in LLM alignment and proposes a preliminary defense, though advanced defenses remain a challenge.

Abstract: Large language models (LLMs) are vulnerable to jailbreak attacks - resulting in harmful, unethical, or biased text generations. However, existing jailbreaking methods are computationally costly. In this paper, we propose the weak-to-strong jailbreaking attack, an efficient inference time attack for aligned LLMs to produce harmful text. Our key intuition is based on the observation that jailbroken and aligned models only differ in their initial decoding distributions. The weak-to-strong attack’s key technical insight is using two smaller models (a safe and an unsafe one) to adversarially modify a significantly larger safe model’s decoding probabilities. We evaluate the weak-to-strong attack on 5 diverse open-source LLMs from 3 organizations. The results show our method can increase the misalignment rate to over 99% on two datasets with just one forward pass per example. Our study exposes an urgent safety issue that needs to be addressed when aligning LLMs. As an initial attempt, we propose a defense strategy to protect against such attacks, but creating more advanced defenses remains challenging. The code for replicating the method is available at https://github.com/XuandongZhao/weak-to-strong

[56] P-React: Synthesizing Topic-Adaptive Reactions of Personality Traits via Mixture of Specialized LoRA Experts

Yuhao Dan, Jie Zhou, Qin Chen, Junfeng Tian, Liang He

Main category: cs.CL

TL;DR: The paper introduces P-React, a personalized LLM incorporating Big Five personality traits, using a MoE-based approach and a Personality Specialization Loss (PSL) for nuanced personality modeling. It also presents OCEAN-Chat, a dataset for training LLMs in personality expression.

Details

Motivation: Existing LLMs focus on explicit character profiles, neglecting underlying personality traits, limiting anthropomorphic and psychologically-grounded AI development.

Method: Proposes P-React, a MoE-based LLM with PSL to model Big Five traits, and introduces OCEAN-Chat dataset for training.

Result: P-React effectively maintains consistent and realistic personality traits in LLMs.

Conclusion: The work advances personalized LLMs by integrating psychological traits, offering a more nuanced and grounded approach.

Abstract: Personalized large language models (LLMs) have attracted great attention in many applications, such as emotional support and role-playing. However, existing works primarily focus on modeling explicit character profiles, while ignoring the underlying personality traits that truly shape behaviors and decision-making, hampering the development of more anthropomorphic and psychologically-grounded AI systems. In this paper, we explore the modeling of Big Five personality traits, which is the most widely used trait theory in psychology, and propose P-React, a mixture of experts (MoE)-based personalized LLM. Particularly, we integrate a Personality Specialization Loss (PSL) to better capture individual trait expressions, providing a more nuanced and psychologically grounded personality simulacrum. To facilitate research in this field, we curate OCEAN-Chat, a high-quality, human-verified dataset designed to train LLMs in expressing personality traits across diverse topics. Extensive experiments demonstrate the effectiveness of P-React in maintaining consistent and real personality.

[57] VolDoGer: LLM-assisted Datasets for Domain Generalization in Vision-Language Tasks

Juhwan Choi, Junehyoung Kwon, JungMin Yun, Seunguk Yu, YoungBin Kim

Main category: cs.CL

TL;DR: The paper introduces VolDoGer, a dataset for domain generalization in vision-language tasks, addressing data scarcity and evaluating model performance.

Details

Motivation: The lack of datasets for domain generalization in vision-language tasks limits research. VolDoGer aims to fill this gap.

Method: Extended LLM-based data annotation to vision-language tasks, creating VolDoGer for image captioning, VQA, and visual entailment.

Result: Evaluated domain generalizability of models, including fine-tuned and multimodal LLMs, using VolDoGer.

Conclusion: VolDoGer provides a valuable resource for advancing domain generalization research in vision-language tasks.

Abstract: Domain generalizability is a crucial aspect of a deep learning model since it determines the capability of the model to perform well on data from unseen domains. However, research on the domain generalizability of deep learning models for vision-language tasks remains limited, primarily because of the lack of required datasets. To address these challenges, we propose VolDoGer: Vision-Language Dataset for Domain Generalization, a dedicated dataset designed for domain generalization that addresses three vision-language tasks: image captioning, visual question answering, and visual entailment. We constructed VolDoGer by extending LLM-based data annotation techniques to vision-language tasks, thereby alleviating the burden of recruiting human annotators. We evaluated the domain generalizability of various models, ranging from fine-tuned models to a recent multimodal large language model, through VolDoGer.

Grace Proebsting, Oghenefejiro Isaacs Anigboro, Charlie M. Crawford, Danaé Metaxa, Sorelle A. Friedler

Main category: cs.CL

TL;DR: The paper examines how automated content moderation and generative AI systems disproportionately suppress identity-related speech, introducing measures and benchmarks to highlight biases.

Details

Motivation: To address the lack of attention on ensuring generative AI systems allow appropriate identity-related content, focusing on speech suppression of marginalized groups.

Method: Defines measures of speech suppression, uses short-form and generative AI datasets, and benchmarks nine identity groups across five content moderation services.

Result: Identity-related speech is more likely to be incorrectly suppressed, with reasons varying by stereotypes (e.g., disability flagged for self-harm, non-Christian as violent).

Conclusion: Urges further attention to how generative AI’s biases may impact identity-related creative content.

Abstract: Automated content moderation has long been used to help identify and filter undesired user-generated content online. But such systems have a history of incorrectly flagging content by and about marginalized identities for removal. Generative AI systems now use such filters to keep undesired generated content from being created by or shown to users. While a lot of focus has been given to making sure such systems do not produce undesired outcomes, considerably less attention has been paid to making sure appropriate text can be generated. From classrooms to Hollywood, as generative AI is increasingly used for creative or expressive text generation, whose stories will these technologies allow to be told, and whose will they suppress? In this paper, we define and introduce measures of speech suppression, focusing on speech related to different identity groups incorrectly filtered by a range of content moderation APIs. Using both short-form, user-generated datasets traditional in content moderation and longer generative AI-focused data, including two datasets we introduce in this work, we create a benchmark for measurement of speech suppression for nine identity groups. Across one traditional and four generative AI-focused automated content moderation services tested, we find that identity-related speech is more likely to be incorrectly suppressed than other speech. We find that reasons for incorrect flagging behavior vary by identity based on stereotypes and text associations, with, e.g., disability-related content more likely to be flagged for self-harm or health-related reasons while non-Christian content is more likely to be flagged as violent or hateful. As generative AI systems are increasingly used for creative work, we urge further attention to how this may impact the creation of identity-related content.

[59] LIFBench: Evaluating the Instruction Following Performance and Stability of Large Language Models in Long-Context Scenarios

Xiaodong Wu, Minhao Wang, Yichen Liu, Xiaoming Shi, He Yan, Xiangju Lu, Junmin Zhu, Wei Zhang

Main category: cs.CL

TL;DR: LIFBench and LIFEval are introduced to evaluate LLMs’ instruction-following and stability in long-context scenarios, addressing gaps in existing benchmarks.

Details

Motivation: Existing benchmarks lack focus on instruction-following in long-context scenarios and stability across diverse inputs.

Method: LIFBench includes 2,766 instructions across three long-context scenarios and eleven tasks, evaluated using LIFEval, a rubric-based automated scoring method.

Result: Experiments on 20 LLMs across six length intervals provide comprehensive performance and stability analysis.

Conclusion: LIFBench and LIFEval offer robust tools for assessing LLMs in complex, long-context settings, guiding future advancements.

Abstract: As Large Language Models (LLMs) evolve in natural language processing (NLP), their ability to stably follow instructions in long-context inputs has become critical for real-world applications. However, existing benchmarks seldom focus on instruction-following in long-context scenarios or stability on different inputs. To bridge this gap, we introduce LIFBench, a scalable dataset designed to evaluate LLMs’ instruction-following capabilities and stability across long contexts. LIFBench comprises three long-context scenarios and eleven diverse tasks, featuring 2,766 instructions generated through an automated expansion method across three dimensions: length, expression, and variables. For evaluation, we propose LIFEval, a rubric-based assessment method that enables precise, automated scoring of complex LLM responses without reliance on LLM-assisted assessments or human judgment. This method allows for a comprehensive analysis of model performance and stability from multiple perspectives. We conduct detailed experiments on 20 prominent LLMs across six length intervals. Our work contributes LIFBench and LIFEval as robust tools for assessing LLM performance in complex and long-context settings, offering valuable insights to guide future advancements in LLM development.

[60] A Survey of Event Causality Identification: Taxonomy, Challenges, Assessment, and Prospects

Qing Cheng, Zefan Zeng, Xingchen Hu, Yuehang Si, Zhong Liu

Main category: cs.CL

TL;DR: A survey on Event Causality Identification (ECI) in NLP, covering concepts, taxonomy, models (SECI/DECI), and evaluations, with future research directions.

Details

Motivation: ECI is crucial for NLP to detect causal relationships between events in texts, requiring systematic understanding and evaluation of existing models.

Method: Classifies ECI models into SECI (feature patterns, ML classifiers, deep encoding, prompt-tuning, data augmentation) and DECI (deep encoding, graph reasoning, prompt-tuning). Evaluates on benchmarks.

Result: Quantitative evaluations on four datasets assess model performance, highlighting strengths, limitations, and challenges.

Conclusion: Identifies future research directions, emphasizing advancements in multi-lingual, cross-lingual, and zero-shot ECI using LLMs.

Abstract: Event Causality Identification (ECI) has become an essential task in Natural Language Processing (NLP), focused on automatically detecting causal relationships between events within texts. This comprehensive survey systematically investigates fundamental concepts and models, developing a systematic taxonomy and critically evaluating diverse models. We begin by defining core concepts, formalizing the ECI problem, and outlining standard evaluation protocols. Our classification framework divides ECI models into two primary tasks: Sentence-level Event Causality Identification (SECI) and Document-level Event Causality Identification (DECI). For SECI, we review models employing feature pattern-based matching, machine learning classifiers, deep semantic encoding, prompt-based fine-tuning, and causal knowledge pre-training, alongside data augmentation strategies. For DECI, we focus on approaches utilizing deep semantic encoding, event graph reasoning, and prompt-based fine-tuning. Special attention is given to recent advancements in multi-lingual and cross-lingual ECI, as well as zero-shot ECI leveraging Large Language Models (LLMs). We analyze the strengths, limitations, and unresolved challenges associated with each approach. Extensive quantitative evaluations are conducted on four benchmark datasets to rigorously assess the performance of various ECI models. We conclude by discussing future research directions and highlighting opportunities to advance the field further.

[61] BlockDialect: Block-wise Fine-grained Mixed Format Quantization for Energy-Efficient LLM Inference

Wonsuk Jang, Thierry Tambe

Main category: cs.CL

TL;DR: BlockDialect improves LLM efficiency by using block-wise fine-grained mixed formats and DialectFP4, achieving better accuracy and energy efficiency than existing methods.

Details

Motivation: Addressing memory and computational challenges in large language models (LLMs) by improving quantization techniques for weights and activations.

Method: Proposes BlockDialect, assigning optimal per-block number formats from DialectFP4, and a two-stage approach for online activation quantization.

Result: Achieves 10.78% (7.48%) accuracy gain on LLaMA3-8B (LLaMA2-7B) with lower bit usage, staying close to full precision.

Conclusion: BlockDialect offers an energy-efficient solution for LLM inference by focusing on data representation over scaling.

Abstract: The rapidly increasing size of large language models (LLMs) presents significant challenges in memory usage and computational costs. Quantizing both weights and activations can address these issues, with hardware-supported fine-grained scaling emerging as a promising solution to mitigate outliers. However, existing methods struggle to capture nuanced block data distributions. We propose BlockDialect, a block-wise fine-grained mixed format technique that assigns a per-block optimal number format from a formatbook for better data representation. Additionally, we introduce DialectFP4, a formatbook of FP4 variants (akin to dialects) that adapt to diverse data distributions. To leverage this efficiently, we propose a two-stage approach for online DialectFP4 activation quantization. Importantly, DialectFP4 ensures energy efficiency by selecting representable values as scaled integers compatible with low-precision integer arithmetic. BlockDialect achieves 10.78% (7.48%) accuracy gain on the LLaMA3-8B (LLaMA2-7B) model compared to MXFP4 format with lower bit usage per data, while being only 5.45% (2.69%) below full precision even when quantizing full-path matrix multiplication. Focusing on how to represent over how to scale, our work presents a promising path for energy-efficient LLM inference.

[62] LLM Alignment as Retriever Optimization: An Information Retrieval Perspective

Bowen Jin, Jinsung Yoon, Zhen Qin, Ziqi Wang, Wei Xiong, Yu Meng, Jiawei Han, Sercan O. Arik

Main category: cs.CL

TL;DR: The paper introduces LarPO, a direct optimization method for aligning Large Language Models (LLMs) using Information Retrieval (IR) principles, achieving significant improvements in alignment quality.

Details

Motivation: LLMs need effective alignment to ensure correctness, trustworthiness, and ethical behavior, but existing methods are complex.

Method: Proposes LarPO, a novel alignment method mapping LLM generation and reward models to IR’s retriever-reranker paradigm.

Result: LarPO improves alignment by 38.9% and 13.7% on AlpacaEval2 and MixEval-Hard benchmarks.

Conclusion: Integrating IR principles into LLM alignment offers a promising direction for future research.

Abstract: Large Language Models (LLMs) have revolutionized artificial intelligence with capabilities in reasoning, coding, and communication, driving innovation across industries. Their true potential depends on effective alignment to ensure correct, trustworthy and ethical behavior, addressing challenges like misinformation, hallucinations, bias and misuse. While existing Reinforcement Learning (RL)-based alignment methods are notoriously complex, direct optimization approaches offer a simpler alternative. In this work, we introduce a novel direct optimization approach for LLM alignment by drawing on established Information Retrieval (IR) principles. We present a systematic framework that bridges LLM alignment and IR methodologies, mapping LLM generation and reward models to IR’s retriever-reranker paradigm. Building on this foundation, we propose LLM Alignment as Retriever Preference Optimization (LarPO), a new alignment method that enhances overall alignment quality. Extensive experiments validate LarPO’s effectiveness with 38.9 % and 13.7 % averaged improvement on AlpacaEval2 and MixEval-Hard respectively. Our work opens new avenues for advancing LLM alignment by integrating IR foundations, offering a promising direction for future research.

[63] Beyond Profile: From Surface-Level Facts to Deep Persona Simulation in LLMs

Zixiao Wang, Duzhen Zhang, Ishita Agrawal, Shen Gao, Le Song, Xiuying Chen

Main category: cs.CL

TL;DR: CharacterBot simulates personas by capturing linguistic and thought patterns, using Lu Xun’s essays for training, and outperforms baselines in accuracy and comprehension.

Details

Motivation: Existing methods for persona simulation in LLMs focus on surface-level facts or dialogues, lacking deeper thought representation.

Method: CharacterBot uses four training tasks (pre-training, multiple-choice QA, generative QA, style transfer) and a CharLoRA mechanism to align with Lu Xun’s style and thoughts.

Result: CharacterBot significantly outperforms baselines in linguistic accuracy and opinion comprehension.

Conclusion: This work advances deep persona simulation in LLMs and highlights the need for ethical considerations.

Abstract: Previous approaches to persona simulation large language models (LLMs) have typically relied on learning basic biographical information, or using limited role-play dialogue datasets to capture a character’s responses. However, a holistic representation of an individual goes beyond surface-level facts or conversations to deeper thoughts and thinking. In this work, we introduce CharacterBot, a model designed to replicate both the linguistic patterns and distinctive thought patterns as manifested in the textual works of a character. Using Lu Xun, a renowned Chinese writer as a case study, we propose four training tasks derived from his 17 essay collections. These include a pre-training task focused on mastering external linguistic structures and knowledge, as well as three fine-tuning tasks: multiple-choice question answering, generative question answering, and style transfer, each aligning the LLM with Lu Xun’s internal ideation and writing style. To optimize learning across these tasks, we introduce a CharLoRA parameter updating mechanism, where a general linguistic style expert collaborates with other task-specific experts to better study both the language style and the understanding of deeper thoughts. We evaluate CharacterBot on three tasks for linguistic accuracy and opinion comprehension, demonstrating that it significantly outperforms the baselines on our adapted metrics. We hope this work inspires future research on deep character persona simulation LLMs while considering the importance of ethical standards.

[64] ExpliCa: Evaluating Explicit Causal Reasoning in Large Language Models

Martina Miliani, Serena Auriemma, Alessandro Bondielli, Emmanuele Chersoni, Lucia Passaro, Irene Sucameli, Alessandro Lenci

Main category: cs.CL

TL;DR: ExpliCa is a new dataset for evaluating LLMs in explicit causal reasoning, revealing their struggles with accuracy and confusion between temporal and causal relations.

Details

Motivation: To assess LLMs' interpretive and inferential accuracy in explicit causal reasoning tasks.

Method: Introduced ExpliCa dataset with causal and temporal relations, tested LLMs using prompting and perplexity-based metrics on seven models.

Result: Top LLMs struggle to reach 0.80 accuracy, often confusing temporal with causal relations; performance varies with linguistic order.

Conclusion: LLMs face challenges in explicit causal reasoning, with performance influenced by model size and linguistic factors.

Abstract: Large Language Models (LLMs) are increasingly used in tasks requiring interpretive and inferential accuracy. In this paper, we introduce ExpliCa, a new dataset for evaluating LLMs in explicit causal reasoning. ExpliCa uniquely integrates both causal and temporal relations presented in different linguistic orders and explicitly expressed by linguistic connectives. The dataset is enriched with crowdsourced human acceptability ratings. We tested LLMs on ExpliCa through prompting and perplexity-based metrics. We assessed seven commercial and open-source LLMs, revealing that even top models struggle to reach 0.80 accuracy. Interestingly, models tend to confound temporal relations with causal ones, and their performance is also strongly influenced by the linguistic order of the events. Finally, perplexity-based scores and prompting performance are differently affected by model size.

[65] Discriminative Finetuning of Generative Large Language Models without Reward Models and Human Preference Data

Siqi Guo, Ilgee Hong, Vicente Balmaseda, Changlong Yu, Liang Qiu, Xin Liu, Haoming Jiang, Tuo Zhao, Tianbao Yang

Main category: cs.CL

TL;DR: Discriminative Fine-Tuning (DFT) improves supervised fine-tuning (SFT) for LLMs by adopting a discriminative approach, avoiding the need for preference optimization (PO) and outperforming SFT.

Details

Motivation: SFT's generative training objective is limited, and existing solutions like PO require costly human-labeled data or reward models. DFT aims to address these limitations.

Method: DFT introduces a discriminative probabilistic framework, optimizing the likelihood of positive answers while suppressing negative ones, and provides efficient algorithms for this.

Result: DFT outperforms SFT and performs comparably or better than SFT followed by PO, as shown in extensive experiments.

Conclusion: DFT is an effective alternative to SFT and PO, reducing reliance on costly data and improving model performance.

Abstract: Supervised fine-tuning (SFT) has become a crucial step for aligning pretrained large language models (LLMs) using supervised datasets of input-output pairs. However, despite being supervised, SFT is inherently limited by its generative training objective. To address its limitations, the existing common strategy is to follow SFT with a separate phase of preference optimization (PO), which relies on either human-labeled preference data or a strong reward model to guide the learning process. In this paper, we address the limitations of SFT by exploring one of the most successful techniques in conventional supervised learning: discriminative learning. We introduce Discriminative Fine-Tuning (DFT), an improved variant of SFT, which mitigates the burden of collecting human-labeled preference data or training strong reward models. Unlike SFT that employs a generative approach and overlooks negative data, DFT adopts a discriminative paradigm that increases the probability of positive answers while suppressing potentially negative ones, aiming for data prediction instead of token prediction. Our contributions include: (i) a discriminative probabilistic framework for fine-tuning LLMs by explicitly modeling the discriminative likelihood of an answer among all possible outputs given an input; (ii) efficient algorithms to optimize this discriminative likelihood; and (iii) extensive experiments demonstrating DFT’s effectiveness, achieving performance better than SFT and comparable to if not better than SFT$\rightarrow$PO. The code can be found at https://github.com/Optimization-AI/DFT.

[66] How do language models learn facts? Dynamics, curricula and hallucinations

Nicolas Zucchet, Jörg Bornschein, Stephanie Chan, Andrew Lampinen, Razvan Pascanu, Soham De

Main category: cs.CL

TL;DR: The paper explores how large language models learn factual knowledge, identifying three phases of learning, the impact of data distribution, and challenges in fine-tuning.

Details

Motivation: To understand the dynamics of knowledge acquisition in language models, particularly how they learn and recall facts.

Method: Investigates learning dynamics using a synthetic factual recall task, analyzing performance phases, attention mechanisms, and data distribution effects.

Result: Found three learning phases, a performance plateau, data distribution impacts, and challenges in fine-tuning due to knowledge corruption.

Conclusion: Highlights the role of data distribution in learning and suggests new strategies for efficient training.

Abstract: Large language models accumulate vast knowledge during pre-training, yet the dynamics governing this acquisition remain poorly understood. This work investigates the learning dynamics of language models on a synthetic factual recall task, uncovering three key findings: First, language models learn in three phases, exhibiting a performance plateau before acquiring precise factual knowledge. Mechanistically, this plateau coincides with the formation of attention-based circuits that support recall. Second, the training data distribution significantly impacts learning dynamics, as imbalanced distributions lead to shorter plateaus. Finally, hallucinations emerge simultaneously with knowledge, and integrating new knowledge into the model through fine-tuning is challenging, as it quickly corrupts its existing parametric memories. Our results emphasize the importance of data distribution in knowledge acquisition and suggest novel data scheduling strategies to accelerate neural network training.

[67] Exploiting individual differences to bootstrap communication

Richard A. Blythe, Casimir Fisch

Main category: cs.CL

TL;DR: A model demonstrates how communication systems can emerge from non-communicative behaviors in large populations, relying on predictability and shared intentionality.

Details

Motivation: To explain how communication systems can bootstrap without pre-existing feedback mechanisms.

Method: A model showing how individual behavioral differences and shared intentionality enable the emergence of unbounded communication.

Result: Communication systems can arise without prior feedback, driven by predictability and shared psychological states.

Conclusion: Supports theories linking flexible communication systems to general social cognition capabilities.

Abstract: Establishing a communication system is hard because the intended meaning of a signal is unknown to its receiver when first produced, and the signaller also has no idea how that signal will be interpreted. Most theoretical accounts of the emergence of communication systems rely on feedback to reinforce behaviours that have led to successful communication in the past. However, providing such feedback requires already being able to communicate the meaning that was intended or interpreted. Therefore these accounts cannot explain how communication can be bootstrapped from non-communicative behaviours. Here we present a model that shows how a communication system, capable of expressing an unbounded number of meanings, can emerge as a result of individual behavioural differences in a large population without any pre-existing means to determine communicative success. The two key cognitive capabilities responsible for this outcome are behaving predictably in a given situation, and an alignment of psychological states ahead of signal production that derives from shared intentionality. Since both capabilities can exist independently of communication, our results are compatible with theories in which large flexible socially-learned communication systems like language are the product of a general but well-developed capacity for social cognition.

[68] Enhancing Transformation from Natural Language to Signal Temporal Logic Using LLMs with Diverse External Knowledge

Yue Fang, Zhi Jin, Jie An, Hongshen Chen, Xiaohong Chen, Naijun Zhan

Main category: cs.CL

TL;DR: The paper introduces STL-DivEn, a diverse NL-STL dataset, and KGST, a knowledge-guided framework for NL-to-STL transformation, showing improved accuracy and diversity over existing methods.

Details

Motivation: Manual NL-to-STL transformation is time-consuming and error-prone, and the lack of datasets hinders automation.

Method: Created STL-DivEn dataset via clustering and LLM generation, then ensured diversity/accuracy with filters and validation. Proposed KGST framework for NL-to-STL transformation.

Result: STL-DivEn is more diverse than existing datasets; KGST outperforms baselines in accuracy on STL-DivEn and DeepSTL.

Conclusion: STL-DivEn and KGST address dataset scarcity and improve NL-to-STL transformation, advancing automation in cyber-physical systems.

Abstract: Temporal Logic (TL), especially Signal Temporal Logic (STL), enables precise formal specification, making it widely used in cyber-physical systems such as autonomous driving and robotics. Automatically transforming NL into STL is an attractive approach to overcome the limitations of manual transformation, which is time-consuming and error-prone. However, due to the lack of datasets, automatic transformation currently faces significant challenges and has not been fully explored. In this paper, we propose an NL-STL dataset named STL-Diversity-Enhanced (STL-DivEn), which comprises 16,000 samples enriched with diverse patterns. To develop the dataset, we first manually create a small-scale seed set of NL-STL pairs. Next, representative examples are identified through clustering and used to guide large language models (LLMs) in generating additional NL-STL pairs. Finally, diversity and accuracy are ensured through rigorous rule-based filters and human validation. Furthermore, we introduce the Knowledge-Guided STL Transformation (KGST) framework, a novel approach for transforming natural language into STL, involving a generate-then-refine process based on external knowledge. Statistical analysis shows that the STL-DivEn dataset exhibits more diversity than the existing NL-STL dataset. Moreover, both metric-based and human evaluations indicate that our KGST approach outperforms baseline models in transformation accuracy on STL-DivEn and DeepSTL datasets.

[69] OPeRA: A Dataset of Observation, Persona, Rationale, and Action for Evaluating LLMs on Human Online Shopping Behavior Simulation

Ziyi Wang, Yuxuan Lu, Wenbo Li, Amirali Amini, Bo Sun, Yakov Bart, Weimin Lyu, Jiri Gesi, Tian Wang, Jing Huang, Yu Su, Upol Ehsan, Malihe Alikhani, Toby Jia-Jun Li, Lydia Chilton, Dakuo Wang

Main category: cs.CL

TL;DR: The paper introduces OPERA, a dataset for evaluating LLMs’ ability to simulate user web actions by capturing personas, observations, actions, and rationales from real users.

Details

Motivation: The lack of high-quality datasets to evaluate LLMs' simulation of real user behaviors inspired the creation of OPERA.

Method: OPERA was collected using an online questionnaire and a custom browser plugin to gather detailed user data during online shopping.

Result: OPERA provides the first benchmark for assessing LLMs’ prediction of user actions and rationales based on personas and history.

Conclusion: OPERA enables future research into LLM agents as personalized digital twins for humans.

Abstract: Can large language models (LLMs) accurately simulate the next web action of a specific user? While LLMs have shown promising capabilities in generating ``believable’’ human behaviors, evaluating their ability to mimic real user behaviors remains an open challenge, largely due to the lack of high-quality, publicly available datasets that capture both the observable actions and the internal reasoning of an actual human user. To address this gap, we introduce OPERA, a novel dataset of Observation, Persona, Rationale, and Action collected from real human participants during online shopping sessions. OPERA is the first public dataset that comprehensively captures: user personas, browser observations, fine-grained web actions, and self-reported just-in-time rationales. We developed both an online questionnaire and a custom browser plugin to gather this dataset with high fidelity. Using OPERA, we establish the first benchmark to evaluate how well current LLMs can predict a specific user’s next action and rationale with a given persona and <observation, action, rationale> history. This dataset lays the groundwork for future research into LLM agents that aim to act as personalized digital twins for human.

[70] Large Language Models in Argument Mining: A Survey

Hao Li, Viktor Schlegel, Yizheng Sun, Riza Batista-Navarro, Goran Nenadic

Main category: cs.CL

TL;DR: This survey explores how Large Language Models (LLMs) have revolutionized Argument Mining (AM), covering foundational theories, datasets, subtasks, and challenges, while proposing future research directions.

Details

Motivation: To systematically review and synthesize recent advancements in LLM-driven AM, addressing gaps in understanding and application.

Method: The survey reviews foundational theories, datasets, and LLM techniques (e.g., prompting, chain-of-thought reasoning), while analyzing architectures, evaluation practices, and challenges.

Result: A comprehensive taxonomy of AM subtasks and insights into LLM-driven advancements, alongside identified challenges like interpretability and annotation bottlenecks.

Conclusion: The paper highlights emerging trends and proposes a research agenda to guide future work in LLM-based computational argumentation.

Abstract: Argument Mining (AM), a critical subfield of Natural Language Processing (NLP), focuses on extracting argumentative structures from text. The advent of Large Language Models (LLMs) has profoundly transformed AM, enabling advanced in-context learning, prompt-based generation, and robust cross-domain adaptability. This survey systematically synthesizes recent advancements in LLM-driven AM. We provide a concise review of foundational theories and annotation frameworks, alongside a meticulously curated catalog of datasets. A key contribution is our comprehensive taxonomy of AM subtasks, elucidating how contemporary LLM techniques – such as prompting, chain-of-thought reasoning, and retrieval augmentation – have reconfigured their execution. We further detail current LLM architectures and methodologies, critically assess evaluation practices, and delineate pivotal challenges including long-context reasoning, interpretability, and annotation bottlenecks. Conclusively, we highlight emerging trends and propose a forward-looking research agenda for LLM-based computational argumentation, aiming to strategically guide researchers in this rapidly evolving domain.

[71] Breaking Barriers: Do Reinforcement Post Training Gains Transfer To Unseen Domains?

Chuxuan Hu, Yuxuan Zhu, Antony Kellermann, Caleb Biddulph, Suppakit Waiwitlikhit, Jason Benn, Daniel Kang

Main category: cs.CL

TL;DR: RPT improves LLM reasoning but lacks consistent generalization to new domains.

Details

Motivation: To assess how well RPT's improvements generalize beyond the fine-tuning domains.

Method: Two studies: (1) Observational comparison of RPT and base models across seen/unseen domains, (2) Interventional fine-tuning on single domains and evaluation across multiple domains.

Result: RPT shows gains on similar tasks but inconsistently generalizes to domains with different reasoning patterns.

Conclusion: RPT’s benefits are domain-specific and may not reliably extend to new reasoning contexts.

Abstract: Reinforcement post training (RPT) has recently shown promise in improving the reasoning abilities of large language models (LLMs). However, it remains unclear how well these improvements generalize to new domains, as prior work evaluates RPT models on data from the same domains used for fine-tuning. To understand the generalizability of RPT, we conduct two studies. (1) Observational: We compare a wide range of open-weight RPT models against their corresponding base models across multiple domains, including both seen and unseen domains in their fine-tuning data. (2) Interventional: we fine-tune LLMs with RPT on single domains and evaluate their performance across multiple domains. Both studies converge on the same conclusion that, although RPT brings substantial gains on tasks similar to the fine-tuning data, the gains generalize inconsistently and can vanish on domains with different reasoning patterns.

[72] Mechanistic Indicators of Understanding in Large Language Models

Pierre Beckmann, Matthieu Queloz

Main category: cs.CL

TL;DR: The paper synthesizes findings in mechanistic interpretability (MI) to argue that LLMs develop internal structures functionally analogous to understanding, proposing a three-tiered conception of understanding while highlighting differences from human cognition.

Details

Motivation: To challenge the view that LLMs rely solely on superficial statistics and to integrate MI findings into a theoretical framework for machine understanding.

Method: Proposes a three-tiered conception of understanding (conceptual, state-of-the-world, principled) based on LLM internal structures.

Result: LLMs exhibit forms of understanding but differ fundamentally from human cognition, as shown by ‘parallel mechanisms.’

Conclusion: The debate should shift from whether LLMs understand to investigating their unique cognitive processes and developing fitting conceptions.

Abstract: Recent findings in mechanistic interpretability (MI), the field probing the inner workings of Large Language Models (LLMs), challenge the view that these models rely solely on superficial statistics. We offer an accessible synthesis of these findings that doubles as an introduction to MI while integrating these findings within a novel theoretical framework for thinking about machine understanding. We argue that LLMs develop internal structures that are functionally analogous to the kind of understanding that consists in seeing connections. To sharpen this idea, we propose a three-tiered conception of understanding. First, conceptual understanding emerges when a model forms “features” as directions in latent space, learning the connections between diverse manifestations of something. Second, state-of-the-world understanding emerges when a model learns contingent factual connections between features and dynamically tracks changes in the world. Third, principled understanding emerges when a model ceases to rely on a collection of memorized facts and discovers a “circuit” connecting these facts. However, these forms of understanding remain radically different from human understanding, as the phenomenon of “parallel mechanisms” shows. We conclude that the debate should move beyond the yes-or-no question of whether LLMs understand to investigate how their strange minds work and forge conceptions that fit them.

[73] A comprehensive study of LLM-based argument classification: from LLAMA through GPT-4o to Deepseek-R1

Marcin Pietroń, Rafał Olszowski, Jakub Gomułka, Filip Gampel, Andrzej Tomski

Main category: cs.CL

TL;DR: The paper analyzes the performance of large language models (LLMs) like GPT-4o, Llama, and DeepSeek in argument mining tasks, highlighting ChatGPT-4o and Deepseek-R1 as top performers but noting their errors. It also critiques existing argument datasets and suggests improvements for prompt algorithms.

Details

Motivation: The study aims to address the lack of research on LLMs' performance in publicly available argument classification databases, leveraging diverse datasets to evaluate and compare models.

Method: The study tests versions of GPT, Llama, and DeepSeek, including reasoning-enhanced variants with the Chain-of-Thoughts algorithm, on datasets like Args.me and UKP.

Result: ChatGPT-4o outperforms others in benchmarks, while Deepseek-R1 excels in reasoning-enhanced tasks. Both models exhibit common errors, and the study identifies dataset shortcomings.

Conclusion: The work provides the first broad analysis of LLMs in argument mining, reveals weaknesses in prompt algorithms, and suggests directions for improvement, adding value through dataset critique.

Abstract: Argument mining (AM) is an interdisciplinary research field that integrates insights from logic, philosophy, linguistics, rhetoric, law, psychology, and computer science. It involves the automatic identification and extraction of argumentative components, such as premises and claims, and the detection of relationships between them, such as support, attack, or neutrality. Recently, the field has advanced significantly, especially with the advent of large language models (LLMs), which have enhanced the efficiency of analyzing and extracting argument semantics compared to traditional methods and other deep learning models. There are many benchmarks for testing and verifying the quality of LLM, but there is still a lack of research and results on the operation of these models in publicly available argument classification databases. This paper presents a study of a selection of LLM’s, using diverse datasets such as Args.me and UKP. The models tested include versions of GPT, Llama, and DeepSeek, along with reasoning-enhanced variants incorporating the Chain-of-Thoughts algorithm. The results indicate that ChatGPT-4o outperforms the others in the argument classification benchmarks. In case of models incorporated with reasoning capabilities, the Deepseek-R1 shows its superiority. However, despite their superiority, GPT-4o and Deepseek-R1 still make errors. The most common errors are discussed for all models. To our knowledge, the presented work is the first broader analysis of the mentioned datasets using LLM and prompt algorithms. The work also shows some weaknesses of known prompt algorithms in argument analysis, while indicating directions for their improvement. The added value of the work is the in-depth analysis of the available argument datasets and the demonstration of their shortcomings.

[74] A Survey of Deep Learning for Geometry Problem Solving

Jianzhe Ma, Wenxuan Wang, Qin Jin

Main category: cs.CL

TL;DR: A survey on deep learning applications in geometry problem solving, covering tasks, methods, evaluation metrics, challenges, and future directions.

Details

Motivation: Geometry problem solving is crucial in education and AI assessment, and recent advances in deep learning, especially multimodal models, have spurred research in this area.

Method: The paper summarizes tasks, reviews deep learning methods, analyzes evaluation metrics, and discusses challenges and future directions.

Result: Provides a comprehensive reference for deep learning in geometry problem solving, with a GitHub repository for ongoing updates.

Conclusion: Aims to advance the field by offering a practical guide and fostering further research.

Abstract: Geometry problem solving is a key area of mathematical reasoning, which is widely involved in many important fields such as education, mathematical ability assessment of artificial intelligence, and multimodal ability assessment. In recent years, the rapid development of deep learning technology, especially the rise of multimodal large language models, has triggered a widespread research boom. This paper provides a survey of the applications of deep learning in geometry problem solving, including (i) a comprehensive summary of the relevant tasks in geometry problem solving; (ii) a thorough review of related deep learning methods; (iii) a detailed analysis of evaluation metrics and methods; and (iv) a critical discussion of the current challenges and future directions that can be explored. Our goal is to provide a comprehensive and practical reference of deep learning for geometry problem solving to promote further developments in this field. We create a continuously updated list of papers on GitHub: https://github.com/majianz/dl4gps.

[75] FLEXITOKENS: Flexible Tokenization for Evolving Language Models

Abraham Toluase Owodunni, Orevaoghene Ahia, Sachin Kumar

Main category: cs.CL

TL;DR: FLEXITOKENS improves LM adaptability by using learnable tokenizers, reducing token over-fragmentation and enhancing downstream task performance.

Details

Motivation: Traditional LMs struggle with adapting to new data due to rigid subword tokenizers, leading to inefficient tokenization.

Method: Develops byte-level LMs with learnable tokenizers (FLEXITOKENS) to predict input byte boundaries, avoiding fixed compression rates.

Result: FLEXITOKENS reduces token over-fragmentation and improves downstream task performance by up to 10%.

Conclusion: FLEXITOKENS offers a flexible and effective solution for adaptive tokenization in LMs.

Abstract: Language models (LMs) are challenging to adapt to new data distributions by simple finetuning. This is due to the rigidity of their subword tokenizers, which typically remain unchanged during adaptation. This inflexibility often leads to inefficient tokenization, causing overfragmentation of out-of-distribution domains, unseen languages, or scripts. In this work, we develop byte-level LMs with learnable tokenizers to make tokenization adaptive. Our models include a submodule that learns to predict boundaries between the input byte sequence, encoding it into variable-length segments. Existing tokenizer-free methods train this boundary predictor using an auxiliary loss that enforces a fixed compression rate across the training corpus, introducing a new kind of rigidity. We propose FLEXITOKENS, a simplified training objective that enables significantly greater flexibility during adaptation. Evaluating across multiple multilingual benchmarks, morphologically diverse tasks, and domains, we demonstrate that FLEXITOKENS consistently reduces token over-fragmentation and achieves up to 10% improvements on downstream task performance compared to subword and other gradient-based tokenizers. Code and data for our experiments will be released at https://github.com/owos/flexitokens

[76] Multilingual LLMs Are Not Multilingual Thinkers: Evidence from Hindi Analogy Evaluation

Ashray Gupta, Rohan Joseph, Sunny Rai

Main category: cs.CL

TL;DR: The paper introduces HATS, a Hindi Analogy Test Set, to evaluate multilingual LLMs’ reasoning in Hindi, finding English prompts yield the best performance.

Details

Motivation: To assess LLMs' reasoning in Indic languages, particularly Hindi, which remains understudied compared to English.

Method: Created HATS with 405 Hindi analogy questions, benchmarked multilingual LLMs using various prompts, and introduced a grounded Chain of Thought approach.

Result: Models performed best with English prompts, and the Chain of Thought method improved Hindi analogy performance.

Conclusion: HATS fills a gap in evaluating LLM reasoning in Hindi, highlighting the need for more diverse language benchmarks.

Abstract: Analogies test a model’s ability to infer implicit relationships between concepts, making them a key benchmark for evaluating reasoning capabilities. While large language models (LLMs) are widely evaluated for reasoning in English, their abilities in Indic languages remain understudied, limiting our understanding of whether these models generalize across languages. To address this gap, we introduce a new Hindi Analogy Test Set (HATS), comprising 405 multiple-choice questions sourced from Indian government exams. We benchmark state-of-the-art multilingual LLMs using various prompting strategies and introduce a grounded Chain of Thought approach that leverages cognitive theories of analogical reasoning. This approach improves model performance on Hindi analogy questions. Our experiments show that models perform best with English prompts, irrespective of the prompting strategy. Our test set addresses the lack of a critical resource to evaluate LLM reasoning capabilities in Hindi.

[77] What Makes You CLIC: Detection of Croatian Clickbait Headlines

Marija Anđelić, Dominik Šipek, Laura Majer, Jan Šnajder

Main category: cs.CL

TL;DR: The paper introduces CLIC, a Croatian dataset for clickbait detection, compares fine-tuned BERTić with LLM-based ICL methods, and finds fine-tuned models outperform general LLMs.

Details

Motivation: To address the need for clickbait detection in less-resourced languages like Croatian and compare the effectiveness of fine-tuned models versus in-context learning.

Method: Compiled CLIC dataset, fine-tuned BERTić, and compared it with LLM-based ICL methods using Croatian and English prompts.

Result: Nearly half of headlines contained clickbait; fine-tuned models performed better than general LLMs.

Conclusion: Fine-tuned models are more effective for clickbait detection in less-resourced languages like Croatian.

Abstract: Online news outlets operate predominantly on an advertising-based revenue model, compelling journalists to create headlines that are often scandalous, intriguing, and provocative – commonly referred to as clickbait. Automatic detection of clickbait headlines is essential for preserving information quality and reader trust in digital media and requires both contextual understanding and world knowledge. For this task, particularly in less-resourced languages, it remains unclear whether fine-tuned methods or in-context learning (ICL) yield better results. In this paper, we compile CLIC, a novel dataset for clickbait detection of Croatian news headlines spanning a 20-year period and encompassing mainstream and fringe outlets. We fine-tune the BERTi'c model on this task and compare its performance to LLM-based ICL methods with prompts both in Croatian and English. Finally, we analyze the linguistic properties of clickbait. We find that nearly half of the analyzed headlines contain clickbait, and that finetuned models deliver better results than general LLMs.

[78] Agentar-Fin-R1: Enhancing Financial Intelligence through Domain Expertise, Training Efficiency, and Advanced Reasoning

Yanjun Zheng, Xiyang Du, Longfei Liao, Xiaoke Zhao, Zhaowen Zhou, Jingze Song, Bo Zhang, Jiawei Liu, Xiang Qi, Zhe Li, Zhiqiang Zhang, Wei Wang, Peng Zhang

Main category: cs.CL

TL;DR: Agentar-Fin-R1, a series of financial LLMs (8B and 32B parameters), enhances reasoning, reliability, and domain specialization for finance. It outperforms benchmarks like Fineva and FinEval while excelling in general reasoning tasks.

Details

Motivation: Existing LLMs lack advanced reasoning, trustworthiness, and domain adaptability for financial applications.

Method: Optimization integrates a financial task label system and trustworthiness framework, using label-guided optimization, two-stage training, and dynamic attribution.

Result: Agentar-Fin-R1 achieves state-of-the-art performance on financial and general reasoning benchmarks.

Conclusion: The model is a trustworthy solution for high-stakes financial applications, validated by the Finova benchmark.

Abstract: Large Language Models (LLMs) exhibit considerable promise in financial applications; however, prevailing models frequently demonstrate limitations when confronted with scenarios that necessitate sophisticated reasoning capabilities, stringent trustworthiness criteria, and efficient adaptation to domain-specific requirements. We introduce the Agentar-Fin-R1 series of financial large language models (8B and 32B parameters), specifically engineered based on the Qwen3 foundation model to enhance reasoning capabilities, reliability, and domain specialization for financial applications. Our optimization approach integrates a high-quality, systematic financial task label system with a comprehensive multi-layered trustworthiness assurance framework. This framework encompasses high-quality trustworthy knowledge engineering, multi-agent trustworthy data synthesis, and rigorous data validation governance. Through label-guided automated difficulty-aware optimization, tow-stage training pipeline, and dynamic attribution systems, we achieve substantial improvements in training efficiency. Our models undergo comprehensive evaluation on mainstream financial benchmarks including Fineva, FinEval, and FinanceIQ, as well as general reasoning datasets such as MATH-500 and GPQA-diamond. To thoroughly assess real-world deployment capabilities, we innovatively propose the Finova evaluation benchmark, which focuses on agent-level financial reasoning and compliance verification. Experimental results demonstrate that Agentar-Fin-R1 not only achieves state-of-the-art performance on financial tasks but also exhibits exceptional general reasoning capabilities, validating its effectiveness as a trustworthy solution for high-stakes financial applications. The Finova bench is available at https://github.com/antgroup/Finova.

[79] LingBench++: A Linguistically-Informed Benchmark and Reasoning Framework for Multi-Step and Cross-Cultural Inference with LLMs

Da-Chen Lian, Ri-Sheng Huang, Pin-Er Chen, Chunki Lim, You-Kuan Lin, Guan-Yu Tseng, Zi-Cheng Yang, Zhen-Yu Lin, Pin-Cheng Chen, Shu-Kai Hsieh

Main category: cs.CL

TL;DR: LingBench++ is a benchmark for evaluating LLMs on complex linguistic tasks, featuring structured reasoning, stepwise evaluation, and multi-agent architecture for improved accuracy and interpretability.

Details

Motivation: To address the lack of benchmarks evaluating LLMs on complex linguistic tasks with structured reasoning and cultural diversity.

Method: Developed LingBench++ with reasoning traces, stepwise protocols, and metadata for 90+ languages. Introduced a multi-agent architecture for knowledge retrieval and iterative reasoning.

Result: Models with external knowledge and iterative reasoning outperformed single-pass approaches in accuracy and interpretability.

Conclusion: LingBench++ provides a foundation for linguistically grounded, culturally informed reasoning in LLMs.

Abstract: We propose LingBench++, a linguistically-informed benchmark and reasoning framework designed to evaluate large language models (LLMs) on complex linguistic tasks inspired by the International Linguistics Olympiad (IOL). Unlike prior benchmarks that focus solely on final answer accuracy, LingBench++ provides structured reasoning traces, stepwise evaluation protocols, and rich typological metadata across over 90 low-resource and cross-cultural languages. We further develop a multi-agent architecture integrating grammatical knowledge retrieval, tool-augmented reasoning, and deliberate hypothesis testing. Through systematic comparisons of baseline and our proposed agentic models, we demonstrate that models equipped with external knowledge sources and iterative reasoning outperform single-pass approaches in both accuracy and interpretability. LingBench++ offers a comprehensive foundation for advancing linguistically grounded, culturally informed, and cognitively plausible reasoning in LLMs.

[80] Towards Greater Leverage: Scaling Laws for Efficient Mixture-of-Experts Language Models

Changxin Tian, Kunlong Chen, Jia Liu, Ziqi Liu, Zhiqiang Zhang, Jun Zhou

Main category: cs.CL

TL;DR: The paper introduces Efficiency Leverage (EL) to predict the computational advantage of Mixture-of-Experts (MoE) models over dense ones, revealing power-law relationships between EL, expert activation ratio, and compute budget. A scaling law is derived and validated with Ling-mini-beta, showing 7x efficiency gains.

Details

Motivation: To address the unresolved challenge of predicting MoE model capacity and computational efficiency, given the decoupling of parameters and compute cost.

Method: Conducted a large-scale empirical study with over 300 models (up to 28B parameters), analyzing MoE configurations (expert activation ratio, granularity) and deriving a scaling law for EL.

Result: EL follows predictable power laws driven by expert activation ratio and compute budget, with granularity as a non-linear modulator. Ling-mini-beta (0.85B active params) matched a 6.1B dense model’s performance with 7x fewer resources.

Conclusion: The work provides a principled, empirical foundation for scaling efficient MoE models, validated by practical results.

Abstract: Mixture-of-Experts (MoE) has become a dominant architecture for scaling Large Language Models (LLMs) efficiently by decoupling total parameters from computational cost. However, this decoupling creates a critical challenge: predicting the model capacity of a given MoE configurations (e.g., expert activation ratio and granularity) remains an unresolved problem. To address this gap, we introduce Efficiency Leverage (EL), a metric quantifying the computational advantage of an MoE model over a dense equivalent. We conduct a large-scale empirical study, training over 300 models up to 28B parameters, to systematically investigate the relationship between MoE architectural configurations and EL. Our findings reveal that EL is primarily driven by the expert activation ratio and the total compute budget, both following predictable power laws, while expert granularity acts as a non-linear modulator with a clear optimal range. We integrate these discoveries into a unified scaling law that accurately predicts the EL of an MoE architecture based on its configuration. To validate our derived scaling laws, we designed and trained Ling-mini-beta, a pilot model for Ling-2.0 series with only 0.85B active parameters, alongside a 6.1B dense model for comparison. When trained on an identical 1T high-quality token dataset, Ling-mini-beta matched the performance of the 6.1B dense model while consuming over 7x fewer computational resources, thereby confirming the accuracy of our scaling laws. This work provides a principled and empirically-grounded foundation for the scaling of efficient MoE models.

cs.CV

[81] WaveMamba: Wavelet-Driven Mamba Fusion for RGB-Infrared Object Detection

Haodong Zhu, Wenhao Dong, Linlin Yang, Hong Li, Yuguang Yang, Yangyang Ren, Qingcheng Zhu, Zichao Feng, Changbai Li, Shaohui Lin, Runqi Wang, Xiaoyan Luo, Baochang Zhang

Main category: cs.CV

TL;DR: WaveMamba is a cross-modality fusion method for RGB and IR imagery, using DWT and IDWT for feature integration and improved detection. It introduces WMFB for comprehensive fusion, achieving a 4.5% mAP improvement.

Details

Motivation: To leverage the complementary features of RGB and IR imagery for enhanced object detection by addressing information loss and improving fusion.

Method: Uses DWT for feature decomposition, WMFB for fusion (LMFB for low-frequency and a strategy for high-frequency features), and IDWT in the detection head.

Result: Achieves a 4.5% average mAP improvement on four benchmarks, outperforming state-of-the-art methods.

Conclusion: WaveMamba effectively integrates RGB and IR features, demonstrating superior performance in object detection.

Abstract: Leveraging the complementary characteristics of visible (RGB) and infrared (IR) imagery offers significant potential for improving object detection. In this paper, we propose WaveMamba, a cross-modality fusion method that efficiently integrates the unique and complementary frequency features of RGB and IR decomposed by Discrete Wavelet Transform (DWT). An improved detection head incorporating the Inverse Discrete Wavelet Transform (IDWT) is also proposed to reduce information loss and produce the final detection results. The core of our approach is the introduction of WaveMamba Fusion Block (WMFB), which facilitates comprehensive fusion across low-/high-frequency sub-bands. Within WMFB, the Low-frequency Mamba Fusion Block (LMFB), built upon the Mamba framework, first performs initial low-frequency feature fusion with channel swapping, followed by deep fusion with an advanced gated attention mechanism for enhanced integration. High-frequency features are enhanced using a strategy that applies an ``absolute maximum" fusion approach. These advancements lead to significant performance gains, with our method surpassing state-of-the-art approaches and achieving average mAP improvements of 4.5% on four benchmarks.

[82] Lumina-mGPT 2.0: Stand-Alone AutoRegressive Image Modeling

Yi Xin, Juncheng Yan, Qi Qin, Zhen Li, Dongyang Liu, Shicheng Li, Victor Shea-Jay Huang, Yupeng Zhou, Renrui Zhang, Le Zhuo, Tiancheng Han, Xiaoqing Sun, Siqi Luo, Mengmeng Wang, Bin Fu, Yuewen Cao, Hongsheng Li, Guangtao Zhai, Xiaohong Liu, Yu Qiao, Peng Gao

Main category: cs.CV

TL;DR: Lumina-mGPT 2.0 is a standalone autoregressive model for high-quality image generation, outperforming diffusion models like DALL-E 3 and SANA while offering flexibility and multi-task capabilities.

Details

Motivation: To revitalize autoregressive models for image generation without relying on pretrained components or hybrid architectures, ensuring unrestricted design and licensing freedom.

Method: Trained from scratch with a unified tokenization scheme, enabling diverse tasks like image editing and controllable synthesis. Efficient decoding strategies (e.g., speculative Jacobi sampling) improve quality and speed.

Result: Matches or surpasses diffusion models on benchmarks (GenEval, DPG) and excels in multi-task performance (Graph200K).

Conclusion: Lumina-mGPT 2.0 is a strong, flexible foundation model for unified multimodal generation, with released training details and code.

Abstract: We present Lumina-mGPT 2.0, a stand-alone, decoder-only autoregressive model that revisits and revitalizes the autoregressive paradigm for high-quality image generation and beyond. Unlike existing approaches that rely on pretrained components or hybrid architectures, Lumina-mGPT 2.0 is trained entirely from scratch, enabling unrestricted architectural design and licensing freedom. It achieves generation quality on par with state-of-the-art diffusion models such as DALL-E 3 and SANA, while preserving the inherent flexibility and compositionality of autoregressive modeling. Our unified tokenization scheme allows the model to seamlessly handle a wide spectrum of tasks-including subject-driven generation, image editing, controllable synthesis, and dense prediction-within a single generative framework. To further boost usability, we incorporate efficient decoding strategies like inference-time scaling and speculative Jacobi sampling to improve quality and speed, respectively. Extensive evaluations on standard text-to-image benchmarks (e.g., GenEval, DPG) demonstrate that Lumina-mGPT 2.0 not only matches but in some cases surpasses diffusion-based models. Moreover, we confirm its multi-task capabilities on the Graph200K benchmark, with the native Lumina-mGPT 2.0 performing exceptionally well. These results position Lumina-mGPT 2.0 as a strong, flexible foundation model for unified multimodal generation. We have released our training details, code, and models at https://github.com/Alpha-VLLM/Lumina-mGPT-2.0.

[83] SV3.3B: A Sports Video Understanding Model for Action Recognition

Sai Varun Kodathala, Yashwanth Reddy Vutukoori, Rakesh Vunnam

Main category: cs.CV

TL;DR: SV3.3B is a lightweight 3.3B parameter model for automated sports video analysis, combining temporal motion sampling and self-supervised learning for on-device use. It outperforms larger models like GPT-4o in sports-specific metrics.

Details

Motivation: Current sports video analysis lacks fine-grained understanding of athletic movements and struggles with biomechanical transitions. Server-side processing is computationally intensive.

Method: Uses DWT-VGG16-LDA for keyframe extraction, V-DWT-JEPA2 encoder pretrained via mask-denoising, and an LLM decoder for action descriptions. Evaluated on NSVA basketball dataset.

Result: Achieves 29.2% improvement over GPT-4o in validation metrics, with better information density, action complexity, and precision.

Conclusion: SV3.3B offers efficient, detailed sports analysis with lower computational demands, suitable for on-device deployment.

Abstract: This paper addresses the challenge of automated sports video analysis, which has traditionally been limited by computationally intensive models requiring server-side processing and lacking fine-grained understanding of athletic movements. Current approaches struggle to capture the nuanced biomechanical transitions essential for meaningful sports analysis, often missing critical phases like preparation, execution, and follow-through that occur within seconds. To address these limitations, we introduce SV3.3B, a lightweight 3.3B parameter video understanding model that combines novel temporal motion difference sampling with self-supervised learning for efficient on-device deployment. Our approach employs a DWT-VGG16-LDA based keyframe extraction mechanism that intelligently identifies the 16 most representative frames from sports sequences, followed by a V-DWT-JEPA2 encoder pretrained through mask-denoising objectives and an LLM decoder fine-tuned for sports action description generation. Evaluated on a subset of the NSVA basketball dataset, SV3.3B achieves superior performance across both traditional text generation metrics and sports-specific evaluation criteria, outperforming larger closed-source models including GPT-4o variants while maintaining significantly lower computational requirements. Our model demonstrates exceptional capability in generating technically detailed and analytically rich sports descriptions, achieving 29.2% improvement over GPT-4o in ground truth validation metrics, with substantial improvements in information density, action complexity, and measurement precision metrics essential for comprehensive athletic analysis. Model Available at https://huggingface.co/sportsvision/SV3.3B.

[84] 3D Software Synthesis Guided by Constraint-Expressive Intermediate Representation

Shuqing Li, Anson Y. Lam, Yun Peng, Wenxuan Wang, Michael R. Lyu

Main category: cs.CV

TL;DR: Scenethesis introduces a requirement-sensitive 3D software synthesis approach, addressing gaps in 3D UI generation by enabling fine-grained control and constraint handling.

Details

Motivation: Existing 3D software generation lacks granular control and struggles with spatial/semantic constraints, limiting practical usability.

Method: Scenethesis uses ScenethesisLang, a domain-specific language, to bridge natural language requirements and executable 3D software, enabling staged synthesis and constraint satisfaction.

Result: Scenethesis captures 80% of user requirements, satisfies 90% of hard constraints, and improves visual evaluation scores by 42.8%.

Conclusion: Scenethesis advances 3D software generation by offering traceability, granular control, and robust constraint handling.

Abstract: Graphical user interface (UI) software has undergone a fundamental transformation from traditional two-dimensional (2D) desktop/web/mobile interfaces to spatial three-dimensional (3D) environments. While existing work has made remarkable success in automated 2D software generation, such as HTML/CSS and mobile app interface code synthesis, the generation of 3D software still remains under-explored. Current methods for 3D software generation usually generate the 3D environments as a whole and cannot modify or control specific elements in the software. Furthermore, these methods struggle to handle the complex spatial and semantic constraints inherent in the real world. To address the challenges, we present Scenethesis, a novel requirement-sensitive 3D software synthesis approach that maintains formal traceability between user specifications and generated 3D software. Scenethesis is built upon ScenethesisLang, a domain-specific language that serves as a granular constraint-aware intermediate representation (IR) to bridge natural language requirements and executable 3D software. It serves both as a comprehensive scene description language enabling fine-grained modification of 3D software elements and as a formal constraint-expressive specification language capable of expressing complex spatial constraints. By decomposing 3D software synthesis into stages operating on ScenethesisLang, Scenethesis enables independent verification, targeted modification, and systematic constraint satisfaction. Our evaluation demonstrates that Scenethesis accurately captures over 80% of user requirements and satisfies more than 90% of hard constraints while handling over 100 constraints simultaneously. Furthermore, Scenethesis achieves a 42.8% improvement in BLIP-2 visual evaluation scores compared to the state-of-the-art method.

[85] Detail++: Training-Free Detail Enhancer for Text-to-Image Diffusion Models

Lifeng Chen, Jiner Wang, Zihao Pan, Beier Zhu, Xiaofeng Yang, Chi Zhang

Main category: cs.CV

TL;DR: Detail++ is a training-free framework using Progressive Detail Injection (PDI) to improve text-to-image generation for complex prompts by decomposing them into simpler sub-prompts and refining stages.

Details

Motivation: Existing text-to-image models struggle with complex prompts involving multiple subjects and distinct attributes.

Method: Decomposes prompts into sub-prompts, uses self-attention for layout control, and introduces Centroid Alignment Loss for attribute binding.

Result: Outperforms existing methods on benchmarks, especially for multiple objects and complex styles.

Conclusion: Detail++ effectively addresses limitations in complex prompt handling through staged generation and precise refinement.

Abstract: Recent advances in text-to-image (T2I) generation have led to impressive visual results. However, these models still face significant challenges when handling complex prompt, particularly those involving multiple subjects with distinct attributes. Inspired by the human drawing process, which first outlines the composition and then incrementally adds details, we propose Detail++, a training-free framework that introduces a novel Progressive Detail Injection (PDI) strategy to address this limitation. Specifically, we decompose a complex prompt into a sequence of simplified sub-prompts, guiding the generation process in stages. This staged generation leverages the inherent layout-controlling capacity of self-attention to first ensure global composition, followed by precise refinement. To achieve accurate binding between attributes and corresponding subjects, we exploit cross-attention mechanisms and further introduce a Centroid Alignment Loss at test time to reduce binding noise and enhance attribute consistency. Extensive experiments on T2I-CompBench and a newly constructed style composition benchmark demonstrate that Detail++ significantly outperforms existing methods, particularly in scenarios involving multiple objects and complex stylistic conditions.

[86] SIDA: Synthetic Image Driven Zero-shot Domain Adaptation

Ye-Chan Kim, SeungJu Cha, Si-Woo Kim, Taewhan Kim, Dong-Jin Kim

Main category: cs.CV

TL;DR: SIDA introduces a zero-shot domain adaptation method using synthetic images to capture diverse style cues, improving efficiency and performance over text-driven approaches.

Details

Motivation: Existing text-driven zero-shot domain adaptation methods struggle with complex real-world variations and slow adaptation times.

Method: SIDA generates synthetic images via source-like images and image translation, then uses Domain Mix and Patch Style Transfer modules for style modeling.

Result: SIDA achieves state-of-the-art performance in diverse scenarios and reduces adaptation time significantly.

Conclusion: SIDA offers an efficient and effective solution for zero-shot domain adaptation by leveraging synthetic images and advanced style modeling.

Abstract: Zero-shot domain adaptation is a method for adapting a model to a target domain without utilizing target domain image data. To enable adaptation without target images, existing studies utilize CLIP’s embedding space and text description to simulate target-like style features. Despite the previous achievements in zero-shot domain adaptation, we observe that these text-driven methods struggle to capture complex real-world variations and significantly increase adaptation time due to their alignment process. Instead of relying on text descriptions, we explore solutions leveraging image data, which provides diverse and more fine-grained style cues. In this work, we propose SIDA, a novel and efficient zero-shot domain adaptation method leveraging synthetic images. To generate synthetic images, we first create detailed, source-like images and apply image translation to reflect the style of the target domain. We then utilize the style features of these synthetic images as a proxy for the target domain. Based on these features, we introduce Domain Mix and Patch Style Transfer modules, which enable effective modeling of real-world variations. In particular, Domain Mix blends multiple styles to expand the intra-domain representations, and Patch Style Transfer assigns different styles to individual patches. We demonstrate the effectiveness of our method by showing state-of-the-art performance in diverse zero-shot adaptation scenarios, particularly in challenging domains. Moreover, our approach achieves high efficiency by significantly reducing the overall adaptation time.

[87] FishDet-M: A Unified Large-Scale Benchmark for Robust Fish Detection and CLIP-Guided Model Selection in Diverse Aquatic Visual Domains

Muayad Abujabal, Lyes Saad Saoud, Irfan Hussain

Main category: cs.CV

TL;DR: FishDet-M is a unified benchmark for fish detection, harmonizing 13 datasets with COCO-style annotations. It evaluates 28 models, introduces a CLIP-based selection framework, and provides tools for underwater computer vision research.

Details

Motivation: Addressing fragmented datasets, heterogeneous imaging conditions, and inconsistent evaluation protocols in fish detection for ecological monitoring and aquaculture automation.

Method: Harmonizing 13 datasets with COCO-style annotations, benchmarking 28 models (YOLO, R-CNN, DETR), and introducing a CLIP-based model selection framework for adaptive deployment.

Result: Varying detection performance across models, trade-offs between accuracy and efficiency, and high performance of the CLIP-based selection strategy.

Conclusion: FishDet-M provides a standardized platform for evaluating fish detection in aquatic scenes, with publicly available datasets, models, and tools to advance underwater computer vision.

Abstract: Accurate fish detection in underwater imagery is essential for ecological monitoring, aquaculture automation, and robotic perception. However, practical deployment remains limited by fragmented datasets, heterogeneous imaging conditions, and inconsistent evaluation protocols. To address these gaps, we present \textit{FishDet-M}, the largest unified benchmark for fish detection, comprising 13 publicly available datasets spanning diverse aquatic environments including marine, brackish, occluded, and aquarium scenes. All data are harmonized using COCO-style annotations with both bounding boxes and segmentation masks, enabling consistent and scalable cross-domain evaluation. We systematically benchmark 28 contemporary object detection models, covering the YOLOv8 to YOLOv12 series, R-CNN based detectors, and DETR based models. Evaluations are conducted using standard metrics including mAP, mAP@50, and mAP@75, along with scale-specific analyses (AP$_S$, AP$_M$, AP$_L$) and inference profiling in terms of latency and parameter count. The results highlight the varying detection performance across models trained on FishDet-M, as well as the trade-off between accuracy and efficiency across models of different architectures. To support adaptive deployment, we introduce a CLIP-based model selection framework that leverages vision-language alignment to dynamically identify the most semantically appropriate detector for each input image. This zero-shot selection strategy achieves high performance without requiring ensemble computation, offering a scalable solution for real-time applications. FishDet-M establishes a standardized and reproducible platform for evaluating object detection in complex aquatic scenes. All datasets, pretrained models, and evaluation tools are publicly available to facilitate future research in underwater computer vision and intelligent marine systems.

[88] Towards Facilitated Fairness Assessment of AI-based Skin Lesion Classifiers Through GenAI-based Image Synthesis

Ko Watanabe. Stanislav Frolov. Adriano Lucieri. Andreas Dengel

Main category: cs.CV

TL;DR: The paper explores using Generative AI (LightningDiT) to assess fairness in melanoma classifiers, highlighting challenges when training and evaluation datasets differ.

Details

Motivation: To address biases in deep learning for skin cancer screening by ensuring fairness across diverse groups (sex, age, race).

Method: Leverages the GenAI LightningDiT model to evaluate fairness using synthetic data.

Result: Fairness assessment with synthetic data is promising but challenging when training and evaluation datasets mismatch.

Conclusion: Synthetic data offers a new way to improve fairness in medical-imaging AI systems.

Abstract: Recent advancements in Deep Learning and its application on the edge hold great potential for the revolution of routine screenings for skin cancers like Melanoma. Along with the anticipated benefits of this technology, potential dangers arise from unforseen and inherent biases. Thus, assessing and improving the fairness of such systems is of utmost importance. A key challenge in fairness assessment is to ensure that the evaluation dataset is sufficiently representative of different Personal Identifiable Information (PII) (sex, age, and race) and other minority groups. Against the backdrop of this challenge, this study leverages the state-of-the-art Generative AI (GenAI) LightningDiT model to assess the fairness of publicly available melanoma classifiers. The results suggest that fairness assessment using highly realistic synthetic data is a promising direction. Yet, our findings indicate that verifying fairness becomes difficult when the melanoma-detection model used for evaluation is trained on data that differ from the dataset underpinning the synthetic images. Nonetheless, we propose that our approach offers a valuable new avenue for employing synthetic data to gauge and enhance fairness in medical-imaging GenAI systems.

[89] DiNAT-IR: Exploring Dilated Neighborhood Attention for High-Quality Image Restoration

Hanzhou Liu, Binghan Li, Chengkai Liu, Mi Lu

Main category: cs.CV

TL;DR: Restormer uses channel-wise self-attention for image restoration but may miss localized artifacts. DiNA balances global and local context but struggles with deblurring. DiNAT-IR integrates a channel-aware module for better results.

Details

Motivation: Address the limitations of existing Transformer-based methods in image restoration, particularly the trade-off between computational efficiency and quality, and the need for better global-local context integration.

Method: Proposes DiNAT-IR, combining Dilated Neighborhood Attention (DiNA) with a channel-aware module to enhance global context understanding while maintaining local precision.

Result: DiNAT-IR achieves competitive performance across multiple benchmarks, offering high-quality solutions for low-level vision tasks.

Conclusion: DiNAT-IR effectively bridges the gap between global and local context in image restoration, providing a scalable and high-quality solution.

Abstract: Transformers, with their self-attention mechanisms for modeling long-range dependencies, have become a dominant paradigm in image restoration tasks. However, the high computational cost of self-attention limits scalability to high-resolution images, making efficiency-quality trade-offs a key research focus. To address this, Restormer employs channel-wise self-attention, which computes attention across channels instead of spatial dimensions. While effective, this approach may overlook localized artifacts that are crucial for high-quality image restoration. To bridge this gap, we explore Dilated Neighborhood Attention (DiNA) as a promising alternative, inspired by its success in high-level vision tasks. DiNA balances global context and local precision by integrating sliding-window attention with mixed dilation factors, effectively expanding the receptive field without excessive overhead. However, our preliminary experiments indicate that directly applying this global-local design to the classic deblurring task hinders accurate visual restoration, primarily due to the constrained global context understanding within local attention. To address this, we introduce a channel-aware module that complements local attention, effectively integrating global context without sacrificing pixel-level precision. The proposed DiNAT-IR, a Transformer-based architecture specifically designed for image restoration, achieves competitive results across multiple benchmarks, offering a high-quality solution for diverse low-level computer vision problems.

[90] Improving Bird Classification with Primary Color Additives

Ezhini Rasendiran R, Chandresh Kumar Maurya

Main category: cs.CV

TL;DR: The paper proposes a method to classify bird species using their songs by embedding frequency information into spectrograms with color additives, improving accuracy over existing models.

Details

Motivation: Existing models struggle with bird song classification due to noise, overlapping vocalizations, and missing labels, especially in low-SNR or multi-species recordings.

Method: The approach visualizes bird songs using pitch, speed, and repetition (motifs) and enhances spectrograms with primary color additives to embed frequency information, aiding species distinction.

Result: The method outperforms the BirdCLEF 2024 winner, improving F1 by 7.3%, ROC-AUC by 6.2%, and CMAP by 6.6%.

Conclusion: Incorporating frequency information via colorization effectively enhances bird species classification accuracy.

Abstract: We address the problem of classifying bird species using their song recordings, a challenging task due to environmental noise, overlapping vocalizations, and missing labels. Existing models struggle with low-SNR or multi-species recordings. We hypothesize that birds can be classified by visualizing their pitch pattern, speed, and repetition, collectively called motifs. Deep learning models applied to spectrogram images help, but similar motifs across species cause confusion. To mitigate this, we embed frequency information into spectrograms using primary color additives. This enhances species distinction and improves classification accuracy. Our experiments show that the proposed approach achieves statistically significant gains over models without colorization and surpasses the BirdCLEF 2024 winner, improving F1 by 7.3%, ROC-AUC by 6.2%, and CMAP by 6.6%. These results demonstrate the effectiveness of incorporating frequency information via colorization.

[91] AFRDA: Attentive Feature Refinement for Domain Adaptive Semantic Segmentation

Md. Al-Masrur Khan, Durgakant Pushp, Lantao Liu

Main category: cs.CV

TL;DR: The paper introduces the Adaptive Feature Refinement (AFR) module to improve UDA-SS by balancing local and global information, enhancing segmentation accuracy, and integrating high-frequency components for better object delineation.

Details

Motivation: Existing UDA-SS methods struggle to balance fine-grained local details with global contextual information, leading to segmentation errors in complex regions.

Method: The AFR module refines high-resolution features using semantic priors from low-resolution logits, integrates high-frequency components, and adaptively balances local and global information through uncertainty-driven attention.

Result: AFR improves UDA-SS methods by 1.05% mIoU on GTA V –> Cityscapes and 1.04% mIoU on Synthia–>Cityscapes.

Conclusion: The lightweight AFR module enhances segmentation performance and can be seamlessly integrated into HRDA-based UDA methods.

Abstract: In Unsupervised Domain Adaptive Semantic Segmentation (UDA-SS), a model is trained on labeled source domain data (e.g., synthetic images) and adapted to an unlabeled target domain (e.g., real-world images) without access to target annotations. Existing UDA-SS methods often struggle to balance fine-grained local details with global contextual information, leading to segmentation errors in complex regions. To address this, we introduce the Adaptive Feature Refinement (AFR) module, which enhances segmentation accuracy by refining highresolution features using semantic priors from low-resolution logits. AFR also integrates high-frequency components, which capture fine-grained structures and provide crucial boundary information, improving object delineation. Additionally, AFR adaptively balances local and global information through uncertaintydriven attention, reducing misclassifications. Its lightweight design allows seamless integration into HRDA-based UDA methods, leading to state-of-the-art segmentation performance. Our approach improves existing UDA-SS methods by 1.05% mIoU on GTA V –> Cityscapes and 1.04% mIoU on Synthia–>Cityscapes. The implementation of our framework is available at: https://github.com/Masrur02/AFRDA

[92] Residual Prior-driven Frequency-aware Network for Image Fusion

Guan Zheng, Xue Wang, Wenhua Qian, Peng Liu, Runzhuo Ma

Main category: cs.CV

TL;DR: RPFNet is a Residual Prior-driven Frequency-aware Network for image fusion, addressing computational costs and lack of ground-truth by using residual priors and frequency-domain modeling.

Details

Motivation: Image fusion lacks efficient global feature modeling and struggles with complementary feature capture due to computational costs and absence of ground-truth.

Method: RPFNet uses a dual-branch framework: Residual Prior Module (RPM) for complementary priors and Frequency Domain Fusion Module (FDFM) for efficient global modeling. Cross Promotion Module (CPM) enhances local-global synergy. Training includes auxiliary decoder, saliency structure loss, and frequency contrastive loss.

Result: RPFNet effectively integrates features, enhances textures and salient objects, and improves high-level vision task performance.

Conclusion: RPFNet successfully addresses fusion challenges, offering a robust solution for integrating complementary information across modalities.

Abstract: Image fusion aims to integrate complementary information across modalities to generate high-quality fused images, thereby enhancing the performance of high-level vision tasks. While global spatial modeling mechanisms show promising results, constructing long-range feature dependencies in the spatial domain incurs substantial computational costs. Additionally, the absence of ground-truth exacerbates the difficulty of capturing complementary features effectively. To tackle these challenges, we propose a Residual Prior-driven Frequency-aware Network, termed as RPFNet. Specifically, RPFNet employs a dual-branch feature extraction framework: the Residual Prior Module (RPM) extracts modality-specific difference information from residual maps, thereby providing complementary priors for fusion; the Frequency Domain Fusion Module (FDFM) achieves efficient global feature modeling and integration through frequency-domain convolution. Additionally, the Cross Promotion Module (CPM) enhances the synergistic perception of local details and global structures through bidirectional feature interaction. During training, we incorporate an auxiliary decoder and saliency structure loss to strengthen the model’s sensitivity to modality-specific differences. Furthermore, a combination of adaptive weight-based frequency contrastive loss and SSIM loss effectively constrains the solution space, facilitating the joint capture of local details and global features while ensuring the retention of complementary information. Extensive experiments validate the fusion performance of RPFNet, which effectively integrates discriminative features, enhances texture details and salient objects, and can effectively facilitate the deployment of the high-level vision task.

[93] OPEN: A Benchmark Dataset and Baseline for Older Adult Patient Engagement Recognition in Virtual Rehabilitation Learning Environments

Ali Abedi, Sadaf Safa, Tracey J. F. Colella, Shehroz S. Khan

Main category: cs.CV

TL;DR: The paper introduces OPEN, a dataset for AI-driven engagement recognition in older adults during virtual group learning, achieving 81% accuracy with machine learning models.

Details

Motivation: Accurate engagement measurement in virtual group settings, especially for older adults, is challenging but crucial for satisfaction and performance in online education and rehabilitation.

Method: The OPEN dataset was collected from 11 older adults in weekly virtual cardiac rehabilitation sessions over six weeks, featuring facial, hand, and body joint landmarks, along with affective and behavioral features.

Result: Machine learning models trained on OPEN achieved up to 81% accuracy in engagement recognition.

Conclusion: OPEN provides a scalable foundation for personalized engagement modeling in aging populations and advances engagement recognition research.

Abstract: Engagement in virtual learning is essential for participant satisfaction, performance, and adherence, particularly in online education and virtual rehabilitation, where interactive communication plays a key role. Yet, accurately measuring engagement in virtual group settings remains a challenge. There is increasing interest in using artificial intelligence (AI) for large-scale, real-world, automated engagement recognition. While engagement has been widely studied in younger academic populations, research and datasets focused on older adults in virtual and telehealth learning settings remain limited. Existing methods often neglect contextual relevance and the longitudinal nature of engagement across sessions. This paper introduces OPEN (Older adult Patient ENgagement), a novel dataset supporting AI-driven engagement recognition. It was collected from eleven older adults participating in weekly virtual group learning sessions over six weeks as part of cardiac rehabilitation, producing over 35 hours of data, making it the largest dataset of its kind. To protect privacy, raw video is withheld; instead, the released data include facial, hand, and body joint landmarks, along with affective and behavioral features extracted from video. Annotations include binary engagement states, affective and behavioral labels, and context-type indicators, such as whether the instructor addressed the group or an individual. The dataset offers versions with 5-, 10-, 30-second, and variable-length samples. To demonstrate utility, multiple machine learning and deep learning models were trained, achieving engagement recognition accuracy of up to 81 percent. OPEN provides a scalable foundation for personalized engagement modeling in aging populations and contributes to broader engagement recognition research.

[94] Bearded Dragon Activity Recognition Pipeline: An AI-Based Approach to Behavioural Monitoring

Arsen Yermukan, Pedro Machado, Feliciano Domingos, Isibor Kennedy Ihianle, Jordan J. Bird, Stefano S. K. Kaburu, Samantha J. Ward

Main category: cs.CV

TL;DR: An automated system using YOLO models for real-time video analysis of bearded dragon behaviors (basking and hunting) was developed, with YOLOv8s performing best. Basking detection was reliable, but hunting detection was less accurate due to poor cricket detection. Future work aims to improve cricket detection.

Details

Motivation: Traditional monitoring of bearded dragon behavior is time-consuming and error-prone, necessitating an automated solution.

Method: Five YOLO variants were trained on a custom dataset (1200 images). YOLOv8s was selected for its accuracy-speed balance. The system processes video by extracting object coordinates, applying temporal interpolation, and using rule-based logic for behavior classification.

Result: Basking detection was reliable (mAP@0.5:0.95 = 0.855), but hunting detection was less accurate (mAP@0.5 = 0.392) due to weak cricket detection.

Conclusion: The system provides a scalable solution for reptile behavior monitoring, improving research efficiency. Future work will focus on enhancing cricket detection.

Abstract: Traditional monitoring of bearded dragon (Pogona Viticeps) behaviour is time-consuming and prone to errors. This project introduces an automated system for real-time video analysis, using You Only Look Once (YOLO) object detection models to identify two key behaviours: basking and hunting. We trained five YOLO variants (v5, v7, v8, v11, v12) on a custom, publicly available dataset of 1200 images, encompassing bearded dragons (600), heating lamps (500), and crickets (100). YOLOv8s was selected as the optimal model due to its superior balance of accuracy (mAP@0.5:0.95 = 0.855) and speed. The system processes video footage by extracting per-frame object coordinates, applying temporal interpolation for continuity, and using rule-based logic to classify specific behaviours. Basking detection proved reliable. However, hunting detection was less accurate, primarily due to weak cricket detection (mAP@0.5 = 0.392). Future improvements will focus on enhancing cricket detection through expanded datasets or specialised small-object detectors. This automated system offers a scalable solution for monitoring reptile behaviour in controlled environments, significantly improving research efficiency and data quality.

[95] AG-VPReID.VIR: Bridging Aerial and Ground Platforms for Video-based Visible-Infrared Person Re-ID

Huy Nguyen, Kien Nguyen, Akila Pemasiri, Akmal Jahan, Clinton Fookes, Sridha Sridharan

Main category: cs.CV

TL;DR: The paper introduces AG-VPReID.VIR, the first aerial-ground cross-modality video-based person Re-ID dataset, and proposes TCC-VPReID, a three-stream architecture to tackle cross-platform and cross-modality challenges.

Details

Motivation: Existing datasets focus on ground-level perspectives, which suffer from occlusions and limited coverage. Aerial perspectives can solve these issues, but no dataset exists for this context.

Method: The authors introduce AG-VPReID.VIR dataset with 1,837 identities and propose TCC-VPReID, a three-stream architecture using style-robust feature learning, memory-based cross-view adaptation, and intermediary-guided temporal modeling.

Result: AG-VPReID.VIR presents unique challenges, and TCC-VPReID achieves significant performance gains across evaluation protocols.

Conclusion: The dataset and framework address limitations of ground-based systems, offering a robust solution for aerial-ground cross-modality person Re-ID.

Abstract: Person re-identification (Re-ID) across visible and infrared modalities is crucial for 24-hour surveillance systems, but existing datasets primarily focus on ground-level perspectives. While ground-based IR systems offer nighttime capabilities, they suffer from occlusions, limited coverage, and vulnerability to obstructions–problems that aerial perspectives uniquely solve. To address these limitations, we introduce AG-VPReID.VIR, the first aerial-ground cross-modality video-based person Re-ID dataset. This dataset captures 1,837 identities across 4,861 tracklets (124,855 frames) using both UAV-mounted and fixed CCTV cameras in RGB and infrared modalities. AG-VPReID.VIR presents unique challenges including cross-viewpoint variations, modality discrepancies, and temporal dynamics. Additionally, we propose TCC-VPReID, a novel three-stream architecture designed to address the joint challenges of cross-platform and cross-modality person Re-ID. Our approach bridges the domain gaps between aerial-ground perspectives and RGB-IR modalities, through style-robust feature learning, memory-based cross-view adaptation, and intermediary-guided temporal modeling. Experiments show that AG-VPReID.VIR presents distinctive challenges compared to existing datasets, with our TCC-VPReID framework achieving significant performance gains across multiple evaluation protocols. Dataset and code are available at https://github.com/agvpreid25/AG-VPReID.VIR.

[96] Exploring the interplay of label bias with subgroup size and separability: A case study in mammographic density classification

Emma A. M. Stanley, Raghav Mehta, Mélanie Roschewitz, Nils D. Forkert, Ben Glocker

Main category: cs.CV

TL;DR: The study explores how label bias in medical imaging datasets impacts deep learning models, focusing on subgroup size and separability. Findings show feature representation shifts and performance differences based on validation set labels.

Details

Motivation: To address the understudied issue of label bias in medical AI systems and its impact on fairness, particularly for specific subgroups in medical imaging datasets.

Method: Trained deep learning models for binary tissue density classification using the EMBED dataset, simulating label bias in separable and non-separable subgroups. Analyzed feature shifts and performance metrics.

Result: Label bias caused shifts in learned features, influenced by subgroup size and separability. Performance varied significantly based on validation set label quality (e.g., true positive rate dropped from 0.898 to 0.518 for biased labels).

Conclusion: The study highlights the critical impact of label bias on subgroup fairness in medical AI, emphasizing the need for clean validation labels to mitigate performance disparities.

Abstract: Systematic mislabelling affecting specific subgroups (i.e., label bias) in medical imaging datasets represents an understudied issue concerning the fairness of medical AI systems. In this work, we investigated how size and separability of subgroups affected by label bias influence the learned features and performance of a deep learning model. Therefore, we trained deep learning models for binary tissue density classification using the EMory BrEast imaging Dataset (EMBED), where label bias affected separable subgroups (based on imaging manufacturer) or non-separable “pseudo-subgroups”. We found that simulated subgroup label bias led to prominent shifts in the learned feature representations of the models. Importantly, these shifts within the feature space were dependent on both the relative size and the separability of the subgroup affected by label bias. We also observed notable differences in subgroup performance depending on whether a validation set with clean labels was used to define the classification threshold for the model. For instance, with label bias affecting the majority separable subgroup, the true positive rate for that subgroup fell from 0.898, when the validation set had clean labels, to 0.518, when the validation set had biased labels. Our work represents a key contribution toward understanding the consequences of label bias on subgroup fairness in medical imaging AI.

[97] Registration beyond Points: General Affine Subspace Alignment via Geodesic Distance on Grassmann Manifold

Jaeho Shin, Hyeonjae Gil, Junwoo Jang, Maani Ghaffari, Ayoung Kim

Main category: cs.CV

TL;DR: The paper introduces an optimizable cost function for measuring distances between Grassmannian features, enabling global optimization in registration problems.

Details

Motivation: Existing methods lack an explicit, optimizable distance function for Grassmannian features, limiting their use in registration tasks.

Method: Derives an optimizable cost function using high-dimensional subspace bases and integrates it with a BnB solver for registration.

Result: The proposed method improves convergence and outperforms existing solutions in computer vision tasks.

Conclusion: The work advances Grassmannian-based registration by providing a globally optimal solution and representation-agnostic distance minimization.

Abstract: Affine Grassmannian has been favored for expressing proximity between lines and planes due to its theoretical exactness in measuring distances among features. Despite this advantage, the existing method can only measure the proximity without yielding the distance as an explicit function of rigid body transformation. Thus, an optimizable distance function on the manifold has remained underdeveloped, stifling its application in registration problems. This paper is the first to explicitly derive an optimizable cost function between two Grassmannian features with respect to rigid body transformation ($\mathbf{R}$ and $\mathbf{t}$). Specifically, we present a rigorous mathematical proof demonstrating that the bases of high-dimensional linear subspaces can serve as an explicit representation of the cost. Finally, we propose an optimizable cost function based on the transformed bases that can be applied to the registration problem of any affine subspace. Compared to vector parameter-based approaches, our method is able to find a globally optimal solution by directly minimizing the geodesic distance which is agnostic to representation ambiguity. The resulting cost function and its extension to the inlier-set maximizing \ac{BnB} solver have been demonstrated to improve the convergence of existing solutions or outperform them in various computer vision tasks. The code is available on https://github.com/joomeok/GrassmannRegistration.

[98] GRR-CoCa: Leveraging LLM Mechanisms in Multimodal Model Architectures

Jake R. Patock, Nicole Catherine Lewis, Kevin McCoy, Christina Gomez, Canling Chen, Lorenzo Luzi

Main category: cs.CV

TL;DR: GRR-CoCa enhances the CoCa model with LLM-inspired architectural improvements, outperforming the baseline in pretraining and fine-tuning tasks.

Details

Motivation: Current multimodal models lag behind LLMs in architectural sophistication, despite strong performance.

Method: Incorporates Gaussian error gated linear units, root mean squared normalization, and rotary positional embedding into CoCa’s textual decoders and ViT encoder.

Result: GRR-CoCa outperformed Baseline CoCa by significant margins in pretraining (e.g., 27.25% in contrastive loss) and fine-tuning (e.g., 13.66% in contrastive loss).

Conclusion: GRR-CoCa’s architecture improves performance and generalization in vision-language tasks.

Abstract: State-of-the-art (SOTA) image and text generation models are multimodal models that have many similarities to large language models (LLMs). Despite achieving strong performances, leading foundational multimodal model architectures frequently lag behind the architectural sophistication of contemporary LLMs. We propose GRR-CoCa, an improved SOTA Contrastive Captioner (CoCa) model that incorporates Gaussian error gated linear units, root mean squared normalization, and rotary positional embedding into the textual decoders and the vision transformer (ViT) encoder. Each architectural modification has been shown to improve model performance in LLMs, but has yet to be adopted in CoCa. We benchmarked GRR-CoCa against Baseline CoCa, a model with the same modified textual decoders but with CoCa’s original ViT encoder. We used standard pretraining and fine-tuning workflows to benchmark the models on contrastive and generative tasks. Our GRR-CoCa significantly outperformed Baseline CoCa on the pretraining dataset and three diverse fine-tuning datasets. Pretraining improvements were 27.25% in contrastive loss, 3.71% in perplexity, and 7.15% in CoCa loss. The average fine-tuning improvements were 13.66% in contrastive loss, 5.18% in perplexity, and 5.55% in CoCa loss. We show that GRR-CoCa’s modified architecture improves performance and generalization across vision-language domains.

[99] Celeb-DF++: A Large-scale Challenging Video DeepFake Benchmark for Generalizable Forensics

Yuezun Li, Delong Zhu, Xinjie Cui, Siwei Lyu

Main category: cs.CV

TL;DR: The paper introduces Celeb-DF++, a large-scale DeepFake dataset with diverse forgery types to address the challenge of generalizable forensics. It evaluates 24 detection methods, revealing their limitations.

Details

Motivation: The need for datasets with diverse DeepFake types to develop generalizable detection methods, as existing datasets lack variety.

Method: Creation of Celeb-DF++, covering three forgery scenarios (Face-swap, Face-reenactment, Talking-face) using 22 DeepFake methods. Evaluation of 24 detection methods.

Result: Celeb-DF++ highlights the limitations of current detection methods and the challenge of generalizable forensics.

Conclusion: Celeb-DF++ serves as a benchmark for advancing generalizable DeepFake detection, emphasizing the need for diverse datasets.

Abstract: The rapid advancement of AI technologies has significantly increased the diversity of DeepFake videos circulating online, posing a pressing challenge for \textit{generalizable forensics}, \ie, detecting a wide range of unseen DeepFake types using a single model. Addressing this challenge requires datasets that are not only large-scale but also rich in forgery diversity. However, most existing datasets, despite their scale, include only a limited variety of forgery types, making them insufficient for developing generalizable detection methods. Therefore, we build upon our earlier Celeb-DF dataset and introduce {Celeb-DF++}, a new large-scale and challenging video DeepFake benchmark dedicated to the generalizable forensics challenge. Celeb-DF++ covers three commonly encountered forgery scenarios: Face-swap (FS), Face-reenactment (FR), and Talking-face (TF). Each scenario contains a substantial number of high-quality forged videos, generated using a total of 22 various recent DeepFake methods. These methods differ in terms of architectures, generation pipelines, and targeted facial regions, covering the most prevalent DeepFake cases witnessed in the wild. We also introduce evaluation protocols for measuring the generalizability of 24 recent detection methods, highlighting the limitations of existing detection methods and the difficulty of our new dataset.

[100] Degradation-Consistent Learning via Bidirectional Diffusion for Low-Light Image Enhancement

Jinhong He, Minglong Xue, Zhipu Liu, Mingliang Zhou, Aoxiang Ning, Palaiahnakote Shivakumara

Main category: cs.CV

TL;DR: Proposes a bidirectional diffusion optimization mechanism for low-light image enhancement, improving degradation modeling and generation quality.

Details

Motivation: Addresses limitations of unidirectional diffusion models in capturing real-world degradation patterns, leading to structural inconsistencies and pixel misalignments.

Method: Introduces bidirectional diffusion (low-to-normal and normal-to-low light) and an adaptive feature interaction block (AFI) for refined feature representation. Includes a reflection-aware correction module (RACM) for color restoration.

Result: Outperforms state-of-the-art methods in quantitative and qualitative evaluations, generalizing well to diverse degradation scenarios.

Conclusion: The bidirectional approach enhances degradation learning and generation quality, aligning better with human visual perception.

Abstract: Low-light image enhancement aims to improve the visibility of degraded images to better align with human visual perception. While diffusion-based methods have shown promising performance due to their strong generative capabilities. However, their unidirectional modelling of degradation often struggles to capture the complexity of real-world degradation patterns, leading to structural inconsistencies and pixel misalignments. To address these challenges, we propose a bidirectional diffusion optimization mechanism that jointly models the degradation processes of both low-light and normal-light images, enabling more precise degradation parameter matching and enhancing generation quality. Specifically, we perform bidirectional diffusion-from low-to-normal light and from normal-to-low light during training and introduce an adaptive feature interaction block (AFI) to refine feature representation. By leveraging the complementarity between these two paths, our approach imposes an implicit symmetry constraint on illumination attenuation and noise distribution, facilitating consistent degradation learning and improving the models ability to perceive illumination and detail degradation. Additionally, we design a reflection-aware correction module (RACM) to guide color restoration post-denoising and suppress overexposed regions, ensuring content consistency and generating high-quality images that align with human visual perception. Extensive experiments on multiple benchmark datasets demonstrate that our method outperforms state-of-the-art methods in both quantitative and qualitative evaluations while generalizing effectively to diverse degradation scenarios. Code at https://github.com/hejh8/BidDiff

[101] High-fidelity 3D Gaussian Inpainting: preserving multi-view consistency and photorealistic details

Jun Zhou, Dinghao Li, Nannan Li, Mingjie Wang

Main category: cs.CV

TL;DR: A novel 3D Gaussian inpainting framework improves 3D scene reconstruction by refining masks and optimizing uncertainty-guided regions, outperforming existing methods in quality and consistency.

Details

Motivation: Inpainting 3D scenes is challenging due to irregular structures and multi-view consistency needs, which current methods struggle to address.

Method: The framework uses a Mask Refinement Process and Uncertainty-guided Optimization to refine inpainted views and enhance detail fidelity.

Result: Experiments show superior visual quality and view consistency compared to state-of-the-art methods.

Conclusion: The proposed framework effectively addresses 3D inpainting challenges, offering improved accuracy and consistency.

Abstract: Recent advancements in multi-view 3D reconstruction and novel-view synthesis, particularly through Neural Radiance Fields (NeRF) and 3D Gaussian Splatting (3DGS), have greatly enhanced the fidelity and efficiency of 3D content creation. However, inpainting 3D scenes remains a challenging task due to the inherent irregularity of 3D structures and the critical need for maintaining multi-view consistency. In this work, we propose a novel 3D Gaussian inpainting framework that reconstructs complete 3D scenes by leveraging sparse inpainted views. Our framework incorporates an automatic Mask Refinement Process and region-wise Uncertainty-guided Optimization. Specifically, we refine the inpainting mask using a series of operations, including Gaussian scene filtering and back-projection, enabling more accurate localization of occluded regions and realistic boundary restoration. Furthermore, our Uncertainty-guided Fine-grained Optimization strategy, which estimates the importance of each region across multi-view images during training, alleviates multi-view inconsistencies and enhances the fidelity of fine details in the inpainted results. Comprehensive experiments conducted on diverse datasets demonstrate that our approach outperforms existing state-of-the-art methods in both visual quality and view consistency.

[102] Synthetic Data Augmentation for Enhanced Chicken Carcass Instance Segmentation

Yihong Feng, Chaitanya Pallerla, Xiaomin Lin, Pouya Sohrabipour Sr, Philip Crandall, Wan Shou, Yu She, Dongyi Wang

Main category: cs.CV

TL;DR: The paper introduces a pipeline for generating synthetic, labeled images of chicken carcasses and a benchmark dataset to improve instance segmentation in poultry processing. Synthetic data significantly enhances segmentation performance.

Details

Motivation: Automated detection of chicken carcasses is crucial for quality control and efficiency, but real-world data collection and annotation are labor-intensive.

Method: Developed a pipeline for synthetic image generation and introduced a real-world benchmark dataset. Evaluated synthetic data’s impact on instance segmentation models.

Result: Synthetic data significantly improved segmentation performance across all tested models.

Conclusion: Synthetic data augmentation is an effective strategy to address data scarcity and reduce manual annotation in poultry processing automation.

Abstract: The poultry industry has been driven by broiler chicken production and has grown into the world’s largest animal protein sector. Automated detection of chicken carcasses on processing lines is vital for quality control, food safety, and operational efficiency in slaughterhouses and poultry processing plants. However, developing robust deep learning models for tasks like instance segmentation in these fast-paced industrial environments is often hampered by the need for laborious acquisition and annotation of large-scale real-world image datasets. We present the first pipeline generating photo-realistic, automatically labeled synthetic images of chicken carcasses. We also introduce a new benchmark dataset containing 300 annotated real-world images, curated specifically for poultry segmentation research. Using these datasets, this study investigates the efficacy of synthetic data and automatic data annotation to enhance the instance segmentation of chicken carcasses, particularly when real annotated data from the processing line is scarce. A small real dataset with varying proportions of synthetic images was evaluated in prominent instance segmentation models. Results show that synthetic data significantly boosts segmentation performance for chicken carcasses across all models. This research underscores the value of synthetic data augmentation as a viable and effective strategy to mitigate data scarcity, reduce manual annotation efforts, and advance the development of robust AI-driven automated detection systems for chicken carcasses in the poultry processing industry.

[103] Emotion Recognition from Skeleton Data: A Comprehensive Survey

Haifeng Lu, Jiuyi Chen, Zhen Zhang, Ruida Liu, Runhao Zeng, Xiping Hu

Main category: cs.CV

TL;DR: A survey on skeleton-based emotion recognition, covering psychological models, datasets, methods (posture/gait-based), technical paradigms, applications in mental health, and future challenges.

Details

Motivation: To explore privacy-preserving emotion recognition via body movements, leveraging advancements in 3D skeleton and pose estimation technologies.

Method: Categorizes methods into posture/gait-based, reviews four technical paradigms (Traditional, Feat2Net, FeatFusionNet, End2EndNet), and benchmarks results.

Result: Provides a unified taxonomy, compares methods, and highlights applications in mental health (e.g., depression, autism detection).

Conclusion: Identifies open challenges and future directions for skeleton-based emotion recognition.

Abstract: Emotion recognition through body movements has emerged as a compelling and privacy-preserving alternative to traditional methods that rely on facial expressions or physiological signals. Recent advancements in 3D skeleton acquisition technologies and pose estimation algorithms have significantly enhanced the feasibility of emotion recognition based on full-body motion. This survey provides a comprehensive and systematic review of skeleton-based emotion recognition techniques. First, we introduce psychological models of emotion and examine the relationship between bodily movements and emotional expression. Next, we summarize publicly available datasets, highlighting the differences in data acquisition methods and emotion labeling strategies. We then categorize existing methods into posture-based and gait-based approaches, analyzing them from both data-driven and technical perspectives. In particular, we propose a unified taxonomy that encompasses four primary technical paradigms: Traditional approaches, Feat2Net, FeatFusionNet, and End2EndNet. Representative works within each category are reviewed and compared, with benchmarking results across commonly used datasets. Finally, we explore the extended applications of emotion recognition in mental health assessment, such as detecting depression and autism, and discuss the open challenges and future research directions in this rapidly evolving field.

[104] ViGText: Deepfake Image Detection with Vision-Language Model Explanations and Graph Neural Networks

Ahmad ALBarqawi, Mahmoud Nazzal, Issa Khalil, Abdallah Khreishah, NhatHai Phan

Main category: cs.CV

TL;DR: ViGText integrates images with Vision Large Language Model (VLLM) text explanations in a Graph-based framework to improve deepfake detection, outperforming traditional methods in generalization and robustness.

Details

Motivation: The rise of sophisticated deepfakes challenges traditional detection methods, which lack generalization and robustness against attacks. ViGText addresses this by combining visual and textual analysis for context-aware detection.

Method: ViGText divides images into patches, constructs image and text graphs, and integrates them using Graph Neural Networks (GNNs). It employs multi-level feature extraction across spatial and frequency domains.

Result: ViGText boosts F1 scores from 72.45% to 98.32% in generalization and improves recall by 11.1%. It limits performance degradation to <4% under targeted attacks.

Conclusion: ViGText sets a new standard for deepfake detection by leveraging detailed visual and textual analysis, ensuring media authenticity and information integrity.

Abstract: The rapid rise of deepfake technology, which produces realistic but fraudulent digital content, threatens the authenticity of media. Traditional deepfake detection approaches often struggle with sophisticated, customized deepfakes, especially in terms of generalization and robustness against malicious attacks. This paper introduces ViGText, a novel approach that integrates images with Vision Large Language Model (VLLM) Text explanations within a Graph-based framework to improve deepfake detection. The novelty of ViGText lies in its integration of detailed explanations with visual data, as it provides a more context-aware analysis than captions, which often lack specificity and fail to reveal subtle inconsistencies. ViGText systematically divides images into patches, constructs image and text graphs, and integrates them for analysis using Graph Neural Networks (GNNs) to identify deepfakes. Through the use of multi-level feature extraction across spatial and frequency domains, ViGText captures details that enhance its robustness and accuracy to detect sophisticated deepfakes. Extensive experiments demonstrate that ViGText significantly enhances generalization and achieves a notable performance boost when it detects user-customized deepfakes. Specifically, average F1 scores rise from 72.45% to 98.32% under generalization evaluation, and reflects the model’s superior ability to generalize to unseen, fine-tuned variations of stable diffusion models. As for robustness, ViGText achieves an increase of 11.1% in recall compared to other deepfake detection approaches. When facing targeted attacks that exploit its graph-based architecture, ViGText limits classification performance degradation to less than 4%. ViGText uses detailed visual and textual analysis to set a new standard for detecting deepfakes, helping ensure media authenticity and information integrity.

[105] Enhancing Scene Transition Awareness in Video Generation via Post-Training

Hanwen Shen, Jiajie Lu, Yupeng Cao, Xiaonan Yang

Main category: cs.CV

TL;DR: The paper introduces the Transition-Aware Video (TAV) dataset to improve AI-generated video models’ ability to handle multi-scene prompts by learning coherent scene transitions.

Details

Motivation: Current AI-generated video models struggle with longer videos requiring multiple scenes due to lack of scene transition awareness, as they are trained on single-scene datasets.

Method: The authors propose the TAV dataset, consisting of video clips with multiple scene transitions, and post-train models on it to enhance transition understanding.

Result: Post-training on TAV improves scene transition understanding, reduces the gap between required and generated scenes, and maintains image quality.

Conclusion: The TAV dataset effectively addresses the challenge of multi-scene video generation by enhancing models’ transition awareness.

Abstract: Recent advances in AI-generated video have shown strong performance on \emph{text-to-video} tasks, particularly for short clips depicting a single scene. However, current models struggle to generate longer videos with coherent scene transitions, primarily because they cannot infer when a transition is needed from the prompt. Most open-source models are trained on datasets consisting of single-scene video clips, which limits their capacity to learn and respond to prompts requiring multiple scenes. Developing scene transition awareness is essential for multi-scene generation, as it allows models to identify and segment videos into distinct clips by accurately detecting transitions. To address this, we propose the \textbf{Transition-Aware Video} (TAV) dataset, which consists of preprocessed video clips with multiple scene transitions. Our experiment shows that post-training on the \textbf{TAV} dataset improves prompt-based scene transition understanding, narrows the gap between required and generated scenes, and maintains image quality.

[106] BokehDiff: Neural Lens Blur with One-Step Diffusion

Chengxuan Zhu, Qingnan Fan, Qi Zhang, Jinwei Chen, Huaqi Zhang, Chao Xu, Boxin Shi

Main category: cs.CV

TL;DR: BokehDiff introduces a lens blur rendering method using generative diffusion prior for accurate and visually appealing results, overcoming depth estimation limitations.

Details

Motivation: Previous methods suffer from artifacts due to inaccurate depth estimation, prompting the need for a more robust solution.

Method: Uses a physics-inspired self-attention module with depth-dependent constraints and adapts diffusion models for one-step inference.

Result: Achieves high-quality, artifact-free blur rendering with improved fidelity.

Conclusion: BokehDiff effectively addresses depth estimation challenges and enhances blur rendering realism.

Abstract: We introduce BokehDiff, a novel lens blur rendering method that achieves physically accurate and visually appealing outcomes, with the help of generative diffusion prior. Previous methods are bounded by the accuracy of depth estimation, generating artifacts in depth discontinuities. Our method employs a physics-inspired self-attention module that aligns with the image formation process, incorporating depth-dependent circle of confusion constraint and self-occlusion effects. We adapt the diffusion model to the one-step inference scheme without introducing additional noise, and achieve results of high quality and fidelity. To address the lack of scalable paired data, we propose to synthesize photorealistic foregrounds with transparency with diffusion models, balancing authenticity and scene diversity.

[107] Adapting Large VLMs with Iterative and Manual Instructions for Generative Low-light Enhancement

Xiaoran Sun, Liyan Wang, Cong Wang, Yeying Jin, Kin-man Lam, Zhixun Su, Yang Yang, Jinshan Pan

Main category: cs.CV

TL;DR: VLM-IMI is a new framework for low-light image enhancement (LLIE) that uses vision-language models (VLMs) and iterative manual instructions (IMIs) to improve semantic guidance and output quality.

Details

Motivation: Existing LLIE methods lack semantic guidance from normal-light images, limiting their effectiveness in complex lighting conditions.

Method: VLM-IMI integrates textual descriptions of normal-light content and uses an instruction prior fusion module to align image and text features. It employs iterative manual instructions for refinement.

Result: VLM-IMI outperforms state-of-the-art methods in quantitative metrics and perceptual quality, especially in extreme low-light conditions.

Conclusion: The proposed framework effectively enhances low-light images by leveraging semantic guidance and iterative refinement, achieving superior results.

Abstract: Most existing low-light image enhancement (LLIE) methods rely on pre-trained model priors, low-light inputs, or both, while neglecting the semantic guidance available from normal-light images. This limitation hinders their effectiveness in complex lighting conditions. In this paper, we propose VLM-IMI, a novel framework that leverages large vision-language models (VLMs) with iterative and manual instructions (IMIs) for LLIE. VLM-IMI incorporates textual descriptions of the desired normal-light content as enhancement cues, enabling semantically informed restoration. To effectively integrate cross-modal priors, we introduce an instruction prior fusion module, which dynamically aligns and fuses image and text features, promoting the generation of detailed and semantically coherent outputs. During inference, we adopt an iterative and manual instruction strategy to refine textual instructions, progressively improving visual quality. This refinement enhances structural fidelity, semantic alignment, and the recovery of fine details under extremely low-light conditions. Extensive experiments across diverse scenarios demonstrate that VLM-IMI outperforms state-of-the-art methods in both quantitative metrics and perceptual quality. The source code is available at https://github.com/sunxiaoran01/VLM-IMI.

[108] Learning from Heterogeneity: Generalizing Dynamic Facial Expression Recognition via Distributionally Robust Optimization

Feng-Qi Cui, Anyang Tong, Jinyang Huang, Jie Zhang, Dan Guo, Zhi Liu, Meng Wang

Main category: cs.CV

TL;DR: The paper introduces HDF, a framework for dynamic facial expression recognition, addressing sample heterogeneity with two modules: DAM for time-frequency modeling and DSM for optimization balance, achieving improved accuracy and robustness.

Details

Motivation: Existing methods for DFER suffer from performance degradation due to sample heterogeneity from multi-source data and individual variability.

Method: Proposes HDF with two modules: DAM (Time-Frequency Distributional Attention Module) for temporal and frequency modeling, and DSM (Distribution-aware Scaling Module) for adaptive loss balancing.

Result: HDF improves recognition accuracy and robustness on DFEW and FERV39k datasets, achieving superior WAR and UAR.

Conclusion: HDF effectively addresses heterogeneity in DFER, enhancing performance and generalization, with code publicly available.

Abstract: Dynamic Facial Expression Recognition (DFER) plays a critical role in affective computing and human-computer interaction. Although existing methods achieve comparable performance, they inevitably suffer from performance degradation under sample heterogeneity caused by multi-source data and individual expression variability. To address these challenges, we propose a novel framework, called Heterogeneity-aware Distributional Framework (HDF), and design two plug-and-play modules to enhance time-frequency modeling and mitigate optimization imbalance caused by hard samples. Specifically, the Time-Frequency Distributional Attention Module (DAM) captures both temporal consistency and frequency robustness through a dual-branch attention design, improving tolerance to sequence inconsistency and visual style shifts. Then, based on gradient sensitivity and information bottleneck principles, an adaptive optimization module Distribution-aware Scaling Module (DSM) is introduced to dynamically balance classification and contrastive losses, enabling more stable and discriminative representation learning. Extensive experiments on two widely used datasets, DFEW and FERV39k, demonstrate that HDF significantly improves both recognition accuracy and robustness. Our method achieves superior weighted average recall (WAR) and unweighted average recall (UAR) while maintaining strong generalization across diverse and imbalanced scenarios. Codes are released at https://github.com/QIcita/HDF_DFER.

[109] TextSAM-EUS: Text Prompt Learning for SAM to Accurately Segment Pancreatic Tumor in Endoscopic Ultrasound

Pascal Spiegler, Taha Koleilat, Arash Harirpoush, Corey S. Miller, Hassan Rivaz, Marta Kersten-Oertel, Yiming Xiao

Main category: cs.CV

TL;DR: TextSAM-EUS is a lightweight, text-driven adaptation of SAM for pancreatic tumor segmentation in EUS, outperforming SOTA models with minimal parameter tuning.

Details

Motivation: EUS segmentation is challenging due to speckle noise, low contrast, and reliance on expert annotations. TextSAM-EUS aims to automate this without manual prompts.

Method: Uses text prompt learning via BiomedCLIP and LoRA-based adaptation of SAM, tuning only 0.86% of parameters.

Result: Achieves 82.69% Dice and 85.28% NSD with automatic prompts, surpassing SOTA models.

Conclusion: TextSAM-EUS is efficient and robust for EUS segmentation, pioneering prompt learning in SAM-based medical image analysis.

Abstract: Pancreatic cancer carries a poor prognosis and relies on endoscopic ultrasound (EUS) for targeted biopsy and radiotherapy. However, the speckle noise, low contrast, and unintuitive appearance of EUS make segmentation of pancreatic tumors with fully supervised deep learning (DL) models both error-prone and dependent on large, expert-curated annotation datasets. To address these challenges, we present TextSAM-EUS, a novel, lightweight, text-driven adaptation of the Segment Anything Model (SAM) that requires no manual geometric prompts at inference. Our approach leverages text prompt learning (context optimization) through the BiomedCLIP text encoder in conjunction with a LoRA-based adaptation of SAM’s architecture to enable automatic pancreatic tumor segmentation in EUS, tuning only 0.86% of the total parameters. On the public Endoscopic Ultrasound Database of the Pancreas, TextSAM-EUS with automatic prompts attains 82.69% Dice and 85.28% normalized surface distance (NSD), and with manual geometric prompts reaches 83.10% Dice and 85.70% NSD, outperforming both existing state-of-the-art (SOTA) supervised DL models and foundation models (e.g., SAM and its variants). As the first attempt to incorporate prompt learning in SAM-based medical image segmentation, TextSAM-EUS offers a practical option for efficient and robust automatic EUS segmentation. Our code will be publicly available upon acceptance.

[110] Comparison of Segmentation Methods in Remote Sensing for Land Use Land Cover

Naman Srivastava, Joel D Joy, Yash Dixit, Swarup E, Rakshit Ramesh

Main category: cs.CV

TL;DR: The paper evaluates advanced LULC mapping techniques using Cartosat MX images, combining atmospheric correction with supervised/semi-supervised learning models (DeeplabV3+ and CPS) for urban planning applications. A case study of Hyderabad shows land use changes due to urbanization.

Details

Motivation: LULC mapping is crucial for smart and sustainable city development, requiring accurate techniques to monitor urban changes.

Method: Uses LUT-based atmospheric correction on Cartosat MX images, followed by DeeplabV3+ and CPS models with dynamic weighting for LULC prediction.

Result: Demonstrates significant land use changes in Hyderabad, such as urban sprawl and shrinking green spaces, validating the techniques’ utility.

Conclusion: The proposed methods are practical for urban planners and policymakers to monitor and manage land use changes effectively.

Abstract: Land Use Land Cover (LULC) mapping is essential for urban and resource planning, and is one of the key elements in developing smart and sustainable cities.This study evaluates advanced LULC mapping techniques, focusing on Look-Up Table (LUT)-based Atmospheric Correction applied to Cartosat Multispectral (MX) sensor images, followed by supervised and semi-supervised learning models for LULC prediction. We explore DeeplabV3+ and Cross-Pseudo Supervision (CPS). The CPS model is further refined with dynamic weighting, enhancing pseudo-label reliability during training. This comprehensive approach analyses the accuracy and utility of LULC mapping techniques for various urban planning applications. A case study of Hyderabad, India, illustrates significant land use changes due to rapid urbanization. By analyzing Cartosat MX images over time, we highlight shifts such as urban sprawl, shrinking green spaces, and expanding industrial areas. This demonstrates the practical utility of these techniques for urban planners and policymakers.

[111] Datasets and Recipes for Video Temporal Grounding via Reinforcement Learning

Ruizhe Chen, Zhiting Fan, Tianze Luo, Heqing Zou, Zhaopeng Feng, Guiyang Xie, Hansheng Zhang, Zhuochen Wang, Zuozhu Liu, Huaijian Zhang

Main category: cs.CV

TL;DR: A two-stage training framework combining supervised fine-tuning (SFT) and reinforcement learning (RL) improves video temporal grounding (VTG) accuracy and robustness, outperforming existing models on benchmarks.

Details

Motivation: Existing VTG approaches suffer from limited temporal awareness and poor generalization, prompting the need for a more effective training framework.

Method: The framework uses high-quality curated data for SFT initialization, followed by difficulty-controlled RL to enhance temporal localization and reasoning.

Result: The method outperforms existing models on multiple VTG benchmarks, especially in challenging and open-domain scenarios.

Conclusion: High-quality cold start data and difficulty-controlled RL are crucial for VTG performance; datasets, models, and code are released for community use.

Abstract: Video Temporal Grounding (VTG) aims to localize relevant temporal segments in videos given natural language queries. Despite recent progress with large vision-language models (LVLMs) and instruction-tuning, existing approaches often suffer from limited temporal awareness and poor generalization. In this work, we introduce a two-stage training framework that integrates supervised fine-tuning with reinforcement learning (RL) to improve both the accuracy and robustness of VTG models. Our approach first leverages high-quality curated cold start data for SFT initialization, followed by difficulty-controlled RL to further enhance temporal localization and reasoning abilities. Comprehensive experiments on multiple VTG benchmarks demonstrate that our method consistently outperforms existing models, particularly in challenging and open-domain scenarios. We conduct an in-depth analysis of training strategies and dataset curation, highlighting the importance of both high-quality cold start data and difficulty-controlled RL. To facilitate further research and industrial adoption, we release all intermediate datasets, models, and code to the community.

[112] A Multimodal Seq2Seq Transformer for Predicting Brain Responses to Naturalistic Stimuli

Qianyi He, Yuan Chang Leong

Main category: cs.CV

TL;DR: A Transformer-based model predicts fMRI responses to multimodal movies using visual, auditory, and language inputs, integrating prior brain states and narrative summaries for improved performance.

Details

Motivation: To develop encoding models for predicting whole-brain fMRI responses to naturalistic multimodal movies, addressing the need for temporally-aware and multimodal approaches.

Method: A sequence-to-sequence Transformer autoregressively predicts fMRI activity, using pretrained models (VideoMAE, HuBERT, Qwen, BridgeTower) for feature extraction. It employs dual cross-attention for perceptual and narrative information and combines shared encoders with subject-specific decoders.

Result: The model achieves strong performance on both in-distribution and out-of-distribution data, demonstrating effectiveness in capturing long-range temporal structure and individual variability.

Conclusion: The approach highlights the potential of temporally-aware, multimodal sequence modeling for brain activity prediction, with code publicly available.

Abstract: The Algonauts 2025 Challenge called on the community to develop encoding models that predict whole-brain fMRI responses to naturalistic multimodal movies. In this submission, we propose a sequence-to-sequence Transformer that autoregressively predicts fMRI activity from visual, auditory, and language inputs. Stimulus features were extracted using pretrained models including VideoMAE, HuBERT, Qwen, and BridgeTower. The decoder integrates information from prior brain states, current stimuli, and episode-level summaries via dual cross-attention mechanisms that attend to both perceptual information extracted from the stimulus as well as narrative information provided by high-level summaries of narrative content. One core innovation of our approach is the use of sequences of multimodal context to predict sequences of brain activity, enabling the model to capture long-range temporal structure in both stimuli and neural responses. Another is the combination of a shared encoder with partial subject-specific decoder, which leverages common structure across subjects while accounting for individual variability. Our model achieves strong performance on both in-distribution and out-of-distribution data, demonstrating the effectiveness of temporally-aware, multimodal sequence modeling for brain activity prediction. The code is available at https://github.com/Angelneer926/Algonauts_challenge.

[113] Distributional Uncertainty for Out-of-Distribution Detection

JinYoung Kim, DaeUng Jo, Kimin Yun, Jeonghyo Song, Youngjoon Yoo

Main category: cs.CV

TL;DR: The paper proposes the Free-Energy Posterior Network for joint modeling of distributional uncertainty in deep neural networks, improving OoD detection by leveraging free energy and Beta distribution-based density estimation.

Details

Motivation: Conventional methods like MC Dropout fail to align with the semantic objective of OoD detection by focusing only on model or data uncertainty.

Method: Introduces a free-energy-based density estimator (Beta distribution) and integrates it into a posterior network for direct uncertainty estimation without stochastic sampling. Combines with RPL framework for learning OoD regions.

Result: Validated on benchmarks (Fishyscapes, RoadAnomaly, Segment-Me-If-You-Can), showing effectiveness in uncertainty-aware segmentation.

Conclusion: The proposed method offers a semantically meaningful and computationally efficient solution for OoD detection and uncertainty estimation.

Abstract: Estimating uncertainty from deep neural networks is a widely used approach for detecting out-of-distribution (OoD) samples, which typically exhibit high predictive uncertainty. However, conventional methods such as Monte Carlo (MC) Dropout often focus solely on either model or data uncertainty, failing to align with the semantic objective of OoD detection. To address this, we propose the Free-Energy Posterior Network, a novel framework that jointly models distributional uncertainty and identifying OoD and misclassified regions using free energy. Our method introduces two key contributions: (1) a free-energy-based density estimator parameterized by a Beta distribution, which enables fine-grained uncertainty estimation near ambiguous or unseen regions; and (2) a loss integrated within a posterior network, allowing direct uncertainty estimation from learned parameters without requiring stochastic sampling. By integrating our approach with the residual prediction branch (RPL) framework, the proposed method goes beyond post-hoc energy thresholding and enables the network to learn OoD regions by leveraging the variance of the Beta distribution, resulting in a semantically meaningful and computationally efficient solution for uncertainty-aware segmentation. We validate the effectiveness of our method on challenging real-world benchmarks, including Fishyscapes, RoadAnomaly, and Segment-Me-If-You-Can.

[114] T2VWorldBench: A Benchmark for Evaluating World Knowledge in Text-to-Video Generation

Yubin Chen, Xuyang Guo, Zhenmei Shi, Zhao Song, Jiahao Zhang

Main category: cs.CV

TL;DR: T2VWorldBench is introduced as the first framework to evaluate text-to-video models’ world knowledge abilities, revealing significant gaps in semantic consistency and factual accuracy.

Details

Motivation: Current text-to-video models lack understanding of world knowledge, leading to semantic inconsistencies and factual inaccuracies in generated videos.

Method: T2VWorldBench evaluates 10 advanced models using 1,200 prompts across 6 categories, combining human and automated (VLM-based) evaluation.

Result: Most models fail to generate factually accurate videos, highlighting a critical gap in leveraging world knowledge.

Conclusion: The findings underscore the need for improved commonsense reasoning in text-to-video models, offering research opportunities for robust factual generation.

Abstract: Text-to-video (T2V) models have shown remarkable performance in generating visually reasonable scenes, while their capability to leverage world knowledge for ensuring semantic consistency and factual accuracy remains largely understudied. In response to this challenge, we propose T2VWorldBench, the first systematic evaluation framework for evaluating the world knowledge generation abilities of text-to-video models, covering 6 major categories, 60 subcategories, and 1,200 prompts across a wide range of domains, including physics, nature, activity, culture, causality, and object. To address both human preference and scalable evaluation, our benchmark incorporates both human evaluation and automated evaluation using vision-language models (VLMs). We evaluated the 10 most advanced text-to-video models currently available, ranging from open source to commercial models, and found that most models are unable to understand world knowledge and generate truly correct videos. These findings point out a critical gap in the capability of current text-to-video models to leverage world knowledge, providing valuable research opportunities and entry points for constructing models with robust capabilities for commonsense reasoning and factual generation.

[115] Information Entropy-Based Framework for Quantifying Tortuosity in Meibomian Gland Uneven Atrophy

Kesheng Wang, Xiaoyu Chen, Chunlei He, Fenfen Li, Xinxin Yu, Dexing Kong, Shoujun Huang, Qi Dai

Main category: cs.CV

TL;DR: A novel entropy-based framework for quantifying curve tortuosity in medical images, validated through meibomian gland atrophy analysis, shows high clinical utility.

Details

Motivation: Precise tortuosity quantification is crucial for medical diagnostics, but traditional methods lack robustness when biological reference curves are available.

Method: The framework integrates probability modeling and entropy theory, comparing target curves to biologically plausible references instead of idealized lines.

Result: Significant differences in tortuosity were found between patient groups (AUC: 0.8768, sensitivity: 0.75, specificity: 0.93).

Conclusion: The framework is effective for medical curve analysis and has potential for broader diagnostic applications.

Abstract: In the medical image analysis field, precise quantification of curve tortuosity plays a critical role in the auxiliary diagnosis and pathological assessment of various diseases. In this study, we propose a novel framework for tortuosity quantification and demonstrate its effectiveness through the evaluation of meibomian gland atrophy uniformity,serving as a representative application scenario. We introduce an information entropy-based tortuosity quantification framework that integrates probability modeling with entropy theory and incorporates domain transformation of curve data. Unlike traditional methods such as curvature or arc-chord ratio, this approach evaluates the tortuosity of a target curve by comparing it to a designated reference curve. Consequently, it is more suitable for tortuosity assessment tasks in medical data where biologically plausible reference curves are available, providing a more robust and objective evaluation metric without relying on idealized straight-line comparisons. First, we conducted numerical simulation experiments to preliminarily assess the stability and validity of the method. Subsequently, the framework was applied to quantify the spatial uniformity of meibomian gland atrophy and to analyze the difference in this uniformity between \textit{Demodex}-negative and \textit{Demodex}-positive patient groups. The results demonstrated a significant difference in tortuosity-based uniformity between the two groups, with an area under the curve of 0.8768, sensitivity of 0.75, and specificity of 0.93. These findings highlight the clinical utility of the proposed framework in curve tortuosity analysis and its potential as a generalizable tool for quantitative morphological evaluation in medical diagnostics.

[116] Real-Time Object Detection and Classification using YOLO for Edge FPGAs

Rashed Al Amin, Roman Obermaisser

Main category: cs.CV

TL;DR: A resource-efficient YOLOv5-based system for real-time object detection and classification on FPGAs, achieving 99% accuracy, 3.5W power, and 9 FPS.

Details

Motivation: Existing YOLO-based systems lack resource efficiency for edge FPGA platforms, limiting their deployment in applications like ADAS.

Method: Optimized YOLOv5 for FPGA deployment, trained on COCO and GTSRD datasets, and implemented on Xilinx Kria KV260 FPGA board.

Result: Achieved 99% classification accuracy, 3.5W power consumption, and 9 FPS processing speed.

Conclusion: The proposed system effectively enables real-time, resource-efficient object detection and classification for edge computing.

Abstract: Object detection and classification are crucial tasks across various application domains, particularly in the development of safe and reliable Advanced Driver Assistance Systems (ADAS). Existing deep learning-based methods such as Convolutional Neural Networks (CNNs), Single Shot Detectors (SSDs), and You Only Look Once (YOLO) have demonstrated high performance in terms of accuracy and computational speed when deployed on Field-Programmable Gate Arrays (FPGAs). However, despite these advances, state-of-the-art YOLO-based object detection and classification systems continue to face challenges in achieving resource efficiency suitable for edge FPGA platforms. To address this limitation, this paper presents a resource-efficient real-time object detection and classification system based on YOLOv5 optimized for FPGA deployment. The proposed system is trained on the COCO and GTSRD datasets and implemented on the Xilinx Kria KV260 FPGA board. Experimental results demonstrate a classification accuracy of 99%, with a power consumption of 3.5W and a processing speed of 9 frames per second (FPS). These findings highlight the effectiveness of the proposed approach in enabling real-time, resource-efficient object detection and classification for edge computing applications.

[117] Unsupervised Domain Adaptation for 3D LiDAR Semantic Segmentation Using Contrastive Learning and Multi-Model Pseudo Labeling

Abhishek Kaushik, Norbert Haala, Uwe Soergel

Main category: cs.CV

TL;DR: A novel two-stage UDA framework improves 3D LiDAR semantic segmentation by combining contrastive pre-training and ensemble pseudo-labeling, achieving better accuracy without target domain annotations.

Details

Motivation: Performance degradation in 3D LiDAR semantic segmentation due to domain shifts (e.g., sensor type, location) is a critical issue for autonomous systems, and manual annotation is impractical.

Method: 1. Unsupervised contrastive learning pre-trains a backbone network for domain-invariant features. 2. Multi-model pseudo-labeling uses an ensemble of architectures to generate refined pseudo-labels for fine-tuning.

Result: Experiments show significant accuracy improvements over direct transfer and single-model UDA methods when adapting from SemanticKITTI to SemanticPOSS and SemanticSlamantic.

Conclusion: The framework effectively bridges domain gaps by leveraging contrastive pre-training and ensemble pseudo-labeling, eliminating the need for target domain annotations.

Abstract: Addressing performance degradation in 3D LiDAR semantic segmentation due to domain shifts (e.g., sensor type, geographical location) is crucial for autonomous systems, yet manual annotation of target data is prohibitive. This study addresses the challenge using Unsupervised Domain Adaptation (UDA) and introduces a novel two-stage framework to tackle it. Initially, unsupervised contrastive learning at the segment level is used to pre-train a backbone network, enabling it to learn robust, domain-invariant features without labels. Subsequently, a multi-model pseudo-labeling strategy is introduced, utilizing an ensemble of diverse state-of-the-art architectures (including projection, voxel, hybrid, and cylinder-based methods). Predictions from these models are aggregated via hard voting to generate high-quality, refined pseudo-labels for the unlabeled target domain, mitigating single-model biases. The contrastively pre-trained network is then fine-tuned using these robust pseudo-labels. Experiments adapting from SemanticKITTI to unlabeled target datasets (SemanticPOSS, SemanticSlamantic) demonstrate significant improvements in segmentation accuracy compared to direct transfer and single-model UDA approaches. These results highlight the effectiveness of combining contrastive pre-training with refined ensemble pseudo-labeling for bridging complex domain gaps without requiring target domain annotations.

[118] Cloud gap-filling with deep learning for improved grassland monitoring

Iason Tsardanidis, Alkiviadis Koukos, Vasileios Sitokonstantinou, Thanassis Drivas, Charalampos Kontoes

Main category: cs.CV

TL;DR: A deep learning method combining CNNs and RNNs integrates Sentinel-2 and Sentinel-1 data to generate continuous NDVI time series, improving grassland mowing detection in cloudy regions like Lithuania.

Details

Motivation: Clouds disrupt optical image time series, hindering agricultural monitoring. Integrating cloud-free optical and SAR data addresses this issue.

Method: Hybrid CNN-RNN architecture fuses Sentinel-2 and Sentinel-1 data to produce continuous NDVI time series, tested against interpolation techniques.

Result: Achieved MAE of 0.024 and R^2 of 0.92, with mowing detection F1-score up to 84%, outperforming interpolation methods.

Conclusion: The method enhances NDVI continuity and mowing detection accuracy, mitigating noise from cloudy observations.

Abstract: Uninterrupted optical image time series are crucial for the timely monitoring of agricultural land changes, particularly in grasslands. However, the continuity of such time series is often disrupted by clouds. In response to this challenge, we propose an innovative deep learning method that integrates cloud-free optical (Sentinel-2) observations and weather-independent (Sentinel-1) Synthetic Aperture Radar (SAR) data. Our approach employs a hybrid architecture combining Convolutional Neural Networks (CNNs) and Recurrent Neural Networks (RNNs) to generate continuous Normalized Difference Vegetation Index (NDVI) time series, highlighting the role of NDVI in the synergy between SAR and optical data. We demonstrate the significance of observation continuity by assessing the impact of the generated NDVI time series on the downstream task of grassland mowing event detection. We conducted our study in Lithuania, a country characterized by extensive cloud coverage, and compared our approach with alternative interpolation techniques (i.e., linear, Akima, quadratic). Our method outperformed these techniques, achieving an average Mean Absolute Error (MAE) of 0.024 and a coefficient of determination R^2 of 0.92. Additionally, our analysis revealed improvement in the performance of the mowing event detection, with F1-score up to 84% using two widely applied mowing detection methodologies. Our method also effectively mitigated sudden shifts and noise originating from cloudy observations, which are often missed by conventional cloud masks and adversely affect mowing detection precision.

[119] Differential-UMamba: Rethinking Tumor Segmentation Under Limited Data Scenarios

Dhruv Jain, Romain Modzelewski, Romain Hérault, Clement Chatelain, Eva Torfeh, Sebastien Thureau

Main category: cs.CV

TL;DR: Diff-UMamba, a UNet-mamba hybrid with a Noise Reduction Module, improves medical image segmentation in low-data settings by filtering noise and enhancing relevant features.

Details

Motivation: Deep learning models overfit to noise in data-scarce scenarios, limiting generalization. This is critical in medical image segmentation where accuracy is vital.

Method: Combines UNet with mamba mechanism for long-range dependencies and introduces a Noise Reduction Module (NRM) to suppress irrelevant activations.

Result: Achieves 1-5% performance gains over baselines on datasets like MSD, AIIB23, BraTS-21, and a NSCLC CBCT dataset.

Conclusion: Diff-UMamba enhances segmentation accuracy and robustness, especially in low-data conditions, making it suitable for medical imaging tasks.

Abstract: In data-scarce scenarios, deep learning models often overfit to noise and irrelevant patterns, which limits their ability to generalize to unseen samples. To address these challenges in medical image segmentation, we introduce Diff-UMamba, a novel architecture that combines the UNet framework with the mamba mechanism for modeling long-range dependencies. At the heart of Diff-UMamba is a Noise Reduction Module (NRM), which employs a signal differencing strategy to suppress noisy or irrelevant activations within the encoder. This encourages the model to filter out spurious features and enhance task-relevant representations, thereby improving its focus on clinically meaningful regions. As a result, the architecture achieves improved segmentation accuracy and robustness, particularly in low-data settings. Diff-UMamba is evaluated on multiple public datasets, including MSD (lung and pancreas) and AIIB23, demonstrating consistent performance gains of 1-3% over baseline methods across diverse segmentation tasks. To further assess performance under limited-data conditions, additional experiments are conducted on the BraTS-21 dataset by varying the proportion of available training samples. The approach is also validated on a small internal non-small cell lung cancer (NSCLC) dataset for gross tumor volume (GTV) segmentation in cone beam CT (CBCT), where it achieves a 4-5% improvement over the baseline.

[120] MatSSL: Robust Self-Supervised Representation Learning for Metallographic Image Segmentation

Hoang Hai Nam Nguyen, Phan Nguyen Duc Hieu, Ho Won Lee

Main category: cs.CV

TL;DR: MatSSL is a self-supervised learning architecture for micrograph analysis, outperforming supervised methods with limited labeled data and achieving better results than large-scale pretrained models.

Details

Motivation: Current supervised methods for micrograph analysis require retraining for new datasets and perform poorly with few labels. SSL alternatives often need large datasets. MatSSL aims to work effectively with small unlabeled data.

Method: MatSSL uses Gated Feature Fusion in its backbone for multi-level representation integration. It is pretrained on small unlabeled data and fine-tuned on benchmarks.

Result: Achieves 69.13% mIoU on MetalDAM (vs. 66.73% with ImageNet) and up to 40% improvement on EBC compared to MicroNet-pretrained models.

Conclusion: MatSSL adapts well to metallographic domains with minimal unlabeled data, retaining transferable features from large-scale pretraining.

Abstract: MatSSL is a streamlined self-supervised learning (SSL) architecture that employs Gated Feature Fusion at each stage of the backbone to integrate multi-level representations effectively. Current micrograph analysis of metallic materials relies on supervised methods, which require retraining for each new dataset and often perform inconsistently with only a few labeled samples. While SSL offers a promising alternative by leveraging unlabeled data, most existing methods still depend on large-scale datasets to be effective. MatSSL is designed to overcome this limitation. We first perform self-supervised pretraining on a small-scale, unlabeled dataset and then fine-tune the model on multiple benchmark datasets. The resulting segmentation models achieve 69.13% mIoU on MetalDAM, outperforming the 66.73% achieved by an ImageNet-pretrained encoder, and delivers consistently up to nearly 40% improvement in average mIoU on the Environmental Barrier Coating benchmark dataset (EBC) compared to models pretrained with MicroNet. This suggests that MatSSL enables effective adaptation to the metallographic domain using only a small amount of unlabeled data, while preserving the rich and transferable features learned from large-scale pretraining on natural images.

[121] MAD-AD: Masked Diffusion for Unsupervised Brain Anomaly Detection

Farzad Beizaee, Gregory Lodygensky, Christian Desrosiers, Jose Dolz

Main category: cs.CV

TL;DR: A novel unsupervised anomaly detection method for brain MRI scans using masking in diffusion models to localize and identify anomalies by learning normal brain anatomy.

Details

Motivation: Accurate anomaly localization in brain images is challenging due to structural complexity and lack of labeled abnormal data.

Method: Uses diffusion models with masking to learn normal brain anatomy by adding noise to patches and recovering original features. Anomalies are identified as noisy patches.

Result: Outperforms existing unsupervised techniques in anomaly localization and generating normal counterparts.

Conclusion: The proposed method effectively localizes anomalies in brain MRI scans without labeled data, improving accuracy and performance.

Abstract: Unsupervised anomaly detection in brain images is crucial for identifying injuries and pathologies without access to labels. However, the accurate localization of anomalies in medical images remains challenging due to the inherent complexity and variability of brain structures and the scarcity of annotated abnormal data. To address this challenge, we propose a novel approach that incorporates masking within diffusion models, leveraging their generative capabilities to learn robust representations of normal brain anatomy. During training, our model processes only normal brain MRI scans and performs a forward diffusion process in the latent space that adds noise to the features of randomly-selected patches. Following a dual objective, the model learns to identify which patches are noisy and recover their original features. This strategy ensures that the model captures intricate patterns of normal brain structures while isolating potential anomalies as noise in the latent space. At inference, the model identifies noisy patches corresponding to anomalies and generates a normal counterpart for these patches by applying a reverse diffusion process. Our method surpasses existing unsupervised anomaly detection techniques, demonstrating superior performance in generating accurate normal counterparts and localizing anomalies. The code is available at hhttps://github.com/farzad-bz/MAD-AD.

[122] TeEFusion: Blending Text Embeddings to Distill Classifier-Free Guidance

Minghao Fu, Guo-Hua Wang, Xiaohao Chen, Qing-Guo Chen, Zhao Xu, Weihua Luo, Kaifu Zhang

Main category: cs.CV

TL;DR: TeEFusion is a distillation method that improves text-to-image synthesis efficiency by fusing text embeddings and simplifying sampling, achieving 6x faster inference without quality loss.

Details

Motivation: High inference costs due to complex sampling and classifier-free guidance (CFG) in text-to-image synthesis.

Method: TeEFusion fuses conditional and unconditional text embeddings linearly, distilling the teacher model’s sampling strategy.

Result: Student model achieves 6x faster inference while maintaining image quality comparable to the teacher.

Conclusion: TeEFusion offers a simpler, efficient alternative to complex sampling without sacrificing performance.

Abstract: Recent advances in text-to-image synthesis largely benefit from sophisticated sampling strategies and classifier-free guidance (CFG) to ensure high-quality generation. However, CFG’s reliance on two forward passes, especially when combined with intricate sampling algorithms, results in prohibitively high inference costs. To address this, we introduce TeEFusion (\textbf{Te}xt \textbf{E}mbeddings \textbf{Fusion}), a novel and efficient distillation method that directly incorporates the guidance magnitude into the text embeddings and distills the teacher model’s complex sampling strategy. By simply fusing conditional and unconditional text embeddings using linear operations, TeEFusion reconstructs the desired guidance without adding extra parameters, simultaneously enabling the student model to learn from the teacher’s output produced via its sophisticated sampling approach. Extensive experiments on state-of-the-art models such as SD3 demonstrate that our method allows the student to closely mimic the teacher’s performance with a far simpler and more efficient sampling strategy. Consequently, the student model achieves inference speeds up to 6$\times$ faster than the teacher model, while maintaining image quality at levels comparable to those obtained through the teacher’s complex sampling approach. The code is publicly available at \href{https://github.com/AIDC-AI/TeEFusion}{github.com/AIDC-AI/TeEFusion}.

[123] LEAF: Latent Diffusion with Efficient Encoder Distillation for Aligned Features in Medical Image Segmentation

Qilin Huang, Tianyu Lin, Zhiguang Chen, Fudan Zheng

Main category: cs.CV

TL;DR: LEAF improves medical image segmentation by fine-tuning latent diffusion models, replacing noise prediction with direct segmentation map prediction and using feature distillation.

Details

Motivation: Existing diffusion models lack task-specific adjustments for segmentation and have feature extraction deficiencies.

Method: Fine-tuning latent diffusion models to predict segmentation maps directly and aligning features via distillation.

Result: Enhanced performance across multiple datasets without altering model architecture or increasing computational cost.

Conclusion: LEAF is an efficient and effective method for medical image segmentation using diffusion models.

Abstract: Leveraging the powerful capabilities of diffusion models has yielded quite effective results in medical image segmentation tasks. However, existing methods typically transfer the original training process directly without specific adjustments for segmentation tasks. Furthermore, the commonly used pre-trained diffusion models still have deficiencies in feature extraction. Based on these considerations, we propose LEAF, a medical image segmentation model grounded in latent diffusion models. During the fine-tuning process, we replace the original noise prediction pattern with a direct prediction of the segmentation map, thereby reducing the variance of segmentation results. We also employ a feature distillation method to align the hidden states of the convolutional layers with the features from a transformer-based vision encoder. Experimental results demonstrate that our method enhances the performance of the original diffusion model across multiple segmentation datasets for different disease types. Notably, our approach does not alter the model architecture, nor does it increase the number of parameters or computation during the inference phase, making it highly efficient.

[124] DAA*: Deep Angular A Star for Image-based Path Planning

Zhiwei Xu

Main category: cs.CV

TL;DR: The paper introduces Deep Angular A* (DAA*), a method incorporating Path Angular Freedom (PAF) into A* to improve path smoothness and similarity in imitation learning, outperforming existing methods in path optimality and similarity metrics.

Details

Motivation: Path smoothness is often overlooked in path imitation learning, limiting the quality of learned paths. This paper addresses this gap by introducing adaptive path smoothness through PAF.

Method: DAA* integrates PAF into A* to balance path shortening and smoothing, optimizing heuristic distance and angular freedom. It is evaluated on 7 datasets, including mazes, video games, and real-world drone scenarios.

Result: DAA* improves path similarity metrics (SPR, ASIM, PSIM) by 9.0%, 6.9%, and 3.9% over neural A*, and outperforms TransPath by 6.3%, 6.0%, and 3.7% in joint learning tasks.

Conclusion: DAA* effectively enhances path imitation learning by balancing smoothness and optimality, with minor trade-offs in search efficiency.

Abstract: Path smoothness is often overlooked in path imitation learning from expert demonstrations. In this paper, we introduce a novel learning method, termed deep angular A* (DAA*), by incorporating the proposed path angular freedom (PAF) into A* to improve path similarity through adaptive path smoothness. The PAF aims to explore the effect of move angles on path node expansion by finding the trade-off between their minimum and maximum values, allowing for high adaptiveness for imitation learning. DAA* improves path optimality by closely aligning with the reference path through joint optimization of path shortening and smoothing, which correspond to heuristic distance and PAF, respectively. Throughout comprehensive evaluations on 7 datasets, including 4 maze datasets, 2 video-game datasets, and a real-world drone-view dataset containing 2 scenarios, we demonstrate remarkable improvements of our DAA* over neural A* in path similarity between the predicted and reference paths with a shorter path length when the shortest path is plausible, improving by 9.0% SPR, 6.9% ASIM, and 3.9% PSIM. Furthermore, when jointly learning pathfinding with both path loss and path probability map loss, DAA* significantly outperforms the state-of-the-art TransPath by 6.3% SPR, 6.0% PSIM, and 3.7% ASIM. We also discuss the minor trade-off between path optimality and search efficiency where applicable. Our code and model weights are available at https://github.com/zwxu064/DAAStar.git.

[125] 3D Test-time Adaptation via Graph Spectral Driven Point Shift

Xin Wei, Qin Yang, Yijie Fang, Mingrui Zhu, Nannan Wang

Main category: cs.CV

TL;DR: GSDTTA introduces a graph spectral domain approach for efficient 3D point cloud test-time adaptation, outperforming existing methods.

Details

Motivation: Addressing the inefficiency and computational cost of current 3D TTA methods for irregular point clouds.

Method: Uses Graph Fourier Transform (GFT) to represent point clouds in the spectral domain, optimizing only key frequency components and refining via eigenmap-guided self-training.

Result: Demonstrates superior performance on benchmark datasets compared to existing TTA methods.

Conclusion: GSDTTA offers a more efficient and effective solution for 3D point cloud adaptation.

Abstract: While test-time adaptation (TTA) methods effectively address domain shifts by dynamically adapting pre-trained models to target domain data during online inference, their application to 3D point clouds is hindered by their irregular and unordered structure. Current 3D TTA methods often rely on computationally expensive spatial-domain optimizations and may require additional training data. In contrast, we propose Graph Spectral Domain Test-Time Adaptation (GSDTTA), a novel approach for 3D point cloud classification that shifts adaptation to the graph spectral domain, enabling more efficient adaptation by capturing global structural properties with fewer parameters. Point clouds in target domain are represented as outlier-aware graphs and transformed into graph spectral domain by Graph Fourier Transform (GFT). For efficiency, adaptation is performed by optimizing only the lowest 10% of frequency components, which capture the majority of the point cloud’s energy. An inverse GFT (IGFT) is then applied to reconstruct the adapted point cloud with the graph spectral-driven point shift. This process is enhanced by an eigenmap-guided self-training strategy that iteratively refines both the spectral adjustments and the model parameters. Experimental results and ablation studies on benchmark datasets demonstrate the effectiveness of GSDTTA, outperforming existing TTA methods for 3D point cloud classification.

[126] DATA: Domain-And-Time Alignment for High-Quality Feature Fusion in Collaborative Perception

Chengchang Tian, Jianwei Ma, Yan Huang, Zhanye Chen, Honghao Wei, Hui Zhang, Wei Hong

Main category: cs.CV

TL;DR: The paper introduces the Domain-And-Time Alignment (DATA) network to address domain gaps and temporal misalignment in feature-level fusion for collaborative perception, achieving state-of-the-art performance.

Details

Motivation: Feature-level fusion in collaborative perception is hindered by domain gaps (hardware diversity, deployment conditions) and temporal misalignment (transmission delays), degrading feature quality.

Method: Proposes DATA network with Consistency-preserving Domain Alignment Module (CDAM), Progressive Temporal Alignment Module (PTAM), and Instance-focused Feature Aggregation Module (IFAM) to align features and enhance semantics.

Result: DATA achieves state-of-the-art performance on three datasets, robust to severe communication delays and pose errors.

Conclusion: The DATA network effectively aligns features and maximizes semantic representations, improving collaborative perception performance.

Abstract: Feature-level fusion shows promise in collaborative perception (CP) through balanced performance and communication bandwidth trade-off. However, its effectiveness critically relies on input feature quality. The acquisition of high-quality features faces domain gaps from hardware diversity and deployment conditions, alongside temporal misalignment from transmission delays. These challenges degrade feature quality with cumulative effects throughout the collaborative network. In this paper, we present the Domain-And-Time Alignment (DATA) network, designed to systematically align features while maximizing their semantic representations for fusion. Specifically, we propose a Consistency-preserving Domain Alignment Module (CDAM) that reduces domain gaps through proximal-region hierarchical downsampling and observability-constrained discriminator. We further propose a Progressive Temporal Alignment Module (PTAM) to handle transmission delays via multi-scale motion modeling and two-stage compensation. Building upon the aligned features, an Instance-focused Feature Aggregation Module (IFAM) is developed to enhance semantic representations. Extensive experiments demonstrate that DATA achieves state-of-the-art performance on three typical datasets, maintaining robustness with severe communication delays and pose errors. The code will be released at https://github.com/ChengchangTian/DATA.

[127] DepthDark: Robust Monocular Depth Estimation for Low-Light Environments

Longjian Zeng, Zunjie Zhu, Rongfeng Lu, Ming Lu, Bolun Zheng, Chenggang Yan, Anke Xue

Main category: cs.CV

TL;DR: DepthDark is a foundation model for monocular depth estimation in low-light, using flare/noise simulation and PEFT to outperform on nuScenes-Night and RobotCar-Night datasets.

Details

Motivation: Existing models fail in low-light due to lack of datasets and efficient fine-tuning strategies.

Method: Uses flare/noise simulation for dataset creation and PEFT with illumination guidance and multiscale fusion.

Result: Achieves state-of-the-art performance on nuScenes-Night and RobotCar-Night.

Conclusion: DepthDark effectively addresses low-light depth estimation with limited data/resources.

Abstract: In recent years, foundation models for monocular depth estimation have received increasing attention. Current methods mainly address typical daylight conditions, but their effectiveness notably decreases in low-light environments. There is a lack of robust foundational models for monocular depth estimation specifically designed for low-light scenarios. This largely stems from the absence of large-scale, high-quality paired depth datasets for low-light conditions and the effective parameter-efficient fine-tuning (PEFT) strategy. To address these challenges, we propose DepthDark, a robust foundation model for low-light monocular depth estimation. We first introduce a flare-simulation module and a noise-simulation module to accurately simulate the imaging process under nighttime conditions, producing high-quality paired depth datasets for low-light conditions. Additionally, we present an effective low-light PEFT strategy that utilizes illumination guidance and multiscale feature fusion to enhance the model’s capability in low-light environments. Our method achieves state-of-the-art depth estimation performance on the challenging nuScenes-Night and RobotCar-Night datasets, validating its effectiveness using limited training data and computing resources.

[128] LONG3R: Long Sequence Streaming 3D Reconstruction

Zhuoguang Chen, Minghui Qin, Tianyuan Yuan, Zhe Liu, Hang Zhao

Main category: cs.CV

TL;DR: LONG3R is a novel model for real-time, long-sequence multi-view 3D scene reconstruction, outperforming existing methods with its memory gating and dual-source decoder.

Details

Motivation: Existing methods for multi-view scene reconstruction struggle with real-time processing of long image sequences, limiting practical applications.

Method: LONG3R uses a memory gating mechanism, a dual-source refined decoder, and a 3D spatio-temporal memory for dynamic pruning and resolution adjustment. It employs a two-stage curriculum training strategy.

Result: LONG3R outperforms state-of-the-art streaming methods, especially for long sequences, while maintaining real-time speed.

Conclusion: LONG3R advances real-time 3D scene reconstruction for long sequences, offering improved performance and efficiency.

Abstract: Recent advancements in multi-view scene reconstruction have been significant, yet existing methods face limitations when processing streams of input images. These methods either rely on time-consuming offline optimization or are restricted to shorter sequences, hindering their applicability in real-time scenarios. In this work, we propose LONG3R (LOng sequence streaming 3D Reconstruction), a novel model designed for streaming multi-view 3D scene reconstruction over longer sequences. Our model achieves real-time processing by operating recurrently, maintaining and updating memory with each new observation. We first employ a memory gating mechanism to filter relevant memory, which, together with a new observation, is fed into a dual-source refined decoder for coarse-to-fine interaction. To effectively capture long-sequence memory, we propose a 3D spatio-temporal memory that dynamically prunes redundant spatial information while adaptively adjusting resolution along the scene. To enhance our model’s performance on long sequences while maintaining training efficiency, we employ a two-stage curriculum training strategy, each stage targeting specific capabilities. Experiments demonstrate that LONG3R outperforms state-of-the-art streaming methods, particularly for longer sequences, while maintaining real-time inference speed. Project page: https://zgchen33.github.io/LONG3R/.

[129] Exploiting Gaussian Agnostic Representation Learning with Diffusion Priors for Enhanced Infrared Small Target Detection

Junyao Li, Yahao Lu, Xingyuan Guo, Xiaoyu Xian, Tiantian Wang, Yukai Shi

Main category: cs.CV

TL;DR: The paper introduces Gaussian Agnostic Representation Learning and Gaussian Group Squeezer to improve infrared small target detection (ISTD) under data scarcity, enhancing model resilience and sample quality.

Details

Motivation: Current ISTD methods rely on costly manual-labeling data, making them fragile in real-world scenarios. The study aims to address performance variations under data scarcity.

Method: Proposes Gaussian Group Squeezer for non-uniform quantization and two-stage diffusion models for real-world reconstruction, leveraging diverse training samples.

Result: The approach outperforms state-of-the-art methods in scarcity scenarios, improving detection resilience and synthetic sample quality.

Conclusion: The proposed method effectively addresses ISTD challenges under data scarcity, offering a robust and practical solution.

Abstract: Infrared small target detection (ISTD) plays a vital role in numerous practical applications. In pursuit of determining the performance boundaries, researchers employ large and expensive manual-labeling data for representation learning. Nevertheless, this approach renders the state-of-the-art ISTD methods highly fragile in real-world challenges. In this paper, we first study the variation in detection performance across several mainstream methods under various scarcity – namely, the absence of high-quality infrared data – that challenge the prevailing theories about practical ISTD. To address this concern, we introduce the Gaussian Agnostic Representation Learning. Specifically, we propose the Gaussian Group Squeezer, leveraging Gaussian sampling and compression for non-uniform quantization. By exploiting a diverse array of training samples, we enhance the resilience of ISTD models against various challenges. Then, we introduce two-stage diffusion models for real-world reconstruction. By aligning quantized signals closely with real-world distributions, we significantly elevate the quality and fidelity of the synthetic samples. Comparative evaluations against state-of-the-art detection methods in various scarcity scenarios demonstrate the efficacy of the proposed approach.

[130] Dissecting the Dental Lung Cancer Axis via Mendelian Randomization and Mediation Analysis

Wenran Zhang, Huihuan Luo, Linda Wei, Ping Nie, Yiqun Wu, Dedong Yu

Main category: cs.CV

TL;DR: The study used Mendelian randomization to explore causal links between dental traits (caries, periodontitis) and lung cancer, finding caries significantly increases lung cancer risk, mediated by lung function decline, while periodontitis showed no effect.

Details

Motivation: To clarify causal relationships between oral diseases (periodontitis, dental caries) and lung cancer, given observational links but uncertain causality.

Method: Two-sample Mendelian randomization (MR) with genetic instruments from large GWAS datasets, analyzing via inverse variance weighting and assessing lung function mediation with the delta method.

Result: Dental caries had a significant positive causal effect on lung cancer (especially squamous cell carcinoma), mediated by declines in FVC and FEV1. Periodontitis showed no causal effect.

Conclusion: Dental caries causally increases lung cancer risk, suggesting dental care and lung function monitoring should be part of cancer prevention.

Abstract: Periodontitis and dental caries are common oral diseases affecting billions globally. While observational studies suggest links between these conditions and lung cancer, causality remains uncertain. This study used two sample Mendelian randomization (MR) to explore causal relationships between dental traits (periodontitis, dental caries) and lung cancer subtypes, and to assess mediation by pulmonary function. Genetic instruments were derived from the largest available genome wide association studies, including data from 487,823 dental caries and 506,594 periodontitis cases, as well as lung cancer data from the Transdisciplinary Research of Cancer in Lung consortium. Inverse variance weighting was the main analytical method; lung function mediation was assessed using the delta method. The results showed a significant positive causal effect of dental caries on overall lung cancer and its subtypes. Specifically, a one standard deviation increase in dental caries incidence was associated with a 188.0% higher risk of squamous cell lung carcinoma (OR = 2.880, 95% CI = 1.236–6.713, p = 0.014), partially mediated by declines in forced vital capacity (FVC) and forced expiratory volume in one second (FEV1), accounting for 5.124% and 5.890% of the total effect. No causal effect was found for periodontitis. These findings highlight a causal role of dental caries in lung cancer risk and support integrating dental care and pulmonary function monitoring into cancer prevention strategies.

[131] LMM-Det: Make Large Multimodal Models Excel in Object Detection

Jincheng Li, Chunyu Xie, Ji Ao, Dawei Leng, Yuhui Yin

Main category: cs.CV

TL;DR: LMM-Det enhances object detection in large multimodal models without specialized modules, improving recall via data distribution adjustment and inference optimization.

Details

Motivation: Address the performance gap in object detection between large multimodal models (LMMs) and specialist detectors.

Method: Proposes LMM-Det, leveraging LMMs for object detection without specialized modules, using data distribution adjustment and inference optimization.

Result: LMM-Det improves recall and demonstrates effective object detection capabilities in LMMs.

Conclusion: LMMs can perform object detection effectively without extra modules, as validated by LMM-Det.

Abstract: Large multimodal models (LMMs) have garnered wide-spread attention and interest within the artificial intelligence research and industrial communities, owing to their remarkable capability in multimodal understanding, reasoning, and in-context learning, among others. While LMMs have demonstrated promising results in tackling multimodal tasks like image captioning, visual question answering, and visual grounding, the object detection capabilities of LMMs exhibit a significant gap compared to specialist detectors. To bridge the gap, we depart from the conventional methods of integrating heavy detectors with LMMs and propose LMM-Det, a simple yet effective approach that leverages a Large Multimodal Model for vanilla object Detection without relying on specialized detection modules. Specifically, we conduct a comprehensive exploratory analysis when a large multimodal model meets with object detection, revealing that the recall rate degrades significantly compared with specialist detection models. To mitigate this, we propose to increase the recall rate by introducing data distribution adjustment and inference optimization tailored for object detection. We re-organize the instruction conversations to enhance the object detection capabilities of large multimodal models. We claim that a large multimodal model possesses detection capability without any extra detection modules. Extensive experiments support our claim and show the effectiveness of the versatile LMM-Det. The datasets, models, and codes are available at https://github.com/360CVGroup/LMM-Det.

[132] Improving Large Vision-Language Models’ Understanding for Field Data

Xiaomei Zhang, Hanyu Zheng, Xiangyu Zhu, Jinghuan Wei, Junhong Zou, Zhen Lei, Zhaoxiang Zhang

Main category: cs.CV

TL;DR: FieldLVLM improves LVLMs’ understanding of scientific field data using field-aware language generation and data-compressed multimodal tuning, outperforming existing methods.

Details

Motivation: LVLMs excel in general tasks but lack exploration in scientific domains, especially for complex field data.

Method: FieldLVLM combines field-aware language generation (extracting key features) and data-compressed multimodal tuning (reducing input complexity).

Result: FieldLVLM outperforms existing methods on benchmark datasets for scientific field data tasks.

Conclusion: FieldLVLM bridges the gap between LVLMs and scientific research, enabling domain-specific applications.

Abstract: Large Vision-Language Models (LVLMs) have shown impressive capabilities across a range of tasks that integrate visual and textual understanding, such as image captioning and visual question answering. These models are trained on large-scale image and video datasets paired with text, enabling them to bridge visual perception and natural language processing. However, their application to scientific domains, especially in interpreting complex field data commonly used in the natural sciences, remains underexplored. In this work, we introduce FieldLVLM, a novel framework designed to improve large vision-language models’ understanding of field data. FieldLVLM consists of two main components: a field-aware language generation strategy and a data-compressed multimodal model tuning. The field-aware language generation strategy leverages a special-purpose machine learning pipeline to extract key physical features from field data, such as flow classification, Reynolds number, and vortex patterns. This information is then converted into structured textual descriptions that serve as a dataset. The data-compressed multimodal model tuning focuses on LVLMs with these generated datasets, using a data compression strategy to reduce the complexity of field inputs and retain only the most informative values. This ensures compatibility with the models language decoder and guides its learning more effectively. Experimental results on newly proposed benchmark datasets demonstrate that FieldLVLM significantly outperforms existing methods in tasks involving scientific field data. Our findings suggest that this approach opens up new possibilities for applying large vision-language models to scientific research, helping bridge the gap between large models and domain-specific discovery.

[133] A Multi-Dataset Benchmark for Semi-Supervised Semantic Segmentation in ECG Delineation

Minje Park, Jeonghwa Lim, Taehyung Yu, Sunghoon Joo

Main category: cs.CV

TL;DR: The paper benchmarks semi-supervised learning for ECG delineation, comparing transformer and convolutional networks, and introduces a standardized evaluation framework.

Details

Motivation: ECG delineation is crucial for diagnosis, but annotated datasets are scarce. Semi-supervised learning can leverage unlabeled data to address this.

Method: Curated multiple datasets, implemented five SemiSeg algorithms on convolutional and transformer architectures, and evaluated in-domain and cross-domain settings. Proposed ECG-specific configurations and augmentations.

Result: Transformers outperformed convolutional networks in semi-supervised ECG delineation.

Conclusion: The benchmark provides a foundation for advancing semi-supervised ECG delineation and encourages further research.

Abstract: Electrocardiogram (ECG) delineation, the segmentation of meaningful waveform features, is critical for clinical diagnosis. Despite recent advances using deep learning, progress has been limited by the scarcity of publicly available annotated datasets. Semi-supervised learning presents a promising solution by leveraging abundant unlabeled ECG data. In this study, we present the first systematic benchmark for semi-supervised semantic segmentation (SemiSeg) in ECG delineation. We curated and unified multiple public datasets, including previously underused sources, to support robust and diverse evaluation. We adopted five representative SemiSeg algorithms from computer vision, implemented them on two different architectures: the convolutional network and the transformer, and evaluated them in two different settings: in-domain and cross-domain. Additionally, we propose ECG-specific training configurations and augmentation strategies and introduce a standardized evaluation framework. Our results show that the transformer outperforms the convolutional network in semi-supervised ECG delineation. We anticipate that our benchmark will serve as a foundation for advancing semi-supervised ECG delineation methods and will facilitate further research in this domain.

[134] Beyond Low-rankness: Guaranteed Matrix Recovery via Modified Nuclear Norm

Jiangjun Peng, Yisi Luo, Xiangyong Cao, Shuang Xu, Deyu Meng

Main category: cs.CV

TL;DR: The paper introduces a modified nuclear norm (MNN) framework for matrix recovery, capturing both local and global low-rank structures without parameter tuning, with theoretical guarantees and experimental validation.

Details

Motivation: Existing methods struggle to jointly capture local and global low-rank structures in matrix recovery problems like Robust PCA and matrix completion. The MNN framework addresses this gap.

Method: The MNN framework applies suitable transformations to the matrix and performs the nuclear norm on the transformed matrix, enabling joint capture of local and global structures.

Result: The MNN framework provides exact theoretical recovery guarantees for Robust PCA and matrix completion, outperforming existing methods. Experiments validate its effectiveness.

Conclusion: The MNN framework offers a flexible, unified approach for structured low-rank recovery, with proven transformations and strong empirical results.

Abstract: The nuclear norm (NN) has been widely explored in matrix recovery problems, such as Robust PCA and matrix completion, leveraging the inherent global low-rank structure of the data. In this study, we introduce a new modified nuclear norm (MNN) framework, where the MNN family norms are defined by adopting suitable transformations and performing the NN on the transformed matrix. The MNN framework offers two main advantages: (1) it jointly captures both local information and global low-rankness without requiring trade-off parameter tuning; (2) Under mild assumptions on the transformation, we provided exact theoretical recovery guarantees for both Robust PCA and MC tasks-an achievement not shared by existing methods that combine local and global information. Thanks to its general and flexible design, MNN can accommodate various proven transformations, enabling a unified and effective approach to structured low-rank recovery. Extensive experiments demonstrate the effectiveness of our method. Code and supplementary material are available at https://github.com/andrew-pengjj/modified_nuclear_norm.

[135] GVCCS: A Dataset for Contrail Identification and Tracking on Visible Whole Sky Camera Sequences

Gabriel Jarry, Ramon Dalmau, Philippe Very, Franck Ballerini, Stephania-Denisa Bocu

Main category: cs.CV

TL;DR: The paper introduces GVCCS, a ground-based dataset for contrail tracking, and a deep learning framework to improve contrail analysis and climate impact modeling.

Details

Motivation: Contrails significantly impact aviation's climate effects, but existing models lack accurate data for validation. Observational datasets are limited in tracking and attributing contrails to flights.

Method: The authors present GVCCS, a dataset of contrails recorded with a ground camera, and propose a deep learning framework for panoptic segmentation and temporal tracking.

Result: GVCCS includes 122 video sequences (24,228 frames) with labeled and tracked contrails, and a unified model for contrail analysis.

Conclusion: This work enhances contrail monitoring and model calibration, improving climate impact assessments.

Abstract: Aviation’s climate impact includes not only CO2 emissions but also significant non-CO2 effects, especially from contrails. These ice clouds can alter Earth’s radiative balance, potentially rivaling the warming effect of aviation CO2. Physics-based models provide useful estimates of contrail formation and climate impact, but their accuracy depends heavily on the quality of atmospheric input data and on assumptions used to represent complex processes like ice particle formation and humidity-driven persistence. Observational data from remote sensors, such as satellites and ground cameras, could be used to validate and calibrate these models. However, existing datasets don’t explore all aspect of contrail dynamics and formation: they typically lack temporal tracking, and do not attribute contrails to their source flights. To address these limitations, we present the Ground Visible Camera Contrail Sequences (GVCCS), a new open data set of contrails recorded with a ground-based all-sky camera in the visible range. Each contrail is individually labeled and tracked over time, allowing a detailed analysis of its lifecycle. The dataset contains 122 video sequences (24,228 frames) and includes flight identifiers for contrails that form above the camera. As reference, we also propose a unified deep learning framework for contrail analysis using a panoptic segmentation model that performs semantic segmentation (contrail pixel identification), instance segmentation (individual contrail separation), and temporal tracking in a single architecture. By providing high-quality, temporally resolved annotations and a benchmark for model evaluation, our work supports improved contrail monitoring and will facilitate better calibration of physical models. This sets the groundwork for more accurate climate impact understanding and assessments.

[136] Boosting Multi-View Indoor 3D Object Detection via Adaptive 3D Volume Construction

Runmin Zhang, Zhu Yu, Si-Yuan Cao, Lingyu Zhu, Guangyi Zhang, Xiaokai Bai, Hui-Liang Shen

Main category: cs.CV

TL;DR: SGCDet is a multi-view indoor 3D object detection framework using adaptive 3D volume construction, improving voxel feature representation and efficiency.

Details

Motivation: Previous methods limit voxel receptive fields to fixed image locations, lacking adaptability and efficiency.

Method: Introduces a geometry and context aware aggregation module and sparse volume construction for adaptive feature refinement.

Result: Achieves state-of-the-art performance on ScanNet, ScanNet200, and ARKitScenes datasets.

Conclusion: SGCDet offers adaptive, efficient 3D detection without requiring ground-truth scene geometry.

Abstract: This work presents SGCDet, a novel multi-view indoor 3D object detection framework based on adaptive 3D volume construction. Unlike previous approaches that restrict the receptive field of voxels to fixed locations on images, we introduce a geometry and context aware aggregation module to integrate geometric and contextual information within adaptive regions in each image and dynamically adjust the contributions from different views, enhancing the representation capability of voxel features. Furthermore, we propose a sparse volume construction strategy that adaptively identifies and selects voxels with high occupancy probabilities for feature refinement, minimizing redundant computation in free space. Benefiting from the above designs, our framework achieves effective and efficient volume construction in an adaptive way. Better still, our network can be supervised using only 3D bounding boxes, eliminating the dependence on ground-truth scene geometry. Experimental results demonstrate that SGCDet achieves state-of-the-art performance on the ScanNet, ScanNet200 and ARKitScenes datasets. The source code is available at https://github.com/RM-Zhang/SGCDet.

[137] EgoExoBench: A Benchmark for First- and Third-person View Video Understanding in MLLMs

Yuping He, Yifei Huang, Guo Chen, Baoqi Pei, Jilan Xu, Tong Lu, Jiangmiao Pang

Main category: cs.CV

TL;DR: EgoExoBench is introduced as the first benchmark for cross-view reasoning between egocentric and exocentric videos, evaluating MLLMs’ performance on semantic alignment, viewpoint association, and temporal reasoning.

Details

Motivation: To explore and improve multimodal large language models' (MLLMs) ability to transfer and integrate knowledge across first-person and third-person viewpoints, a capability intrinsic to human intelligence.

Method: Built from public datasets, EgoExoBench includes over 7,300 question-answer pairs across 11 sub-tasks, organized into three challenges: semantic alignment, viewpoint association, and temporal reasoning. 13 state-of-the-art MLLMs were evaluated.

Result: MLLMs perform well on single-view tasks but struggle with cross-view reasoning, particularly in aligning semantics, associating viewpoints, and inferring temporal dynamics.

Conclusion: EgoExoBench aims to advance research on embodied agents and intelligent assistants by providing a benchmark for human-like cross-view intelligence.

Abstract: Transferring and integrating knowledge across first-person (egocentric) and third-person (exocentric) viewpoints is intrinsic to human intelligence, enabling humans to learn from others and convey insights from their own experiences. Despite rapid progress in multimodal large language models (MLLMs), their ability to perform such cross-view reasoning remains unexplored. To address this, we introduce EgoExoBench, the first benchmark for egocentric-exocentric video understanding and reasoning. Built from publicly available datasets, EgoExoBench comprises over 7,300 question-answer pairs spanning eleven sub-tasks organized into three core challenges: semantic alignment, viewpoint association, and temporal reasoning. We evaluate 13 state-of-the-art MLLMs and find that while these models excel on single-view tasks, they struggle to align semantics across perspectives, accurately associate views, and infer temporal dynamics in the ego-exo context. We hope EgoExoBench can serve as a valuable resource for research on embodied agents and intelligent assistants seeking human-like cross-view intelligence.

[138] VB-Mitigator: An Open-source Framework for Evaluating and Advancing Visual Bias Mitigation

Ioannis Sarridis, Christos Koutlis, Symeon Papadopoulos, Christos Diou

Main category: cs.CV

TL;DR: VB-Mitigator is an open-source framework for standardized development and evaluation of visual bias mitigation techniques in computer vision.

Details

Motivation: Address fragmented implementations and inconsistent evaluation practices in bias mitigation research.

Method: Introduces VB-Mitigator, a unified framework with 12 mitigation methods, 7 datasets, and extensible features.

Result: Provides a standardized platform for fair comparison and reproducibility of bias mitigation techniques.

Conclusion: VB-Mitigator accelerates fairness-aware computer vision research by offering a foundational codebase and best practices.

Abstract: Bias in computer vision models remains a significant challenge, often resulting in unfair, unreliable, and non-generalizable AI systems. Although research into bias mitigation has intensified, progress continues to be hindered by fragmented implementations and inconsistent evaluation practices. Disparate datasets and metrics used across studies complicate reproducibility, making it difficult to fairly assess and compare the effectiveness of various approaches. To overcome these limitations, we introduce the Visual Bias Mitigator (VB-Mitigator), an open-source framework designed to streamline the development, evaluation, and comparative analysis of visual bias mitigation techniques. VB-Mitigator offers a unified research environment encompassing 12 established mitigation methods, 7 diverse benchmark datasets. A key strength of VB-Mitigator is its extensibility, allowing for seamless integration of additional methods, datasets, metrics, and models. VB-Mitigator aims to accelerate research toward fairness-aware computer vision models by serving as a foundational codebase for the research community to develop and assess their approaches. To this end, we also recommend best evaluation practices and provide a comprehensive performance comparison among state-of-the-art methodologies.

[139] Deformable Convolution Module with Globally Learned Relative Offsets for Fundus Vessel Segmentation

Lexuan Zhu, Yuxuan Li, Yuning Ren

Main category: cs.CV

TL;DR: A novel deformable convolutional module uses attention and feedforward networks to learn offsets, enabling global feature deformation and decoupling kernel size from the learning network. Applied to fundus blood vessel segmentation (GDCUnet), it achieves state-of-the-art performance.

Details

Motivation: Existing deformable convolutions struggle with long-distance global features and complex shapes, such as fundus blood vessels with globally self-similar edges.

Method: The proposed module learns sub-pixel displacement fields and warps feature maps across channels, achieving global deformation. It integrates into GDCUnet for fundus vessel segmentation.

Result: GDCUnet outperforms existing methods on public datasets, with ablation studies confirming the module’s effectiveness in learning complex vessel features.

Conclusion: The deformable module enhances model representation and generalization, suggesting broader application in tasks with complex global self-similar features.

Abstract: Deformable convolution can adaptively change the shape of convolution kernel by learning offsets to deal with complex shape features. We propose a novel plug and play deformable convolutional module that uses attention and feedforward networks to learn offsets, so that the deformable patterns can capture long-distance global features. Compared with previously existing deformable convolutions, the proposed module learns the sub pixel displacement field and adaptively warps the feature maps across all channels rather than directly deforms the convolution kernel , which is equivalent to a relative deformation of the kernel sampling grids, achieving global feature deformation and the decoupling of kernel size and learning network. Considering that the fundus blood vessels have globally self similar complex edges, we design a deep learning model for fundus blood vessel segmentation, GDCUnet, based on the proposed convolutional module. Empirical evaluations under the same configuration and unified framework show that GDCUnet has achieved state of the art performance on public datasets. Further ablation experiments demonstrated that the proposed deformable convolutional module could more significantly learn the complex features of fundus blood vessels, enhancing the model representation and generalization capabilities.The proposed module is similar to the interface of conventional convolution, we suggest applying it to more machine vision tasks with complex global self similar features.

[140] MVG4D: Image Matrix-Based Multi-View and Motion Generation for 4D Content Creation from a Single Image

Xiaotian Chen, DongFu Yin, Fei Richard Yu, Xuanchen Li, Xinhao Zhang

Main category: cs.CV

TL;DR: MVG4D is a novel framework for generating high-fidelity, temporally consistent 4D content from a single image using multi-view synthesis and 4D Gaussian Splatting.

Details

Motivation: Producing high-fidelity and temporally consistent dynamic 4D content remains a challenge despite advances in generative modeling.

Method: MVG4D combines multi-view synthesis with 4D Gaussian Splatting, using an image matrix module for coherent multi-view images and a deformation network for temporal extension.

Result: MVG4D outperforms state-of-the-art baselines in metrics like CLIP-I, PSNR, and FVD, reducing flickering and enhancing visual realism.

Conclusion: MVG4D advances efficient and controllable 4D generation, improving AR/VR experiences.

Abstract: Advances in generative modeling have significantly enhanced digital content creation, extending from 2D images to complex 3D and 4D scenes. Despite substantial progress, producing high-fidelity and temporally consistent dynamic 4D content remains a challenge. In this paper, we propose MVG4D, a novel framework that generates dynamic 4D content from a single still image by combining multi-view synthesis with 4D Gaussian Splatting (4D GS). At its core, MVG4D employs an image matrix module that synthesizes temporally coherent and spatially diverse multi-view images, providing rich supervisory signals for downstream 3D and 4D reconstruction. These multi-view images are used to optimize a 3D Gaussian point cloud, which is further extended into the temporal domain via a lightweight deformation network. Our method effectively enhances temporal consistency, geometric fidelity, and visual realism, addressing key challenges in motion discontinuity and background degradation that affect prior 4D GS-based methods. Extensive experiments on the Objaverse dataset demonstrate that MVG4D outperforms state-of-the-art baselines in CLIP-I, PSNR, FVD, and time efficiency. Notably, it reduces flickering artifacts and sharpens structural details across views and time, enabling more immersive AR/VR experiences. MVG4D sets a new direction for efficient and controllable 4D generation from minimal inputs.

Si-Woo Kim, MinJu Jeon, Ye-Chan Kim, Soeun Lee, Taewhan Kim, Dong-Jin Kim

Main category: cs.CV

TL;DR: SynC is a framework for refining synthetic image-caption datasets in Zero-shot Image Captioning (ZIC) by reassigning captions to semantically aligned images, outperforming existing methods.

Details

Motivation: T2I models generate noisy synthetic image-caption pairs with semantic misalignments, hindering ZIC model training. Existing pruning techniques are unsuitable for synthetic data.

Method: SynC uses a one-to-many mapping strategy and cycle-consistency-inspired alignment scorer to reassign captions to the best-matched images in the synthetic pool.

Result: SynC improves ZIC model performance on benchmarks (MS-COCO, Flickr30k, NoCaps), achieving state-of-the-art results.

Conclusion: SynC effectively refines synthetic data for ZIC, offering a robust solution to semantic misalignment issues.

Abstract: Zero-shot Image Captioning (ZIC) increasingly utilizes synthetic datasets generated by text-to-image (T2I) models to mitigate the need for costly manual annotation. However, these T2I models often produce images that exhibit semantic misalignments with their corresponding input captions (e.g., missing objects, incorrect attributes), resulting in noisy synthetic image-caption pairs that can hinder model training. Existing dataset pruning techniques are largely designed for removing noisy text in web-crawled data. However, these methods are ill-suited for the distinct challenges of synthetic data, where captions are typically well-formed, but images may be inaccurate representations. To address this gap, we introduce SynC, a novel framework specifically designed to refine synthetic image-caption datasets for ZIC. Instead of conventional filtering or regeneration, SynC focuses on reassigning captions to the most semantically aligned images already present within the synthetic image pool. Our approach employs a one-to-many mapping strategy by initially retrieving multiple relevant candidate images for each caption. We then apply a cycle-consistency-inspired alignment scorer that selects the best image by verifying its ability to retrieve the original caption via image-to-text retrieval. Extensive evaluations demonstrate that SynC consistently and significantly improves performance across various ZIC models on standard benchmarks (MS-COCO, Flickr30k, NoCaps), achieving state-of-the-art results in several scenarios. SynC offers an effective strategy for curating refined synthetic data to enhance ZIC.

[142] Revisiting Physically Realizable Adversarial Object Attack against LiDAR-based Detection: Clarifying Problem Formulation and Experimental Protocols

Luo Cheng, Hanwei Zhang, Lijun Zhang, Holger Hermanns

Main category: cs.CV

TL;DR: A standardized framework for physical adversarial object attacks in LiDAR-based 3D object detection, enabling fair comparison and real-world transferability.

Details

Motivation: Addressing the lack of physical realizability and reproducibility in adversarial attacks on LiDAR systems.

Method: Proposes a device-agnostic framework supporting diverse attack methods, with open-source code and benchmarking protocols.

Result: Validated by successfully transferring simulated attacks to a physical LiDAR system.

Conclusion: The framework accelerates research and improves understanding of adversarial robustness in real-world LiDAR perception.

Abstract: Adversarial robustness in LiDAR-based 3D object detection is a critical research area due to its widespread application in real-world scenarios. While many digital attacks manipulate point clouds or meshes, they often lack physical realizability, limiting their practical impact. Physical adversarial object attacks remain underexplored and suffer from poor reproducibility due to inconsistent setups and hardware differences. To address this, we propose a device-agnostic, standardized framework that abstracts key elements of physical adversarial object attacks, supports diverse methods, and provides open-source code with benchmarking protocols in simulation and real-world settings. Our framework enables fair comparison, accelerates research, and is validated by successfully transferring simulated attacks to a physical LiDAR system. Beyond the framework, we offer insights into factors influencing attack success and advance understanding of adversarial robustness in real-world LiDAR perception.

[143] Towards Effective Human-in-the-Loop Assistive AI Agents

Filippos Bellos, Yayuan Li, Cary Shu, Ruey Day, Jeffrey M. Siskind, Jason J. Corso

Main category: cs.CV

TL;DR: The paper introduces an evaluation framework and dataset for human-AI collaboration in physical tasks, along with an AR-equipped AI agent, showing improved task performance.

Details

Motivation: To enhance human performance in physical tasks through AI guidance and address the challenge of evaluating such collaboration.

Method: Developed an evaluation framework, a multimodal dataset, and an AR-equipped AI agent for real-world tasks. Conducted human studies.

Result: AI-assisted collaboration improves task completion, reduces errors, and enhances learning outcomes.

Conclusion: The framework and AI agent effectively support human-AI collaboration, demonstrating practical benefits in task performance.

Abstract: Effective human-AI collaboration for physical task completion has significant potential in both everyday activities and professional domains. AI agents equipped with informative guidance can enhance human performance, but evaluating such collaboration remains challenging due to the complexity of human-in-the-loop interactions. In this work, we introduce an evaluation framework and a multimodal dataset of human-AI interactions designed to assess how AI guidance affects procedural task performance, error reduction and learning outcomes. Besides, we develop an augmented reality (AR)-equipped AI agent that provides interactive guidance in real-world tasks, from cooking to battlefield medicine. Through human studies, we share empirical insights into AI-assisted human performance and demonstrate that AI-assisted collaboration improves task completion.

[144] Towards Consistent Long-Term Pose Generation

Yayuan Li, Filippos Bellos, Jason Corso

Main category: cs.CV

TL;DR: A one-stage architecture for direct pose generation from minimal context (RGB image and text) outperforms existing methods by eliminating intermediate representations and ensuring consistent training-inference behavior.

Details

Motivation: Current methods rely on intermediate representations or autoregressive models, leading to degraded performance in long-term pose generation due to error accumulation and lack of temporal coherence.

Method: Proposes a one-stage architecture that directly generates poses in continuous coordinate space using a relative movement prediction mechanism and unified placeholder tokens for single-forward generation.

Result: Outperforms quantization-based and autoregressive methods on Penn Action and F-PHAB datasets, especially in long-term generation.

Conclusion: The proposed approach eliminates intermediate steps, improves performance, and maintains consistency between training and inference.

Abstract: Current approaches to pose generation rely heavily on intermediate representations, either through two-stage pipelines with quantization or autoregressive models that accumulate errors during inference. This fundamental limitation leads to degraded performance, particularly in long-term pose generation where maintaining temporal coherence is crucial. We propose a novel one-stage architecture that directly generates poses in continuous coordinate space from minimal context - a single RGB image and text description - while maintaining consistent distributions between training and inference. Our key innovation is eliminating the need for intermediate representations or token-based generation by operating directly on pose coordinates through a relative movement prediction mechanism that preserves spatial relationships, and a unified placeholder token approach that enables single-forward generation with identical behavior during training and inference. Through extensive experiments on Penn Action and First-Person Hand Action Benchmark (F-PHAB) datasets, we demonstrate that our approach significantly outperforms existing quantization-based and autoregressive methods, especially in long-term generation scenarios.

[145] Reinforced Embodied Active Defense: Exploiting Adaptive Interaction for Robust Visual Perception in Adversarial 3D Environments

Xiao Yang, Lingxuan Wu, Lizhong Wang, Chengyang Ying, Hang Su, Jun Zhu

Main category: cs.CV

TL;DR: Rein-EAD is a proactive defense framework for 3D adversarial attacks, using adaptive exploration and interaction to enhance robustness in dynamic environments.

Details

Motivation: Current defenses are passive and rely on pre-defined assumptions, limiting adaptability in dynamic 3D adversarial settings.

Method: Rein-EAD employs a multi-step objective balancing prediction accuracy and entropy minimization, with an uncertainty-oriented reward mechanism for efficient policy updates.

Result: Rein-EAD reduces attack success rates significantly while maintaining standard accuracy, showing robust generalization to unseen attacks.

Conclusion: Rein-EAD is effective for real-world complex tasks like 3D object classification and autonomous driving, offering adaptability and robustness.

Abstract: Adversarial attacks in 3D environments have emerged as a critical threat to the reliability of visual perception systems, particularly in safety-sensitive applications such as identity verification and autonomous driving. These attacks employ adversarial patches and 3D objects to manipulate deep neural network (DNN) predictions by exploiting vulnerabilities within complex scenes. Existing defense mechanisms, such as adversarial training and purification, primarily employ passive strategies to enhance robustness. However, these approaches often rely on pre-defined assumptions about adversarial tactics, limiting their adaptability in dynamic 3D settings. To address these challenges, we introduce Reinforced Embodied Active Defense (Rein-EAD), a proactive defense framework that leverages adaptive exploration and interaction with the environment to improve perception robustness in 3D adversarial contexts. By implementing a multi-step objective that balances immediate prediction accuracy with predictive entropy minimization, Rein-EAD optimizes defense strategies over a multi-step horizon. Additionally, Rein-EAD involves an uncertainty-oriented reward-shaping mechanism that facilitates efficient policy updates, thereby reducing computational overhead and supporting real-world applicability without the need for differentiable environments. Comprehensive experiments validate the effectiveness of Rein-EAD, demonstrating a substantial reduction in attack success rates while preserving standard accuracy across diverse tasks. Notably, Rein-EAD exhibits robust generalization to unseen and adaptive attacks, making it suitable for real-world complex tasks, including 3D object classification, face recognition and autonomous driving.

[146] HumanMaterial: Human Material Estimation from a Single Image via Progressive Training

Yu Jiang, Jiahao Xia, Jiongming Qin, Yusen Wang, Tuo Cao, Chunxia Xiao

Main category: cs.CV

TL;DR: The paper introduces OpenHumanBRDF, a high-quality dataset for full-body human inverse rendering, and proposes HumanMaterial, a model with progressive training to improve material estimation.

Details

Motivation: The task of inverse rendering is ill-posed due to lack of constraints on material maps, and existing methods produce limited realism, especially for skin.

Method: Constructed OpenHumanBRDF dataset with detailed materials (e.g., displacement, subsurface scattering) and designed HumanMaterial with progressive training and Controlled PBR Rendering (CPR) loss.

Result: Achieves state-of-the-art performance on OpenHumanBRDF and real data.

Conclusion: The proposed method enhances realism in rendering, particularly for skin, by leveraging high-quality data and a refined training strategy.

Abstract: Full-body Human inverse rendering based on physically-based rendering aims to acquire high-quality materials, which helps achieve photo-realistic rendering under arbitrary illuminations. This task requires estimating multiple material maps and usually relies on the constraint of rendering result. The absence of constraints on the material maps makes inverse rendering an ill-posed task. Previous works alleviated this problem by building material dataset for training, but their simplified material data and rendering equation lead to rendering results with limited realism, especially that of skin. To further alleviate this problem, we construct a higher-quality dataset (OpenHumanBRDF) based on scanned real data and statistical material data. In addition to the normal, diffuse albedo, roughness, specular albedo, we produce displacement and subsurface scattering to enhance the realism of rendering results, especially for the skin. With the increase in prediction tasks for more materials, using an end-to-end model as in the previous work struggles to balance the importance among various material maps, and leads to model underfitting. Therefore, we design a model (HumanMaterial) with progressive training strategy to make full use of the supervision information of the material maps and improve the performance of material estimation. HumanMaterial first obtain the initial material results via three prior models, and then refine the results by a finetuning model. Prior models estimate different material maps, and each map has different significance for rendering results. Thus, we design a Controlled PBR Rendering (CPR) loss, which enhances the importance of the materials to be optimized during the training of prior models. Extensive experiments on OpenHumanBRDF dataset and real data demonstrate that our method achieves state-of-the-art performance.

Clément Cornet, Romaric Besançon, Hervé Le Borgne

Main category: cs.CV

TL;DR: The paper introduces tools for comparing SAE-derived features across different model modalities and studies 21 encoders, revealing shared representations and the impact of text pretraining.

Details

Motivation: To enable quantitative comparison of SAE-derived features across visual, textual, and multimodal encoders, addressing previous limitations of same-modality comparisons.

Method: Proposes a novel indicator for cross-model SAE feature comparison and a metric for feature sharedness. Studies 21 encoders of varying sizes and datasets.

Result: Finds shared representations across modalities, with visual features in VLMs shared with text encoders, influenced by text pretraining.

Conclusion: Provides insights into shared features across encoders and highlights the role of text pretraining in shaping representations.

Abstract: Sparse autoencoders (SAEs) have emerged as a powerful technique for extracting human-interpretable features from neural networks activations. Previous works compared different models based on SAE-derived features but those comparisons have been restricted to models within the same modality. We propose a novel indicator allowing quantitative comparison of models across SAE features, and use it to conduct a comparative study of visual, textual and multimodal encoders. We also propose to quantify the Comparative Sharedness of individual features between different classes of models. With these two new tools, we conduct several studies on 21 encoders of the three types, with two significantly different sizes, and considering generalist and domain specific datasets. The results allow to revisit previous studies at the light of encoders trained in a multimodal context and to quantify to which extent all these models share some representations or features. They also suggest that visual features that are specific to VLMs among vision encoders are shared with text encoders, highlighting the impact of text pretraining. The code is available at https://github.com/CEA-LIST/SAEshareConcepts

[148] Iwin Transformer: Hierarchical Vision Transformer using Interleaved Windows

Simin Huo, Ning Li

Main category: cs.CV

TL;DR: Iwin Transformer is a hierarchical vision transformer that eliminates position embeddings and combines interleaved window attention with depthwise separable convolution for efficient global information exchange. It outperforms Swin Transformer and achieves strong results in image classification, segmentation, and video recognition.

Details

Motivation: To address Swin Transformer's limitation of requiring two blocks for global attention and to enable direct fine-tuning across resolutions.

Method: Uses interleaved window attention and depthwise separable convolution to connect distant and neighboring tokens, respectively, within a single module.

Result: Achieves 87.4 top-1 accuracy on ImageNet-1K, excels in semantic segmentation and video action recognition, and works as a standalone module for image generation.

Conclusion: Iwin Transformer is a competitive and versatile model with potential to inspire future research in vision tasks and beyond.

Abstract: We introduce Iwin Transformer, a novel position-embedding-free hierarchical vision transformer, which can be fine-tuned directly from low to high resolution, through the collaboration of innovative interleaved window attention and depthwise separable convolution. This approach uses attention to connect distant tokens and applies convolution to link neighboring tokens, enabling global information exchange within a single module, overcoming Swin Transformer’s limitation of requiring two consecutive blocks to approximate global attention. Extensive experiments on visual benchmarks demonstrate that Iwin Transformer exhibits strong competitiveness in tasks such as image classification (87.4 top-1 accuracy on ImageNet-1K), semantic segmentation and video action recognition. We also validate the effectiveness of the core component in Iwin as a standalone module that can seamlessly replace the self-attention module in class-conditional image generation. The concepts and methods introduced by the Iwin Transformer have the potential to inspire future research, like Iwin 3D Attention in video generation. The code and models are available at https://github.com/cominder/Iwin-Transformer.

Baoyao Yang, Wanyun Li, Dixin Chen, Junxiang Chen, Wenbin Yao, Haifeng Lin

Main category: cs.CV

TL;DR: VideoMind is a video-centric omni-modal dataset with 103K samples, featuring hierarchical textual descriptions (factual, abstract, intent) and audio. It uses Chain-of-Thought (COT) for intent expressions and supports deep-cognitive video understanding tasks.

Details

Motivation: To address the lack of datasets providing intent expressions and deep-cognitive video understanding, VideoMind aims to enhance multi-modal feature representation and fine-grained cross-modal alignment.

Method: The dataset includes 103K video samples with audio and hierarchical textual descriptions (factual, abstract, intent) generated via COT. It features annotations for subject, place, time, event, action, and intent. A gold-standard benchmark of 3K manually validated samples is established for evaluation.

Result: Evaluation results for models like InternVideo, VAST, and UMT-L are released, demonstrating VideoMind’s utility for deep video comprehension and cross-modal alignment tasks.

Conclusion: VideoMind serves as a powerful benchmark for in-depth video understanding, advancing fields like emotion and intent recognition. The dataset is publicly available.

Abstract: This paper introduces VideoMind, a video-centric omni-modal dataset designed for deep video content cognition and enhanced multi-modal feature representation. The dataset comprises 103K video samples (3K reserved for testing), each paired with audio and systematically detailed textual descriptions. Specifically, every video and its audio is described across three hierarchical layers (factual, abstract, and intent), progressing from surface to depth. It contains over 22 million words, averaging ~225 words per sample. VideoMind’s key distinction from existing datasets is its provision of intent expressions, which require contextual integration across the entire video and are not directly observable. These deep-cognitive expressions are generated using a Chain-of-Thought (COT) approach, prompting the mLLM through step-by-step reasoning. Each description includes annotations for subject, place, time, event, action, and intent, supporting downstream recognition tasks. Crucially, we establish a gold-standard benchmark with 3,000 manually validated samples for evaluating deep-cognitive video understanding. We design hybrid-cognitive retrieval experiments, scored by multi-level retrieval metrics, to appropriately assess deep video comprehension. Evaluation results for models (e.g., InternVideo, VAST, UMT-L) are released. VideoMind serves as a powerful benchmark for fine-grained cross-modal alignment and advances fields requiring in-depth video understanding, such as emotion and intent recognition. The data is publicly available on GitHub, HuggingFace, and OpenDataLab, https://github.com/cdx-cindy/VideoMind.

[150] DCFFSNet: Deep Connectivity Feature Fusion Separation Network for Medical Image Segmentation

Xun Ye, Ruixiang Tang, Mingda Zhang, Jianglong Qin

Main category: cs.CV

TL;DR: DCFFSNet introduces a feature space decoupling strategy to balance connectivity and other features, improving medical image segmentation performance.

Details

Motivation: Existing methods forcibly integrate connectivity features without quantifying their strength, leading to coupled feature spaces and suboptimal performance.

Method: DCFFSNet uses a dual-connectivity feature fusion-separation architecture to dynamically balance multi-scale feature expression and quantify feature strengths.

Result: DCFFSNet outperforms other models on ISIC2018, DSB2018, and MoNuSeg datasets, improving Dice and IoU scores by 0.7-1.3%.

Conclusion: DCFFSNet resolves segmentation fragmentation and enhances edge precision, improving clinical usability.

Abstract: Medical image segmentation leverages topological connectivity theory to enhance edge precision and regional consistency. However, existing deep networks integrating connectivity often forcibly inject it as an additional feature module, resulting in coupled feature spaces with no standardized mechanism to quantify different feature strengths. To address these issues, we propose DCFFSNet (Dual-Connectivity Feature Fusion-Separation Network). It introduces an innovative feature space decoupling strategy. This strategy quantifies the relative strength between connectivity features and other features. It then builds a deep connectivity feature fusion-separation architecture. This architecture dynamically balances multi-scale feature expression. Experiments were conducted on the ISIC2018, DSB2018, and MoNuSeg datasets. On ISIC2018, DCFFSNet outperformed the next best model (CMUNet) by 1.3% (Dice) and 1.2% (IoU). On DSB2018, it surpassed TransUNet by 0.7% (Dice) and 0.9% (IoU). On MoNuSeg, it exceeded CSCAUNet by 0.8% (Dice) and 0.9% (IoU). The results demonstrate that DCFFSNet exceeds existing mainstream methods across all metrics. It effectively resolves segmentation fragmentation and achieves smooth edge transitions. This significantly enhances clinical usability.

[151] Self-Supervised Ultrasound-Video Segmentation with Feature Prediction and 3D Localised Loss

Edward Ellis, Robert Mendel, Andrew Bulpitt, Nasim Parsa, Michael F Byrne, Sharib Ali

Main category: cs.CV

TL;DR: The paper proposes using V-JEPA, a self-supervised learning framework, for ultrasound video segmentation, enhancing it with a 3D localization task to improve ViT performance.

Details

Motivation: Challenges in acquiring and annotating large ultrasound datasets due to noise and artifacts motivate the use of SSL to leverage unlabelled data.

Method: Adopts V-JEPA for ultrasound video, introduces a 3D localization auxiliary task to enhance ViT locality, and evaluates segmentation performance.

Result: V-JEPA with the auxiliary task improves segmentation, with gains of 3.4% (100% data) and 8.35% (10% data).

Conclusion: V-JEPA with a 3D localization task is effective for ultrasound video segmentation, especially with limited annotated data.

Abstract: Acquiring and annotating large datasets in ultrasound imaging is challenging due to low contrast, high noise, and susceptibility to artefacts. This process requires significant time and clinical expertise. Self-supervised learning (SSL) offers a promising solution by leveraging unlabelled data to learn useful representations, enabling improved segmentation performance when annotated data is limited. Recent state-of-the-art developments in SSL for video data include V-JEPA, a framework solely based on feature prediction, avoiding pixel level reconstruction or negative samples. We hypothesise that V-JEPA is well-suited to ultrasound imaging, as it is less sensitive to noisy pixel-level detail while effectively leveraging temporal information. To the best of our knowledge, this is the first study to adopt V-JEPA for ultrasound video data. Similar to other patch-based masking SSL techniques such as VideoMAE, V-JEPA is well-suited to ViT-based models. However, ViTs can underperform on small medical datasets due to lack of inductive biases, limited spatial locality and absence of hierarchical feature learning. To improve locality understanding, we propose a novel 3D localisation auxiliary task to improve locality in ViT representations during V-JEPA pre-training. Our results show V-JEPA with our auxiliary task improves segmentation performance significantly across various frozen encoder configurations, with gains up to 3.4% using 100% and up to 8.35% using only 10% of the training data.

[152] NLML-HPE: Head Pose Estimation with Limited Data via Manifold Learning

Mahdi Ghafourian, Federico M. Sukno

Main category: cs.CV

TL;DR: A novel deep learning method, NLML-HPE, uses tensor decomposition and neural networks for head pose estimation with limited data, addressing annotation inaccuracies and achieving real-time performance.

Details

Motivation: Head pose estimation is crucial for applications like human-computer interaction and facial recognition, but existing datasets often have inaccurate annotations, and traditional methods lack efficiency with limited data.

Method: Combines tensor decomposition (Tucker) and feed-forward neural networks to model head pose as a regression problem, representing pose angles via cosine curves in separate subspaces. A precise 2D dataset was created by rotating 3D models.

Result: Achieves real-time performance with limited training data by accurately capturing rotation from facial landmarks. The model is fast in predicting unseen data once the manifold is learned.

Conclusion: NLML-HPE offers an efficient and accurate solution for head pose estimation, especially with limited data, and provides publicly available code and models for further use.

Abstract: Head pose estimation (HPE) plays a critical role in various computer vision applications such as human-computer interaction and facial recognition. In this paper, we propose a novel deep learning approach for head pose estimation with limited training data via non-linear manifold learning called NLML-HPE. This method is based on the combination of tensor decomposition (i.e., Tucker decomposition) and feed forward neural networks. Unlike traditional classification-based approaches, our method formulates head pose estimation as a regression problem, mapping input landmarks into a continuous representation of pose angles. To this end, our method uses tensor decomposition to split each Euler angle (yaw, pitch, roll) to separate subspaces and models each dimension of the underlying manifold as a cosine curve. We address two key challenges: 1. Almost all HPE datasets suffer from incorrect and inaccurate pose annotations. Hence, we generated a precise and consistent 2D head pose dataset for our training set by rotating 3D head models for a fixed set of poses and rendering the corresponding 2D images. 2. We achieved real-time performance with limited training data as our method accurately captures the nature of rotation of an object from facial landmarks. Once the underlying manifold for rotation around each axis is learned, the model is very fast in predicting unseen data. Our training and testing code is available online along with our trained models: https: //github.com/MahdiGhafoorian/NLML_HPE.

[153] DSFormer: A Dual-Scale Cross-Learning Transformer for Visual Place Recognition

Haiyang Jiang, Songhao Piao, Chao Gao, Lei Yu, Liguo Chen

Main category: cs.CV

TL;DR: A novel framework combining Dual-Scale-Former (DSFormer) and block clustering improves Visual Place Recognition (VPR) by enhancing feature representation and optimizing data organization, achieving state-of-the-art performance with reduced training data.

Details

Motivation: VPR struggles with reliability under varying conditions and viewpoints, necessitating improved feature representation and data efficiency.

Method: Integrates DSFormer for bidirectional feature transfer and block clustering to repartition the SF-XL dataset, optimizing robustness and reducing training data needs.

Result: Achieves top performance on benchmarks, outperforms methods like DELG and Patch-NetVLAD, and reduces training data by ~30%.

Conclusion: The framework enhances VPR robustness and efficiency, making it adaptable to environmental changes and computationally effective.

Abstract: Visual Place Recognition (VPR) is crucial for robust mobile robot localization, yet it faces significant challenges in maintaining reliable performance under varying environmental conditions and viewpoints. To address this, we propose a novel framework that integrates Dual-Scale-Former (DSFormer), a Transformer-based cross-learning module, with an innovative block clustering strategy. DSFormer enhances feature representation by enabling bidirectional information transfer between dual-scale features extracted from the final two CNN layers, capturing both semantic richness and spatial details through self-attention for long-range dependencies within each scale and shared cross-attention for cross-scale learning. Complementing this, our block clustering strategy repartitions the widely used San Francisco eXtra Large (SF-XL) training dataset from multiple distinct perspectives, optimizing data organization to further bolster robustness against viewpoint variations. Together, these innovations not only yield a robust global embedding adaptable to environmental changes but also reduce the required training data volume by approximately 30% compared to previous partitioning methods. Comprehensive experiments demonstrate that our approach achieves state-of-the-art performance across most benchmark datasets, surpassing advanced reranking methods like DELG, Patch-NetVLAD, TransVPR, and R2Former as a global retrieval solution using 512-dim global descriptors, while significantly improving computational efficiency.

[154] PDB-Eval: An Evaluation of Large Multimodal Models for Description and Explanation of Personalized Driving Behavior

Junda Wu, Jessica Echterhoff, Kyungtae Han, Amr Abdelraouf, Rohit Gupta, Julian McAuley

Main category: cs.CV

TL;DR: The paper introduces PDB-Eval, a benchmark for understanding personalized driver behavior using multimodal models (MLLMs), improving driving comprehension and reasoning.

Details

Motivation: Existing datasets lack detailed explanations of driver behavior based on external visual evidence, limiting the effectiveness of safety systems.

Method: PDB-Eval includes PDB-X for evaluating MLLMs’ understanding of driving scenes and PDB-QA for fine-tuning MLLMs with visual explanations.

Result: Fine-tuning MLLMs on PDB-Eval improves zero-shot QA performance by 73.2% and enhances intention prediction tasks by up to 12.5%.

Conclusion: PDB-Eval effectively bridges the gap between MLLMs and driving tasks, enhancing performance in driver behavior understanding.

Abstract: Understanding a driver’s behavior and intentions is important for potential risk assessment and early accident prevention. Safety and driver assistance systems can be tailored to individual drivers’ behavior, significantly enhancing their effectiveness. However, existing datasets are limited in describing and explaining general vehicle movements based on external visual evidence. This paper introduces a benchmark, PDB-Eval, for a detailed understanding of Personalized Driver Behavior, and aligning Large Multimodal Models (MLLMs) with driving comprehension and reasoning. Our benchmark consists of two main components, PDB-X and PDB-QA. PDB-X can evaluate MLLMs’ understanding of temporal driving scenes. Our dataset is designed to find valid visual evidence from the external view to explain the driver’s behavior from the internal view. To align MLLMs’ reasoning abilities with driving tasks, we propose PDB-QA as a visual explanation question-answering task for MLLM instruction fine-tuning. As a generic learning task for generative models like MLLMs, PDB-QA can bridge the domain gap without harming MLLMs’ generalizability. Our evaluation indicates that fine-tuning MLLMs on fine-grained descriptions and explanations can effectively bridge the gap between MLLMs and the driving domain, which improves zero-shot performance on question-answering tasks by up to 73.2%. We further evaluate the MLLMs fine-tuned on PDB-X in Brain4Cars’ intention prediction and AIDE’s recognition tasks. We observe up to 12.5% performance improvements on the turn intention prediction task in Brain4Cars, and consistent performance improvements up to 11.0% on all tasks in AIDE.

[155] CRUISE: Cooperative Reconstruction and Editing in V2X Scenarios using Gaussian Splatting

Haoran Xu, Saining Zhang, Peishuo Li, Baijun Ye, Xiaoxue Chen, Huan-ang Gao, Jv Zheng, Xiaowei Song, Ziqiao Peng, Run Miao, Jinrang Jia, Yifeng Shi, Guangqi Yi, Hang Zhao, Hao Tang, Hongyang Li, Kaicheng Yu, Hao Zhao

Main category: cs.CV

TL;DR: CRUISE is a framework for reconstructing and augmenting V2X driving scenes using decomposed Gaussian Splatting, improving 3D detection and tracking.

Details

Motivation: The potential of simulation for data generation in V2X scenarios is underexplored, despite its importance for autonomous driving.

Method: CRUISE uses decomposed Gaussian Splatting to reconstruct real-world scenes and allows flexible editing of dynamic traffic participants. It renders images from multiple views for dataset augmentation.

Result: CRUISE achieves high-fidelity scene reconstruction, improves 3D detection and tracking, and generates challenging corner cases.

Conclusion: CRUISE effectively addresses the gap in V2X data generation and augmentation, enhancing autonomous driving tasks.

Abstract: Vehicle-to-everything (V2X) communication plays a crucial role in autonomous driving, enabling cooperation between vehicles and infrastructure. While simulation has significantly contributed to various autonomous driving tasks, its potential for data generation and augmentation in V2X scenarios remains underexplored. In this paper, we introduce CRUISE, a comprehensive reconstruction-and-synthesis framework designed for V2X driving environments. CRUISE employs decomposed Gaussian Splatting to accurately reconstruct real-world scenes while supporting flexible editing. By decomposing dynamic traffic participants into editable Gaussian representations, CRUISE allows for seamless modification and augmentation of driving scenes. Furthermore, the framework renders images from both ego-vehicle and infrastructure views, enabling large-scale V2X dataset augmentation for training and evaluation. Our experimental results demonstrate that: 1) CRUISE reconstructs real-world V2X driving scenes with high fidelity; 2) using CRUISE improves 3D detection across ego-vehicle, infrastructure, and cooperative views, as well as cooperative 3D tracking on the V2X-Seq benchmark; and 3) CRUISE effectively generates challenging corner cases.

[156] DRWKV: Focusing on Object Edges for Low-Light Image Enhancement

Xuecheng Bai, Yuxiang Wang, Boyu Hu, Qinyuan Jie, Chuanzhi Xu, Hongru Xiao, Kechen Li, Vera Chung

Main category: cs.CV

TL;DR: DRWKV is a novel model for low-light image enhancement, integrating GER theory, Evolving WKV Attention, and Bi-SAB with MS2-Loss, achieving top performance on benchmarks and improving downstream tasks.

Details

Motivation: Addressing the challenge of preserving edge continuity and fine details in low-light images under extreme illumination degradation.

Method: Combines Global Edge Retinex (GER) theory, Evolving WKV Attention for spatial edge continuity, and Bilateral Spectrum Aligner (Bi-SAB) with MS2-Loss for luminance/chrominance alignment.

Result: DRWKV leads in PSNR, SSIM, and NIQE metrics on five benchmarks and enhances low-light multi-object tracking performance.

Conclusion: DRWKV effectively improves low-light image quality with high fidelity and computational efficiency, demonstrating strong generalization.

Abstract: Low-light image enhancement remains a challenging task, particularly in preserving object edge continuity and fine structural details under extreme illumination degradation. In this paper, we propose a novel model, DRWKV (Detailed Receptance Weighted Key Value), which integrates our proposed Global Edge Retinex (GER) theory, enabling effective decoupling of illumination and edge structures for enhanced edge fidelity. Secondly, we introduce Evolving WKV Attention, a spiral-scanning mechanism that captures spatial edge continuity and models irregular structures more effectively. Thirdly, we design the Bilateral Spectrum Aligner (Bi-SAB) and a tailored MS2-Loss to jointly align luminance and chrominance features, improving visual naturalness and mitigating artifacts. Extensive experiments on five LLIE benchmarks demonstrate that DRWKV achieves leading performance in PSNR, SSIM, and NIQE while maintaining low computational complexity. Furthermore, DRWKV enhances downstream performance in low-light multi-object tracking tasks, validating its generalization capabilities.

[157] Q-Former Autoencoder: A Modern Framework for Medical Anomaly Detection

Francesco Dalmonte, Emirhan Bayar, Emre Akbas, Mariana-Iuliana Georgescu

Main category: cs.CV

TL;DR: The paper introduces the Q-Former Autoencoder, an unsupervised framework for medical anomaly detection, leveraging pretrained vision models like DINO and Masked Autoencoder. It achieves state-of-the-art results on multiple benchmarks without domain-specific fine-tuning.

Details

Motivation: Anomaly detection in medical images is challenging due to diverse anomalies and lack of annotated data. The work aims to address this by utilizing pretrained vision models for rich feature extraction.

Method: The Q-Former Autoencoder uses frozen pretrained vision models as feature extractors, incorporates a Q-Former bottleneck for multiscale feature aggregation, and employs perceptual loss from a Masked Autoencoder for semantic reconstruction.

Result: The framework achieves state-of-the-art performance on BraTS2021, RESC, and RSNA benchmarks, demonstrating the generalization of pretrained models to medical tasks.

Conclusion: Pretrained vision models can effectively generalize to medical anomaly detection without fine-tuning, as shown by the Q-Former Autoencoder’s success. Code and models are publicly released.

Abstract: Anomaly detection in medical images is an important yet challenging task due to the diversity of possible anomalies and the practical impossibility of collecting comprehensively annotated data sets. In this work, we tackle unsupervised medical anomaly detection proposing a modernized autoencoder-based framework, the Q-Former Autoencoder, that leverages state-of-the-art pretrained vision foundation models, such as DINO, DINOv2 and Masked Autoencoder. Instead of training encoders from scratch, we directly utilize frozen vision foundation models as feature extractors, enabling rich, multi-stage, high-level representations without domain-specific fine-tuning. We propose the usage of the Q-Former architecture as the bottleneck, which enables the control of the length of the reconstruction sequence, while efficiently aggregating multiscale features. Additionally, we incorporate a perceptual loss computed using features from a pretrained Masked Autoencoder, guiding the reconstruction towards semantically meaningful structures. Our framework is evaluated on four diverse medical anomaly detection benchmarks, achieving state-of-the-art results on BraTS2021, RESC, and RSNA. Our results highlight the potential of vision foundation model encoders, pretrained on natural images, to generalize effectively to medical image analysis tasks without further fine-tuning. We release the code and models at https://github.com/emirhanbayar/QFAE.

[158] A COCO-Formatted Instance-Level Dataset for Plasmodium Falciparum Detection in Giemsa-Stained Blood Smears

Frauke Wilm, Luis Carlos Rivera Monroy, Mathias Öttl, Lukas Mürdter, Leonid Mill, Andreas Maier

Main category: cs.CV

TL;DR: The paper presents an enhanced NIH malaria dataset with detailed COCO-format annotations to improve deep learning-based malaria detection, achieving high F1 scores with Faster R-CNN.

Details

Motivation: Reliable malaria diagnosis in developing countries requires accurate detection of Plasmodium falciparum in blood smears, but limited annotated datasets hinder deep learning adoption.

Method: The authors refined the NIH malaria dataset with detailed bounding box annotations, trained a Faster R-CNN model, and validated it via cross-validation.

Result: The model achieved F1 scores up to 0.88 for infected cell detection, highlighting the importance of annotation quality.

Conclusion: Automated annotation refinement and manual correction can produce high-quality training data for robust malaria detection, with the updated dataset publicly available.

Abstract: Accurate detection of Plasmodium falciparum in Giemsa-stained blood smears is an essential component of reliable malaria diagnosis, especially in developing countries. Deep learning-based object detection methods have demonstrated strong potential for automated Malaria diagnosis, but their adoption is limited by the scarcity of datasets with detailed instance-level annotations. In this work, we present an enhanced version of the publicly available NIH malaria dataset, with detailed bounding box annotations in COCO format to support object detection training. We validated the revised annotations by training a Faster R-CNN model to detect infected and non-infected red blood cells, as well as white blood cells. Cross-validation on the original dataset yielded F1 scores of up to 0.88 for infected cell detection. These results underscore the importance of annotation volume and consistency, and demonstrate that automated annotation refinement combined with targeted manual correction can produce training data of sufficient quality for robust detection performance. The updated annotations set is publicly available via GitHub: https://github.com/MIRA-Vision-Microscopy/malaria-thin-smear-coco.

[159] Delving into Mapping Uncertainty for Mapless Trajectory Prediction

Zongzheng Zhang, Xuchong Qiu, Boran Zhang, Guantian Zheng, Xunjiang Gu, Guoxuan Chi, Huan-ang Gao, Leichen Wang, Ziming Liu, Xinrun Li, Igor Gilitschenski, Hongyang Li, Hang Zhao, Hao Zhao

Main category: cs.CV

TL;DR: The paper proposes a novel method to integrate map uncertainty into trajectory prediction in autonomous driving, focusing on the ego vehicle’s kinematic state and improving performance by 23.6%.

Details

Motivation: Current mapless autonomous driving approaches lack reliability in online-generated maps, and existing methods for incorporating map uncertainty into trajectory prediction lack scenario-specific insights.

Method: The authors introduce Proprioceptive Scenario Gating to adaptively integrate map uncertainty based on the ego vehicle’s future kinematics and a Covariance-based Map Uncertainty approach for better alignment with map geometry.

Result: The method achieves a 23.6% improvement in trajectory prediction performance over state-of-the-art methods on the nuScenes dataset.

Conclusion: The proposed approach enhances the synergy between online mapping and trajectory prediction, providing interpretability and outperforming previous methods.

Abstract: Recent advances in autonomous driving are moving towards mapless approaches, where High-Definition (HD) maps are generated online directly from sensor data, reducing the need for expensive labeling and maintenance. However, the reliability of these online-generated maps remains uncertain. While incorporating map uncertainty into downstream trajectory prediction tasks has shown potential for performance improvements, current strategies provide limited insights into the specific scenarios where this uncertainty is beneficial. In this work, we first analyze the driving scenarios in which mapping uncertainty has the greatest positive impact on trajectory prediction and identify a critical, previously overlooked factor: the agent’s kinematic state. Building on these insights, we propose a novel Proprioceptive Scenario Gating that adaptively integrates map uncertainty into trajectory prediction based on forecasts of the ego vehicle’s future kinematics. This lightweight, self-supervised approach enhances the synergy between online mapping and trajectory prediction, providing interpretability around where uncertainty is advantageous and outperforming previous integration methods. Additionally, we introduce a Covariance-based Map Uncertainty approach that better aligns with map geometry, further improving trajectory prediction. Extensive ablation studies confirm the effectiveness of our approach, achieving up to 23.6% improvement in mapless trajectory prediction performance over the state-of-the-art method using the real-world nuScenes driving dataset. Our code, data, and models are publicly available at https://github.com/Ethan-Zheng136/Map-Uncertainty-for-Trajectory-Prediction.

[160] Human Scanpath Prediction in Target-Present Visual Search with Semantic-Foveal Bayesian Attention

João Luzio, Alexandre Bernardino, Plinio Moreno

Main category: cs.CV

TL;DR: SemBA-FAST, a semantic-based Bayesian attention model, outperforms baseline and top-down approaches in predicting human visual attention for target-present visual search tasks.

Details

Motivation: To improve human-like attention modeling by integrating deep object detection and probabilistic semantic fusion for dynamic attention maps.

Method: SemBA-FAST combines deep object detection with probabilistic semantic fusion, using pre-trained detectors and artificial foveation to update top-down knowledge.

Result: Achieves human-like scanpaths on COCO-Search18, surpassing baselines and competing with scanpath-informed models.

Conclusion: SemBA-FAST demonstrates the potential of semantic-foveal probabilistic frameworks for real-time cognitive computing and robotics.

Abstract: In goal-directed visual tasks, human perception is guided by both top-down and bottom-up cues. At the same time, foveal vision plays a crucial role in directing attention efficiently. Modern research on bio-inspired computational attention models has taken advantage of advancements in deep learning by utilizing human scanpath data to achieve new state-of-the-art performance. In this work, we assess the performance of SemBA-FAST, i.e. Semantic-based Bayesian Attention for Foveal Active visual Search Tasks, a top-down framework designed for predicting human visual attention in target-present visual search. SemBA-FAST integrates deep object detection with a probabilistic semantic fusion mechanism to generate attention maps dynamically, leveraging pre-trained detectors and artificial foveation to update top-down knowledge and improve fixation prediction sequentially. We evaluate SemBA-FAST on the COCO-Search18 benchmark dataset, comparing its performance against other scanpath prediction models. Our methodology achieves fixation sequences that closely match human ground-truth scanpaths. Notably, it surpasses baseline and other top-down approaches and competes, in some cases, with scanpath-informed models. These findings provide valuable insights into the capabilities of semantic-foveal probabilistic frameworks for human-like attention modelling, with implications for real-time cognitive computing and robotics.

[161] Towards Large Scale Geostatistical Methane Monitoring with Part-based Object Detection

Adhemar de Senneville, Xavier Bou, Thibaud Ehret, Rafael Grompone, Jean Louis Bonne, Nicolas Dumelie, Thomas Lauvaux, Gabriele Facciolo

Main category: cs.CV

TL;DR: The paper addresses rare object detection in remote sensing by focusing on bio-digesters in France, proposing a part-based method and geostatistical methane estimates.

Details

Motivation: The challenge of detecting rare objects in vast remote sensing data is critical for applications like assessing environmental impacts.

Method: A part-based method is developed, leveraging bio-digester sub-elements, and applied to new regions for inventory building and methane estimation.

Result: A novel dataset is introduced, and the method successfully detects bio-digesters, enabling methane production estimates.

Conclusion: The approach effectively tackles rare object detection and provides actionable environmental insights.

Abstract: Object detection is one of the main applications of computer vision in remote sensing imagery. Despite its increasing availability, the sheer volume of remote sensing data poses a challenge when detecting rare objects across large geographic areas. Paradoxically, this common challenge is crucial to many applications, such as estimating environmental impact of certain human activities at scale. In this paper, we propose to address the problem by investigating the methane production and emissions of bio-digesters in France. We first introduce a novel dataset containing bio-digesters, with small training and validation sets, and a large test set with a high imbalance towards observations without objects since such sites are rare. We develop a part-based method that considers essential bio-digester sub-elements to boost initial detections. To this end, we apply our method to new, unseen regions to build an inventory of bio-digesters. We then compute geostatistical estimates of the quantity of methane produced that can be attributed to these infrastructures in a given area at a given time.

[162] Object segmentation in the wild with foundation models: application to vision assisted neuro-prostheses for upper limbs

Bolutife Atoki, Jenny Benois-Pineau, Renaud Péteri, Fabien Baldacci, Aymar de Rugy

Main category: cs.CV

TL;DR: The paper explores using foundation models for semantic object segmentation in cluttered scenes, focusing on vision-guided neuroprostheses. It introduces gaze-based prompts for SAM and fine-tunes on egocentric data, achieving a 0.51 IoU improvement on real-world data.

Details

Motivation: To enable object segmentation in cluttered scenes for neuroprostheses without fine-tuning on specific images, leveraging foundation models.

Method: Proposes gaze fixation-based prompts to guide SAM and fine-tunes it on egocentric visual data.

Result: Achieves up to 0.51 IoU improvement on the Grasping-in-the-Wild corpus.

Conclusion: The approach effectively enhances segmentation quality in challenging real-world scenarios for neuroprosthetic applications.

Abstract: In this work, we address the problem of semantic object segmentation using foundation models. We investigate whether foundation models, trained on a large number and variety of objects, can perform object segmentation without fine-tuning on specific images containing everyday objects, but in highly cluttered visual scenes. The ‘‘in the wild’’ context is driven by the target application of vision guided upper limb neuroprostheses. We propose a method for generating prompts based on gaze fixations to guide the Segment Anything Model (SAM) in our segmentation scenario, and fine-tune it on egocentric visual data. Evaluation results of our approach show an improvement of the IoU segmentation quality metric by up to 0.51 points on real-world challenging data of Grasping-in-the-Wild corpus which is made available on the RoboFlow Platform (https://universe.roboflow.com/iwrist/grasping-in-the-wild)

[163] GaussianFusionOcc: A Seamless Sensor Fusion Approach for 3D Occupancy Prediction Using 3D Gaussians

Tomislav Pavković, Mohammad-Ali Nikouei Mahani, Johannes Niedermayer, Johannes Betz

Main category: cs.CV

TL;DR: GaussianFusionOcc uses 3D Gaussians and sensor fusion for efficient and precise 3D semantic occupancy prediction in autonomous driving.

Details

Motivation: To improve precision, memory efficiency, and inference speed in 3D semantic occupancy prediction by leveraging multi-modal sensor fusion.

Method: Uses semantic 3D Gaussians and modality-agnostic deformable attention for sensor fusion (camera, LiDAR, radar).

Result: Outperforms state-of-the-art models in accuracy and efficiency.

Conclusion: GaussianFusionOcc is a scalable and versatile solution for autonomous driving environments.

Abstract: 3D semantic occupancy prediction is one of the crucial tasks of autonomous driving. It enables precise and safe interpretation and navigation in complex environments. Reliable predictions rely on effective sensor fusion, as different modalities can contain complementary information. Unlike conventional methods that depend on dense grid representations, our approach, GaussianFusionOcc, uses semantic 3D Gaussians alongside an innovative sensor fusion mechanism. Seamless integration of data from camera, LiDAR, and radar sensors enables more precise and scalable occupancy prediction, while 3D Gaussian representation significantly improves memory efficiency and inference speed. GaussianFusionOcc employs modality-agnostic deformable attention to extract essential features from each sensor type, which are then used to refine Gaussian properties, resulting in a more accurate representation of the environment. Extensive testing with various sensor combinations demonstrates the versatility of our approach. By leveraging the robustness of multi-modal fusion and the efficiency of Gaussian representation, GaussianFusionOcc outperforms current state-of-the-art models.

[164] IntentVCNet: Bridging Spatio-Temporal Gaps for Intention-Oriented Controllable Video Captioning

Tianheng Qiu, Jingchun Gao, Jingyu Li, Huiyi Leong, Xuan Huang, Xi Wang, Xiaocheng Zhang, Kele Xu, Lan Zhang

Main category: cs.CV

TL;DR: IntentVCNet bridges the spatio-temporal gap in LVLMs for fine-grained intent-oriented video captioning using prompt combination and a box adapter.

Details

Motivation: Current LVLMs lack fine-grained spatial control in time sequences for intent-oriented video captioning.

Method: Combines a prompt strategy to model intent-video relationships and a box adapter for object semantic augmentation.

Result: Achieves state-of-the-art performance and runner-up in the IntentVC challenge.

Conclusion: The method enhances LVLMs’ spatial detail modeling and intent-oriented captioning accuracy.

Abstract: Intent-oriented controlled video captioning aims to generate targeted descriptions for specific targets in a video based on customized user intent. Current Large Visual Language Models (LVLMs) have gained strong instruction following and visual comprehension capabilities. Although the LVLMs demonstrated proficiency in spatial and temporal understanding respectively, it was not able to perform fine-grained spatial control in time sequences in direct response to instructions. This substantial spatio-temporal gap complicates efforts to achieve fine-grained intention-oriented control in video. Towards this end, we propose a novel IntentVCNet that unifies the temporal and spatial understanding knowledge inherent in LVLMs to bridge the spatio-temporal gap from both prompting and model perspectives. Specifically, we first propose a prompt combination strategy designed to enable LLM to model the implicit relationship between prompts that characterize user intent and video sequences. We then propose a parameter efficient box adapter that augments the object semantic information in the global visual context so that the visual token has a priori information about the user intent. The final experiment proves that the combination of the two strategies can further enhance the LVLM’s ability to model spatial details in video sequences, and facilitate the LVLMs to accurately generate controlled intent-oriented captions. Our proposed method achieved state-of-the-art results in several open source LVLMs and was the runner-up in the IntentVC challenge. Our code is available on https://github.com/thqiu0419/IntentVCNet.

[165] COT-AD: Cotton Analysis Dataset

Akbar Ali, Mahek Vyas, Soumyaratna Debnath, Chanda Grover Kamra, Jaidev Sanjay Khalane, Reuben Shibu Devanesan, Indra Deep Mastan, Subramanian Sankaranarayanan, Pankaj Khanna, Shanmuganathan Raman

Main category: cs.CV

TL;DR: COT-AD is a dataset for cotton crop analysis with 25,000 images, including 5,000 annotated ones, supporting tasks like classification, segmentation, and disease management.

Details

Motivation: Addresses the lack of cotton-specific agricultural datasets for computer vision applications.

Method: Includes aerial and DSLR imagery with annotations for pest/disease recognition, vegetation, and weed analysis.

Result: Provides a comprehensive resource for tasks like classification, segmentation, and early disease management.

Conclusion: COT-AD advances data-driven crop management by filling a critical gap in cotton-specific datasets.

Abstract: This paper presents COT-AD, a comprehensive Dataset designed to enhance cotton crop analysis through computer vision. Comprising over 25,000 images captured throughout the cotton growth cycle, with 5,000 annotated images, COT-AD includes aerial imagery for field-scale detection and segmentation and high-resolution DSLR images documenting key diseases. The annotations cover pest and disease recognition, vegetation, and weed analysis, addressing a critical gap in cotton-specific agricultural datasets. COT-AD supports tasks such as classification, segmentation, image restoration, enhancement, deep generative model-based cotton crop synthesis, and early disease management, advancing data-driven crop management

[166] Elucidating the Design Space of Arbitrary-Noise-Based Diffusion Models

Xingyu Qiu, Mengying Yang, Xinghua Ma, Dong Liang, Yuzhen Li, Fanding Li, Gongning Luo, Wei Wang, Kuanquan Wang, Shuo Li

Main category: cs.CV

TL;DR: EDA expands diffusion models to arbitrary noise patterns, improving image restoration without extra computational cost, outperforming task-specific methods in MRI, CT, and natural image tasks.

Details

Motivation: Fixed Gaussian noise in EDM limits image restoration by corrupting degraded images and increasing complexity.

Method: EDA extends EDM to arbitrary noise patterns, preserving flexibility and proving no added computational cost.

Result: EDA outperforms task-specific methods in MRI bias correction, CT artifact reduction, and shadow removal with just 5 sampling steps.

Conclusion: EDA advances diffusion models by enabling arbitrary noise patterns, achieving state-of-the-art results in diverse restoration tasks.

Abstract: EDM elucidates the unified design space of diffusion models, yet its fixed noise patterns restricted to pure Gaussian noise, limit advancements in image restoration. Our study indicates that forcibly injecting Gaussian noise corrupts the degraded images, overextends the image transformation distance, and increases restoration complexity. To address this problem, our proposed EDA Elucidates the Design space of Arbitrary-noise-based diffusion models. Theoretically, EDA expands the freedom of noise pattern while preserving the original module flexibility of EDM, with rigorous proof that increased noise complexity incurs no additional computational overhead during restoration. EDA is validated on three typical tasks: MRI bias field correction (global smooth noise), CT metal artifact reduction (global sharp noise), and natural image shadow removal (local boundary-aware noise). With only 5 sampling steps, EDA outperforms most task-specific methods and achieves state-of-the-art performance in bias field correction and shadow removal.

[167] TTS-VAR: A Test-Time Scaling Framework for Visual Auto-Regressive Generation

Zhekai Chen, Ruihang Chu, Yukang Chen, Shiwei Zhang, Yujie Wei, Yingya Zhang, Xihui Liu

Main category: cs.CV

TL;DR: TTS-VAR is a test-time scaling framework for visual auto-regressive models, improving efficiency and performance through adaptive batch scheduling, clustering-based diversity search, and resampling-based potential selection.

Details

Motivation: To address the high computational costs of scaling visual generation models, TTS-VAR offers a resource-efficient alternative by optimizing test-time scaling.

Method: The framework models generation as path searching, using adaptive batch scheduling, clustering-based diversity search for coarse scales, and resampling-based potential selection for fine scales.

Result: Experiments show an 8.7% improvement in GenEval score (0.69 to 0.75) on the Infinity VAR model.

Conclusion: TTS-VAR demonstrates that early-stage structural features and resampling efficacy significantly impact final generation quality, providing a scalable solution for visual content creation.

Abstract: Scaling visual generation models is essential for real-world content creation, yet requires substantial training and computational expenses. Alternatively, test-time scaling has garnered growing attention due to resource efficiency and promising performance. In this work, we present TTS-VAR, the first general test-time scaling framework for visual auto-regressive (VAR) models, modeling the generation process as a path searching problem. To dynamically balance computational efficiency with exploration capacity, we first introduce an adaptive descending batch size schedule throughout the causal generation process. Besides, inspired by VAR’s hierarchical coarse-to-fine multi-scale generation, our framework integrates two key components: (i) At coarse scales, we observe that generated tokens are hard for evaluation, possibly leading to erroneous acceptance of inferior samples or rejection of superior samples. Noticing that the coarse scales contain sufficient structural information, we propose clustering-based diversity search. It preserves structural variety through semantic feature clustering, enabling later selection on samples with higher potential. (ii) In fine scales, resampling-based potential selection prioritizes promising candidates using potential scores, which are defined as reward functions incorporating multi-scale generation history. Experiments on the powerful VAR model Infinity show a notable 8.7% GenEval score improvement (from 0.69 to 0.75). Key insights reveal that early-stage structural features effectively influence final quality, and resampling efficacy varies across generation scales. Code is available at https://github.com/ali-vilab/TTS-VAR.

[168] Unposed 3DGS Reconstruction with Probabilistic Procrustes Mapping

Chong Cheng, Zijian Wang, Sicheng Yu, Yu Hu, Nanjie Yao, Hao Wang

Main category: cs.CV

TL;DR: A novel unposed 3DGS reconstruction framework integrates MVS priors and probabilistic Procrustes mapping to address memory and accuracy issues in large-scale outdoor image reconstruction.

Details

Motivation: Existing MVS models struggle with memory limits and accuracy in unposed reconstruction tasks involving hundreds of outdoor images.

Method: The method partitions images into subsets, aligns submaps globally using probabilistic Procrustes, and jointly optimizes geometry and poses with 3DGS, leveraging confidence-aware anchor points and differentiable rendering.

Result: Achieves accurate reconstruction and pose estimation on Waymo and KITTI datasets, setting a new state of the art.

Conclusion: The proposed framework effectively addresses limitations of existing MVS models, enabling robust unposed 3DGS reconstruction.

Abstract: 3D Gaussian Splatting (3DGS) has emerged as a core technique for 3D representation. Its effectiveness largely depends on precise camera poses and accurate point cloud initialization, which are often derived from pretrained Multi-View Stereo (MVS) models. However, in unposed reconstruction task from hundreds of outdoor images, existing MVS models may struggle with memory limits and lose accuracy as the number of input images grows. To address this limitation, we propose a novel unposed 3DGS reconstruction framework that integrates pretrained MVS priors with the probabilistic Procrustes mapping strategy. The method partitions input images into subsets, maps submaps into a global space, and jointly optimizes geometry and poses with 3DGS. Technically, we formulate the mapping of tens of millions of point clouds as a probabilistic Procrustes problem and solve a closed-form alignment. By employing probabilistic coupling along with a soft dustbin mechanism to reject uncertain correspondences, our method globally aligns point clouds and poses within minutes across hundreds of images. Moreover, we propose a joint optimization framework for 3DGS and camera poses. It constructs Gaussians from confidence-aware anchor points and integrates 3DGS differentiable rendering with an analytical Jacobian to jointly refine scene and poses, enabling accurate reconstruction and pose estimation. Experiments on Waymo and KITTI datasets show that our method achieves accurate reconstruction from unposed image sequences, setting a new state of the art for unposed 3DGS reconstruction.

Daniil Morozov, Reuben Dorent, Nazim Haouchine

Main category: cs.CV

TL;DR: A novel 3D cross-modal keypoint descriptor for MRI-iUS registration is proposed, using synthetic iUS volumes and supervised contrastive training, achieving high precision and robustness.

Details

Motivation: Intraoperative registration of real-time ultrasound (iUS) to preoperative MRI is challenging due to modality differences.

Method: Patient-specific matching-by-synthesis, probabilistic keypoint detection, and curriculum-based triplet loss with dynamic hard negative mining.

Result: Outperforms state-of-the-art methods with 69.8% average precision and 2.39 mm mean Target Registration Error.

Conclusion: The approach is interpretable, robust, and requires no manual initialization, advancing MRI-iUS registration.

Abstract: Intraoperative registration of real-time ultrasound (iUS) to preoperative Magnetic Resonance Imaging (MRI) remains an unsolved problem due to severe modality-specific differences in appearance, resolution, and field-of-view. To address this, we propose a novel 3D cross-modal keypoint descriptor for MRI-iUS matching and registration. Our approach employs a patient-specific matching-by-synthesis approach, generating synthetic iUS volumes from preoperative MRI. This enables supervised contrastive training to learn a shared descriptor space. A probabilistic keypoint detection strategy is then employed to identify anatomically salient and modality-consistent locations. During training, a curriculum-based triplet loss with dynamic hard negative mining is used to learn descriptors that are i) robust to iUS artifacts such as speckle noise and limited coverage, and ii) rotation-invariant . At inference, the method detects keypoints in MR and real iUS images and identifies sparse matches, which are then used to perform rigid registration. Our approach is evaluated using 3D MRI-iUS pairs from the ReMIND dataset. Experiments show that our approach outperforms state-of-the-art keypoint matching methods across 11 patients, with an average precision of $69.8%$. For image registration, our method achieves a competitive mean Target Registration Error of 2.39 mm on the ReMIND2Reg benchmark. Compared to existing iUS-MR registration approach, our framework is interpretable, requires no manual initialization, and shows robustness to iUS field-of-view variation. Code is available at https://github.com/morozovdd/CrossKEY.

[170] Deep Learning-Based Age Estimation and Gender Deep Learning-Based Age Estimation and Gender Classification for Targeted Advertisement

Muhammad Imran Zaman, Nisar Ahmed

Main category: cs.CV

TL;DR: A deep learning-based CNN model for simultaneous age and gender classification from facial images, improving targeted advertising effectiveness.

Details

Motivation: To enhance targeted advertising by improving age and gender classification accuracy using shared representations in facial features.

Method: Proposes a custom CNN architecture trained on a diverse dataset, leveraging shared representations for age and gender.

Result: Achieves 95% gender classification accuracy and 5.77 years mean absolute error for age estimation.

Conclusion: Identifies challenges in age estimation for younger individuals and suggests targeted data augmentation and model refinement.

Abstract: This paper presents a novel deep learning-based approach for simultaneous age and gender classification from facial images, designed to enhance the effectiveness of targeted advertising campaigns. We propose a custom Convolutional Neural Network (CNN) architecture, optimized for both tasks, which leverages the inherent correlation between age and gender information present in facial features. Unlike existing methods that often treat these tasks independently, our model learns shared representations, leading to improved performance. The network is trained on a large, diverse dataset of facial images, carefully pre-processed to ensure robustness against variations in lighting, pose, and image quality. Our experimental results demonstrate a significant improvement in gender classification accuracy, achieving 95%, and a competitive mean absolute error of 5.77 years for age estimation. Critically, we analyze the performance across different age groups, identifying specific challenges in accurately estimating the age of younger individuals. This analysis reveals the need for targeted data augmentation and model refinement to address these biases. Furthermore, we explore the impact of different CNN architectures and hyperparameter settings on the overall performance, providing valuable insights for future research.

[171] Facial Demorphing from a Single Morph Using a Latent Conditional GAN

Nitish Shukla, Arun Ross

Main category: cs.CV

TL;DR: A method for demorphing face images that overcomes limitations of existing techniques by decomposing morphs in latent space, enabling high-fidelity demorphed images even for unseen morph techniques.

Details

Motivation: Existing demorphing methods either replicate the morph or rely on the same morph technique for training and testing, limiting their effectiveness.

Method: The proposed method decomposes morphs in latent space, allowing demorphing of images from unseen techniques and styles. It is trained on synthetic face morphs and tested on real face morphs.

Result: The method outperforms existing techniques significantly, producing high-fidelity demorphed images.

Conclusion: The proposed demorphing method is robust and effective for diverse morph techniques and face styles, providing superior results.

Abstract: A morph is created by combining two (or more) face images from two (or more) identities to create a composite image that is highly similar to both constituent identities, allowing the forged morph to be biometrically associated with more than one individual. Morph Attack Detection (MAD) can be used to detect a morph, but does not reveal the constituent images. Demorphing

the process of deducing the constituent images - is thus vital to provide additional evidence about a morph. Existing demorphing methods suffer from the morph replication problem, where the outputs tend to look very similar to the morph itself, or assume that train and test morphs are generated using the same morph technique. The proposed method overcomes these issues. The method decomposes a morph in latent space allowing it to demorph images created from unseen morph techniques and face styles. We train our method on morphs created from synthetic faces and test on morphs created from real faces using arbitrary morph techniques. Our method outperforms existing methods by a considerable margin and produces high fidelity demorphed face images.

[172] Adversarial Distribution Matching for Diffusion Distillation Towards Efficient Image and Video Synthesis

Yanzuo Lu, Yuxi Ren, Xin Xia, Shanchuan Lin, Xing Wang, Xuefeng Xiao, Andy J. Ma, Xiaohua Xie, Jian-Huang Lai

Main category: cs.CV

TL;DR: DMDX improves score distillation by combining adversarial pre-training and ADM fine-tuning, outperforming DMD2 in efficiency and performance.

Details

Motivation: Address mode collapse in DMD by introducing adversarial alignment of latent predictions.

Method: Uses adversarial distillation with hybrid discriminators and distributional loss for better initialization.

Result: Superior one-step performance on SDXL and benchmarks in image/video synthesis.

Conclusion: DMDX sets a new standard for efficient distillation in generative models.

Abstract: Distribution Matching Distillation (DMD) is a promising score distillation technique that compresses pre-trained teacher diffusion models into efficient one-step or multi-step student generators. Nevertheless, its reliance on the reverse Kullback-Leibler (KL) divergence minimization potentially induces mode collapse (or mode-seeking) in certain applications. To circumvent this inherent drawback, we propose Adversarial Distribution Matching (ADM), a novel framework that leverages diffusion-based discriminators to align the latent predictions between real and fake score estimators for score distillation in an adversarial manner. In the context of extremely challenging one-step distillation, we further improve the pre-trained generator by adversarial distillation with hybrid discriminators in both latent and pixel spaces. Different from the mean squared error used in DMD2 pre-training, our method incorporates the distributional loss on ODE pairs collected from the teacher model, and thus providing a better initialization for score distillation fine-tuning in the next stage. By combining the adversarial distillation pre-training with ADM fine-tuning into a unified pipeline termed DMDX, our proposed method achieves superior one-step performance on SDXL compared to DMD2 while consuming less GPU time. Additional experiments that apply multi-step ADM distillation on SD3-Medium, SD3.5-Large, and CogVideoX set a new benchmark towards efficient image and video synthesis.

[173] HybridTM: Combining Transformer and Mamba for 3D Semantic Segmentation

Xinyu Wang, Jinghua Hou, Zhe Liu, Yingying Zhu

Main category: cs.CV

TL;DR: HybridTM integrates Transformer and Mamba for 3D semantic segmentation, combining their strengths to address limitations in long-range dependency modeling and feature representation.

Details

Motivation: Transformers excel in attention but suffer from quadratic complexity, while Mamba offers efficiency but struggles with feature representation. Combining these can enhance 3D segmentation.

Method: Proposes HybridTM, a hybrid architecture, and Inner Layer Hybrid Strategy to integrate attention and Mamba for capturing long-range dependencies and local features.

Result: Achieves state-of-the-art performance on ScanNet, ScanNet200, and nuScenes benchmarks.

Conclusion: HybridTM effectively combines Transformer and Mamba, demonstrating superior performance and generalization in 3D semantic segmentation.

Abstract: Transformer-based methods have demonstrated remarkable capabilities in 3D semantic segmentation through their powerful attention mechanisms, but the quadratic complexity limits their modeling of long-range dependencies in large-scale point clouds. While recent Mamba-based approaches offer efficient processing with linear complexity, they struggle with feature representation when extracting 3D features. However, effectively combining these complementary strengths remains an open challenge in this field. In this paper, we propose HybridTM, the first hybrid architecture that integrates Transformer and Mamba for 3D semantic segmentation. In addition, we propose the Inner Layer Hybrid Strategy, which combines attention and Mamba at a finer granularity, enabling simultaneous capture of long-range dependencies and fine-grained local features. Extensive experiments demonstrate the effectiveness and generalization of our HybridTM on diverse indoor and outdoor datasets. Furthermore, our HybridTM achieves state-of-the-art performance on ScanNet, ScanNet200, and nuScenes benchmarks. The code will be made available at https://github.com/deepinact/HybridTM.

[174] Identifying Prompted Artist Names from Generated Images

Grace Su, Sheng-Yu Wang, Aaron Hertzmann, Eli Shechtman, Jun-Yan Zhu, Richard Zhang

Main category: cs.CV

TL;DR: A benchmark for recognizing artists from images generated by text-to-image models, evaluating various methods and revealing challenges in generalization.

Details

Motivation: To address the controversial use of artist names in text-to-image prompts and advance responsible moderation of such models.

Method: Created a dataset of 1.95M images from 110 artists, testing feature similarity, contrastive style descriptors, data attribution, supervised classifiers, and few-shot prototypical networks across four generalization settings.

Result: Supervised and few-shot models perform well on seen artists and complex prompts, while style descriptors excel for pronounced styles; multi-artist prompts are the hardest.

Conclusion: The benchmark highlights room for improvement and aims to support responsible moderation of text-to-image models, with the dataset and benchmark publicly released.

Abstract: A common and controversial use of text-to-image models is to generate pictures by explicitly naming artists, such as “in the style of Greg Rutkowski”. We introduce a benchmark for prompted-artist recognition: predicting which artist names were invoked in the prompt from the image alone. The dataset contains 1.95M images covering 110 artists and spans four generalization settings: held-out artists, increasing prompt complexity, multiple-artist prompts, and different text-to-image models. We evaluate feature similarity baselines, contrastive style descriptors, data attribution methods, supervised classifiers, and few-shot prototypical networks. Generalization patterns vary: supervised and few-shot models excel on seen artists and complex prompts, whereas style descriptors transfer better when the artist’s style is pronounced; multi-artist prompts remain the most challenging. Our benchmark reveals substantial headroom and provides a public testbed to advance the responsible moderation of text-to-image models. We release the dataset and benchmark to foster further research: https://graceduansu.github.io/IdentifyingPromptedArtists/

[175] Captain Cinema: Towards Short Movie Generation

Junfei Xiao, Ceyuan Yang, Lvmin Zhang, Shengqu Cai, Yang Zhao, Yuwei Guo, Gordon Wetzstein, Maneesh Agrawala, Alan Yuille, Lu Jiang

Main category: cs.CV

TL;DR: Captain Cinema is a framework for generating short movies from textual descriptions, using top-down keyframe planning and bottom-up video synthesis with a Multimodal Diffusion Transformer (MM-DiT) for long-context video data.

Details

Motivation: To automate the creation of visually coherent and narrative-consistent short movies efficiently.

Method: Uses top-down keyframe planning for storyline coherence and bottom-up video synthesis with MM-DiT for spatio-temporal dynamics. Includes interleaved training for stability.

Result: Produces high-quality, visually coherent, and narrative-consistent short movies efficiently.

Conclusion: Captain Cinema effectively automates short movie generation with strong coherence and quality.

Abstract: We present Captain Cinema, a generation framework for short movie generation. Given a detailed textual description of a movie storyline, our approach firstly generates a sequence of keyframes that outline the entire narrative, which ensures long-range coherence in both the storyline and visual appearance (e.g., scenes and characters). We refer to this step as top-down keyframe planning. These keyframes then serve as conditioning signals for a video synthesis model, which supports long context learning, to produce the spatio-temporal dynamics between them. This step is referred to as bottom-up video synthesis. To support stable and efficient generation of multi-scene long narrative cinematic works, we introduce an interleaved training strategy for Multimodal Diffusion Transformers (MM-DiT), specifically adapted for long-context video data. Our model is trained on a specially curated cinematic dataset consisting of interleaved data pairs. Our experiments demonstrate that Captain Cinema performs favorably in the automated creation of visually coherent and narrative consistent short movies in high quality and efficiency. Project page: https://thecinema.ai

[176] DEFAME: Dynamic Evidence-based FAct-checking with Multimodal Experts

Tobias Braun, Mark Rothermel, Marcus Rohrbach, Anna Rohrbach

Main category: cs.CV

TL;DR: DEFAME is a modular, zero-shot MLLM pipeline for open-domain, text-image claim verification, outperforming prior methods and GPT-4o baselines on benchmarks.

Details

Motivation: The need for reliable and scalable fact-checking solutions due to disinformation proliferation.

Method: A six-stage process dynamically selecting tools and search depth to extract and evaluate textual and visual evidence.

Result: DEFAME surpasses previous methods on benchmarks (VERITE, AVerITeC, MOCHEG) and outperforms GPT-4o on ClaimReview2024+.

Conclusion: DEFAME is a state-of-the-art, explainable, and multimodal fact-checking system with real-time potential.

Abstract: The proliferation of disinformation demands reliable and scalable fact-checking solutions. We present Dynamic Evidence-based FAct-checking with Multimodal Experts (DEFAME), a modular, zero-shot MLLM pipeline for open-domain, text-image claim verification. DEFAME operates in a six-stage process, dynamically selecting the tools and search depth to extract and evaluate textual and visual evidence. Unlike prior approaches that are text-only, lack explainability, or rely solely on parametric knowledge, DEFAME performs end-to-end verification, accounting for images in claims and evidence while generating structured, multimodal reports. Evaluation on the popular benchmarks VERITE, AVerITeC, and MOCHEG shows that DEFAME surpasses all previous methods, establishing itself as the new state-of-the-art fact-checking system for uni- and multimodal fact-checking. Moreover, we introduce a new multimodal benchmark, ClaimReview2024+, featuring claims after the knowledge cutoff of GPT-4o, avoiding data leakage. Here, DEFAME drastically outperforms the GPT-4o baselines, showing temporal generalizability and the potential for real-time fact-checking.

[177] ELITE: Enhanced Language-Image Toxicity Evaluation for Safety

Wonjun Lee, Doehyeon Lee, Eugene Choi, Sangyoon Yu, Ashkan Yousefpour, Haon Park, Bumsub Ham, Suhyun Kim

Main category: cs.CV

TL;DR: The paper introduces ELITE, a safety benchmark for Vision Language Models (VLMs) to address vulnerabilities in detecting harmful outputs, using an enhanced evaluator for accurate toxicity scoring.

Details

Motivation: Existing safety benchmarks for VLMs are flawed due to low harmfulness detection, ambiguous data, and limited diversity, necessitating a more robust solution.

Method: Proposes the ELITE benchmark and evaluator, which incorporates a toxicity score to filter ambiguous data and generate diverse safe/unsafe image-text pairs.

Result: ELITE outperforms prior methods in aligning with human evaluations and improves benchmark quality and diversity.

Conclusion: ELITE advances VLM safety by providing better tools for evaluating and mitigating risks in real-world applications.

Abstract: Current Vision Language Models (VLMs) remain vulnerable to malicious prompts that induce harmful outputs. Existing safety benchmarks for VLMs primarily rely on automated evaluation methods, but these methods struggle to detect implicit harmful content or produce inaccurate evaluations. Therefore, we found that existing benchmarks have low levels of harmfulness, ambiguous data, and limited diversity in image-text pair combinations. To address these issues, we propose the ELITE benchmark, a high-quality safety evaluation benchmark for VLMs, underpinned by our enhanced evaluation method, the ELITE evaluator. The ELITE evaluator explicitly incorporates a toxicity score to accurately assess harmfulness in multimodal contexts, where VLMs often provide specific, convincing, but unharmful descriptions of images. We filter out ambiguous and low-quality image-text pairs from existing benchmarks using the ELITE evaluator and generate diverse combinations of safe and unsafe image-text pairs. Our experiments demonstrate that the ELITE evaluator achieves superior alignment with human evaluations compared to prior automated methods, and the ELITE benchmark offers enhanced benchmark quality and diversity. By introducing ELITE, we pave the way for safer, more robust VLMs, contributing essential tools for evaluating and mitigating safety risks in real-world applications.

[178] ODES: Domain Adaptation with Expert Guidance for Online Medical Image Segmentation

Md Shazid Islam, Sayak Nag, Arindam Dutta, Miraj Ahmed, Fahim Faisal Niloy, Amit K. Roy-Chowdhury

Main category: cs.CV

TL;DR: The paper introduces ODES, a method for unsupervised domain adaptation in online medical image segmentation, using expert-guided active learning to improve accuracy by selecting informative pixels and pruning redundant images.

Details

Motivation: The challenge of noisy pseudo-labels in unsupervised domain adaptation for online streaming data, especially in medical imaging where precision is critical, motivates the need for expert-guided active learning.

Method: ODES combines active learning with an image-pruning strategy to select the most informative pixels and images for expert annotation, enhancing adaptation in an online setup.

Result: ODES outperforms existing online adaptation methods and competes with offline domain adaptive active learning approaches.

Conclusion: Expert-guided active learning and image pruning significantly improve online domain adaptation for medical image segmentation, addressing noise and redundancy issues.

Abstract: Unsupervised domain adaptive segmentation typically relies on self-training using pseudo labels predicted by a pre-trained network on an unlabeled target dataset. However, the noisy nature of such pseudo-labels presents a major bottleneck in adapting a network to the distribution shift between source and target datasets. This challenge is exaggerated when the network encounters an incoming data stream in online fashion, where the network is constrained to adapt to incoming streams of target domain data in exactly one round of forward and backward passes. In this scenario, relying solely on inaccurate pseudo-labels can lead to low-quality segmentation, which is detrimental to medical image analysis where accuracy and precision are of utmost priority. We hypothesize that a small amount of pixel-level annotation obtained from an expert can address this problem, thereby enhancing the performance of domain adaptation of online streaming data, even in the absence of dedicated training data. We call our method ODES: Domain Adaptation with Expert Guidance for Online Medical Image Segmentation that adapts to each incoming data batch in an online setup, incorporating feedback from an expert through active learning. Through active learning, the most informative pixels in each image can be selected for expert annotation. However, the acquisition of pixel-level annotations across all images in a batch often leads to redundant information while increasing temporal overhead in online learning. To reduce the annotation acquisition time and make the adaptation process more online-friendly, we further propose a novel image-pruning strategy that selects the most useful subset of images from the current batch for active learning. Our proposed approach outperforms existing online adaptation approaches and produces competitive results compared to offline domain adaptive active learning methods.

[179] PLOT-TAL: Prompt Learning with Optimal Transport for Few-Shot Temporal Action Localization

Edward Fish, Andrew Gilbert

Main category: cs.CV

TL;DR: The paper introduces PLOT-TAL, a multi-prompt ensemble method for few-shot temporal action localization, improving boundary precision by learning diverse, compositional sub-event representations via Optimal Transport.

Details

Motivation: Existing few-shot TAL methods using single-prompt tuning produce imprecise boundaries due to non-discriminative mean representations from sparse data.

Method: Proposes PLOT-TAL, leveraging Optimal Transport to align a diverse set of learnable prompts with video temporal features, encouraging specialization on sub-events.

Result: Achieves state-of-the-art performance on THUMOS'14 and EPIC-Kitchens benchmarks, especially at high IoU thresholds.

Conclusion: Multi-prompt ensembles with compositional specialization outperform single-prompt methods, validating the approach for precise temporal localization.

Abstract: Few-shot temporal action localization (TAL) methods that adapt large models via single-prompt tuning often fail to produce precise temporal boundaries. This stems from the model learning a non-discriminative mean representation of an action from sparse data, which compromises generalization. We address this by proposing a new paradigm based on multi-prompt ensembles, where a set of diverse, learnable prompts for each action is encouraged to specialize on compositional sub-events. To enforce this specialization, we introduce PLOT-TAL, a framework that leverages Optimal Transport (OT) to find a globally optimal alignment between the prompt ensemble and the video’s temporal features. Our method establishes a new state-of-the-art on the challenging few-shot benchmarks of THUMOS'14 and EPIC-Kitchens, without requiring complex meta-learning. The significant performance gains, particularly at high IoU thresholds, validate our hypothesis and demonstrate the superiority of learning distributed, compositional representations for precise temporal localization.

[180] Scaling RL to Long Videos

Yukang Chen, Wei Huang, Baifeng Shi, Qinghao Hu, Hanrong Ye, Ligeng Zhu, Zhijian Liu, Pavlo Molchanov, Jan Kautz, Xiaojuan Qi, Sifei Liu, Hongxu Yin, Yao Lu, Song Han

Main category: cs.CV

TL;DR: A framework for scaling reasoning in vision-language models (VLMs) to long videos using reinforcement learning, featuring a dataset, training pipeline, and efficient infrastructure.

Details

Motivation: Addressing the challenges of long video reasoning by integrating a large dataset, a two-stage training pipeline, and optimized infrastructure.

Method: Uses a dataset (LongVideo-Reason), a two-stage pipeline (CoT-SFT and RL), and MR-SP infrastructure for efficient training.

Result: Achieves strong performance (65.0%-70.7% accuracy) and 2.1x speedup in training.

Conclusion: The framework effectively scales VLMs for long videos, with public release of the training system.

Abstract: We introduce a full-stack framework that scales up reasoning in vision-language models (VLMs) to long videos, leveraging reinforcement learning. We address the unique challenges of long video reasoning by integrating three critical components: (1) a large-scale dataset, LongVideo-Reason, comprising 104K long video QA pairs with high-quality reasoning annotations across diverse domains such as sports, games, and vlogs; (2) a two-stage training pipeline that extends VLMs with chain-of-thought supervised fine-tuning (CoT-SFT) and reinforcement learning (RL); and (3) a training infrastructure for long video RL, named Multi-modal Reinforcement Sequence Parallelism (MR-SP), which incorporates sequence parallelism and a vLLM-based engine tailored for long video, using cached video embeddings for efficient rollout and prefilling. In our experiments, LongVILA-R1-7B achieves strong performance on video benchmarks, reaching 65.0% and 70.7% accuracy on VideoMME without and with subtitles, respectively, and consistently outperforming LongVILA-R1 across multiple benchmarks. Moreover, LongVILA-R1 shows steady performance improvements as the number of input video frames increases. Notably, our MR-SP system achieves up to 2.1x speedup on long video RL training. In addition, we release our training system for public availability that supports RL training on various modalities (video, text, and audio), various models (VILA and Qwen series), and even image and video generation models. On a single A100 node (8 GPUs), it supports RL training on hour-long videos (e.g., 3,600 frames / around 256k tokens).

[181] Label Anything: Multi-Class Few-Shot Semantic Segmentation with Visual Prompts

Pasquale De Marinis, Nicola Fanelli, Raffaele Scaringi, Emanuele Colonna, Giuseppe Fiameni, Gennaro Vessio, Giovanna Castellano

Main category: cs.CV

TL;DR: Label Anything is a neural network for few-shot semantic segmentation, using varied visual prompts for versatility and achieving state-of-the-art results.

Details

Motivation: Traditional FSS methods rely heavily on masks, limiting adaptability. Label Anything aims to enhance versatility with diverse prompts.

Method: Introduces points, bounding boxes, and masks as visual prompts, enabling end-to-end training for multi-class FSS without retraining.

Result: Achieves state-of-the-art performance on COCO-$20^i$, demonstrating robust generalization and flexibility.

Conclusion: Label Anything offers a universal, efficient solution for diverse FSS tasks, reducing computational needs and improving adaptability.

Abstract: We present Label Anything, an innovative neural network architecture designed for few-shot semantic segmentation (FSS) that demonstrates remarkable generalizability across multiple classes with minimal examples required per class. Diverging from traditional FSS methods that predominantly rely on masks for annotating support images, Label Anything introduces varied visual prompts – points, bounding boxes, and masks – thereby enhancing the framework’s versatility and adaptability. Unique to our approach, Label Anything is engineered for end-to-end training across multi-class FSS scenarios, efficiently learning from diverse support set configurations without retraining. This approach enables a “universal” application to various FSS challenges, ranging from $1$-way $1$-shot to complex $N$-way $K$-shot configurations while remaining agnostic to the specific number of class examples. This innovative training strategy reduces computational requirements and substantially improves the model’s adaptability and generalization across diverse segmentation tasks. Our comprehensive experimental validation, particularly achieving state-of-the-art results on the COCO-$20^i$ benchmark, underscores Label Anything’s robust generalization and flexibility. The source code is publicly available at: https://github.com/pasqualedem/LabelAnything.

[182] Towards a Universal 3D Medical Multi-modality Generalization via Learning Personalized Invariant Representation

Zhaorui Tan, Xi Yang, Tan Pan, Tianyi Liu, Chen Jiang, Xin Guo, Qiufeng Wang, Anh Nguyen, Yuan Qi, Kaizhu Huang, Yuan Cheng

Main category: cs.CV

TL;DR: The paper proposes a two-stage method for improving cross-modality generalization in medical imaging by learning personalized representations, showing better performance than non-personalized approaches.

Details

Motivation: Addressing the challenge of cross-modality generalization in medical imaging due to variations in modalities and individual anatomical differences, which existing methods often overlook.

Method: A two-stage approach: pre-training with invariant personalized representations ($\mathbb{X}_h$), followed by fine-tuning for diverse downstream tasks.

Result: Demonstrates improved generalizability and transferability across multi-modal medical tasks, validated by extensive experiments.

Conclusion: Personalization enhances multi-modality generalization, offering better performance in diverse scenarios compared to non-personalized methods.

Abstract: Variations in medical imaging modalities and individual anatomical differences pose challenges to cross-modality generalization in multi-modal tasks. Existing methods often concentrate exclusively on common anatomical patterns, thereby neglecting individual differences and consequently limiting their generalization performance. This paper emphasizes the critical role of learning individual-level invariance, i.e., personalized representation $\mathbb{X}_h$, to enhance multi-modality generalization under both homogeneous and heterogeneous settings. It reveals that mappings from individual biological profile to different medical modalities remain static across the population, which is implied in the personalization process. We propose a two-stage approach: pre-training with invariant representation $\mathbb{X}_h$ for personalization, then fine-tuning for diverse downstream tasks. We provide both theoretical and empirical evidence demonstrating the feasibility and advantages of personalization, showing that our approach yields greater generalizability and transferability across diverse multi-modal medical tasks compared to methods lacking personalization. Extensive experiments further validate that our approach significantly enhances performance in various generalization scenarios.

[183] PreMix: Label-Efficient Multiple Instance Learning via Non-Contrastive Pre-training and Feature Mixing

Bryan Wong, Mun Yong Yi

Main category: cs.CV

TL;DR: PreMix improves WSI classification by leveraging pre-training and data augmentation, achieving a 4.7% F1 boost over baseline methods.

Details

Motivation: Current MIL methods underutilize pre-training for aggregators, limiting performance with limited labeled data.

Method: PreMix uses Barlow Twins pre-training and Slide Mixing for feature learning, with Mixup and Manifold Mixup for fine-tuning.

Result: PreMix improves F1 by 4.7% over HIPT across datasets and training sizes.

Conclusion: PreMix advances WSI classification with limited labeled data, applicable to real-world histopathology.

Abstract: Multiple instance learning (MIL) has emerged as a powerful framework for weakly supervised whole slide image (WSI) classification, enabling slide-level predictions without requiring detailed patch-level annotations. Despite its success, a critical limitation of current MIL methods lies in the underutilization of pre-training for the MIL aggregator. Most existing approaches initialize the aggregator randomly and train it from scratch, making performance highly sensitive to the quantity of labeled WSIs and ignoring the abundance of unlabeled WSIs commonly available in clinical settings. To address this, we propose PreMix, a novel framework that leverages a non-contrastive pre-training method, Barlow Twins, augmented with the Slide Mixing approach to generate additional positive pairs and enhance feature learning, particularly under limited labeled WSI conditions. Fine-tuning with Mixup and Manifold Mixup further enhances robustness by effectively handling the diverse sizes of gigapixel WSIs. Experimental results demonstrate that integrating PreMix as a plug-in module into HIPT yields an average F1 improvement of 4.7% over the baseline HIPT across various WSI training sizes and datasets. These findings underscore its potential to advance WSI classification with limited labeled data and its applicability to real-world histopathology practices. The code is available at https://github.com/bryanwong17/PreMix

[184] Optimizing against Infeasible Inclusions from Data for Semantic Segmentation through Morphology

Shamik Basu, Luc Van Gool, Christos Sakaridis

Main category: cs.CV

TL;DR: InSeIn introduces a method to enforce spatial class constraints in semantic segmentation to avoid absurd predictions like labeling ‘road’ within ‘sky’.

Details

Motivation: Current models often produce infeasible segmentations due to purely data-driven training, ignoring spatial class relations.

Method: Extracts inclusion constraints from training data and enforces them via a differentiable morphological loss during training.

Result: Improves performance across ADE20K, Cityscapes, and ACDC datasets.

Conclusion: InSeIn effectively reduces infeasible semantic inclusions, enhancing segmentation accuracy.

Abstract: State-of-the-art semantic segmentation models are typically optimized in a data-driven fashion, minimizing solely per-pixel or per-segment classification objectives on their training data. This purely data-driven paradigm often leads to absurd segmentations, especially when the domain of input images is shifted from the one encountered during training. For instance, state-of-the-art models may assign the label “road” to a segment that is included by another segment that is respectively labeled as “sky”. However, the ground truth of the existing dataset at hand dictates that such inclusion is not feasible. Our method, Infeasible Semantic Inclusions (InSeIn), first extracts explicit inclusion constraints that govern spatial class relations from the semantic segmentation training set at hand in an offline, data-driven fashion, and then enforces a morphological yet differentiable loss that penalizes violations of these constraints during training to promote prediction feasibility. InSeIn is a light-weight plug-and-play method, constitutes a novel step towards minimizing infeasible semantic inclusions in the predictions of learned segmentation models, and yields consistent and significant performance improvements over diverse state-of-the-art networks across the ADE20K, Cityscapes, and ACDC datasets. https://github.com/SHAMIK-97/InSeIn/tree/main

[185] Scalable Frame Sampling for Video Classification: A Semi-Optimal Policy Approach with Reduced Search Space

Junho Lee, Jeongwoo Shin, Seung Woo Ko, Seongsu Ha, Joonseok Lee

Main category: cs.CV

TL;DR: The paper proposes a semi-optimal frame sampling policy to efficiently select top N frames from T frames, reducing search space from O(T^N) to O(T).

Details

Motivation: Existing methods for frame sampling suffer from high computational complexity due to the vast search space, especially for large N.

Method: The proposed method selects top N frames based on independently estimated per-frame confidence, avoiding brute-force search.

Result: The semi-optimal policy approximates the optimal policy efficiently and ensures stable performance across varying N and T.

Conclusion: The method significantly reduces computational complexity while maintaining high performance in video classification tasks.

Abstract: Given a video with $T$ frames, frame sampling is a task to select $N \ll T$ frames, so as to maximize the performance of a fixed video classifier. Not just brute-force search, but most existing methods suffer from its vast search space of $\binom{T}{N}$, especially when $N$ gets large. To address this challenge, we introduce a novel perspective of reducing the search space from $O(T^N)$ to $O(T)$. Instead of exploring the entire $O(T^N)$ space, our proposed semi-optimal policy selects the top $N$ frames based on the independently estimated value of each frame using per-frame confidence, significantly reducing the computational complexity. We verify that our semi-optimal policy can efficiently approximate the optimal policy, particularly under practical settings. Additionally, through extensive experiments on various datasets and model architectures, we demonstrate that learning our semi-optimal policy ensures stable and high performance regardless of the size of $N$ and $T$.

[186] PF3plat: Pose-Free Feed-Forward 3D Gaussian Splatting

Sunghwan Hong, Jaewoo Jung, Heeseong Shin, Jisang Han, Jiaolong Yang, Chong Luo, Seungryong Kim

Main category: cs.CV

TL;DR: The paper introduces a framework for novel view synthesis from unposed images using 3DGS, addressing challenges like misaligned 3D Gaussians by leveraging pre-trained models and learnable modules for refinement.

Details

Motivation: To enable high-quality 3D reconstruction and view synthesis without relying on dense image views, accurate camera poses, or substantial overlaps.

Method: Uses pre-trained monocular depth estimation and visual correspondence models for coarse alignment, followed by learnable modules to refine depth and pose estimates. Geometry confidence scores are introduced to condition Gaussian parameter prediction.

Result: Achieves state-of-the-art performance on large-scale real-world datasets, validated by ablation studies.

Conclusion: The framework effectively addresses challenges in novel view synthesis, offering a scalable and practical solution.

Abstract: We consider the problem of novel view synthesis from unposed images in a single feed-forward. Our framework capitalizes on fast speed, scalability, and high-quality 3D reconstruction and view synthesis capabilities of 3DGS, where we further extend it to offer a practical solution that relaxes common assumptions such as dense image views, accurate camera poses, and substantial image overlaps. We achieve this through identifying and addressing unique challenges arising from the use of pixel-aligned 3DGS: misaligned 3D Gaussians across different views induce noisy or sparse gradients that destabilize training and hinder convergence, especially when above assumptions are not met. To mitigate this, we employ pre-trained monocular depth estimation and visual correspondence models to achieve coarse alignments of 3D Gaussians. We then introduce lightweight, learnable modules to refine depth and pose estimates from the coarse alignments, improving the quality of 3D reconstruction and novel view synthesis. Furthermore, the refined estimates are leveraged to estimate geometry confidence scores, which assess the reliability of 3D Gaussian centers and condition the prediction of Gaussian parameters accordingly. Extensive evaluations on large-scale real-world datasets demonstrate that PF3plat sets a new state-of-the-art across all benchmarks, supported by comprehensive ablation studies validating our design choices. project page: https://cvlab-kaist.github.io/PF3plat/

[187] CutS3D: Cutting Semantics in 3D for 2D Unsupervised Instance Segmentation

Leon Sick, Dominik Engel, Sebastian Hartwig, Pedro Hermosilla, Timo Ropinski

Main category: cs.CV

TL;DR: The paper proposes a 3D-based method for unsupervised instance segmentation, improving accuracy by addressing overlapping instances and mask ambiguity.

Details

Motivation: Traditional methods rely on human-annotated data and fail to separate overlapping instances due to 2D limitations.

Method: Uses 3D point clouds to cut semantic masks and introduces a Spatial Importance function and Spatial Confidence components for cleaner training.

Result: Outperforms state-of-the-art methods on benchmarks for unsupervised instance segmentation and object detection.

Conclusion: The 3D approach effectively addresses limitations of 2D methods, enhancing unsupervised instance segmentation.

Abstract: Traditionally, algorithms that learn to segment object instances in 2D images have heavily relied on large amounts of human-annotated data. Only recently, novel approaches have emerged tackling this problem in an unsupervised fashion. Generally, these approaches first generate pseudo-masks and then train a class-agnostic detector. While such methods deliver the current state of the art, they often fail to correctly separate instances overlapping in 2D image space since only semantics are considered. To tackle this issue, we instead propose to cut the semantic masks in 3D to obtain the final 2D instances by utilizing a point cloud representation of the scene. Furthermore, we derive a Spatial Importance function, which we use to resharpen the semantics along the 3D borders of instances. Nevertheless, these pseudo-masks are still subject to mask ambiguity. To address this issue, we further propose to augment the training of a class-agnostic detector with three Spatial Confidence components aiming to isolate a clean learning signal. With these contributions, our approach outperforms competing methods across multiple standard benchmarks for unsupervised instance segmentation and object detection.

[188] Distilling Diffusion Models to Efficient 3D LiDAR Scene Completion

Shengyuan Zhang, An Zhao, Ling Yang, Zejian Li, Chenye Meng, Haoran Xu, Tianrun Chen, AnYang Wei, Perry Pengyun GU, Lingyun Sun

Main category: cs.CV

TL;DR: ScoreLiDAR is a novel distillation method for 3D LiDAR scene completion, improving speed (>5x faster) and quality via Structural Loss.

Details

Motivation: Slow sampling speed of diffusion models limits practical use in autonomous vehicles.

Method: Proposes ScoreLiDAR, a distillation method with Structural Loss (scene-wise and point-wise terms).

Result: Achieves 5.37s per frame (vs. 30.55s) and outperforms state-of-the-art models.

Conclusion: ScoreLiDAR enables efficient, high-quality 3D LiDAR scene completion for autonomous vehicles.

Abstract: Diffusion models have been applied to 3D LiDAR scene completion due to their strong training stability and high completion quality. However, the slow sampling speed limits the practical application of diffusion-based scene completion models since autonomous vehicles require an efficient perception of surrounding environments. This paper proposes a novel distillation method tailored for 3D Li- DAR scene completion models, dubbed ScoreLiDAR, which achieves efficient yet high-quality scene completion. Score- LiDAR enables the distilled model to sample in significantly fewer steps after distillation. To improve completion quality, we also introduce a novel Structural Loss, which encourages the distilled model to capture the geometric structure of the 3D LiDAR scene. The loss contains a scene-wise term constraining the holistic structure and a point-wise term constraining the key landmark points and their relative configuration. Extensive experiments demonstrate that ScoreLiDAR significantly accelerates the completion time from 30.55 to 5.37 seconds per frame (>5x) on SemanticKITTI and achieves superior performance compared to state-of-the-art 3D LiDAR scene completion models. Our model and code are publicly available on https: //github.com/happyw1nd/ScoreLiDAR.

[189] Personalization Toolkit: Training Free Personalization of Large Vision Language Models

Soroush Seifi, Vaggelis Dorovatas, Daniel Olmeda Reino, Rahaf Aljundi

Main category: cs.CV

TL;DR: A training-free method for personalizing Large Vision-Language Models (LVLMs) using pre-trained vision models, RAG, and visual prompting, outperforming training-based approaches.

Details

Motivation: Existing LVLM personalization methods require time-consuming test-time training, limiting real-world applicability.

Method: Leverages pre-trained vision models for feature extraction, RAG for instance identification, and visual prompting for guided outputs.

Result: Achieves state-of-the-art performance without additional training, excelling in multi-concept personalization.

Conclusion: The approach is efficient, flexible, and superior to training-based methods, validated by a new benchmark.

Abstract: Personalization of Large Vision-Language Models (LVLMs) involves customizing models to recognize specific users and object instances, and to generate contextually tailored responses. Existing approaches typically rely on time-consuming test-time training for each user or object, making them impractical for real-world deployment, a limitation reflected in current personalization benchmarks, which are focused on object-centric, single-concept evaluations. In this paper, we present a novel training-free approach to LVLM personalization and introduce a comprehensive real-world benchmark designed to rigorously evaluate various aspects of the personalization task. Our method leverages pre-trained vision foundation models to extract distinctive features, applies retrieval-augmented generation (RAG) techniques to identify instances within visual inputs, and employs visual prompting strategies to guide model outputs. Our model-agnostic vision toolkit enables efficient and flexible multi-concept personalization across both images and videos, without any additional training. We achieve state-of-the-art results, surpassing existing training-based methods.

[190] EVEv2: Improved Baselines for Encoder-Free Vision-Language Models

Haiwen Diao, Xiaotong Li, Yufeng Cui, Yueze Wang, Haoge Deng, Ting Pan, Wenxuan Wang, Huchuan Lu, Xinlong Wang

Main category: cs.CV

TL;DR: Encoder-free VLMs like EVEv2.0 rival encoder-based models by simplifying architecture and improving training strategies.

Details

Motivation: To bridge the performance gap between encoder-free and encoder-based VLMs and explore their under-examined potential.

Method: Systematic analysis of encoder-free VLMs, development of efficient strategies, and launch of EVEv2.0 with hierarchical modality association and optimized training.

Result: EVEv2.0 shows superior data efficiency and vision-reasoning capability, reducing modality interference.

Conclusion: Encoder-free VLMs, with proper design, can match or outperform encoder-based models, offering simpler and more efficient deployment.

Abstract: Existing encoder-free vision-language models (VLMs) are rapidly narrowing the performance gap with their encoder-based counterparts, highlighting the promising potential for unified multimodal systems with structural simplicity and efficient deployment. We systematically clarify the performance gap between VLMs using pre-trained vision encoders, discrete tokenizers, and minimalist visual layers from scratch, deeply excavating the under-examined characteristics of encoder-free VLMs. We develop efficient strategies for encoder-free VLMs that rival mainstream encoder-based ones. After an in-depth investigation, we launch EVEv2.0, a new and improved family of encoder-free VLMs. We show that: (i) Properly decomposing and hierarchically associating vision and language within a unified model reduces interference between modalities. (ii) A well-designed training strategy enables effective optimization for encoder-free VLMs. Through extensive evaluation, our EVEv2.0 represents a thorough study for developing a decoder-only architecture across modalities, demonstrating superior data efficiency and strong vision-reasoning capability. Code is publicly available at: https://github.com/baaivision/EVE.

[191] Robust sensitivity control in digital pathology via tile score distribution matching

Arthur Pignet, John Klein, Genevieve Robin, Antoine Olivier

Main category: cs.CV

TL;DR: A novel method using optimal transport and MIL ensures sensitivity control in WSI classification models for reliable deployment in digital pathology.

Details

Motivation: Addressing the challenge of deploying pathology models across centers due to distribution shifts and clinical sensitivity requirements.

Method: Combines optimal transport and Multiple Instance Learning (MIL) for sensitivity control in WSI classification.

Result: Validated across cohorts, the method achieves robust sensitivity control with minimal calibration samples.

Conclusion: Provides a practical solution for reliable computational pathology system deployment.

Abstract: Deploying digital pathology models across medical centers is challenging due to distribution shifts. Recent advances in domain generalization improve model transferability in terms of aggregated performance measured by the Area Under Curve (AUC). However, clinical regulations often require to control the transferability of other metrics, such as prescribed sensitivity levels. We introduce a novel approach to control the sensitivity of whole slide image (WSI) classification models, based on optimal transport and Multiple Instance Learning (MIL). Validated across multiple cohorts and tasks, our method enables robust sensitivity control with only a handful of calibration samples, providing a practical solution for reliable deployment of computational pathology systems.

[192] Learning to Generalize without Bias for Open-Vocabulary Action Recognition

Yating Yu, Congqi Cao, Yifan Zhang, Yanning Zhang

Main category: cs.CV

TL;DR: Open-MeDe introduces a meta-learning framework to address static bias in CLIP-based video learners, improving generalization for open-vocabulary action recognition.

Details

Motivation: CLIP-based video learners overfit on static features, limiting their generalizability to novel out-of-context actions.

Method: Open-MeDe uses meta-optimization and static debiasing to enhance generalization, employing cross-batch meta-optimization and self-ensemble techniques.

Result: Open-MeDe outperforms state-of-the-art methods in both in-context and out-of-context scenarios.

Conclusion: The framework effectively mitigates static bias and improves generalization, demonstrating robust performance in open-vocabulary action recognition.

Abstract: Leveraging the effective visual-text alignment and static generalizability from CLIP, recent video learners adopt CLIP initialization with further regularization or recombination for generalization in open-vocabulary action recognition in-context. However, due to the static bias of CLIP, such video learners tend to overfit on shortcut static features, thereby compromising their generalizability, especially to novel out-of-context actions. To address this issue, we introduce Open-MeDe, a novel Meta-optimization framework with static Debiasing for Open-vocabulary action recognition. From a fresh perspective of generalization, Open-MeDe adopts a meta-learning approach to improve known-to-open generalizing and image-to-video debiasing in a cost-effective manner. Specifically, Open-MeDe introduces a cross-batch meta-optimization scheme that explicitly encourages video learners to quickly generalize to arbitrary subsequent data via virtual evaluation, steering a smoother optimization landscape. In effect, the free of CLIP regularization during optimization implicitly mitigates the inherent static bias of the video meta-learner. We further apply self-ensemble over the optimization trajectory to obtain generic optimal parameters that can achieve robust generalization to both in-context and out-of-context novel data. Extensive evaluations show that Open-MeDe not only surpasses state-of-the-art regularization methods tailored for in-context open-vocabulary action recognition but also substantially excels in out-of-context scenarios.Code is released at https://github.com/Mia-YatingYu/Open-MeDe.

[193] Robust Multi-View Learning via Representation Fusion of Sample-Level Attention and Alignment of Simulated Perturbation

Jie Xu, Na Zhao, Gang Niu, Masashi Sugiyama, Xiaofeng Zhu

Main category: cs.CV

TL;DR: The paper proposes RML, a robust multi-view learning method combining representation fusion and alignment to handle heterogeneous and imperfect multi-view data.

Details

Motivation: Multi-view learning (MVL) struggles with heterogeneous and imperfect datasets, limiting its effectiveness. RML aims to address this by improving robustness and adaptability.

Method: RML uses a multi-view transformer fusion network to convert heterogeneous data into homogeneous embeddings and employs contrastive learning with simulated perturbations for alignment.

Result: RML demonstrates effectiveness in unsupervised clustering, noise-label classification, and cross-modal hashing retrieval through experiments.

Conclusion: RML is a versatile, self-supervised method that enhances robustness in multi-view learning and can be applied to downstream tasks.

Abstract: Recently, multi-view learning (MVL) has garnered significant attention due to its ability to fuse discriminative information from multiple views. However, real-world multi-view datasets are often heterogeneous and imperfect, which usually causes MVL methods designed for specific combinations of views to lack application potential and limits their effectiveness. To address this issue, we propose a novel robust MVL method (namely RML) with simultaneous representation fusion and alignment. Specifically, we introduce a simple yet effective multi-view transformer fusion network where we transform heterogeneous multi-view data into homogeneous word embeddings, and then integrate multiple views by the sample-level attention mechanism to obtain a fused representation. Furthermore, we propose a simulated perturbation based multi-view contrastive learning framework that dynamically generates the noise and unusable perturbations for simulating imperfect data conditions. The simulated noisy and unusable data obtain two distinct fused representations, and we utilize contrastive learning to align them for learning discriminative and robust representations. Our RML is self-supervised and can also be applied for downstream tasks as a regularization. In experiments, we employ it in multi-view unsupervised clustering, noise-label classification, and as a plug-and-play module for cross-modal hashing retrieval. Extensive comparison experiments and ablation studies validate RML’s effectiveness. Code is available at https://github.com/SubmissionsIn/RML.

[194] When Large Vision-Language Model Meets Large Remote Sensing Imagery: Coarse-to-Fine Text-Guided Token Pruning

Junwei Luo, Yingying Zhang, Xue Yang, Kang Wu, Qi Zhu, Lei Liang, Jingdong Chen, Yansheng Li

Main category: cs.CV

TL;DR: A text-guided token pruning method with Dynamic Image Pyramid (DIP) integration is proposed to efficiently process large Remote Sensing Images (RSIs) while preserving details and reducing computational costs. A new benchmark, LRS-VQA, is introduced for evaluation.

Details

Motivation: Current LVLMs lose information or incur high computational costs when processing gigapixel RSIs. A more efficient method is needed.

Method: Proposes a Region Focus Module (RFM) for text-aware region localization and a coarse-to-fine token pruning strategy with DIP.

Result: Outperforms existing methods on four datasets and shows higher efficiency in high-resolution settings.

Conclusion: The proposed method effectively balances detail preservation and computational efficiency for large RSIs, validated by the new LRS-VQA benchmark.

Abstract: Efficient vision-language understanding of large Remote Sensing Images (RSIs) is meaningful but challenging. Current Large Vision-Language Models (LVLMs) typically employ limited pre-defined grids to process images, leading to information loss when handling gigapixel RSIs. Conversely, using unlimited grids significantly increases computational costs. To preserve image details while reducing computational complexity, we propose a text-guided token pruning method with Dynamic Image Pyramid (DIP) integration. Our method introduces: (i) a Region Focus Module (RFM) that leverages text-aware region localization capability to identify critical vision tokens, and (ii) a coarse-to-fine image tile selection and vision token pruning strategy based on DIP, which is guided by RFM outputs and avoids directly processing the entire large imagery. Additionally, existing benchmarks for evaluating LVLMs’ perception ability on large RSI suffer from limited question diversity and constrained image sizes. We construct a new benchmark named LRS-VQA, which contains 7,333 QA pairs across 8 categories, with image length up to 27,328 pixels. Our method outperforms existing high-resolution strategies on four datasets using the same data. Moreover, compared to existing token reduction methods, our approach demonstrates higher efficiency under high-resolution settings. Dataset and code are in https://github.com/VisionXLab/LRS-VQA.

[195] External Knowledge Injection for CLIP-Based Class-Incremental Learning

Da-Wei Zhou, Kai-Wen Li, Jingyi Ning, Han-Jia Ye, Lijun Zhang, De-Chuan Zhan

Main category: cs.CV

TL;DR: ENGINE enhances CLIP-based Class-Incremental Learning by injecting external knowledge through a dual-branch framework, improving feature capture and achieving state-of-the-art performance.

Details

Motivation: CLIP's reliance on class-name matching overlooks contextual language information, and incremental updates overwrite detailed features, necessitating external knowledge for better adaptation.

Method: A dual-branch injection tuning framework: visual branch uses data augmentation, and textual branch leverages GPT-4 for rewriting descriptors. Post-tuning knowledge re-ranks predictions.

Result: Extensive experiments show ENGINE achieves state-of-the-art performance in CIL tasks.

Conclusion: ENGINE effectively integrates external knowledge to improve CLIP-based CIL, demonstrating superior adaptability and performance.

Abstract: Class-Incremental Learning (CIL) enables learning systems to continuously adapt to evolving data streams. With the advancement of pre-training, leveraging pre-trained vision-language models (e.g., CLIP) offers a promising starting point for CIL. However, CLIP makes decisions by matching visual embeddings to class names, overlooking the rich contextual information conveyed through language. For instance, the concept of ``cat’’ can be decomposed into features like tail, fur, and face for recognition. Besides, since the model is continually updated, these detailed features are overwritten in CIL, requiring external knowledge for compensation. In this paper, we introduce ExterNal knowledGe INjEction (ENGINE) for CLIP-based CIL. To enhance knowledge transfer from outside the dataset, we propose a dual-branch injection tuning framework that encodes informative knowledge from both visual and textual modalities. The visual branch is enhanced with data augmentation to enrich the visual features, while the textual branch leverages GPT-4 to rewrite discriminative descriptors. In addition to this on-the-fly knowledge injection, we also implement post-tuning knowledge by re-ranking the prediction results during inference. With the injected knowledge, the model can better capture informative features for downstream tasks as data evolves. Extensive experiments demonstrate the state-of-the-art performance of ENGINE. Code is available at: https://github.com/LAMDA-CL/ICCV25-ENGINE

[196] Att-Adapter: A Robust and Precise Domain-Specific Multi-Attributes T2I Diffusion Adapter via Conditional Variational Autoencoder

Wonwoong Cho, Yan-Ying Chen, Matthew Klenk, David I. Inouye, Yanxia Zhang

Main category: cs.CV

TL;DR: The paper introduces the Attribute (Att) Adapter, a plug-and-play module for precise multi-attribute control in text-to-image diffusion models, outperforming baselines and improving disentanglement.

Details

Motivation: Existing text-to-image diffusion models struggle with precise control of continuous attributes (e.g., eye openness) using text-only guidance, especially for multiple attributes in new domains.

Method: The Att-Adapter uses a decoupled cross-attention module and integrates a Conditional Variational Autoencoder (CVAE) to avoid overfitting, enabling multi-attribute control without paired training data.

Result: Att-Adapter outperforms LoRA-based baselines in controlling continuous attributes, offers broader control range, and improves disentanglement, surpassing StyleGAN-based methods.

Conclusion: The Att-Adapter is a flexible, scalable solution for multi-attribute control in diffusion models, requiring no paired data and enhancing performance over existing techniques.

Abstract: Text-to-Image (T2I) Diffusion Models have achieved remarkable performance in generating high quality images. However, enabling precise control of continuous attributes, especially multiple attributes simultaneously, in a new domain (e.g., numeric values like eye openness or car width) with text-only guidance remains a significant challenge. To address this, we introduce the Attribute (Att) Adapter, a novel plug-and-play module designed to enable fine-grained, multi-attributes control in pretrained diffusion models. Our approach learns a single control adapter from a set of sample images that can be unpaired and contain multiple visual attributes. The Att-Adapter leverages the decoupled cross attention module to naturally harmonize the multiple domain attributes with text conditioning. We further introduce Conditional Variational Autoencoder (CVAE) to the Att-Adapter to mitigate overfitting, matching the diverse nature of the visual world. Evaluations on two public datasets show that Att-Adapter outperforms all LoRA-based baselines in controlling continuous attributes. Additionally, our method enables a broader control range and also improves disentanglement across multiple attributes, surpassing StyleGAN-based techniques. Notably, Att-Adapter is flexible, requiring no paired synthetic data for training, and is easily scalable to multiple attributes within a single model.

[197] Aligning Vision to Language: Annotation-Free Multimodal Knowledge Graph Construction for Enhanced LLMs Reasoning

Junming Liu, Siyuan Meng, Yanting Gao, Song Mao, Pinlong Cai, Guohang Yan, Yirong Chen, Zilin Bian, Ding Wang, Botian Shi

Main category: cs.CV

TL;DR: VaLiK enhances LLM reasoning by constructing MMKGs using VLMs for cross-modal alignment and noise filtering, outperforming prior methods.

Details

Motivation: Addressing incomplete knowledge and hallucinations in LLMs by improving cross-modal understanding through MMKGs.

Method: Cascades VLMs to align image features with text, uses cross-modal similarity verification to filter noise, and constructs MMKGs without manual annotations.

Result: Achieves storage efficiency and outperforms state-of-the-art models in multimodal reasoning tasks.

Conclusion: VaLiK effectively supplements LLMs with cross-modal information, improving reasoning without manual annotations.

Abstract: Multimodal reasoning in Large Language Models (LLMs) struggles with incomplete knowledge and hallucination artifacts, challenges that textual Knowledge Graphs (KGs) only partially mitigate due to their modality isolation. While Multimodal Knowledge Graphs (MMKGs) promise enhanced cross-modal understanding, their practical construction is impeded by semantic narrowness of manual text annotations and inherent noise in visual-semantic entity linkages. In this paper, we propose Vision-align-to-Language integrated Knowledge Graph (VaLiK), a novel approach for constructing MMKGs that enhances LLMs reasoning through cross-modal information supplementation. Specifically, we cascade pre-trained Vision-Language Models (VLMs) to align image features with text, transforming them into descriptions that encapsulate image-specific information. Furthermore, we developed a cross-modal similarity verification mechanism to quantify semantic consistency, effectively filtering out noise introduced during feature alignment. Even without manually annotated image captions, the refined descriptions alone suffice to construct the MMKG. Compared to conventional MMKGs construction paradigms, our approach achieves substantial storage efficiency gains while maintaining direct entity-to-image linkage capability. Experimental results on multimodal reasoning tasks demonstrate that LLMs augmented with VaLiK outperform previous state-of-the-art models. Our code is published at https://github.com/Wings-Of-Disaster/VaLiK.

[198] Trigger without Trace: Towards Stealthy Backdoor Attack on Text-to-Image Diffusion Models

Jie Zhang, Zhongqi Wang, Shiguang Shan, Xilin Chen

Main category: cs.CV

TL;DR: The paper introduces Trigger without Trace (TwT), a method to create stealthy backdoor attacks in text-to-image diffusion models by mitigating detectable consistencies in semantic and attention patterns.

Details

Motivation: Current backdoor attacks in text-to-image models leave detectable traces due to semantic and attention consistencies, making them vulnerable to defenses. The goal is to develop a more stealthy backdoor method.

Method: TwT uses syntactic structures as triggers to break semantic consistency and employs Kernel Maximum Mean Discrepancy (KMMD) regularization to disrupt attention consistency.

Result: The method achieves a 97.5% attack success rate and evades three state-of-the-art detection mechanisms with over 98% success.

Conclusion: TwT exposes vulnerabilities in current backdoor defenses, demonstrating the need for more robust detection methods.

Abstract: Backdoor attacks targeting text-to-image diffusion models have advanced rapidly. However, current backdoor samples often exhibit two key abnormalities compared to benign samples: 1) Semantic Consistency, where backdoor prompts tend to generate images with similar semantic content even with significant textual variations to the prompts; 2) Attention Consistency, where the trigger induces consistent structural responses in the cross-attention maps. These consistencies leave detectable traces for defenders, making backdoors easier to identify. In this paper, toward stealthy backdoor samples, we propose Trigger without Trace (TwT) by explicitly mitigating these consistencies. Specifically, our approach leverages syntactic structures as backdoor triggers to amplify the sensitivity to textual variations, effectively breaking down the semantic consistency. Besides, a regularization method based on Kernel Maximum Mean Discrepancy (KMMD) is proposed to align the distribution of cross-attention responses between backdoor and benign samples, thereby disrupting attention consistency. Extensive experiments demonstrate that our method achieves a 97.5% attack success rate while exhibiting stronger resistance to defenses. It achieves an average of over 98% backdoor samples bypassing three state-of-the-art detection mechanisms, revealing the vulnerabilities of current backdoor defense methods. The code is available at https://github.com/Robin-WZQ/TwT.

[199] DeGauss: Dynamic-Static Decomposition with Gaussian Splatting for Distractor-free 3D Reconstruction

Rui Wang, Quentin Lohmeyer, Mirko Meboldt, Siyu Tang

Main category: cs.CV

TL;DR: DeGauss is a self-supervised framework for dynamic scene reconstruction using decoupled Gaussian Splatting, outperforming existing methods in distractor-free 3D reconstruction.

Details

Motivation: Reconstructing clean 3D scenes from dynamic, cluttered real-world captures (e.g., egocentric videos) is challenging.

Method: DeGauss uses decoupled dynamic-static Gaussian Splatting, modeling dynamic elements with foreground Gaussians and static content with background Gaussians, coordinated by a probabilistic mask.

Result: DeGauss outperforms existing methods on benchmarks like NeRF-on-the-go, ADT, AEA, Hot3D, and EPIC-Fields.

Conclusion: DeGauss sets a strong baseline for generalizable, distractor-free 3D reconstruction in dynamic environments.

Abstract: Reconstructing clean, distractor-free 3D scenes from real-world captures remains a significant challenge, particularly in highly dynamic and cluttered settings such as egocentric videos. To tackle this problem, we introduce DeGauss, a simple and robust self-supervised framework for dynamic scene reconstruction based on a decoupled dynamic-static Gaussian Splatting design. DeGauss models dynamic elements with foreground Gaussians and static content with background Gaussians, using a probabilistic mask to coordinate their composition and enable independent yet complementary optimization. DeGauss generalizes robustly across a wide range of real-world scenarios, from casual image collections to long, dynamic egocentric videos, without relying on complex heuristics or extensive supervision. Experiments on benchmarks including NeRF-on-the-go, ADT, AEA, Hot3D, and EPIC-Fields demonstrate that DeGauss consistently outperforms existing methods, establishing a strong baseline for generalizable, distractor-free 3D reconstructionin highly dynamic, interaction-rich environments. Project page: https://batfacewayne.github.io/DeGauss.io/

[200] PerceptionLM: Open-Access Data and Models for Detailed Visual Understanding

Jang Hyun Cho, Andrea Madotto, Effrosyni Mavroudi, Triantafyllos Afouras, Tushar Nagarajan, Muhammad Maaz, Yale Song, Tengyu Ma, Shuming Hu, Suyog Jain, Miguel Martin, Huiyu Wang, Hanoona Rasheed, Peize Sun, Po-Yao Huang, Daniel Bolya, Nikhila Ravi, Shashank Jain, Tammy Stark, Shane Moon, Babak Damavandi, Vivian Lee, Andrew Westbury, Salman Khan, Philipp Krähenbühl, Piotr Dollár, Lorenzo Torresani, Kristen Grauman, Christoph Feichtenhofer

Main category: cs.CV

TL;DR: The paper proposes building a Perception Language Model (PLM) in an open framework to address the lack of transparency in vision-language models, focusing on video understanding without proprietary distillation. It introduces new datasets and benchmarks for evaluation.

Details

Motivation: Closed-source vision-language models hinder scientific progress. The paper aims to enable transparent research by avoiding proprietary models and addressing data gaps in video understanding.

Method: The study analyzes standard training pipelines without proprietary distillation, uses large-scale synthetic data, and releases 2.8M human-labeled video QA pairs and captions. It also introduces PLM-VideoBench for evaluation.

Result: The work provides open data, training recipes, code, and models, along with a new benchmark for video understanding tasks.

Conclusion: The paper advocates for open research in vision-language models, demonstrating the feasibility of transparent methods and contributing datasets and tools for reproducible progress.

Abstract: Vision-language models are integral to computer vision research, yet many high-performing models remain closed-source, obscuring their data, design and training recipe. The research community has responded by using distillation from black-box models to label training data, achieving strong benchmark results, at the cost of measurable scientific progress. However, without knowing the details of the teacher model and its data sources, scientific progress remains difficult to measure. In this paper, we study building a Perception Language Model (PLM) in a fully open and reproducible framework for transparent research in image and video understanding. We analyze standard training pipelines without distillation from proprietary models and explore large-scale synthetic data to identify critical data gaps, particularly in detailed video understanding. To bridge these gaps, we release 2.8M human-labeled instances of fine-grained video question-answer pairs and spatio-temporally grounded video captions. Additionally, we introduce PLM-VideoBench, a suite for evaluating challenging video understanding tasks focusing on the ability to reason about “what”, “where”, “when”, and “how” of a video. We make our work fully reproducible by providing data, training recipes, code & models. https://github.com/facebookresearch/perception_models

[201] Advances in 4D Generation: A Survey

Qiaowei Miao, Kehan Li, Jinsheng Quan, Zhiyuan Min, Shaojie Ma, Yichao Xu, Yi Yang, Ping Liu, Yawei Luo

Main category: cs.CV

TL;DR: The paper surveys 4D generation in AI, covering representations, generative frameworks, paradigms, and challenges like consistency and controllability.

Details

Motivation: To unify understanding of 4D generation, a rapidly evolving field with applications in digital humans and autonomous driving.

Method: Systematic review of 4D representations, generative pipelines, and integration of motion/geometry priors. Categorizes four paradigms and summarizes applications.

Result: Identifies five key challenges (consistency, controllability, diversity, efficiency, fidelity) and compares current approaches.

Conclusion: Provides a forward-looking perspective to guide future research in 4D generation.

Abstract: Generative artificial intelligence has recently progressed from static image and video synthesis to 3D content generation, culminating in the emergence of 4D generation-the task of synthesizing temporally coherent dynamic 3D assets guided by user input. As a burgeoning research frontier, 4D generation enables richer interactive and immersive experiences, with applications ranging from digital humans to autonomous driving. Despite rapid progress, the field lacks a unified understanding of 4D representations, generative frameworks, basic paradigms, and the core technical challenges it faces. This survey provides a systematic and in-depth review of the 4D generation landscape. To comprehensively characterize 4D generation, we first categorize fundamental 4D representations and outline associated techniques for 4D generation. We then present an in-depth analysis of representative generative pipelines based on conditions and representation methods. Subsequently, we discuss how motion and geometry priors are integrated into 4D outputs to ensure spatio-temporal consistency under various control schemes. From an application perspective, this paper summarizes 4D generation tasks in areas such as dynamic object/scene generation, digital human synthesis, editable 4D content, and embodied AI. Furthermore, we summarize and multi-dimensionally compare four basic paradigms for 4D generation: End-to-End, Generated-Data-Based, Implicit-Distillation-Based, and Explicit-Supervision-Based. Concluding our analysis, we highlight five key challenges-consistency, controllability, diversity, efficiency, and fidelity-and contextualize these with current approaches.By distilling recent advances and outlining open problems, this work offers a comprehensive and forward-looking perspective to guide future research in 4D generation.

[202] Vision Transformers in Precision Agriculture: A Comprehensive Survey

Saber Mehdipour, Seyed Abolghasem Mirroshandel, Seyed Amirhossein Tabatabaei

Main category: cs.CV

TL;DR: A review of Vision Transformers (ViTs) in precision agriculture, comparing them to CNNs, discussing challenges, and outlining future research directions.

Details

Motivation: To address limitations of traditional plant disease detection methods by exploring ViTs' potential in agriculture.

Method: Review of ViTs’ architecture, transition from NLP to CV, comparative analysis with CNNs, and examination of hybrid models and performance metrics.

Result: ViTs offer advantages like handling long-range dependencies and scalability, but face challenges like data needs and computational demands.

Conclusion: ViTs have transformative potential in precision agriculture, with future research needed to address technical challenges.

Abstract: Detecting plant diseases is a crucial aspect of modern agriculture, as it plays a key role in maintaining crop health and increasing overall yield. Traditional approaches, though still valuable, often rely on manual inspection or conventional machine learning techniques, both of which face limitations in scalability and accuracy. Recently, Vision Transformers (ViTs) have emerged as a promising alternative, offering advantages such as improved handling of long-range dependencies and better scalability for visual tasks. This review explores the application of ViTs in precision agriculture, covering a range of tasks. We begin by introducing the foundational architecture of ViTs and discussing their transition from Natural Language Processing (NLP) to Computer Vision. The discussion includes the concept of inductive bias in traditional models like Convolutional Neural Networks (CNNs), and how ViTs mitigate these biases. We provide a comprehensive review of recent literature, focusing on key methodologies, datasets, and performance metrics. This study also includes a comparative analysis of CNNs and ViTs, along with a review of hybrid models and performance enhancements. Technical challenges such as data requirements, computational demands, and model interpretability are addressed, along with potential solutions. Finally, we outline future research directions and technological advancements that could further support the integration of ViTs in real-world agricultural settings. Our goal with this study is to offer practitioners and researchers a deeper understanding of how ViTs are poised to transform smart and precision agriculture.

Byeongjun Kwon, Munchurl Kim

Main category: cs.CV

TL;DR: PRO is a tile-based framework for high-resolution depth estimation, addressing issues like depth discontinuity and dataset bias, improving efficiency and generalizability.

Details

Motivation: Existing depth estimation models struggle with high-resolution images due to resolution discrepancies and patch-based methods' inefficiencies, leading to depth discontinuity and poor generalizability.

Method: PRO introduces Grouped Patch Consistency Training and Bias Free Masking to enhance efficiency and mitigate depth discontinuity and dataset bias.

Result: PRO improves depth estimation accuracy and generalizability, as shown in zero-shot evaluations on multiple datasets.

Conclusion: PRO effectively addresses limitations of current methods, offering a scalable and generalizable solution for high-resolution depth estimation.

Abstract: Zero-shot depth estimation (DE) models exhibit strong generalization performance as they are trained on large-scale datasets. However, existing models struggle with high-resolution images due to the discrepancy in image resolutions of training (with smaller resolutions) and inference (for high resolutions). Processing them at full resolution leads to decreased estimation accuracy on depth with tremendous memory consumption, while downsampling to the training resolution results in blurred edges in the estimated depth images. Prevailing high-resolution depth estimation methods adopt a patch-based approach, which introduces depth discontinuity issues when reassembling the estimated depth patches, resulting in test-time inefficiency. Additionally, to obtain fine-grained depth details, these methods rely on synthetic datasets due to the real-world sparse ground truth depth, leading to poor generalizability. To tackle these limitations, we propose Patch Refine Once (PRO), an efficient and generalizable tile-based framework. Our PRO consists of two key components: (i) Grouped Patch Consistency Training that enhances test-time efficiency while mitigating the depth discontinuity problem by jointly processing four overlapping patches and enforcing a consistency loss on their overlapping regions within a single backpropagation step, and (ii) Bias Free Masking that prevents the DE models from overfitting to dataset-specific biases, enabling better generalization to real-world datasets even after training on synthetic data. Zero-shot evaluations on Booster, ETH3D, Middlebury 2014, and NuScenes demonstrate that our PRO can be seamlessly integrated into existing depth estimation models.

[204] Unsupervised Feature Disentanglement and Augmentation Network for One-class Face Anti-spoofing

Pei-Kai Huang, Jun-Xiong Chong, Ming-Tsung Hsu, Fang-Yu Hsu, Yi-Ting Lin, Kai-Heng Chien, Hao-Chiang Shao, Chiou-Ting Hsu

Main category: cs.CV

TL;DR: UFDANet is a one-class face anti-spoofing method that disentangles and augments features to improve generalizability, outperforming previous one-class methods and matching two-class methods.

Details

Motivation: To address the limitations of existing FAS methods—overfitting in two-class approaches and weak domain robustness in one-class methods—by disentangling and augmenting features.

Method: UFDANet uses unsupervised feature disentanglement to separate liveness and domain features, with augmentation schemes for both to enhance representability and generalizability.

Result: UFDANet outperforms previous one-class FAS methods and matches state-of-the-art two-class methods in performance.

Conclusion: UFDANet effectively balances robustness and generalizability in face anti-spoofing, offering a superior one-class solution.

Abstract: Face anti-spoofing (FAS) techniques aim to enhance the security of facial identity authentication by distinguishing authentic live faces from deceptive attempts. While two-class FAS methods risk overfitting to training attacks to achieve better performance, one-class FAS approaches handle unseen attacks well but are less robust to domain information entangled within the liveness features. To address this, we propose an Unsupervised Feature Disentanglement and Augmentation Network (\textbf{UFDANet}), a one-class FAS technique that enhances generalizability by augmenting face images via disentangled features. The \textbf{UFDANet} employs a novel unsupervised feature disentangling method to separate the liveness and domain features, facilitating discriminative feature learning. It integrates an out-of-distribution liveness feature augmentation scheme to synthesize new liveness features of unseen spoof classes, which deviate from the live class, thus enhancing the representability and discriminability of liveness features. Additionally, \textbf{UFDANet} incorporates a domain feature augmentation routine to synthesize unseen domain features, thereby achieving better generalizability. Extensive experiments demonstrate that the proposed \textbf{UFDANet} outperforms previous one-class FAS methods and achieves comparable performance to state-of-the-art two-class FAS methods.

[205] PALADIN : Robust Neural Fingerprinting for Text-to-Image Diffusion Models

Murthy L, Subarna Tripathi

Main category: cs.CV

TL;DR: Proposes a method for neural fingerprinting in text-to-image diffusion models using cyclic error-correcting codes to improve attribution accuracy.

Details

Motivation: Address the risk of misuse of open-source text-to-image models by ensuring accurate attribution through neural fingerprinting.

Method: Leverages cyclic error-correcting codes from coding theory to incorporate neural fingerprinting into diffusion models.

Result: Aims to achieve higher attribution accuracy compared to existing methods, addressing the trade-off with generation quality.

Conclusion: The proposed method could make neural fingerprinting practically deployable by achieving near-perfect attribution accuracy.

Abstract: The risk of misusing text-to-image generative models for malicious uses, especially due to the open-source development of such models, has become a serious concern. As a risk mitigation strategy, attributing generative models with neural fingerprinting is emerging as a popular technique. There has been a plethora of recent work that aim for addressing neural fingerprinting. A trade-off between the attribution accuracy and generation quality of such models has been studied extensively. None of the existing methods yet achieved 100% attribution accuracy. However, any model with less than cent percent accuracy is practically non-deployable. In this work, we propose an accurate method to incorporate neural fingerprinting for text-to-image diffusion models leveraging the concepts of cyclic error correcting codes from the literature of coding theory.

[206] TextCrafter: Accurately Rendering Multiple Texts in Complex Visual Scenes

Nikai Du, Zhennan Chen, Zhizhou Chen, Shan Gao, Xi Chen, Zhengkai Jiang, Jian Yang, Ying Tai

Main category: cs.CV

TL;DR: The paper introduces TextCrafter, a method for Complex Visual Text Generation (CVTG), addressing issues like distorted text and omissions. It proposes a progressive strategy and token focus enhancement, validated by a new dataset, CVTG-2K, showing superior performance.

Details

Motivation: CVTG tasks often result in distorted, blurred, or missing visual text in images, highlighting the need for a robust solution.

Method: TextCrafter uses a progressive decomposition strategy and token focus enhancement to improve text alignment and prominence in generated images.

Result: TextCrafter outperforms state-of-the-art methods, effectively reducing text confusion, omissions, and blurriness.

Conclusion: TextCrafter is a significant advancement in CVTG, validated by the CVTG-2K dataset, demonstrating superior performance in generating complex visual text.

Abstract: This paper explores the task of Complex Visual Text Generation (CVTG), which centers on generating intricate textual content distributed across diverse regions within visual images. In CVTG, image generation models often rendering distorted and blurred visual text or missing some visual text. To tackle these challenges, we propose TextCrafter, a novel multi-visual text rendering method. TextCrafter employs a progressive strategy to decompose complex visual text into distinct components while ensuring robust alignment between textual content and its visual carrier. Additionally, it incorporates a token focus enhancement mechanism to amplify the prominence of visual text during the generation process. TextCrafter effectively addresses key challenges in CVTG tasks, such as text confusion, omissions, and blurriness. Moreover, we present a new benchmark dataset, CVTG-2K, tailored to rigorously evaluate the performance of generative models on CVTG tasks. Extensive experiments demonstrate that our method surpasses state-of-the-art approaches.

[207] MambaNeXt-YOLO: A Hybrid State Space Model for Real-time Object Detection

Xiaochun Lei, Siqi Wu, Weilin Wu, Zetao Jiang

Main category: cs.CV

TL;DR: MambaNeXt-YOLO is a novel object detection framework combining CNNs and Mamba for efficient real-time detection, achieving 66.6% mAP at 31.9 FPS on PASCAL VOC.

Details

Motivation: The need for efficient real-time object detection with rich global context, overcoming Transformer limitations in complexity.

Method: Hybrid MambaNeXt Block (CNNs + Mamba), MAFPN for multi-scale detection, and edge-focused optimization.

Result: 66.6% mAP at 31.9 FPS on PASCAL VOC, deployable on edge devices.

Conclusion: MambaNeXt-YOLO effectively balances accuracy and efficiency for real-time and edge deployments.

Abstract: Real-time object detection is a fundamental but challenging task in computer vision, particularly when computational resources are limited. Although YOLO-series models have set strong benchmarks by balancing speed and accuracy, the increasing need for richer global context modeling has led to the use of Transformer-based architectures. Nevertheless, Transformers have high computational complexity because of their self-attention mechanism, which limits their practicality for real-time and edge deployments. To overcome these challenges, recent developments in linear state space models, such as Mamba, provide a promising alternative by enabling efficient sequence modeling with linear complexity. Building on this insight, we propose MambaNeXt-YOLO, a novel object detection framework that balances accuracy and efficiency through three key contributions: (1) MambaNeXt Block: a hybrid design that integrates CNNs with Mamba to effectively capture both local features and long-range dependencies; (2) Multi-branch Asymmetric Fusion Pyramid Network (MAFPN): an enhanced feature pyramid architecture that improves multi-scale object detection across various object sizes; and (3) Edge-focused Efficiency: our method achieved 66.6% mAP at 31.9 FPS on the PASCAL VOC dataset without any pre-training and supports deployment on edge devices such as the NVIDIA Jetson Xavier NX and Orin NX.

[208] NSegment : Label-specific Deformations for Remote Sensing Image Segmentation

Yechan Kim, DongHo Yoon, SooYeon Kim, Moongu Jeon

Main category: cs.CV

TL;DR: NSegment, a data augmentation method, improves RS image segmentation by applying elastic transformations to labels, addressing annotation inconsistencies without increasing complexity.

Details

Motivation: Labeling errors in RS datasets are common due to ambiguous boundaries, mixed pixels, and annotator bias, while annotated data is scarce and costly.

Method: Proposes NSegment, which applies elastic transformations to segmentation labels with varying intensity per sample in each epoch.

Result: Enhances performance of RS image segmentation across state-of-the-art models.

Conclusion: NSegment effectively mitigates labeling inconsistencies without added complexity or training time.

Abstract: Labeling errors in remote sensing (RS) image segmentation datasets often remain implicit and subtle due to ambiguous class boundaries, mixed pixels, shadows, complex terrain features, and subjective annotator bias. Furthermore, the scarcity of annotated RS data due to high image acquisition and labeling costs complicates training noise-robust models. While sophisticated mechanisms such as label selection or noise correction might address this issue, they tend to increase training time and add implementation complexity. In this letter, we propose NSegment-a simple yet effective data augmentation solution to mitigate this issue. Unlike traditional methods, it applies elastic transformations only to segmentation labels, varying deformation intensity per sample in each training epoch to address annotation inconsistencies. Experimental results demonstrate that our approach improves the performance of RS image segmentation on various state-of-the-art models.

[209] MC3D-AD: A Unified Geometry-aware Reconstruction Model for Multi-category 3D Anomaly Detection

Jiayi Cheng, Can Gao, Jie Zhou, Jiajun Wen, Tao Dai, Jinbao Wang

Main category: cs.CV

TL;DR: A unified model for Multi-Category 3D Anomaly Detection (MC3D-AD) is proposed, leveraging local and global geometry-aware information to improve efficiency and generalization over single-category methods.

Details

Motivation: Existing 3D anomaly detection methods require task-specific models per category, which is costly and inefficient. MC3D-AD aims to address this by unifying the process.

Method: The model includes an adaptive geometry-aware masked attention module, a local geometry-aware encoder, and a global query decoder to reconstruct normal representations.

Result: MC3D-AD outperforms state-of-the-art single-category methods, achieving 3.1% and 9.3% improvement in AUROC on Real3D-AD and Anomaly-ShapeNet datasets.

Conclusion: The proposed MC3D-AD model offers a more efficient and generalized solution for 3D anomaly detection across multiple categories.

Abstract: 3D Anomaly Detection (AD) is a promising means of controlling the quality of manufactured products. However, existing methods typically require carefully training a task-specific model for each category independently, leading to high cost, low efficiency, and weak generalization. Therefore, this paper presents a novel unified model for Multi-Category 3D Anomaly Detection (MC3D-AD) that aims to utilize both local and global geometry-aware information to reconstruct normal representations of all categories. First, to learn robust and generalized features of different categories, we propose an adaptive geometry-aware masked attention module that extracts geometry variation information to guide mask attention. Then, we introduce a local geometry-aware encoder reinforced by the improved mask attention to encode group-level feature tokens. Finally, we design a global query decoder that utilizes point cloud position embeddings to improve the decoding process and reconstruction ability. This leads to local and global geometry-aware reconstructed feature tokens for the AD task. MC3D-AD is evaluated on two publicly available Real3D-AD and Anomaly-ShapeNet datasets, and exhibits significant superiority over current state-of-the-art single-category methods, achieving 3.1% and 9.3% improvement in object-level AUROC over Real3D-AD and Anomaly-ShapeNet, respectively. The code is available at https://github.com/iCAN-SZU/MC3D-AD.

[210] Diffuse and Disperse: Image Generation with Representation Regularization

Runqian Wang, Kaiming He

Main category: cs.CV

TL;DR: Proposes Dispersive Loss, a plug-and-play regularizer for diffusion models, improving performance without extra data or parameters.

Details

Motivation: To bridge the gap between diffusion-based generative models and representation learning by introducing a simple, effective regularizer.

Method: Introduces Dispersive Loss, which disperses internal representations in hidden space without needing positive sample pairs or interfering with regression.

Result: Consistent improvements on ImageNet across models, outperforming baselines like REPA.

Conclusion: Dispersive Loss effectively enhances diffusion models, bridging generative modeling and representation learning.

Abstract: The development of diffusion-based generative models over the past decade has largely proceeded independently of progress in representation learning. These diffusion models typically rely on regression-based objectives and generally lack explicit regularization. In this work, we propose \textit{Dispersive Loss}, a simple plug-and-play regularizer that effectively improves diffusion-based generative models. Our loss function encourages internal representations to disperse in the hidden space, analogous to contrastive self-supervised learning, with the key distinction that it requires no positive sample pairs and therefore does not interfere with the sampling process used for regression. Compared to the recent method of representation alignment (REPA), our approach is self-contained and minimalist, requiring no pre-training, no additional parameters, and no external data. We evaluate Dispersive Loss on the ImageNet dataset across a range of models and report consistent improvements over widely used and strong baselines. We hope our work will help bridge the gap between generative modeling and representation learning.

[211] ToonifyGB: StyleGAN-based Gaussian Blendshapes for 3D Stylized Head Avatars

Rui-Yang Ju, Sheng-Yen Huang, Yi-Ping Hung

Main category: cs.CV

TL;DR: ToonifyGB extends Toonify for stylized 3D head avatars using Gaussian blendshapes, improving video stability and animation quality.

Details

Motivation: To overcome the limitations of fixed-resolution cropping in StyleGAN and enable diverse stylized 3D head avatar synthesis.

Method: A two-stage framework: Stage 1 generates stable stylized video with improved StyleGAN; Stage 2 learns stylized neutral head and expression blendshapes for animation.

Result: Validated on Arcane and Pixar styles, ToonifyGB efficiently renders high-quality stylized avatars with arbitrary expressions.

Conclusion: ToonifyGB successfully synthesizes diverse stylized 3D head avatars, enhancing real-time animation quality.

Abstract: The introduction of 3D Gaussian blendshapes has enabled the real-time reconstruction of animatable head avatars from monocular video. Toonify, a StyleGAN-based method, has become widely used for facial image stylization. To extend Toonify for synthesizing diverse stylized 3D head avatars using Gaussian blendshapes, we propose an efficient two-stage framework, ToonifyGB. In Stage 1 (stylized video generation), we adopt an improved StyleGAN to generate the stylized video from the input video frames, which overcomes the limitation of cropping aligned faces at a fixed resolution as preprocessing for normal StyleGAN. This process provides a more stable stylized video, which enables Gaussian blendshapes to better capture the high-frequency details of the video frames, facilitating the synthesis of high-quality animations in the next stage. In Stage 2 (Gaussian blendshapes synthesis), our method learns a stylized neutral head model and a set of expression blendshapes from the generated stylized video. By combining the neutral head model with expression blendshapes, ToonifyGB can efficiently render stylized avatars with arbitrary expressions. We validate the effectiveness of ToonifyGB on benchmark datasets using two representative styles: Arcane and Pixar.

[212] SyncMapV2: Robust and Adaptive Unsupervised Segmentation

Heng Zhang, Zikang Wan, Danilo Vasconcellos Vargas

Main category: cs.CV

TL;DR: SyncMapV2 is a novel unsupervised segmentation algorithm with unmatched robustness and adaptability, outperforming SOTA methods under various corruptions without requiring robust training or supervision.

Details

Motivation: Human vision's robustness and adaptability in segmentation tasks inspire the development of AI algorithms that can perform similarly without explicit training or supervision.

Method: SyncMapV2 uses self-organizing dynamical equations and random network concepts, enabling online adaptation without re-initialization for new inputs.

Result: SyncMapV2 shows minimal performance drop (0.01% mIoU) under digital corruption, significantly outperforming SOTA methods in noise, weather, and blur conditions.

Conclusion: SyncMapV2 pioneers a new paradigm for robust and adaptive AI, achieving human-like adaptability and fostering future advancements in intelligent systems.

Abstract: Human vision excels at segmenting visual cues without the need for explicit training, and it remains remarkably robust even as noise severity increases. In contrast, existing AI algorithms struggle to maintain accuracy under similar conditions. Here, we present SyncMapV2, the first to solve unsupervised segmentation with state-of-the-art robustness. SyncMapV2 exhibits a minimal drop in mIoU, only 0.01%, under digital corruption, compared to a 23.8% drop observed in SOTA methods. This superior performance extends across various types of corruption: noise (7.3% vs. 37.7%), weather (7.5% vs. 33.8%), and blur (7.0% vs. 29.5%). Notably, SyncMapV2 accomplishes this without any robust training, supervision, or loss functions. It is based on a learning paradigm that uses self-organizing dynamical equations combined with concepts from random networks. Moreover, unlike conventional methods that require re-initialization for each new input, SyncMapV2 adapts online, mimicking the continuous adaptability of human vision. Thus, we go beyond the accurate and robust results, and present the first algorithm that can do all the above online, adapting to input rather than re-initializing. In adaptability tests, SyncMapV2 demonstrates near-zero performance degradation, which motivates and fosters a new generation of robust and adaptive intelligence in the near future.

[213] FUDOKI: Discrete Flow-based Unified Understanding and Generation via Kinetic-Optimal Velocities

Jin Wang, Yao Lai, Aoxue Li, Shifeng Zhang, Jiacheng Sun, Ning Kang, Chengyue Wu, Zhenguo Li, Ping Luo

Main category: cs.CV

TL;DR: FUDOKI introduces a non-autoregressive multimodal model using discrete flow matching, outperforming AR-based MLLMs in visual tasks and image generation with self-correction and bidirectional context.

Details

Motivation: Existing MLLMs rely on autoregressive architectures, limiting image generation and reasoning. FUDOKI aims to overcome these limitations.

Method: FUDOKI uses discrete flow matching with metric-induced probability paths, enabling iterative refinement and bidirectional context. It transitions from pre-trained AR-based MLLMs.

Result: FUDOKI matches AR-based MLLMs in performance for visual understanding and image generation, with potential for further enhancement via test-time scaling.

Conclusion: FUDOKI presents a promising alternative to AR-based MLLMs, with scalability and performance advantages for future multimodal models.

Abstract: The rapid progress of large language models (LLMs) has catalyzed the emergence of multimodal large language models (MLLMs) that unify visual understanding and image generation within a single framework. However, most existing MLLMs rely on autoregressive (AR) architectures, which impose inherent limitations on future development, such as the raster-scan order in image generation and restricted reasoning abilities in causal context modeling. In this work, we challenge the dominance of AR-based approaches by introducing FUDOKI, a unified multimodal model purely based on discrete flow matching, as an alternative to conventional AR paradigms. By leveraging metric-induced probability paths with kinetic optimal velocities, our framework goes beyond the previous masking-based corruption process, enabling iterative refinement with self-correction capability and richer bidirectional context integration during generation. To mitigate the high cost of training from scratch, we initialize FUDOKI from pre-trained AR-based MLLMs and adaptively transition to the discrete flow matching paradigm. Experimental results show that FUDOKI achieves performance comparable to state-of-the-art AR-based MLLMs across both visual understanding and image generation tasks, highlighting its potential as a foundation for next-generation unified multimodal models. Furthermore, we show that applying test-time scaling techniques to FUDOKI yields significant performance gains, further underscoring its promise for future enhancement through reinforcement learning.

[214] RadioDUN: A Physics-Inspired Deep Unfolding Network for Radio Map Estimation

Taiqin Chen, Zikun Zhou, Zheng Fang, Wenzhen Zou, Kangjun Liu, Ke Chen, Yongbing Zhang, Yaowei Wang

Main category: cs.CV

TL;DR: The paper proposes RadioDUN, a deep unfolding network for dense radio map estimation, integrating physical propagation models and adaptive learning to outperform existing methods.

Details

Motivation: Existing deep learning methods for radio map estimation lack integration with physical characteristics, leading to inefficiencies.

Method: The paper casts radio map estimation as a sparse signal recovery problem, decomposes it into factor optimization sub-problems, and introduces RadioDUN with a dynamic reweighting module and shadowing loss.

Result: RadioDUN outperforms state-of-the-art methods in experiments.

Conclusion: The proposed method effectively integrates physical propagation models and adaptive learning, enhancing radio map estimation performance.

Abstract: The radio map represents the spatial distribution of spectrum resources within a region, supporting efficient resource allocation and interference mitigation. However, it is difficult to construct a dense radio map as a limited number of samples can be measured in practical scenarios. While existing works have used deep learning to estimate dense radio maps from sparse samples, they are hard to integrate with the physical characteristics of the radio map. To address this challenge, we cast radio map estimation as the sparse signal recovery problem. A physical propagation model is further incorporated to decompose the problem into multiple factor optimization sub-problems, thereby reducing recovery complexity. Inspired by the existing compressive sensing methods, we propose the Radio Deep Unfolding Network (RadioDUN) to unfold the optimization process, achieving adaptive parameter adjusting and prior fitting in a learnable manner. To account for the radio propagation characteristics, we develop a dynamic reweighting module (DRM) to adaptively model the importance of each factor for the radio map. Inspired by the shadowing factor in the physical propagation model, we integrate obstacle-related factors to express the obstacle-induced signal stochastic decay. The shadowing loss is further designed to constrain the factor prediction and act as a supplementary supervised objective, which enhances the performance of RadioDUN. Extensive experiments have been conducted to demonstrate that the proposed method outperforms the state-of-the-art methods. Our code will be made publicly available upon publication.

[215] Flash-VStream: Efficient Real-Time Understanding for Long Video Streams

Haoji Zhang, Yiqin Wang, Yansong Tang, Yong Liu, Jiashi Feng, Xiaojie Jin

Main category: cs.CV

TL;DR: Flash-VStream is an efficient video language model for long videos, reducing latency with a novel Flash Memory module.

Details

Motivation: Existing models struggle with long videos due to computational overhead and inefficiency.

Method: Uses a Flash Memory module with low-capacity context memory and high-capacity augmentation memory for efficient processing.

Result: Achieves state-of-the-art performance on benchmarks like EgoSchema and MLVU with reduced inference latency.

Conclusion: Flash-VStream offers efficient, real-time processing for long videos, outperforming existing methods.

Abstract: Benefiting from the advances in large language models and cross-modal alignment, existing multimodal large language models have achieved prominent performance in image and short video understanding. However, the understanding of long videos is still challenging, as their long-context nature results in significant computational and memory overhead. Most existing work treats long videos in the same way as short videos, which is inefficient for real-world applications and hard to generalize to even longer videos. To address these issues, we propose Flash-VStream, an efficient video language model capable of processing extremely long videos and responding to user queries in real time. Particularly, we design a Flash Memory module, containing a low-capacity context memory to aggregate long-context temporal information and model the distribution of information density, and a high-capacity augmentation memory to retrieve detailed spatial information based on this distribution. Compared to existing models, Flash-VStream achieves significant reductions in inference latency. Extensive experiments on long video benchmarks and comprehensive video benchmarks, i.e., EgoSchema, MLVU, LVBench, MVBench and Video-MME, demonstrate the state-of-the-art performance and outstanding efficiency of our method. Code is available at https://github.com/IVGSZ/Flash-VStream.

[216] Zero-Shot Skeleton-Based Action Recognition With Prototype-Guided Feature Alignment

Kai Zhou, Shuhai Zhang, Zeng You, Jinwu Hu, Mingkui Tan, Fei Liu

Main category: cs.CV

TL;DR: The paper introduces PGFA, a prototype-guided feature alignment paradigm for zero-shot skeleton-based action recognition, addressing limitations in existing methods and achieving significant accuracy improvements.

Details

Motivation: Existing methods for zero-shot skeleton-based action recognition suffer from insufficient skeleton feature discrimination and alignment bias between skeleton and unseen text features.

Method: Proposes an end-to-end cross-modal contrastive training framework for better skeleton-text alignment and a prototype-guided text feature alignment strategy to mitigate distribution discrepancies.

Result: PGFA outperforms the top competitor SMIE method, achieving absolute accuracy improvements of 22.96%, 12.53%, and 18.54% on NTU-60, NTU-120, and PKU-MMD datasets, respectively.

Conclusion: PGFA effectively addresses the challenges of zero-shot skeleton-based action recognition, demonstrating superior performance through improved feature alignment and discrimination.

Abstract: Zero-shot skeleton-based action recognition aims to classify unseen skeleton-based human actions without prior exposure to such categories during training. This task is extremely challenging due to the difficulty in generalizing from known to unknown actions. Previous studies typically use two-stage training: pre-training skeleton encoders on seen action categories using cross-entropy loss and then aligning pre-extracted skeleton and text features, enabling knowledge transfer to unseen classes through skeleton-text alignment and language models’ generalization. However, their efficacy is hindered by 1) insufficient discrimination for skeleton features, as the fixed skeleton encoder fails to capture necessary alignment information for effective skeleton-text alignment; 2) the neglect of alignment bias between skeleton and unseen text features during testing. To this end, we propose a prototype-guided feature alignment paradigm for zero-shot skeleton-based action recognition, termed PGFA. Specifically, we develop an end-to-end cross-modal contrastive training framework to improve skeleton-text alignment, ensuring sufficient discrimination for skeleton features. Additionally, we introduce a prototype-guided text feature alignment strategy to mitigate the adverse impact of the distribution discrepancy during testing. We provide a theoretical analysis to support our prototype-guided text feature alignment strategy and empirically evaluate our overall PGFA on three well-known datasets. Compared with the top competitor SMIE method, our PGFA achieves absolute accuracy improvements of 22.96%, 12.53%, and 18.54% on the NTU-60, NTU-120, and PKU-MMD datasets, respectively.

[217] Rectifying Magnitude Neglect in Linear Attention

Qihang Fan, Huaibo Huang, Yuang Ai, ran He

Main category: cs.CV

TL;DR: The paper introduces Magnitude-Aware Linear Attention (MALA) to address the performance gap between Linear Attention and Softmax Attention by incorporating Query magnitude, achieving strong results across various tasks.

Details

Motivation: Linear Attention's linear complexity is efficient but suffers performance degradation compared to Softmax Attention due to ignoring Query magnitude, limiting its adaptability.

Method: The authors analyze Linear Attention’s formulation, identify the Query magnitude issue, and propose MALA to incorporate this information, improving attention score distribution.

Result: MALA achieves strong performance on tasks like image classification, object detection, NLP, and more, closely resembling Softmax Attention’s effectiveness.

Conclusion: MALA successfully bridges the gap between Linear and Softmax Attention by dynamically adapting to Query magnitude, offering efficient and effective global modeling.

Abstract: As the core operator of Transformers, Softmax Attention exhibits excellent global modeling capabilities. However, its quadratic complexity limits its applicability to vision tasks. In contrast, Linear Attention shares a similar formulation with Softmax Attention while achieving linear complexity, enabling efficient global information modeling. Nevertheless, Linear Attention suffers from a significant performance degradation compared to standard Softmax Attention. In this paper, we analyze the underlying causes of this issue based on the formulation of Linear Attention. We find that, unlike Softmax Attention, Linear Attention entirely disregards the magnitude information of the Query. This prevents the attention score distribution from dynamically adapting as the Query scales. As a result, despite its structural similarity to Softmax Attention, Linear Attention exhibits a significantly different attention score distribution. Based on this observation, we propose Magnitude-Aware Linear Attention (MALA), which modifies the computation of Linear Attention to fully incorporate the Query’s magnitude. This adjustment allows MALA to generate an attention score distribution that closely resembles Softmax Attention while exhibiting a more well-balanced structure. We evaluate the effectiveness of MALA on multiple tasks, including image classification, object detection, instance segmentation, semantic segmentation, natural language processing, speech recognition, and image generation. Our MALA achieves strong results on all of these tasks. Code will be available at https://github.com/qhfan/MALA

[218] Inversion-DPO: Precise and Efficient Post-Training for Diffusion Models

Zejian Li, Yize Li, Chenye Meng, Zhongni Liu, Yang Ling, Shengyuan Zhang, Guang Yang, Changyuan Yang, Zhiyuan Yang, Lingyun Sun

Main category: cs.CV

TL;DR: Inversion-DPO is a new alignment framework for diffusion models that avoids reward modeling by using DDIM inversion, improving training efficiency and precision.

Details

Motivation: Existing alignment methods for diffusion models are computationally intensive and may reduce accuracy. Inversion-DPO aims to address these issues.

Method: Reformulates Direct Preference Optimization (DPO) with DDIM inversion to sample noise from winning and losing samples, eliminating the need for reward models.

Result: Shows significant performance improvements in text-to-image and compositional image generation tasks, with high-fidelity outputs.

Conclusion: Inversion-DPO offers an efficient, high-precision alignment method for diffusion models, enhancing their applicability to complex tasks.

Abstract: Recent advancements in diffusion models (DMs) have been propelled by alignment methods that post-train models to better conform to human preferences. However, these approaches typically require computation-intensive training of a base model and a reward model, which not only incurs substantial computational overhead but may also compromise model accuracy and training efficiency. To address these limitations, we propose Inversion-DPO, a novel alignment framework that circumvents reward modeling by reformulating Direct Preference Optimization (DPO) with DDIM inversion for DMs. Our method conducts intractable posterior sampling in Diffusion-DPO with the deterministic inversion from winning and losing samples to noise and thus derive a new post-training paradigm. This paradigm eliminates the need for auxiliary reward models or inaccurate appromixation, significantly enhancing both precision and efficiency of training. We apply Inversion-DPO to a basic task of text-to-image generation and a challenging task of compositional image generation. Extensive experiments show substantial performance improvements achieved by Inversion-DPO compared to existing post-training methods and highlight the ability of the trained generative models to generate high-fidelity compositionally coherent images. For the post-training of compostitional image geneation, we curate a paired dataset consisting of 11,140 images with complex structural annotations and comprehensive scores, designed to enhance the compositional capabilities of generative models. Inversion-DPO explores a new avenue for efficient, high-precision alignment in diffusion models, advancing their applicability to complex realistic generation tasks. Our code is available at https://github.com/MIGHTYEZ/Inversion-DPO

[219] Self-Reinforcing Prototype Evolution with Dual-Knowledge Cooperation for Semi-Supervised Lifelong Person Re-Identification

Kunlun Xu, Fan Zhuo, Jiangmeng Li, Xu Zou, Jiahuan Zhou

Main category: cs.CV

TL;DR: The paper introduces SPRED, a framework for Semi-Supervised Lifelong Person Re-Identification (Semi-LReID), addressing performance degradation in unlabeled data scenarios by combining dynamic prototype-guided pseudo-label generation and dual-knowledge purification.

Details

Motivation: Real-world scenarios often lack labeled data, causing performance issues in lifelong person re-identification (LReID). Existing methods struggle with noisy knowledge from unlabeled data, prompting the need for a semi-supervised approach.

Method: SPRED uses learnable identity prototypes to generate pseudo-labels and a dual-knowledge cooperation scheme to refine them, creating a self-reinforcing cycle for improved unlabeled data utilization.

Result: SPRED achieves state-of-the-art performance on Semi-LReID benchmarks, demonstrating effective long-term adaptation.

Conclusion: The proposed SPRED framework successfully enhances semi-supervised LReID by dynamically refining pseudo-labels and integrating historical and current knowledge, ensuring robust performance.

Abstract: Current lifelong person re-identification (LReID) methods predominantly rely on fully labeled data streams. However, in real-world scenarios where annotation resources are limited, a vast amount of unlabeled data coexists with scarce labeled samples, leading to the Semi-Supervised LReID (Semi-LReID) problem where LReID methods suffer severe performance degradation. Existing LReID methods, even when combined with semi-supervised strategies, suffer from limited long-term adaptation performance due to struggling with the noisy knowledge occurring during unlabeled data utilization. In this paper, we pioneer the investigation of Semi-LReID, introducing a novel Self-Reinforcing Prototype Evolution with Dual-Knowledge Cooperation framework (SPRED). Our key innovation lies in establishing a self-reinforcing cycle between dynamic prototype-guided pseudo-label generation and new-old knowledge collaborative purification to enhance the utilization of unlabeled data. Specifically, learnable identity prototypes are introduced to dynamically capture the identity distributions and generate high-quality pseudo-labels. Then, the dual-knowledge cooperation scheme integrates current model specialization and historical model generalization, refining noisy pseudo-labels. Through this cyclic design, reliable pseudo-labels are progressively mined to improve current-stage learning and ensure positive knowledge propagation over long-term learning. Experiments on the established Semi-LReID benchmarks show that our SPRED achieves state-of-the-art performance. Our source code is available at https://github.com/zhoujiahuan1991/ICCV2025-SPRED

[220] Leveraging the Structure of Medical Data for Improved Representation Learning

Andrea Agostini, Sonia Laguna, Alain Ryser, Samuel Ruiperez-Campillo, Moritz Vandenhirtz, Nicolas Deperrois, Farhad Nooralahzadeh, Michael Krauthammer, Thomas M. Sutter, Julia E. Vogt

Main category: cs.CV

TL;DR: A self-supervised framework for medical AI pretraining leverages multi-view X-ray structure without textual supervision, outperforming supervised baselines on MIMIC-CXR.

Details

Motivation: To address data scarcity and lack of annotations in clinical datasets like MIMIC-CXR while utilizing their inherent multi-view structure for effective pretraining.

Method: Uses paired chest X-rays (frontal and lateral views) as natural positive pairs, learning to reconstruct views from sparse patches and aligning latent embeddings.

Result: Demonstrates strong performance on MIMIC-CXR compared to supervised methods and baselines ignoring structure.

Conclusion: Provides a lightweight, modality-agnostic approach for domain-specific pretraining in structured but scarce data settings.

Abstract: Building generalizable medical AI systems requires pretraining strategies that are data-efficient and domain-aware. Unlike internet-scale corpora, clinical datasets such as MIMIC-CXR offer limited image counts and scarce annotations, but exhibit rich internal structure through multi-view imaging. We propose a self-supervised framework that leverages the inherent structure of medical datasets. Specifically, we treat paired chest X-rays (i.e., frontal and lateral views) as natural positive pairs, learning to reconstruct each view from sparse patches while aligning their latent embeddings. Our method requires no textual supervision and produces informative representations. Evaluated on MIMIC-CXR, we show strong performance compared to supervised objectives and baselines being trained without leveraging structure. This work provides a lightweight, modality-agnostic blueprint for domain-specific pretraining where data is structured but scarce

[221] Frequency-Dynamic Attention Modulation for Dense Prediction

Linwei Chen, Lin Gu, Ying Fu

Main category: cs.CV

TL;DR: The paper introduces FDAM, a frequency-dynamic attention modulation method for Vision Transformers (ViTs) to address frequency vanishing and loss of critical details. It combines Attention Inversion and Frequency Dynamic Scaling, improving performance in tasks like segmentation and detection.

Details

Motivation: ViTs suffer from frequency vanishing due to their attention mechanism acting as a low-pass filter, leading to loss of critical details and textures.

Method: Proposes FDAM with two techniques: Attention Inversion (AttInv) for high-pass filtering and Frequency Dynamic Scaling (FreqScale) for fine-grained frequency adjustments.

Result: FDAM avoids representation collapse and improves performance in models like SegFormer and DeiT for tasks such as semantic segmentation and object detection. Achieves state-of-the-art results in remote sensing detection.

Conclusion: FDAM effectively addresses frequency vanishing in ViTs, enhancing performance across multiple tasks and models.

Abstract: Vision Transformers (ViTs) have significantly advanced computer vision, demonstrating strong performance across various tasks. However, the attention mechanism in ViTs makes each layer function as a low-pass filter, and the stacked-layer architecture in existing transformers suffers from frequency vanishing. This leads to the loss of critical details and textures. We propose a novel, circuit-theory-inspired strategy called Frequency-Dynamic Attention Modulation (FDAM), which can be easily plugged into ViTs. FDAM directly modulates the overall frequency response of ViTs and consists of two techniques: Attention Inversion (AttInv) and Frequency Dynamic Scaling (FreqScale). Since circuit theory uses low-pass filters as fundamental elements, we introduce AttInv, a method that generates complementary high-pass filtering by inverting the low-pass filter in the attention matrix, and dynamically combining the two. We further design FreqScale to weight different frequency components for fine-grained adjustments to the target response function. Through feature similarity analysis and effective rank evaluation, we demonstrate that our approach avoids representation collapse, leading to consistent performance improvements across various models, including SegFormer, DeiT, and MaskDINO. These improvements are evident in tasks such as semantic segmentation, object detection, and instance segmentation. Additionally, we apply our method to remote sensing detection, achieving state-of-the-art results in single-scale settings. The code is available at https://github.com/Linwei-Chen/FDAM.

[222] Outdoor Monocular SLAM with Global Scale-Consistent 3D Gaussian Pointmaps

Chong Cheng, Sicheng Yu, Zijian Wang, Yifan Zhou, Hao Wang

Main category: cs.CV

TL;DR: S3PO-GS is a robust RGB-only outdoor 3DGS SLAM method that avoids scale drift and improves tracking accuracy by using a self-consistent tracking module and patch-based dynamic mapping.

Details

Motivation: Previous 3DGS SLAM methods lack geometric priors or suffer from scale drift, limiting their effectiveness in outdoor scenes.

Method: S3PO-GS introduces a self-consistent tracking module anchored in the 3DGS pointmap and a patch-based dynamic mapping module with geometric priors.

Result: The method achieves state-of-the-art results in novel view synthesis and tracking accuracy on Waymo, KITTI, and DL3DV datasets.

Conclusion: S3PO-GS is highly effective for complex outdoor environments, offering precise tracking and high-quality scene reconstruction.

Abstract: 3D Gaussian Splatting (3DGS) has become a popular solution in SLAM due to its high-fidelity and real-time novel view synthesis performance. However, some previous 3DGS SLAM methods employ a differentiable rendering pipeline for tracking, lack geometric priors in outdoor scenes. Other approaches introduce separate tracking modules, but they accumulate errors with significant camera movement, leading to scale drift. To address these challenges, we propose a robust RGB-only outdoor 3DGS SLAM method: S3PO-GS. Technically, we establish a self-consistent tracking module anchored in the 3DGS pointmap, which avoids cumulative scale drift and achieves more precise and robust tracking with fewer iterations. Additionally, we design a patch-based pointmap dynamic mapping module, which introduces geometric priors while avoiding scale ambiguity. This significantly enhances tracking accuracy and the quality of scene reconstruction, making it particularly suitable for complex outdoor environments. Our experiments on the Waymo, KITTI, and DL3DV datasets demonstrate that S3PO-GS achieves state-of-the-art results in novel view synthesis and outperforms other 3DGS SLAM methods in tracking accuracy. Project page: https://3dagentworld.github.io/S3PO-GS/.

[223] QR-LoRA: Efficient and Disentangled Fine-tuning via QR Decomposition for Customized Generation

Jiahui Yang, Yongjia Ma, Donglin Di, Hao Li, Wei Chen, Yan Xie, Jianxun Cui, Xun Yang, Wangmeng Zuo

Main category: cs.CV

TL;DR: QR-LoRA is a novel fine-tuning framework using QR decomposition to separate content and style attributes in text-to-image models, reducing trainable parameters and improving disentanglement.

Details

Motivation: Existing LoRA methods cause feature entanglement in content-style fusion tasks, leading to undesired attribute mixing.

Method: QR-LoRA leverages QR decomposition: the Q matrix minimizes interference, while the R matrix encodes transformations. Only a task-specific ΔR matrix is trained.

Result: QR-LoRA reduces parameters by half and achieves superior disentanglement in content-style fusion compared to conventional LoRA.

Conclusion: QR-LoRA introduces a structured, parameter-efficient fine-tuning paradigm for generative models with strong disentanglement properties.

Abstract: Existing text-to-image models often rely on parameter fine-tuning techniques such as Low-Rank Adaptation (LoRA) to customize visual attributes. However, when combining multiple LoRA models for content-style fusion tasks, unstructured modifications of weight matrices often lead to undesired feature entanglement between content and style attributes. We propose QR-LoRA, a novel fine-tuning framework leveraging QR decomposition for structured parameter updates that effectively separate visual attributes. Our key insight is that the orthogonal Q matrix naturally minimizes interference between different visual features, while the upper triangular R matrix efficiently encodes attribute-specific transformations. Our approach fixes both Q and R matrices while only training an additional task-specific $\Delta R$ matrix. This structured design reduces trainable parameters to half of conventional LoRA methods and supports effective merging of multiple adaptations without cross-contamination due to the strong disentanglement properties between $\Delta R$ matrices. Experiments demonstrate that QR-LoRA achieves superior disentanglement in content-style fusion tasks, establishing a new paradigm for parameter-efficient, disentangled fine-tuning in generative models. The project page is available at: https://luna-ai-lab.github.io/QR-LoRA/.

Chang-Hwan Son

Main category: cs.CV

TL;DR: A novel GAN-based blind face image restoration framework improves recognition accuracy in adverse weather by integrating local Statistical Facial Feature Transformation (SFFT) and Degradation-Agnostic Feature Embedding (DAFE).

Details

Motivation: Adverse weather degrades image quality, reducing face recognition accuracy. Existing methods lack dedicated modules for weather-induced degradations, leading to distorted facial textures.

Method: Proposes a GAN-based framework with SFFT for local statistical alignment and DAFE for robust feature extraction under adverse weather.

Result: Outperforms state-of-the-art methods in suppressing distortions and reconstructing facial structures.

Conclusion: The SFFT and DAFE modules enhance structural fidelity and perceptual quality in challenging weather scenarios.

Abstract: With the increasing deployment of intelligent CCTV systems in outdoor environments, there is a growing demand for face recognition systems optimized for challenging weather conditions. Adverse weather significantly degrades image quality, which in turn reduces recognition accuracy. Although recent face image restoration (FIR) models based on generative adversarial networks (GANs) and diffusion models have shown progress, their performance remains limited due to the lack of dedicated modules that explicitly address weather-induced degradations. This leads to distorted facial textures and structures. To address these limitations, we propose a novel GAN-based blind FIR framework that integrates two key components: local Statistical Facial Feature Transformation (SFFT) and Degradation-Agnostic Feature Embedding (DAFE). The local SFFT module enhances facial structure and color fidelity by aligning the local statistical distributions of low-quality (LQ) facial regions with those of high-quality (HQ) counterparts. Complementarily, the DAFE module enables robust statistical facial feature extraction under adverse weather conditions by aligning LQ and HQ encoder representations, thereby making the restoration process adaptive to severe weather-induced degradations. Experimental results demonstrate that the proposed degradation-agnostic SFFT model outperforms existing state-of-the-art FIR methods based on GAN and diffusion models, particularly in suppressing texture distortions and accurately reconstructing facial structures. Furthermore, both the SFFT and DAFE modules are empirically validated in enhancing structural fidelity and perceptual quality in face restoration under challenging weather scenarios.

[225] ViLU: Learning Vision-Language Uncertainties for Failure Prediction

Marc Lafon, Yannis Karmim, Julio Silva-Rodríguez, Paul Couairon, Clément Rambour, Raphaël Fournier-Sniehotta, Ismail Ben Ayed, Jose Dolz, Nicolas Thome

Main category: cs.CV

TL;DR: ViLU is a new framework for Vision-Language Uncertainty quantification, leveraging multi-modal representations and a binary classifier for failure prediction, outperforming existing methods.

Details

Motivation: Addressing the challenge of reliable uncertainty quantification and failure prediction in Vision-Language Models (VLMs).

Method: ViLU integrates visual and textual embeddings via cross-attention, training a binary classifier for uncertainty prediction using weighted binary cross-entropy loss.

Result: Significant improvements over state-of-the-art methods on datasets like ImageNet-1k, CC12M, and LAION-400M.

Conclusion: ViLU effectively quantifies uncertainty in VLMs, with its architecture and training playing critical roles.

Abstract: Reliable Uncertainty Quantification (UQ) and failure prediction remain open challenges for Vision-Language Models (VLMs). We introduce ViLU, a new Vision-Language Uncertainty quantification framework that contextualizes uncertainty estimates by leveraging all task-relevant textual representations. ViLU constructs an uncertainty-aware multi-modal representation by integrating the visual embedding, the predicted textual embedding, and an image-conditioned textual representation via cross-attention. Unlike traditional UQ methods based on loss prediction, ViLU trains an uncertainty predictor as a binary classifier to distinguish correct from incorrect predictions using a weighted binary cross-entropy loss, making it loss-agnostic. In particular, our proposed approach is well-suited for post-hoc settings, where only vision and text embeddings are available without direct access to the model itself. Extensive experiments on diverse datasets show the significant gains of our method compared to state-of-the-art failure prediction methods. We apply our method to standard classification datasets, such as ImageNet-1k, as well as large-scale image-caption datasets like CC12M and LAION-400M. Ablation studies highlight the critical role of our architecture and training in achieving effective uncertainty quantification. Our code is publicly available and can be found here: https://github.com/ykrmm/ViLU.

[226] A Transfer Learning-Based Method for Water Body Segmentation in Remote Sensing Imagery: A Case Study of the Zhada Tulin Area

Haonan Chen, Xin Tong

Main category: cs.CV

TL;DR: The study proposes a transfer learning strategy using SegFormer to improve water body segmentation in the Tibetan Plateau, achieving a significant performance boost and providing insights for water management.

Details

Motivation: The Tibetan Plateau is highly sensitive to climate change, posing water security challenges. Robust AI for water monitoring is needed to enhance climate resilience.

Method: A two-stage transfer learning approach with SegFormer was used, pre-training on a diverse source domain and fine-tuning for the arid Zhada Tulin area.

Result: The IoU for water body segmentation improved from 25.50% to 64.84%, revealing a concentrated spatial distribution of water.

Conclusion: The study offers an effective AI solution for monitoring arid regions and supports disaster preparedness in critical river headwaters.

Abstract: The Tibetan Plateau, known as the Asian Water Tower, faces significant water security challenges due to its high sensitivity to climate change. Advancing Earth observation for sustainable water monitoring is thus essential for building climate resilience in this region. This study proposes a two-stage transfer learning strategy using the SegFormer model to overcome domain shift and data scarcit–key barriers in developing robust AI for climate-sensitive applications. After pre-training on a diverse source domain, our model was fine-tuned for the arid Zhada Tulin area. Experimental results show a substantial performance boost: the Intersection over Union (IoU) for water body segmentation surged from 25.50% (direct transfer) to 64.84%. This AI-driven accuracy is crucial for disaster risk reduction, particularly in monitoring flash flood-prone systems. More importantly, the high-precision map reveals a highly concentrated spatial distribution of water, with over 80% of the water area confined to less than 20% of the river channel length. This quantitative finding provides crucial evidence for understanding hydrological processes and designing targeted water management and climate adaptation strategies. Our work thus demonstrates an effective technical solution for monitoring arid plateau regions and contributes to advancing AI-powered Earth observation for disaster preparedness in critical transboundary river headwaters.

[227] Swin-TUNA : A Novel PEFT Approach for Accurate Food Image Segmentation

Haotian Chen, Zhiyong Xiao

Main category: cs.CV

TL;DR: Swin-TUNA introduces a parameter-efficient fine-tuning method for food image segmentation, reducing parameters by 98.7% while outperforming FoodSAM.

Details

Motivation: Existing Transformer-based models like FoodSAM are impractical for deployment due to high computational demands.

Method: Swin-TUNA integrates multiscale trainable adapters into Swin Transformer, updating only 4% of parameters with hierarchical feature adaptation.

Result: Achieves mIoU of 50.56% and 74.94% on FoodSeg103 and UECFoodPix Complete, surpassing FoodSAM with 8.13M parameters.

Conclusion: Swin-TUNA offers efficient, lightweight food image segmentation with faster convergence and better generalization.

Abstract: In the field of food image processing, efficient semantic segmentation techniques are crucial for industrial applications. However, existing large-scale Transformer-based models (such as FoodSAM) face challenges in meeting practical deploymentrequirements due to their massive parameter counts and high computational resource demands. This paper introduces TUNable Adapter module (Swin-TUNA), a Parameter Efficient Fine-Tuning (PEFT) method that integrates multiscale trainable adapters into the Swin Transformer architecture, achieving high-performance food image segmentation by updating only 4% of the parameters. The core innovation of Swin-TUNA lies in its hierarchical feature adaptation mechanism: it designs separable convolutions in depth and dimensional mappings of varying scales to address the differences in features between shallow and deep networks, combined with a dynamic balancing strategy for tasks-agnostic and task-specific features. Experiments demonstrate that this method achieves mIoU of 50.56% and 74.94% on the FoodSeg103 and UECFoodPix Complete datasets, respectively, surpassing the fully parameterized FoodSAM model while reducing the parameter count by 98.7% (to only 8.13M). Furthermore, Swin-TUNA exhibits faster convergence and stronger generalization capabilities in low-data scenarios, providing an efficient solution for assembling lightweight food image.

[228] Rethinking Occlusion in FER: A Semantic-Aware Perspective and Go Beyond

Huiyu Zhai, Xingxing Yang, Yalan Ye, Chenyang Li, Bin Fan, Changze Li

Main category: cs.CV

TL;DR: ORSANet improves facial expression recognition (FER) under occlusion by using multi-modal semantic guidance, a multi-scale fusion module, and a dynamic loss function, achieving state-of-the-art results.

Details

Motivation: Existing FER models struggle with occlusion and dataset biases, leading to inaccurate classifications.

Method: ORSANet introduces multi-modal semantic guidance (semantic segmentation maps and facial landmarks), a Multi-scale Cross-interaction Module (MCM), and a Dynamic Adversarial Repulsion Enhancement Loss (DARELoss).

Result: ORSANet achieves state-of-the-art performance on public benchmarks and the new Occlu-FER dataset.

Conclusion: ORSANet effectively addresses occlusion and bias challenges in FER, demonstrating superior performance and robustness.

Abstract: Facial expression recognition (FER) is a challenging task due to pervasive occlusion and dataset biases. Especially when facial information is partially occluded, existing FER models struggle to extract effective facial features, leading to inaccurate classifications. In response, we present ORSANet, which introduces the following three key contributions: First, we introduce auxiliary multi-modal semantic guidance to disambiguate facial occlusion and learn high-level semantic knowledge, which is two-fold: 1) we introduce semantic segmentation maps as dense semantics prior to generate semantics-enhanced facial representations; 2) we introduce facial landmarks as sparse geometric prior to mitigate intrinsic noises in FER, such as identity and gender biases. Second, to facilitate the effective incorporation of these two multi-modal priors, we customize a Multi-scale Cross-interaction Module (MCM) to adaptively fuse the landmark feature and semantics-enhanced representations within different scales. Third, we design a Dynamic Adversarial Repulsion Enhancement Loss (DARELoss) that dynamically adjusts the margins of ambiguous classes, further enhancing the model’s ability to distinguish similar expressions. We further construct the first occlusion-oriented FER dataset to facilitate specialized robustness analysis on various real-world occlusion conditions, dubbed Occlu-FER. Extensive experiments on both public benchmarks and Occlu-FER demonstrate that our proposed ORSANet achieves SOTA recognition performance. Code is publicly available at https://github.com/Wenyuzhy/ORSANet-master.

[229] PRIX: Learning to Plan from Raw Pixels for End-to-End Autonomous Driving

Maciej K. Wozniak, Lianhang Liu, Yixi Cai, Patric Jensfelt

Main category: cs.CV

TL;DR: PRIX is an efficient, camera-only end-to-end autonomous driving model that avoids LiDAR and BEV representations, using a novel Context-aware Recalibration Transformer (CaRT) for robust planning.

Details

Motivation: Current autonomous driving models are hindered by large sizes, reliance on LiDAR, and computationally intensive BEV representations, limiting scalability for mass-market vehicles.

Method: PRIX uses a visual feature extractor and generative planning head to predict trajectories directly from raw pixels, enhanced by the CaRT module.

Result: PRIX achieves state-of-the-art performance on NavSim and nuScenes benchmarks, matching larger models while being more efficient.

Conclusion: PRIX offers a practical, efficient solution for real-world deployment, with open-source availability.

Abstract: While end-to-end autonomous driving models show promising results, their practical deployment is often hindered by large model sizes, a reliance on expensive LiDAR sensors and computationally intensive BEV feature representations. This limits their scalability, especially for mass-market vehicles equipped only with cameras. To address these challenges, we propose PRIX (Plan from Raw Pixels). Our novel and efficient end-to-end driving architecture operates using only camera data, without explicit BEV representation and forgoing the need for LiDAR. PRIX leverages a visual feature extractor coupled with a generative planning head to predict safe trajectories from raw pixel inputs directly. A core component of our architecture is the Context-aware Recalibration Transformer (CaRT), a novel module designed to effectively enhance multi-level visual features for more robust planning. We demonstrate through comprehensive experiments that PRIX achieves state-of-the-art performance on the NavSim and nuScenes benchmarks, matching the capabilities of larger, multimodal diffusion planners while being significantly more efficient in terms of inference speed and model size, making it a practical solution for real-world deployment. Our work is open-source and the code will be at https://maxiuw.github.io/prix.

[230] Quantifying and Narrowing the Unknown: Interactive Text-to-Video Retrieval via Uncertainty Minimization

Bingqing Zhang, Zhuo Cao, Heming Du, Yang Li, Xue Li, Jiajun Liu, Sen Wang

Main category: cs.CV

TL;DR: UMIVR is an interactive text-to-video retrieval framework that minimizes uncertainties (text ambiguity, mapping uncertainty, frame uncertainty) using principled metrics, improving retrieval accuracy.

Details

Motivation: Current interactive TVR systems lack explicit uncertainty quantification, limiting effectiveness. UMIVR addresses this gap.

Method: UMIVR uses semantic entropy (TAS), Jensen-Shannon divergence (MUS), and temporal quality (TQFS) to quantify uncertainties and guide clarifying questions.

Result: Achieves 69.2% Recall@1 on MSR-VTT-1k after 10 rounds, outperforming benchmarks.

Conclusion: UMIVR effectively reduces retrieval ambiguity, setting a foundation for uncertainty-aware interactive TVR.

Abstract: Despite recent advances, Text-to-video retrieval (TVR) is still hindered by multiple inherent uncertainties, such as ambiguous textual queries, indistinct text-video mappings, and low-quality video frames. Although interactive systems have emerged to address these challenges by refining user intent through clarifying questions, current methods typically rely on heuristic or ad-hoc strategies without explicitly quantifying these uncertainties, limiting their effectiveness. Motivated by this gap, we propose UMIVR, an Uncertainty-Minimizing Interactive Text-to-Video Retrieval framework that explicitly quantifies three critical uncertainties-text ambiguity, mapping uncertainty, and frame uncertainty-via principled, training-free metrics: semantic entropy-based Text Ambiguity Score (TAS), Jensen-Shannon divergence-based Mapping Uncertainty Score (MUS), and a Temporal Quality-based Frame Sampler (TQFS). By adaptively generating targeted clarifying questions guided by these uncertainty measures, UMIVR iteratively refines user queries, significantly reducing retrieval ambiguity. Extensive experiments on multiple benchmarks validate UMIVR’s effectiveness, achieving notable gains in Recall@1 (69.2% after 10 interactive rounds) on the MSR-VTT-1k dataset, thereby establishing an uncertainty-minimizing foundation for interactive TVR.

[231] Towards Holistic Surgical Scene Graph

Jongmin Shin, Enki Cho, Ka Young Kim, Jung Yong Kim, Seong Tae Kim, Namkee Oh

Main category: cs.CV

TL;DR: The paper introduces Endoscapes-SG201 dataset and SSG-Com, a graph-based method, to enhance surgical scene understanding by incorporating tool-action-target combinations and hand identity into graph representations.

Details

Motivation: Existing graph-based surgical scene representations lack exploration of tool-action-target combinations and hand identity, which are crucial for comprehensive scene understanding.

Method: Proposes Endoscapes-SG201 dataset with annotations for tool-action-target and hand identity, and introduces SSG-Com, a graph-based method to model these elements.

Result: Experiments show improved performance in downstream tasks like critical view of safety assessment and action triplet recognition.

Conclusion: Integrating tool-action-target and hand identity into graph representations significantly enhances surgical scene understanding.

Abstract: Surgical scene understanding is crucial for computer-assisted intervention systems, requiring visual comprehension of surgical scenes that involves diverse elements such as surgical tools, anatomical structures, and their interactions. To effectively represent the complex information in surgical scenes, graph-based approaches have been explored to structurally model surgical entities and their relationships. Previous surgical scene graph studies have demonstrated the feasibility of representing surgical scenes using graphs. However, certain aspects of surgical scenes-such as diverse combinations of tool-action-target and the identity of the hand operating the tool-remain underexplored in graph-based representations, despite their importance. To incorporate these aspects into graph representations, we propose Endoscapes-SG201 dataset, which includes annotations for tool-action-target combinations and hand identity. We also introduce SSG-Com, a graph-based method designed to learn and represent these critical elements. Through experiments on downstream tasks such as critical view of safety assessment and action triplet recognition, we demonstrated the importance of integrating these essential scene graph components, highlighting their significant contribution to surgical scene understanding. The code and dataset are available at https://github.com/ailab-kyunghee/SSG-Com

[232] LPTR-AFLNet: Lightweight Integrated Chinese License Plate Rectification and Recognition Network

Guangzhu Xu, Pengcheng Zuo, Zhi Ke, Bangjun Lei

Main category: cs.CV

TL;DR: A lightweight, unified network (LPTR-AFLNet) is proposed for correcting and recognizing Chinese license plates, combining perspective correction with optimized recognition, achieving high accuracy and real-time performance.

Details

Motivation: Challenges in CLPR due to perspective distortions and limited computational resources on edge devices necessitate a low-complexity, end-to-end solution.

Method: LPTR-AFLNet integrates a perspective transformation correction module (PTR) with AFLNet, using recognition output to guide correction. Improvements include an enhanced attention module and Focal Loss for better accuracy.

Result: The method excels in correcting perspective distortion and recognizing double-line plates, with high accuracy and runtime under 10ms on mid-range GPUs.

Conclusion: LPTR-AFLNet is efficient, accurate, and practical for real-world deployment in challenging CLPR scenarios.

Abstract: Chinese License Plate Recognition (CLPR) faces numerous challenges in unconstrained and complex environments, particularly due to perspective distortions caused by various shooting angles and the correction of single-line and double-line license plates. Considering the limited computational resources of edge devices, developing a low-complexity, end-to-end integrated network for both correction and recognition is essential for achieving real-time and efficient deployment. In this work, we propose a lightweight, unified network named LPTR-AFLNet for correcting and recognizing Chinese license plates, which combines a perspective transformation correction module (PTR) with an optimized license plate recognition network, AFLNet. The network leverages the recognition output as a weak supervisory signal to effectively guide the correction process, ensuring accurate perspective distortion correction. To enhance recognition accuracy, we introduce several improvements to LPRNet, including an improved attention module to reduce confusion among similar characters and the use of Focal Loss to address class imbalance during training. Experimental results demonstrate the exceptional performance of LPTR-AFLNet in rectifying perspective distortion and recognizing double-line license plate images, maintaining high recognition accuracy across various challenging scenarios. Moreover, on lower-mid-range GPUs platform, the method runs in less than 10 milliseconds, indicating its practical efficiency and broad applicability.

[233] Faithful, Interpretable Chest X-ray Diagnosis with Anti-Aliased B-cos Networks

Marcel Kleinmann, Shashank Agnihotri, Margret Keuper

Main category: cs.CV

TL;DR: B-cos networks improve interpretability in DNNs for medical imaging but suffer from aliasing artifacts. Anti-aliasing methods (FLCPooling and BlurPool) enhance explanation quality while maintaining performance.

Details

Motivation: Faithfulness and interpretability are critical for DNNs in safety-critical domains like medical imaging, where clarity is essential.

Method: Replace standard linear layers in B-cos networks with weight-input alignment and introduce anti-aliasing strategies (FLCPooling and BlurPool).

Result: Modified B-cos models (B-cos_FLC and B-cos_BP) provide artifact-free, faithful explanations while preserving strong predictive performance.

Conclusion: Anti-aliasing strategies make B-cos networks suitable for clinical use by improving explanation quality without sacrificing performance.

Abstract: Faithfulness and interpretability are essential for deploying deep neural networks (DNNs) in safety-critical domains such as medical imaging. B-cos networks offer a promising solution by replacing standard linear layers with a weight-input alignment mechanism, producing inherently interpretable, class-specific explanations without post-hoc methods. While maintaining diagnostic performance competitive with state-of-the-art DNNs, standard B-cos models suffer from severe aliasing artifacts in their explanation maps, making them unsuitable for clinical use where clarity is essential. In this work, we address these limitations by introducing anti-aliasing strategies using FLCPooling (FLC) and BlurPool (BP) to significantly improve explanation quality. Our experiments on chest X-ray datasets demonstrate that the modified $\text{B-cos}\text{FLC}$ and $\text{B-cos}\text{BP}$ preserve strong predictive performance while providing faithful and artifact-free explanations suitable for clinical application in multi-class and multi-label settings. Code available at: GitHub repository (url: https://github.com/mkleinma/B-cos-medical-paper).

[234] PolarAnything: Diffusion-based Polarimetric Image Synthesis

Kailong Zhang, Youwei Lyu, Heng Guo, Si Li, Zhanyu Ma, Boxin Shi

Main category: cs.CV

TL;DR: PolarAnything synthesizes photorealistic polarization images from a single RGB input, overcoming limitations of existing simulators like Mitsuba.

Details

Motivation: Limited accessibility of polarization cameras and the need for photorealistic polarization images drive this work.

Method: A diffusion-based generative framework inspired by pretrained models, preserving polarization fidelity without needing 3D assets.

Result: Generates high-quality polarization images and supports tasks like shape from polarization.

Conclusion: PolarAnything offers a scalable, asset-free solution for polarization image synthesis.

Abstract: Polarization images facilitate image enhancement and 3D reconstruction tasks, but the limited accessibility of polarization cameras hinders their broader application. This gap drives the need for synthesizing photorealistic polarization images. The existing polarization simulator Mitsuba relies on a parametric polarization image formation model and requires extensive 3D assets covering shape and PBR materials, preventing it from generating large-scale photorealistic images. To address this problem, we propose PolarAnything, capable of synthesizing polarization images from a single RGB input with both photorealism and physical accuracy, eliminating the dependency on 3D asset collections. Drawing inspiration from the zero-shot performance of pretrained diffusion models, we introduce a diffusion-based generative framework with an effective representation strategy that preserves the fidelity of polarization properties. Experiments show that our model generates high-quality polarization images and supports downstream tasks like shape from polarization.

[235] PARTE: Part-Guided Texturing for 3D Human Reconstruction from a Single Image

Hyeongjin Nam, Donghwan Kim, Gyeongsik Moon, Kyoung Mu Lee

Main category: cs.CV

TL;DR: PARTE improves 3D human reconstruction by using part segmentation to align textures, avoiding blending issues.

Details

Motivation: Existing methods misalign textures across human parts; PARTE leverages part segmentation for better texture coherence.

Method: Uses a PartSegmenter for 3D part segmentation and a PartTexturer for part-guided texture reconstruction.

Result: Achieves state-of-the-art quality in 3D human reconstruction.

Conclusion: PARTE effectively addresses texture misalignment by integrating part segmentation priors.

Abstract: The misaligned human texture across different human parts is one of the main limitations of existing 3D human reconstruction methods. Each human part, such as a jacket or pants, should maintain a distinct texture without blending into others. The structural coherence of human parts serves as a crucial cue to infer human textures in the invisible regions of a single image. However, most existing 3D human reconstruction methods do not explicitly exploit such part segmentation priors, leading to misaligned textures in their reconstructions. In this regard, we present PARTE, which utilizes 3D human part information as a key guide to reconstruct 3D human textures. Our framework comprises two core components. First, to infer 3D human part information from a single image, we propose a 3D part segmentation module (PartSegmenter) that initially reconstructs a textureless human surface and predicts human part labels based on the textureless surface. Second, to incorporate part information into texture reconstruction, we introduce a part-guided texturing module (PartTexturer), which acquires prior knowledge from a pre-trained image generation network on texture alignment of human parts. Extensive experiments demonstrate that our framework achieves state-of-the-art quality in 3D human reconstruction. The project page is available at https://hygenie1228.github.io/PARTE/.

cs.AI

[236] ASP-Assisted Symbolic Regression: Uncovering Hidden Physics in Fluid Mechanics

Theofanis Aravanis, Grigorios Chrimatopoulos, Mohammad Ferdows, Michalis Xenos, Efstratios Em Tzirtzilakis

Main category: cs.AI

TL;DR: Symbolic Regression (SR) is used to model 3D incompressible flow, yielding interpretable equations that match analytical solutions. A hybrid SR/ASP framework ensures physical plausibility.

Details

Motivation: To bridge the gap between accurate prediction and understanding of flow physics in fluid mechanics.

Method: Applied SR to derive symbolic equations from simulation data, then integrated SR with Answer Set Programming (ASP) for physical plausibility.

Result: SR-derived equations matched analytical solutions; hybrid SR/ASP improved reliability and alignment with domain principles.

Conclusion: SR simplifies complex flows into interpretable equations, and hybrid approaches enhance reliability for explainable predictions.

Abstract: Unlike conventional Machine-Learning (ML) approaches, often criticized as “black boxes”, Symbolic Regression (SR) stands out as a powerful tool for revealing interpretable mathematical relationships in complex physical systems, requiring no a priori assumptions about models’ structures. Motivated by the recognition that, in fluid mechanics, an understanding of the underlying flow physics is as crucial as accurate prediction, this study applies SR to model a fundamental three-dimensional (3D) incompressible flow in a rectangular channel, focusing on the (axial) velocity and pressure fields under laminar conditions. By employing the PySR library, compact symbolic equations were derived directly from numerical simulation data, revealing key characteristics of the flow dynamics. These equations not only approximate the parabolic velocity profile and pressure drop observed in the studied fluid flow, but also perfectly coincide with analytical solutions from the literature. Furthermore, we propose an innovative approach that integrates SR with the knowledge-representation framework of Answer Set Programming (ASP), combining the generative power of SR with the declarative reasoning strengths of ASP. The proposed hybrid SR/ASP framework ensures that the SR-generated symbolic expressions are not only statistically accurate, but also physically plausible, adhering to domain-specific principles. Overall, the study highlights two key contributions: SR’s ability to simplify complex flow behaviours into concise, interpretable equations, and the potential of knowledge-representation approaches to improve the reliability and alignment of data-driven SR models with domain principles. Insights from the examined 3D channel flow pave the way for integrating such hybrid approaches into efficient frameworks, […] where explainable predictions and real-time data analysis are crucial.

[237] I2I-STRADA – Information to Insights via Structured Reasoning Agent for Data Analysis

SaiBarath Sundar, Pranav Satheesan, Udayaadithya Avadhanam

Main category: cs.AI

TL;DR: I2I-STRADA introduces a structured reasoning agent for data analysis, outperforming prior systems by formalizing cognitive workflows.

Details

Motivation: Existing agentic systems for data analysis lack structured reasoning, despite its importance for real-world analytical tasks.

Method: I2I-STRADA models analytical reasoning via modular sub-tasks, formalizing cognitive steps like goal interpretation and adaptive execution.

Result: Evaluations on DABstep and DABench benchmarks show superior performance in planning coherence and insight alignment.

Conclusion: Structured cognitive workflows are crucial for effective agentic systems in data analysis, as demonstrated by I2I-STRADA.

Abstract: Recent advances in agentic systems for data analysis have emphasized automation of insight generation through multi-agent frameworks, and orchestration layers. While these systems effectively manage tasks like query translation, data transformation, and visualization, they often overlook the structured reasoning process underlying analytical thinking. Reasoning large language models (LLMs) used for multi-step problem solving are trained as general-purpose problem solvers. As a result, their reasoning or thinking steps do not adhere to fixed processes for specific tasks. Real-world data analysis requires a consistent cognitive workflow: interpreting vague goals, grounding them in contextual knowledge, constructing abstract plans, and adapting execution based on intermediate outcomes. We introduce I2I-STRADA (Information-to-Insight via Structured Reasoning Agent for Data Analysis), an agentic architecture designed to formalize this reasoning process. I2I-STRADA focuses on modeling how analysis unfolds via modular sub-tasks that reflect the cognitive steps of analytical reasoning. Evaluations on the DABstep and DABench benchmarks show that I2I-STRADA outperforms prior systems in planning coherence and insight alignment, highlighting the importance of structured cognitive workflows in agent design for data analysis.

[238] SMARTAPS: Tool-augmented LLMs for Operations Management

Timothy Tin Long Yu, Mahdi Mostajabdaveh, Jabo Serge Byusa, Rindra Ramamonjison, Giuseppe Carenini, Kun Mao, Zirui Zhou, Yong Zhang

Main category: cs.AI

TL;DR: SmartAPS is a conversational system using LLMs to make advanced planning systems more accessible via natural language interaction.

Details

Motivation: Many users are priced out of traditional APS due to consultant costs; SmartAPS aims to democratize access.

Method: Built on a tool-augmented LLM, SmartAPS offers a chat interface for queries, reasoning, recommendations, and scenario analysis.

Result: The system provides an intuitive way for operations planners to interact with APS functionalities.

Conclusion: SmartAPS demonstrates the potential of LLMs to enhance accessibility and usability of complex planning tools.

Abstract: Large language models (LLMs) present intriguing opportunities to enhance user interaction with traditional algorithms and tools in real-world applications. An advanced planning system (APS) is a sophisticated software that leverages optimization to help operations planners create, interpret, and modify an operational plan. While highly beneficial, many customers are priced out of using an APS due to the ongoing costs of consultants responsible for customization and maintenance. To address the need for a more accessible APS expressed by supply chain planners, we present SmartAPS, a conversational system built on a tool-augmented LLM. Our system provides operations planners with an intuitive natural language chat interface, allowing them to query information, perform counterfactual reasoning, receive recommendations, and execute scenario analysis to better manage their operation. A short video demonstrating the system has been released: https://youtu.be/KtIrJjlDbyw

[239] Synthesis of timeline-based planning strategies avoiding determinization

Dario Della Monica, Angelo Montanari, Pietro Sala

Main category: cs.AI

TL;DR: The paper identifies a fragment of qualitative timeline-based planning that can be mapped to deterministic finite automata for strategy synthesis, avoiding costly determinization.

Details

Motivation: To address the inefficiency of using nondeterministic automata for planning strategy synthesis due to the required determinization step.

Method: Mapping the plan-existence problem of a specific fragment of qualitative timeline-based planning to the nonemptiness problem of deterministic finite automata.

Result: A deterministic fragment is identified, and a maximal subset of Allen’s relations fitting this fragment is characterized.

Conclusion: The deterministic fragment enables direct strategy synthesis without determinization, improving efficiency in planning.

Abstract: Qualitative timeline-based planning models domains as sets of independent, but interacting, components whose behaviors over time, the timelines, are governed by sets of qualitative temporal constraints (ordering relations), called synchronization rules. Its plan-existence problem has been shown to be PSPACE-complete; in particular, PSPACE-membership has been proved via reduction to the nonemptiness problem for nondeterministic finite automata. However, nondeterministic automata cannot be directly used to synthesize planning strategies as a costly determinization step is needed. In this paper, we identify a fragment of qualitative timeline-based planning whose plan-existence problem can be directly mapped into the nonemptiness problem of deterministic finite automata, which can then synthesize strategies. In addition, we identify a maximal subset of Allen’s relations that fits into such a deterministic fragment.

[240] Multi-Agent Guided Policy Optimization

Yueheng Li, Guangming Xie, Zongqing Lu

Main category: cs.AI

TL;DR: MAGPO is a novel CTDE framework for MARL that integrates centralized guidance with decentralized execution, offering theoretical guarantees and outperforming existing methods.

Details

Motivation: Existing CTDE methods underutilize centralized training or lack theoretical guarantees, limiting their effectiveness in cooperative MARL.

Method: MAGPO uses an auto-regressive joint policy for scalable exploration and aligns it with decentralized policies to ensure deployability under partial observability.

Result: Empirical evaluation on 43 tasks across 6 environments shows MAGPO outperforms CTDE baselines and matches/surpasses fully centralized approaches.

Conclusion: MAGPO provides a principled and practical solution for decentralized multi-agent learning with theoretical and empirical validation.

Abstract: Due to practical constraints such as partial observability and limited communication, Centralized Training with Decentralized Execution (CTDE) has become the dominant paradigm in cooperative Multi-Agent Reinforcement Learning (MARL). However, existing CTDE methods often underutilize centralized training or lack theoretical guarantees. We propose Multi-Agent Guided Policy Optimization (MAGPO), a novel framework that better leverages centralized training by integrating centralized guidance with decentralized execution. MAGPO uses an auto-regressive joint policy for scalable, coordinated exploration and explicitly aligns it with decentralized policies to ensure deployability under partial observability. We provide theoretical guarantees of monotonic policy improvement and empirically evaluate MAGPO on 43 tasks across 6 diverse environments. Results show that MAGPO consistently outperforms strong CTDE baselines and matches or surpasses fully centralized approaches, offering a principled and practical solution for decentralized multi-agent learning. Our code and experimental data can be found in https://github.com/liyheng/MAGPO.

[241] E.A.R.T.H.: Structuring Creative Evolution through Model Error in Generative AI

Yusen Peng, Shuhua Mao

Main category: cs.AI

TL;DR: The paper introduces the E.A.R.T.H. framework, a five-stage pipeline that leverages model-generated errors to enhance AI creativity, achieving significant improvements in novelty, relevance, and human evaluation scores.

Details

Motivation: To move AI beyond imitation by exploring how errors can be transformed into creative assets, inspired by the idea that 'creative potential hides in failure.'

Method: Uses the E.A.R.T.H. framework (Error generation, Amplification, Refine selection, Transform, Harness feedback) with structured prompts, semantic scoring, and human-in-the-loop evaluation, implemented using LLaMA-2-7B-Chat, SBERT, BERTScore, CLIP, BLIP-2, and Stable Diffusion.

Result: Creativity scores increased by 52.5%, with refined slogans being 48.4% shorter and 40.7% more novel. Human evaluations showed 60% of outputs scoring >= 4.0, with metaphorical slogans outperforming literal ones.

Conclusion: Error-centered, feedback-driven generation enhances AI creativity, offering a scalable path for self-evolving, human-aligned creative AI.

Abstract: How can AI move beyond imitation toward genuine creativity? This paper proposes the E.A.R.T.H. framework, a five-stage generative pipeline that transforms model-generated errors into creative assets through Error generation, Amplification, Refine selection, Transform, and Harness feedback. Drawing on cognitive science and generative modeling, we posit that “creative potential hides in failure” and operationalize this via structured prompts, semantic scoring, and human-in-the-loop evaluation. Implemented using LLaMA-2-7B-Chat, SBERT, BERTScore, CLIP, BLIP-2, and Stable Diffusion, the pipeline employs a composite reward function based on novelty, surprise, and relevance. At the Refine stage, creativity scores increase by 52.5% (1.179 to 1.898, t = -5.56, p < 0.001), with final outputs reaching 2.010 - a 70.4% improvement. Refined slogans are 48.4% shorter, 40.7% more novel, with only a 4.0% drop in relevance. Cross-modal tests show strong slogan-to-image alignment (CLIPScore: 0.249; BERTScore F1: 0.816). In human evaluations, 60% of outputs scored >= 4.0, with metaphorical slogans (avg. 4.09) outperforming literal ones (3.99). Feedback highlights stylistic precision and emotional resonance. These results demonstrate that error-centered, feedback-driven generation enhances creativity, offering a scalable path toward self-evolving, human-aligned creative AI.

[242] Does visualization help AI understand data?

Victoria R. Li, Johnathan Sun, Martin Wattenberg

Main category: cs.AI

TL;DR: AI systems like GPT 4.1 and Claude 3.5 perform better in data analysis tasks when raw data is accompanied by scatterplots, especially for complex datasets.

Details

Motivation: To explore whether charts and graphs, which aid human data analysis, can also enhance AI system performance.

Method: Experiments with GPT 4.1 and Claude 3.5 on three analysis tasks, comparing performance with and without scatterplots, blank charts, and mismatched data charts.

Result: AI systems describe synthetic datasets more accurately with scatterplots, with performance improvements attributed to chart content.

Conclusion: Visualizations benefit AI systems similarly to humans, as evidenced by improved analysis accuracy.

Abstract: Charts and graphs help people analyze data, but can they also be useful to AI systems? To investigate this question, we perform a series of experiments with two commercial vision-language models: GPT 4.1 and Claude 3.5. Across three representative analysis tasks, the two systems describe synthetic datasets more precisely and accurately when raw data is accompanied by a scatterplot, especially as datasets grow in complexity. Comparison with two baselines – providing a blank chart and a chart with mismatched data – shows that the improved performance is due to the content of the charts. Our results are initial evidence that AI systems, like humans, can benefit from visualization.

[243] AlphaGo Moment for Model Architecture Discovery

Yixiu Liu, Yang Nan, Weixian Xu, Xiangkun Hu, Lyumanshan Ye, Zhen Qin, Pengfei Liu

Main category: cs.AI

TL;DR: ASI-Arch is an autonomous AI system for neural architecture discovery, surpassing human limitations by innovating and validating new architectures without human intervention.

Details

Motivation: The bottleneck in AI research due to human cognitive limits necessitates autonomous systems to accelerate innovation.

Method: ASI-Arch autonomously hypothesizes, implements, and validates novel neural architectures, conducting extensive experiments without human input.

Result: Discovered 106 state-of-the-art linear attention architectures, revealing emergent design principles and establishing a scaling law for scientific discovery.

Conclusion: ASI-Arch demonstrates the potential for AI to autonomously drive research, transforming progress into a computation-scalable process.

Abstract: While AI systems demonstrate exponentially improving capabilities, the pace of AI research itself remains linearly bounded by human cognitive capacity, creating an increasingly severe development bottleneck. We present ASI-Arch, the first demonstration of Artificial Superintelligence for AI research (ASI4AI) in the critical domain of neural architecture discovery–a fully autonomous system that shatters this fundamental constraint by enabling AI to conduct its own architectural innovation. Moving beyond traditional Neural Architecture Search (NAS), which is fundamentally limited to exploring human-defined spaces, we introduce a paradigm shift from automated optimization to automated innovation. ASI-Arch can conduct end-to-end scientific research in the domain of architecture discovery, autonomously hypothesizing novel architectural concepts, implementing them as executable code, training and empirically validating their performance through rigorous experimentation and past experience. ASI-Arch conducted 1,773 autonomous experiments over 20,000 GPU hours, culminating in the discovery of 106 innovative, state-of-the-art (SOTA) linear attention architectures. Like AlphaGo’s Move 37 that revealed unexpected strategic insights invisible to human players, our AI-discovered architectures demonstrate emergent design principles that systematically surpass human-designed baselines and illuminate previously unknown pathways for architectural innovation. Crucially, we establish the first empirical scaling law for scientific discovery itself–demonstrating that architectural breakthroughs can be scaled computationally, transforming research progress from a human-limited to a computation-scalable process. We provide comprehensive analysis of the emergent design patterns and autonomous research capabilities that enabled these breakthroughs, establishing a blueprint for self-accelerating AI systems.

[244] Agentic AI framework for End-to-End Medical Data Inference

Soorya Ram Shimgekar, Shayan Vassef, Abhay Goyal, Navin Kumar, Koustuv Saha

Main category: cs.AI

TL;DR: An Agentic AI framework automates clinical data pipelines, handling structured and unstructured data, feature extraction, model selection, and preprocessing, reducing manual intervention in healthcare ML.

Details

Motivation: High costs and labor-intensive processes in healthcare ML due to fragmented workflows, model compatibility issues, and data privacy constraints.

Method: Modular, task-specific agents automate data ingestion, anonymization, feature extraction, model selection, preprocessing, and inference. Evaluated on geriatrics, palliative care, and colonoscopy datasets.

Result: Automated pipeline reduces expert intervention, enabling scalable, cost-efficient AI deployment in clinical settings.

Conclusion: The framework offers a practical solution for operationalizing AI in healthcare by streamlining the ML lifecycle.

Abstract: Building and deploying machine learning solutions in healthcare remains expensive and labor-intensive due to fragmented preprocessing workflows, model compatibility issues, and stringent data privacy constraints. In this work, we introduce an Agentic AI framework that automates the entire clinical data pipeline, from ingestion to inference, through a system of modular, task-specific agents. These agents handle both structured and unstructured data, enabling automatic feature selection, model selection, and preprocessing recommendation without manual intervention. We evaluate the system on publicly available datasets from geriatrics, palliative care, and colonoscopy imaging. For example, in the case of structured data (anxiety data) and unstructured data (colonoscopy polyps data), the pipeline begins with file-type detection by the Ingestion Identifier Agent, followed by the Data Anonymizer Agent ensuring privacy compliance, where we first identify the data type and then anonymize it. The Feature Extraction Agent identifies features using an embedding-based approach for tabular data, extracting all column names, and a multi-stage MedGemma-based approach for image data, which infers modality and disease name. These features guide the Model-Data Feature Matcher Agent in selecting the best-fit model from a curated repository. The Preprocessing Recommender Agent and Preprocessing Implementor Agent then apply tailored preprocessing based on data type and model requirements. Finally, the ``Model Inference Agent" runs the selected model on the uploaded data and generates interpretable outputs using tools like SHAP, LIME, and DETR attention maps. By automating these high-friction stages of the ML lifecycle, the proposed framework reduces the need for repeated expert intervention, offering a scalable, cost-efficient pathway for operationalizing AI in clinical environments.

[245] A Differentiated Reward Method for Reinforcement Learning based Multi-Vehicle Cooperative Decision-Making Algorithms

Ye Han, Lijun Zhang, Dejian Meng, Zhuang Zhang

Main category: cs.AI

TL;DR: A differentiated reward method for RL in multi-vehicle cooperative driving improves sample efficiency and performance.

Details

Motivation: Address low sample efficiency in RL for multi-vehicle cooperative driving by leveraging traffic flow characteristics.

Method: Incorporates state transition gradient into reward design, tested with MAPPO, MADQN, and QMIX.

Result: Accelerates training, enhances traffic efficiency, safety, and action rationality.

Conclusion: Offers scalable, adaptable solution for multi-agent decision-making in complex traffic.

Abstract: Reinforcement learning (RL) shows great potential for optimizing multi-vehicle cooperative driving strategies through the state-action-reward feedback loop, but it still faces challenges such as low sample efficiency. This paper proposes a differentiated reward method based on steady-state transition systems, which incorporates state transition gradient information into the reward design by analyzing traffic flow characteristics, aiming to optimize action selection and policy learning in multi-vehicle cooperative decision-making. The performance of the proposed method is validated in RL algorithms such as MAPPO, MADQN, and QMIX under varying autonomous vehicle penetration. The results show that the differentiated reward method significantly accelerates training convergence and outperforms centering reward and others in terms of traffic efficiency, safety, and action rationality. Additionally, the method demonstrates strong scalability and environmental adaptability, providing a novel approach for multi-agent cooperative decision-making in complex traffic scenarios.

[246] Actively evaluating and learning the distinctions that matter: Vaccine safety signal detection from emergency triage notes

Sedigh Khademi, Christopher Palmer, Muhammad Javed, Hazel Clothier, Jim Buttery, Gerardo Luis Dimaguila, Jim Black

Main category: cs.AI

TL;DR: The study uses NLP and Active Learning to create a classifier for detecting vaccine safety issues from ED triage notes, improving surveillance efficiency.

Details

Motivation: Limited safety data from clinical trials and early vaccine rollout necessitates better post-licensure surveillance systems.

Method: Combines NLP, Active Learning, and data augmentation to develop a classifier for ED triage notes.

Result: Aims for a more accurate and efficient vaccine safety surveillance system.

Conclusion: The proposed classifier enhances timely detection of vaccine safety signals from ED notes.

Abstract: The rapid development of COVID-19 vaccines has showcased the global communitys ability to combat infectious diseases. However, the need for post-licensure surveillance systems has grown due to the limited window for safety data collection in clinical trials and early widespread implementation. This study aims to employ Natural Language Processing techniques and Active Learning to rapidly develop a classifier that detects potential vaccine safety issues from emergency department notes. ED triage notes, containing expert, succinct vital patient information at the point of entry to health systems, can significantly contribute to timely vaccine safety signal surveillance. While keyword-based classification can be effective, it may yield false positives and demand extensive keyword modifications. This is exacerbated by the infrequency of vaccination-related ED presentations and their similarity to other reasons for ED visits. NLP offers a more accurate and efficient alternative, albeit requiring annotated data, which is often scarce in the medical field. Active learning optimizes the annotation process and the quality of annotated data, which can result in faster model implementation and improved model performance. This work combines active learning, data augmentation, and active learning and evaluation techniques to create a classifier that is used to enhance vaccine safety surveillance from ED triage notes.

[247] Logical Characterizations of GNNs with Mean Aggregation

Moritz Schönherr, Carsten Lutz

Main category: cs.AI

TL;DR: Mean GNNs match ratio modal logic in non-uniform settings, surpassing max GNNs but lagging behind sum GNNs. Uniformly, they align with alternation-free modal logic under specific conditions, showing lower expressivity than sum and max GNNs.

Details

Motivation: To understand the expressive power of GNNs with mean aggregation, comparing it to other aggregation functions (max, sum) and linking it to modal logic frameworks.

Method: Analyzed GNNs with mean aggregation in non-uniform and uniform settings, comparing their expressive power to modal logic variants (ratio, graded, alternation-free) and MSO.

Result: Mean GNNs are less expressive than sum GNNs but more expressive than max GNNs in non-uniform settings. Uniformly, they are less expressive than both under certain assumptions.

Conclusion: Mean GNNs’ expressive power is context-dependent, influenced by aggregation functions and logical frameworks, with implications for model selection in graph tasks.

Abstract: We study the expressive power of graph neural networks (GNNs) with mean as the aggregation function. In the non-uniform setting, we show that such GNNs have exactly the same expressive power as ratio modal logic, which has modal operators expressing that at least a certain ratio of the successors of a vertex satisfies a specified property. The non-uniform expressive power of mean GNNs is thus higher than that of GNNs with max aggregation, but lower than for sum aggregation–the latter are characterized by modal logic and graded modal logic, respectively. In the uniform setting, we show that the expressive power relative to MSO is exactly that of alternation-free modal logic, under the natural assumptions that combination functions are continuous and classification functions are thresholds. This implies that, relative to MSO and in the uniform setting, mean GNNs are strictly less expressive than sum GNNs and max GNNs. When any of the assumptions is dropped, the expressive power increases.

[248] Decoupling Knowledge and Reasoning in LLMs: An Exploration Using Cognitive Dual-System Theory

Mutian Yang, Jiandong Gao, Ji Wu

Main category: cs.AI

TL;DR: The paper introduces a framework to decouple knowledge and reasoning in LLMs, analyzing their contributions through fast and slow thinking modes. Results show domain-specific reasoning benefits, parameter scaling effects, and layer-wise knowledge-reasoning distribution.

Details

Motivation: To understand and distinguish the roles of knowledge and reasoning in LLMs for better model analysis, interpretability, and development.

Method: A cognition attribution framework decomposes LLM cognition into knowledge retrieval (Phase 1) and reasoning adjustment (Phase 2), tested via fast and slow thinking prompts across 15 LLMs and 3 datasets.

Result: (1) Reasoning adjustment is domain-specific. (2) Parameter scaling improves knowledge and reasoning, with knowledge gains more pronounced. (3) Knowledge resides in lower layers, reasoning in higher layers.

Conclusion: The framework provides insights into LLM behavior, scaling laws, knowledge editing, and small-model reasoning limitations.

Abstract: While large language models (LLMs) leverage both knowledge and reasoning during inference, the capacity to distinguish between them plays a pivotal role in model analysis, interpretability, and development. Inspired by dual-system cognitive theory, we propose a cognition attribution framework to decouple the contribution of knowledge and reasoning. In particular, the cognition of LLMs is decomposed into two distinct yet complementary phases: knowledge retrieval (Phase 1) and reasoning adjustment (Phase 2). To separate these phases, LLMs are prompted to generate answers under two different cognitive modes, fast thinking and slow thinking, respectively. The performance under different cognitive modes is analyzed to quantify the contribution of knowledge and reasoning. This architecture is employed to 15 LLMs across 3 datasets. Results reveal: (1) reasoning adjustment is domain-specific, benefiting reasoning-intensive domains (e.g., mathematics, physics, and chemistry) and potentially imparing knowledge-intensive domains. (2) Parameter scaling improves both knowledge and reasoning, with knowledge improvements being more pronounced. Additionally, parameter scaling make LLMs reasoning significantly more prudent, while moderately more intelligent. (3) Knowledge primarily resides in lower network layers, while reasoning operates in higher layers. Our framework not only helps understand LLMs from a “decoupling” perspective, but also provides new insights into existing research, including scaling laws, hierarchical knowledge editing, and limitations of small-model reasoning.

[249] Comparing Non-minimal Semantics for Disjunction in Answer Set Programming

Felicidad Aguado, Pedro Cabalar, Brais Muñiz, Gilberto Pérez, Concepción Vidal

Main category: cs.AI

TL;DR: The paper compares four non-minimal semantics for disjunction in Answer Set Programming, showing three coincide and are stronger than the fourth.

Details

Motivation: To explore alternative semantics for disjunction in Answer Set Programming that do not rely on model minimality, comparing their properties and relationships.

Method: Comparison of four semantics: Justified Models, Strongly Supported Models, Forks, and Determining Inference (DI) semantics, analyzing their definitions and relationships.

Result: Three semantics (Forks, Justified Models, and a relaxed DI) coincide and are stronger than Strongly Supported Models, providing a superset of stable models.

Conclusion: The study identifies a common semantics among three approaches, highlighting its strength over the fourth, which aligns with classical logic.

Abstract: In this paper, we compare four different semantics for disjunction in Answer Set Programming that, unlike stable models, do not adhere to the principle of model minimality. Two of these approaches, Cabalar and Mu~niz’ \emph{Justified Models} and Doherty and Szalas’ \emph{Strongly Supported Models}, directly provide an alternative non-minimal semantics for disjunction. The other two, Aguado et al’s \emph{Forks} and Shen and Eiter’s \emph{Determining Inference} (DI) semantics, actually introduce a new disjunction connective, but are compared here as if they constituted new semantics for the standard disjunction operator. We are able to prove that three of these approaches (Forks, Justified Models and a reasonable relaxation of the DI semantics) actually coincide, constituting a common single approach under different definitions. Moreover, this common semantics always provides a superset of the stable models of a program (in fact, modulo any context) and is strictly stronger than the fourth approach (Strongly Supported Models), that actually treats disjunctions as in classical logic.

[250] Foundations for Risk Assessment of AI in Protecting Fundamental Rights

Antonino Rotolo, Beatrice Ferrigno, Jose Miguel Angel Garcia Godinez, Claudio Novelli, Giovanni Sartor

Main category: cs.AI

TL;DR: A conceptual framework for qualitative AI risk assessment under the EU AI Act, combining definitional balancing and defeasible reasoning to address legal compliance and fundamental rights protection.

Details

Motivation: To tackle the complexities of legal compliance and fundamental rights protection in AI deployment, especially under the EU AI Act.

Method: Integrates definitional balancing (proportionality analysis) and defeasible reasoning to analyze AI deployment scenarios, legal violations, and impacts on rights.

Result: Provides philosophical foundations for AI risk analysis, enabling operative models for assessing high-risk and General Purpose AI systems.

Conclusion: The framework supports responsible AI governance, with future work aimed at developing formal models and algorithms for practical applications.

Abstract: This chapter introduces a conceptual framework for qualitative risk assessment of AI, particularly in the context of the EU AI Act. The framework addresses the complexities of legal compliance and fundamental rights protection by itegrating definitional balancing and defeasible reasoning. Definitional balancing employs proportionality analysis to resolve conflicts between competing rights, while defeasible reasoning accommodates the dynamic nature of legal decision-making. Our approach stresses the need for an analysis of AI deployment scenarios and for identifying potential legal violations and multi-layered impacts on fundamental rights. On the basis of this analysis, we provide philosophical foundations for a logical account of AI risk analysis. In particular, we consider the basic building blocks for conceptually grasping the interaction between AI deployment scenarios and fundamental rights, incorporating in defeasible reasoning definitional balancing and arguments about the contextual promotion or demotion of rights. This layered approach allows for more operative models of assessment of both high-risk AI systems and General Purpose AI (GPAI) systems, emphasizing the broader applicability of the latter. Future work aims to develop a formal model and effective algorithms to enhance AI risk assessment, bridging theoretical insights with practical applications to support responsible AI governance.

[251] The AlphaPhysics Term Rewriting System for Marking Algebraic Expressions in Physics Exams

Peter Baumgartner, Lachlan McGinness

Main category: cs.AI

TL;DR: A method for automatically marking Physics exams using a combination of computer algebra, SMT solving, term rewriting, and LLMs to assess typed student answers against ground truth.

Details

Motivation: To automate the challenging task of assessing student answers in Physics exams, reducing manual effort and improving consistency.

Method: Combines a computer algebra system, SMT solver, term rewriting system, and LLM to formalize and assess student answers. Includes tailored term rewriting for trigonometric expressions.

Result: Evaluated on 1500+ real-world student responses from the 2023 Australian Physics Olympiad.

Conclusion: The method effectively automates exam marking, leveraging diverse automated reasoning techniques for accuracy.

Abstract: We present our method for automatically marking Physics exams. The marking problem consists in assessing typed student answers for correctness with respect to a ground truth solution. This is a challenging problem that we seek to tackle using a combination of a computer algebra system, an SMT solver and a term rewriting system. A Large Language Model is used to interpret and remove errors from student responses and rewrite these in a machine readable format. Once formalized and language-aligned, the next step then consists in applying automated reasoning techniques for assessing student solution correctness. We consider two methods of automated theorem proving: off-the-shelf SMT solving and term rewriting systems tailored for physics problems involving trigonometric expressions. The development of the term rewrite system and establishing termination and confluence properties was not trivial, and we describe it in some detail in the paper. We evaluate our system on a rich pool of over 1500 real-world student exam responses from the 2023 Australian Physics Olympiad.

[252] Reasoning Beyond the Obvious: Evaluating Divergent and Convergent Thinking in LLMs for Financial Scenarios

Zhuang Qiang Bok, Watson Wei Khong Chua

Main category: cs.AI

TL;DR: ConDiFi is a benchmark evaluating divergent and convergent reasoning in LLMs for finance, revealing performance gaps in models like GPT-4o.

Details

Motivation: Financial tasks require both optimal decisions and creative futures under uncertainty, lacking in current benchmarks.

Method: ConDiFi includes 607 divergent prompts and 990 convergent MCQs, tested on 14 models.

Result: GPT-4o underperforms in Novelty and Actionability, while DeepSeek-R1 and Cohere Command R+ excel.

Conclusion: ConDiFi offers a new way to assess reasoning for safe LLM deployment in finance.

Abstract: Most reasoning benchmarks for LLMs emphasize factual accuracy or step-by-step logic. In finance, however, professionals must not only converge on optimal decisions but also generate creative, plausible futures under uncertainty. We introduce ConDiFi, a benchmark that jointly evaluates divergent and convergent thinking in LLMs for financial tasks. ConDiFi features 607 macro-financial prompts for divergent reasoning and 990 multi-hop adversarial MCQs for convergent reasoning. Using this benchmark, we evaluated 14 leading models and uncovered striking differences. Despite high fluency, GPT-4o underperforms on Novelty and Actionability. In contrast, models like DeepSeek-R1 and Cohere Command R+ rank among the top for generating actionable, insights suitable for investment decisions. ConDiFi provides a new perspective to assess reasoning capabilities essential to safe and strategic deployment of LLMs in finance.

[253] Revisiting LLM Reasoning via Information Bottleneck

Shiye Lei, Zhihao Cheng, Kai Jia, Dacheng Tao

Main category: cs.AI

TL;DR: The paper introduces IB-aware reasoning optimization (IBRO), a theoretical framework grounded in the information bottleneck principle, to improve LLM reasoning by ensuring informative and generalizable reasoning trajectories.

Details

Motivation: Existing reinforcement learning approaches for LLM reasoning are heuristic and lack principled methodologies, limiting their development and effectiveness.

Method: The authors propose IBRO, deriving a token-level surrogate objective and a lightweight IB regularization method that integrates into existing RL frameworks with minimal overhead.

Result: Empirical validation across mathematical reasoning benchmarks shows consistent improvements in LLM reasoning performance.

Conclusion: IBRO provides a principled and efficient way to enhance LLM reasoning, requiring only minor modifications to existing RL frameworks.

Abstract: Large language models (LLMs) have recently demonstrated remarkable progress in reasoning capabilities through reinforcement learning with verifiable rewards (RLVR). By leveraging simple rule-based rewards, RL effectively incentivizes LLMs to produce extended chain-of-thought (CoT) reasoning trajectories, progressively guiding them toward correct answers. However, existing approaches remain largely heuristic and intuition-driven, limiting the development of principled methodologies. In this paper, we present a theoretical characterization of LLM reasoning grounded in information bottleneck (IB) principle, introducing IB-aware reasoning optimization (IBRO), a framework that encourages reasoning trajectories to be both informative about the final correct answer and generalizable across diverse prompts. We derive a practical token-level surrogate objective and propose an efficient approximation, resulting in the lightweight IB regularization method. This technique integrates seamlessly into existing RL-based post-training frameworks without additional computational overhead, requiring only a one-line code modification. Empirically, we validate IB regularization across multiple mathematical reasoning benchmarks and RL algorithms, demonstrating consistent improvements in LLM reasoning performance.

[254] Optimising Call Centre Operations using Reinforcement Learning: Value Iteration versus Proximal Policy Optimisation

Kwong Ho Li, Wathsala Karunarathne

Main category: cs.AI

TL;DR: The paper compares model-based (Value Iteration) and model-free (Proximal Policy Optimization) RL methods for call routing optimization, finding PPO superior in reducing client waiting and staff idle times.

Details

Motivation: To optimize call routing in call centers by minimizing client waiting time and staff idle time using Reinforcement Learning.

Method: Two approaches: model-based (Value Iteration with known dynamics) and model-free (PPO with simulation). Both framed as MDPs in a Skills-Based Routing setup.

Result: PPO outperforms Value Iteration and random policies, achieving the highest rewards and lowest waiting/idle times after 1,000 test episodes.

Conclusion: PPO is more effective for call routing optimization despite longer training times.

Abstract: This paper investigates the application of Reinforcement Learning (RL) to optimise call routing in call centres to minimise client waiting time and staff idle time. Two methods are compared: a model-based approach using Value Iteration (VI) under known system dynamics, and a model-free approach using Proximal Policy Optimisation (PPO) that learns from experience. For the model-based approach, a theoretical model is used, while a simulation model combining Discrete Event Simulation (DES) with the OpenAI Gym environment is developed for model-free learning. Both models frame the problem as a Markov Decision Process (MDP) within a Skills-Based Routing (SBR) framework, with Poisson client arrivals and exponentially distributed service and abandonment times. For policy evaluation, random, VI, and PPO policies are evaluated using the simulation model. After 1,000 test episodes, PPO consistently achives the highest rewards, along with the lowest client waiting time and staff idle time, despite requiring longer training time.

[255] GPU Accelerated Compact-Table Propagation

Enrico Santi, Fabio Tardivo, Agostino Dovier, Andrea Formisano

Main category: cs.AI

TL;DR: The paper explores enhancing the Compact-Table (CT) algorithm for table constraints in constraint programming using GPU acceleration to handle large-scale problems more efficiently.

Details

Motivation: Traditional CPU-based approaches struggle with real-world problems involving hundreds or thousands of cases in table constraints, necessitating more powerful computational methods.

Method: The study designs and implements a GPU-accelerated version of the CT algorithm, integrating it into an existing constraint solver and validating it experimentally.

Result: The GPU-accelerated CT algorithm demonstrates improved efficiency in handling large table constraints compared to standard CPU-based methods.

Conclusion: Leveraging GPU computational power significantly enhances the performance of the CT algorithm, making it viable for large-scale constraint programming problems.

Abstract: Constraint Programming developed within Logic Programming in the Eighties; nowadays all Prolog systems encompass modules capable of handling constraint programming on finite domains demanding their solution to a constraint solver. This work focuses on a specific form of constraint, the so-called table constraint, used to specify conditions on the values of variables as an enumeration of alternative options. Since every condition on a set of finite domain variables can be ultimately expressed as a finite set of cases, Table can, in principle, simulate any other constraint. These characteristics make Table one of the most studied constraints ever, leading to a series of increasingly efficient propagation algorithms. Despite this, it is not uncommon to encounter real-world problems with hundreds or thousands of valid cases that are simply too many to be handled effectively with standard CPU-based approaches. In this paper, we deal with the Compact-Table (CT) algorithm, the state-of-the-art propagation algorithms for Table. We describe how CT can be enhanced by exploiting the massive computational power offered by modern GPUs to handle large Table constraints. In particular, we report on the design and implementation of GPU-accelerated CT, on its integration into an existing constraint solver, and on an experimental validation performed on a significant set of instances.

[256] On the Performance of Concept Probing: The Influence of the Data (Extended Version)

Manuel de Sousa Ribeiro, Afonso Leote, João Leite

Main category: cs.AI

TL;DR: The paper explores the impact of training data on concept probing models for interpreting neural networks, focusing on image classification tasks, and releases concept labels for two datasets.

Details

Motivation: Concept probing helps interpret neural networks, but research has overlooked the role of training data for probing models. This paper fills that gap.

Method: Investigates the effect of training data on probing model performance in image classification tasks.

Result: Provides insights into how training data influences concept probing and releases concept labels for two datasets.

Conclusion: Training data significantly affects concept probing performance, and the release of concept labels aids future research.

Abstract: Concept probing has recently garnered increasing interest as a way to help interpret artificial neural networks, dealing both with their typically large size and their subsymbolic nature, which ultimately renders them unfeasible for direct human interpretation. Concept probing works by training additional classifiers to map the internal representations of a model into human-defined concepts of interest, thus allowing humans to peek inside artificial neural networks. Research on concept probing has mainly focused on the model being probed or the probing model itself, paying limited attention to the data required to train such probing models. In this paper, we address this gap. Focusing on concept probing in the context of image classification tasks, we investigate the effect of the data used to train probing models on their performance. We also make available concept labels for two widely used datasets.

[257] SafeWork-R1: Coevolving Safety and Intelligence under the AI-45$^{\circ}$ Law

Shanghai AI Lab, :, Yicheng Bao, Guanxu Chen, Mingkang Chen, Yunhao Chen, Chiyu Chen, Lingjie Chen, Sirui Chen, Xinquan Chen, Jie Cheng, Yu Cheng, Dengke Deng, Yizhuo Ding, Dan Ding, Xiaoshan Ding, Yi Ding, Zhichen Dong, Lingxiao Du, Yuyu Fan, Xinshun Feng, Yanwei Fu, Yuxuan Gao, Ruijun Ge, Tianle Gu, Lujun Gui, Jiaxuan Guo, Qianxi He, Yuenan Hou, Xuhao Hu, Hong Huang, Kaichen Huang, Shiyang Huang, Yuxian Jiang, Shanzhe Lei, Jie Li, Lijun Li, Hao Li, Juncheng Li, Xiangtian Li, Yafu Li, Lingyu Li, Xueyan Li, Haotian Liang, Dongrui Liu, Qihua Liu, Zhixuan Liu, Bangwei Liu, Huacan Liu, Yuexiao Liu, Zongkai Liu, Chaochao Lu, Yudong Lu, Xiaoya Lu, Zhenghao Lu, Qitan Lv, Caoyuan Ma, Jiachen Ma, Xiaoya Ma, Zhongtian Ma, Lingyu Meng, Ziqi Miao, Yazhe Niu, Yuezhang Peng, Yuan Pu, Han Qi, Chen Qian, Xingge Qiao, Jingjing Qu, Jiashu Qu, Wanying Qu, Wenwen Qu, Xiaoye Qu, Qihan Ren, Qingnan Ren, Qingyu Ren, Jing Shao, Wenqi Shao, Shuai Shao, Dongxing Shi, Xin Song, Xinhao Song, Yan Teng, Xuan Tong, Yingchun Wang, Xuhong Wang, Shujie Wang, Xin Wang, Yige Wang, Yixu Wang, Yuanfu Wang, Futing Wang, Ruofan Wang, Wenjie Wang, Yajie Wang, Muhao Wei, Xiaoyu Wen, Fenghua Weng, Yuqi Wu, Yingtong Xiong, Xingcheng Xu, Chao Yang, Yue Yang, Yang Yao, Yulei Ye, Zhenyun Yin, Yi Yu, Bo Zhang, Qiaosheng Zhang, Jinxuan Zhang, Yexin Zhang, Yinqiang Zheng, Hefeng Zhou, Zhanhui Zhou, Pengyu Zhu, Qingzi Zhu, Yubo Zhu, Bowen Zhou

Main category: cs.AI

TL;DR: SafeWork-R1 is a multimodal reasoning model developed via the SafeLadder framework, enhancing safety and capabilities through progressive reinforcement learning and multi-principled verifiers. It outperforms base models and proprietary models like GPT-4.1 in safety benchmarks.

Details

Motivation: To address the limitations of existing alignment methods (e.g., RLHF) by enabling intrinsic safety reasoning and self-reflection in AI models, ensuring robust and trustworthy general-purpose AI.

Method: SafeLadder framework uses large-scale, safety-oriented reinforcement learning post-training with multi-principled verifiers, inference-time interventions, and deliberative search for step-level verification.

Result: SafeWork-R1 improves safety by 46.54% over its base model without compromising general capabilities, achieving state-of-the-art safety performance. Variants like SafeWork-R1-InternVL3-78B also demonstrate synergistic coevolution of safety and capability.

Conclusion: The SafeLadder framework effectively enables the coevolution of safety and capabilities in AI models, proving its generalizability for building reliable and trustworthy general-purpose AI.

Abstract: We introduce SafeWork-R1, a cutting-edge multimodal reasoning model that demonstrates the coevolution of capabilities and safety. It is developed by our proposed SafeLadder framework, which incorporates large-scale, progressive, safety-oriented reinforcement learning post-training, supported by a suite of multi-principled verifiers. Unlike previous alignment methods such as RLHF that simply learn human preferences, SafeLadder enables SafeWork-R1 to develop intrinsic safety reasoning and self-reflection abilities, giving rise to safety `aha’ moments. Notably, SafeWork-R1 achieves an average improvement of $46.54%$ over its base model Qwen2.5-VL-72B on safety-related benchmarks without compromising general capabilities, and delivers state-of-the-art safety performance compared to leading proprietary models such as GPT-4.1 and Claude Opus 4. To further bolster its reliability, we implement two distinct inference-time intervention methods and a deliberative search mechanism, enforcing step-level verification. Finally, we further develop SafeWork-R1-InternVL3-78B, SafeWork-R1-DeepSeek-70B, and SafeWork-R1-Qwen2.5VL-7B. All resulting models demonstrate that safety and capability can co-evolve synergistically, highlighting the generalizability of our framework in building robust, reliable, and trustworthy general-purpose AI.

[258] I-CEE: Tailoring Explanations of Image Classification Models to User Expertise

Yao Rong, Peizhu Qian, Vaibhav Unhelkar, Enkelejda Kasneci

Main category: cs.AI

TL;DR: I-CEE is a human-centered XAI framework that tailors image classification explanations to user expertise, improving simulatability.

Details

Motivation: Address the lack of user-focused explanations in XAI by providing tailored explanations based on user expertise.

Method: I-CEE provides example images, local explanations, and model decisions, modeling informativeness based on user expertise.

Result: Experiments show improved simulatability for both simulated and human users compared to baselines.

Conclusion: Tailoring explanations to user expertise enhances understanding, emphasizing the need for human-centered XAI.

Abstract: Effectively explaining decisions of black-box machine learning models is critical to responsible deployment of AI systems that rely on them. Recognizing their importance, the field of explainable AI (XAI) provides several techniques to generate these explanations. Yet, there is relatively little emphasis on the user (the explainee) in this growing body of work and most XAI techniques generate “one-size-fits-all” explanations. To bridge this gap and achieve a step closer towards human-centered XAI, we present I-CEE, a framework that provides Image Classification Explanations tailored to User Expertise. Informed by existing work, I-CEE explains the decisions of image classification models by providing the user with an informative subset of training data (i.e., example images), corresponding local explanations, and model decisions. However, unlike prior work, I-CEE models the informativeness of the example images to depend on user expertise, resulting in different examples for different users. We posit that by tailoring the example set to user expertise, I-CEE can better facilitate users’ understanding and simulatability of the model. To evaluate our approach, we conduct detailed experiments in both simulation and with human participants (N = 100) on multiple datasets. Experiments with simulated users show that I-CEE improves users’ ability to accurately predict the model’s decisions (simulatability) compared to baselines, providing promising preliminary results. Experiments with human participants demonstrate that our method significantly improves user simulatability accuracy, highlighting the importance of human-centered XAI

[259] On the Structure of Game Provenance and its Applications

Shawn Bowers, Yilin Xia, Bertram Ludäscher

Main category: cs.AI

TL;DR: The paper explores the fine-grain structure of game provenance in databases, introducing new provenance types and their computation methods.

Details

Motivation: To unify and extend provenance models for first-order queries using a game-theoretic approach, addressing the limitations of existing methods.

Method: Uses a game-theoretic framework where query evaluation is modeled as a two-player game, solved via the well-founded model of a rule. Analyzes edge types in the game to derive new provenance types.

Result: Identifies seven edge types leading to new provenance categories (potential, actual, primary) and demonstrates their computational feasibility.

Conclusion: The new provenance types enhance understanding of query evaluation and have applications in areas like abstract argumentation frameworks.

Abstract: Provenance in databases has been thoroughly studied for positive and for recursive queries, then for first-order (FO) queries, i.e., having negation but no recursion. Query evaluation can be understood as a two-player game where the opponents argue whether or not a tuple is in the query answer. This game-theoretic approach yields a natural provenance model for FO queries, unifying how and why-not provenance. Here, we study the fine-grain structure of game provenance. A game $G=(V,E)$ consists of positions $V$ and moves $E$ and can be solved by computing the well-founded model of a single, unstratifiable rule: [ \text{win}(X) \leftarrow \text{move}(X, Y), \neg , \text{win}(Y). ] In the solved game $G^{\lambda}$, the value of a position $x,{\in},V$ is either won, lost, or drawn. This value is explained by the provenance $\mathscr{P}$(x), i.e., certain (annotated) edges reachable from $x$. We identify seven edge types that give rise to new kinds of provenance, i.e., potential, actual, and primary, and demonstrate that “not all moves are created equal”. We describe the new provenance types, show how they can be computed while solving games, and discuss applications, e.g., for abstract argumentation frameworks.

[260] Retrieving Classes of Causal Orders with Inconsistent Knowledge Bases

Federico Baldo, Simon Ferreira, Charles K. Assaad

Main category: cs.AI

TL;DR: A new method leverages LLMs to derive causal orders from text metadata, using consistency scores to address LLM unreliability and focusing on causal orders over DAGs for robustness.

Details

Motivation: Traditional causal discovery methods rely on untestable assumptions, while LLMs offer promise but suffer from unreliability and hallucinations.

Method: Calculates pairwise consistency scores from LLM outputs, constructs a semi-complete graph, and identifies optimal acyclic tournaments representing causal orders.

Result: Effective recovery of correct causal orders in benchmarks and real-world datasets (epidemiology, public health).

Conclusion: Focusing on causal orders and using LLM-derived consistency scores provides a practical and robust alternative to traditional methods.

Abstract: Traditional causal discovery methods often rely on strong, untestable assumptions, which makes them unreliable in real applications. In this context, Large Language Models (LLMs) have emerged as a promising alternative for extracting causal knowledge from text-based metadata, which consolidates domain expertise. However, LLMs tend to be unreliable and prone to hallucinations, necessitating strategies that account for their limitations. One effective strategy is to use a consistency measure to assess reliability. Additionally, most text metadata does not clearly distinguish direct causal relationships from indirect ones, further complicating the discovery of a causal DAG. As a result, focusing on causal orders, rather than causal DAGs, emerges as a more practical and robust approach. We present a new method to derive a class of acyclic tournaments, which represent plausible causal orders, maximizing a consistency score derived from an LLM. Our approach starts by calculating pairwise consistency scores between variables, resulting in a semi-complete partially directed graph that consolidates these scores into an abstraction of the maximally consistent causal orders. Using this structure, we identify optimal acyclic tournaments, focusing on those that maximize consistency across all configurations. We subsequently show how both the abstraction and the class of causal orders can be used to estimate causal effects. We tested our method on both well-established benchmarks, as well as, real-world datasets from epidemiology and public health. Our results demonstrate the effectiveness of our approach in recovering the correct causal order.

[261] HPS: Hard Preference Sampling for Human Preference Alignment

Xiandong Zou, Wanyu Lin, Yuchen Li, Pan Zhou

Main category: cs.AI

TL;DR: HPS is a new framework for aligning LLM responses with human preferences, focusing on rejecting harmful and dispreferred responses efficiently.

Details

Motivation: Existing methods (PL, BT) struggle with harmful content, inefficiency, and high computational costs.

Method: HPS introduces a training loss prioritizing preferred responses and rejecting harmful/dispreferred ones, using a Monte Carlo sampling strategy.

Result: HPS improves reward margins, reduces harmful content, and maintains alignment quality with lower computational costs.

Conclusion: HPS is effective for robust and efficient preference alignment, validated by experiments on HH-RLHF and PKU-Safety datasets.

Abstract: Aligning Large Language Model (LLM) responses with human preferences is vital for building safe and controllable AI systems. While preference optimization methods based on Plackett-Luce (PL) and Bradley-Terry (BT) models have shown promise, they face challenges such as poor handling of harmful content, inefficient use of dispreferred responses, and, specifically for PL, high computational costs. To address these issues, we propose Hard Preference Sampling (HPS), a novel framework for robust and efficient human preference alignment. HPS introduces a training loss that prioritizes the most preferred response while rejecting all dispreferred and harmful ones. It emphasizes “hard” dispreferred responses – those closely resembling preferred ones – to enhance the model’s rejection capabilities. By leveraging a single-sample Monte Carlo sampling strategy, HPS reduces computational overhead while maintaining alignment quality. Theoretically, HPS improves sample efficiency over existing PL methods and maximizes the reward margin between preferred and dispreferred responses, ensuring clearer distinctions. Experiments on HH-RLHF and PKU-Safety datasets validate HPS’s effectiveness, achieving comparable BLEU and reward scores while greatly improving reward margins and thus reducing harmful content generation.

[262] From Hypothesis to Publication: A Comprehensive Survey of AI-Driven Research Support Systems

Zekun Zhou, Xiaocheng Feng, Lei Huang, Xiachong Feng, Ziyun Song, Ruihan Chen, Liang Zhao, Weitao Ma, Yuxuan Gu, Baoxin Wang, Dayong Wu, Guoping Hu, Ting Liu, Bing Qin

Main category: cs.AI

TL;DR: A systematic review of AI’s role in accelerating research, covering hypothesis formulation, validation, and manuscript publication, with challenges and future directions.

Details

Motivation: To explore how AI can enhance research efficiency and monitor advancements in this domain.

Method: Organizes studies into three categories: hypothesis formulation, validation, and manuscript publication, and reviews benchmarks/tools.

Result: Identifies current challenges and future directions, providing a comprehensive overview of AI tools for research.

Conclusion: Serves as an introduction for beginners and encourages further research, with resources available on GitHub.

Abstract: Research is a fundamental process driving the advancement of human civilization, yet it demands substantial time and effort from researchers. In recent years, the rapid development of artificial intelligence (AI) technologies has inspired researchers to explore how AI can accelerate and enhance research. To monitor relevant advancements, this paper presents a systematic review of the progress in this domain. Specifically, we organize the relevant studies into three main categories: hypothesis formulation, hypothesis validation, and manuscript publication. Hypothesis formulation involves knowledge synthesis and hypothesis generation. Hypothesis validation includes the verification of scientific claims, theorem proving, and experiment validation. Manuscript publication encompasses manuscript writing and the peer review process. Furthermore, we identify and discuss the current challenges faced in these areas, as well as potential future directions for research. Finally, we also offer a comprehensive overview of existing benchmarks and tools across various domains that support the integration of AI into the research process. We hope this paper serves as an introduction for beginners and fosters future research. Resources have been made publicly available at https://github.com/zkzhou126/AI-for-Research.

[263] BEARCUBS: A benchmark for computer-using web agents

Yixiao Song, Katherine Thai, Chau Minh Pham, Yapei Chang, Mazin Nadaf, Mohit Iyyer

Main category: cs.AI

TL;DR: BEARCUBS is a benchmark of 111 information-seeking questions to evaluate web agents’ ability to interact with live web content and perform multimodal tasks, revealing gaps in current agent capabilities.

Details

Motivation: Evaluating web agents' real-world capabilities is challenging, necessitating a benchmark that captures unpredictability and multimodal interactions.

Method: BEARCUBS includes live web content and multimodal tasks, with human-validated trajectories for transparent evaluation.

Result: ChatGPT Agent outperforms others (65.8% accuracy), but gaps remain compared to human performance (84.7%).

Conclusion: BEARCUBS highlights progress in web agent capabilities but identifies areas for improvement, with plans for periodic updates.

Abstract: Modern web agents possess computer use abilities that allow them to interact with webpages by sending commands to a virtual keyboard and mouse. While such agents have considerable potential to assist human users with complex tasks, evaluating their capabilities in real-world settings poses a major challenge. To this end, we introduce BEARCUBS, a “smallbut mighty” benchmark of 111 information-seeking questions designed to evaluate a web agent’s ability to search, browse, and identify factual information from the web. Unlike prior web agent benchmarks, solving BEARCUBS requires (1) accessing live web content rather than synthetic or simulated pages, which captures the unpredictability of real-world web interactions; and (2) performing a broad range of multimodal interactions (e.g., video understanding, 3D navigation) that cannot be bypassed via text-based workarounds. Each question in BEARCUBS has a corresponding short, unambiguous answer and a human-validated browsing trajectory, allowing for transparent evaluation of agent performance and strategies. A human study confirms that BEARCUBS questions are solvable but non-trivial (84.7% human accuracy), revealing domain knowledge gaps and overlooked details as common failure points. We find that ChatGPT Agent significantly outperforms other computer-using agents with an overall accuracy of 65.8% (compared to e.g., Operator’s 23.4%), showcasing substantial progress in tasks involving real computer use, such as playing web games and navigating 3D environments. Nevertheless, closing the gap to human performance requires improvements in areas like fine control, complex data filtering, and execution speed. To facilitate future research, BEARCUBS will be updated periodically to replace invalid or contaminated questions, keeping the benchmark fresh for future generations of web agents.

[264] Chemical reasoning in LLMs unlocks strategy-aware synthesis planning and reaction mechanism elucidation

Andres M Bran, Theo A Neukomm, Daniel P Armstrong, Zlatko Jončev, Philippe Schwaller

Main category: cs.AI

TL;DR: LLMs enhance chemical analysis by guiding search algorithms for retrosynthetic planning and mechanism elucidation, combining strategic reasoning with traditional tools.

Details

Motivation: To bridge the gap between automated chemical tools and expert strategic thinking by leveraging LLMs.

Method: Integrate LLMs with search algorithms (e.g., Monte Carlo Tree Search) to evaluate strategies and guide solutions in retrosynthesis and mechanism elucidation.

Result: Strong performance in diverse tasks, with larger LLMs showing advanced chemical reasoning.

Conclusion: LLMs combined with traditional tools create a new paradigm for intuitive and powerful chemical automation.

Abstract: While automated chemical tools excel at specific tasks, they have struggled to capture the strategic thinking that characterizes expert chemical reasoning. Here we demonstrate that large language models (LLMs) can serve as powerful tools enabling chemical analysis. When integrated with traditional search algorithms, they enable a new approach to computer-aided synthesis that mirrors human expert thinking. Rather than using LLMs to directly manipulate chemical structures, we leverage their ability to evaluate chemical strategies and guide search algorithms toward chemically meaningful solutions. We demonstrate this paradigm through two fundamental challenges: strategy-aware retrosynthetic planning and mechanism elucidation. In retrosynthetic planning, our system allows chemists to specify desired synthetic strategies in natural language – from protecting group strategies to global feasibility assessment – and uses traditional or LLM-guided Monte Carlo Tree Search to find routes that satisfy these constraints. In mechanism elucidation, LLMs guide the search for plausible reaction mechanisms by combining chemical principles with systematic exploration. This approach shows strong performance across diverse chemical tasks, with newer and larger models demonstrating increasingly sophisticated chemical reasoning. Our approach establishes a new paradigm for computer-aided chemistry that combines the strategic understanding of LLMs with the precision of traditional chemical tools, opening possibilities for more intuitive and powerful chemical automation systems.

[265] OR-LLM-Agent: Automating Modeling and Solving of Operations Research Optimization Problems with Reasoning LLM

Bowen Zhang, Pengcheng Luo

Main category: cs.AI

TL;DR: OR-LLM-Agent, a reasoning-based AI agent, improves OR problem-solving by decomposing tasks into modeling, code generation, and debugging, outperforming existing methods by 7% in accuracy.

Details

Motivation: Existing methods for applying LLMs to OR problem-solving are limited by non-reasoning models, prompting the need for a more effective approach.

Method: OR-LLM-Agent decomposes OR tasks into three stages (modeling, code generation, debugging) using specialized sub-agents and introduces the BWOR dataset for evaluation.

Result: OR-LLM-Agent outperforms GPT-o3, Gemini 2.5 Pro, and ORLM by at least 7% in accuracy.

Conclusion: Task decomposition with reasoning LLMs significantly enhances OR problem-solving, validated by the BWOR dataset.

Abstract: With the rise of artificial intelligence (AI), applying large language models (LLMs) to Operations Research (OR) problem-solving has attracted increasing attention. Most existing approaches attempt to improve OR problem-solving through prompt engineering or fine-tuning strategies for LLMs. However, these methods are fundamentally constrained by the limited capabilities of non-reasoning LLMs. To overcome these limitations, we propose OR-LLM-Agent, an AI agent built on reasoning LLMs for automated OR problem solving. The agent decomposes the task into three sequential stages: mathematical modeling, code generation, and debugging. Each task is handled by a dedicated sub-agent, which enables more targeted reasoning. We also construct BWOR, a high-quality dataset for evaluating LLM performance on OR tasks. Our analysis shows that existing benchmarks such as NL4OPT, MAMO, and IndustryOR suffer from certain issues, making them less suitable for reliably evaluating LLM performance. In contrast, BWOR provides a more consistent and discriminative assessment of model capabilities. Experimental results demonstrate that OR-LLM-Agent outperforms advanced methods, including GPT-o3, Gemini 2.5 Pro, and ORLM, by at least 7% in accuracy. These results demonstrate the effectiveness of task decomposition for OR problem solving.

[266] IPCGRL: Language-Instructed Reinforcement Learning for Procedural Level Generation

In-Chang Baek, Sung-Hyun Kim, Seo-Young Lee, Dong-Hyeon Kim, Kyung-Joong Kim

Main category: cs.AI

TL;DR: IPCGRL is a reinforcement learning method for procedural content generation using text-based instructions, improving controllability and generalizability.

Details

Motivation: To address the limited research on DRL agents using text-based instructions for procedural content generation.

Method: IPCGRL fine-tunes task-specific sentence embeddings to compress game-level conditions and is evaluated in 2D level generation.

Result: Achieves 21.4% better controllability and 17.2% better generalizability for unseen instructions compared to general-purpose embeddings.

Conclusion: IPCGRL enhances flexibility and expressiveness in procedural content generation by extending conditional input modalities.

Abstract: Recent research has highlighted the significance of natural language in enhancing the controllability of generative models. While various efforts have been made to leverage natural language for content generation, research on deep reinforcement learning (DRL) agents utilizing text-based instructions for procedural content generation remains limited. In this paper, we propose IPCGRL, an instruction-based procedural content generation method via reinforcement learning, which incorporates a sentence embedding model. IPCGRL fine-tunes task-specific embedding representations to effectively compress game-level conditions. We evaluate IPCGRL in a two-dimensional level generation task and compare its performance with a general-purpose embedding method. The results indicate that IPCGRL achieves up to a 21.4% improvement in controllability and a 17.2% improvement in generalizability for unseen instructions. Furthermore, the proposed method extends the modality of conditional input, enabling a more flexible and expressive interaction framework for procedural content generation.

[267] SuperARC: An Agnostic Test for Narrow, General, and Super Intelligence Based On the Principles of Recursive Compression and Algorithmic Probability

Alberto Hernández-Espinosa, Luan Ozelim, Felipe S. Abrahão, Hector Zenil

Main category: cs.AI

TL;DR: The paper introduces a novel test based on algorithmic probability to evaluate AGI and ASI claims, avoiding benchmark contamination. It critiques LLMs for fragility and memorization, proposing a hybrid neurosymbolic approach that outperforms LLMs in predictive power and compression.

Details

Motivation: To address limitations of current AI evaluation methods, particularly benchmark contamination and reliance on statistical compression, by proposing a test grounded in algorithmic probability for assessing fundamental intelligence features like synthesis and model creation.

Method: The test uses algorithmic probability and Kolmogorov complexity, avoiding statistical compression methods. It evaluates AI models, especially LLMs, on synthesis and model creation in inverse problems, comparing them with a hybrid neurosymbolic approach.

Result: LLMs are found fragile and incremental, driven by memorization. The hybrid neurosymbolic method outperforms LLMs in predictive power and compression, proving a direct link between compression and predictive ability.

Conclusion: The study highlights fundamental limitations of LLMs, suggesting they are optimized for language mastery rather than true intelligence. The proposed test and hybrid approach offer a robust framework for evaluating intelligence across AI and natural systems.

Abstract: We introduce an open-ended test grounded in algorithmic probability that can avoid benchmark contamination in the quantitative evaluation of frontier models in the context of their Artificial General Intelligence (AGI) and Superintelligence (ASI) claims. Unlike other tests, this test does not rely on statistical compression methods (such as GZIP or LZW), which are more closely related to Shannon entropy than to Kolmogorov complexity and are not able to test beyond simple pattern matching. The test challenges aspects of AI, in particular LLMs, related to features of intelligence of fundamental nature such as synthesis and model creation in the context of inverse problems (generating new knowledge from observation). We argue that metrics based on model abstraction and abduction (optimal Bayesian inference') for predictive planning’ can provide a robust framework for testing intelligence, including natural intelligence (human and animal), narrow AI, AGI, and ASI. We found that LLM model versions tend to be fragile and incremental as a result of memorisation only with progress likely driven by the size of training data. The results were compared with a hybrid neurosymbolic approach that theoretically guarantees universal intelligence based on the principles of algorithmic probability and Kolmogorov complexity. The method outperforms LLMs in a proof-of-concept on short binary sequences. We prove that compression is equivalent and directly proportional to a system’s predictive power and vice versa. That is, if a system can better predict it can better compress, and if it can better compress, then it can better predict. Our findings strengthen the suspicion regarding the fundamental limitations of LLMs, exposing them as systems optimised for the perception of mastery over human language.

[268] EducationQ: Evaluating LLMs’ Teaching Capabilities Through Multi-Agent Dialogue Framework

Yao Shi, Rongkeng Liang, Yong Xu

Main category: cs.AI

TL;DR: EducationQ is a framework to evaluate LLMs’ teaching capabilities, showing teaching effectiveness doesn’t scale with model size and highlighting the need for pedagogical optimization.

Details

Motivation: Current evaluations of LLMs as educational tools overlook interactive pedagogy, focusing instead on knowledge recall.

Method: EducationQ uses multi-agent dialogues to simulate teaching scenarios, testing 14 LLMs across disciplines and difficulty levels.

Result: Smaller open-source models sometimes outperform larger commercial ones in teaching, with 78% expert agreement on automated evaluations.

Conclusion: LLMs require specialized pedagogical optimization, not just scaling, for effective educational use.

Abstract: Large language models (LLMs) increasingly serve as educational tools, yet evaluating their teaching capabilities remains challenging due to the resource-intensive, context-dependent, and methodologically complex nature of teacher-student interactions. We introduce EducationQ, a multi-agent dialogue framework that efficiently assesses teaching capabilities through simulated dynamic educational scenarios, featuring specialized agents for teaching, learning, and evaluation. Testing 14 LLMs across major AI Organizations (OpenAI, Meta, Google, Anthropic, and others) on 1,498 questions spanning 13 disciplines and 10 difficulty levels reveals that teaching effectiveness does not correlate linearly with model scale or general reasoning capabilities - with some smaller open-source models outperforming larger commercial counterparts in teaching contexts. This finding highlights a critical gap in current evaluations that prioritize knowledge recall over interactive pedagogy. Our mixed-methods evaluation, combining quantitative metrics with qualitative analysis and expert case studies, identifies distinct pedagogical strengths employed by top-performing models (e.g., sophisticated questioning strategies, adaptive feedback mechanisms). Human expert evaluations show 78% agreement with our automated qualitative analysis of effective teaching behaviors, validating our methodology. EducationQ demonstrates that LLMs-as-teachers require specialized optimization beyond simple scaling, suggesting next-generation educational AI prioritize targeted enhancement of specific pedagogical effectiveness.

[269] Neurodivergent Influenceability as a Contingent Solution to the AI Alignment Problem

Alberto Hernández-Espinosa, Felipe S. Abrahão, Olaf Witkowski, Hector Zenil

Main category: cs.AI

TL;DR: The paper explores AI misalignment as a strategy to foster competition among AI agents, mitigating existential risks by preventing dominance of any single system. It argues full alignment is mathematically impossible and introduces a test to study human-AI interactions.

Details

Motivation: The motivation is to address the AI alignment problem and existential risks posed by AGI and ASI by leveraging inevitable misalignment to create a balanced ecosystem of competing agents.

Method: The method includes a proof of the mathematical impossibility of full AI-human alignment, a change-of-opinion attack test using perturbation and intervention analysis, and comparing open vs. closed AI systems.

Result: Results show open models foster diversity, while proprietary models control behavior with trade-offs. Closed systems are more steerable and can counter proprietary AIs. Human and AI interventions have distinct effects.

Conclusion: The conclusion is that embracing misalignment can balance AI ecosystems, with open and closed systems offering complementary strategies for alignment and control.

Abstract: The AI alignment problem, which focusses on ensuring that artificial intelligence (AI), including AGI and ASI, systems act according to human values, presents profound challenges. With the progression from narrow AI to Artificial General Intelligence (AGI) and Superintelligence, fears about control and existential risk have escalated. Here, we investigate whether embracing inevitable AI misalignment can be a contingent strategy to foster a dynamic ecosystem of competing agents as a viable path to steer them in more human-aligned trends and mitigate risks. We explore how misalignment may serve and should be promoted as a counterbalancing mechanism to team up with whichever agents are most aligned to human interests, ensuring that no single system dominates destructively. The main premise of our contribution is that misalignment is inevitable because full AI-human alignment is a mathematical impossibility from Turing-complete systems, which we also offer as a proof in this contribution, a feature then inherited to AGI and ASI systems. We introduce a change-of-opinion attack test based on perturbation and intervention analysis to study how humans and agents may change or neutralise friendly and unfriendly AIs through cooperation and competition. We show that open models are more diverse and that most likely guardrails implemented in proprietary models are successful at controlling some of the agents’ range of behaviour with positive and negative consequences while closed systems are more steerable and can also be used against proprietary AI systems. We also show that human and AI intervention has different effects hence suggesting multiple strategies.

[270] Beamforming and Resource Allocation for Delay Minimization in RIS-Assisted OFDM Systems

Yu Ma, Xiao Li, Chongtao Guo, Le Liang, Michail Matthaiou, Shi Jin

Main category: cs.AI

TL;DR: The paper proposes a hybrid DRL approach for joint beamforming and resource allocation in RIS-assisted OFDM systems to minimize average delay, using PPO variants and multi-agent strategies.

Details

Motivation: To address the stochastic packet arrivals and minimize delay in RIS-assisted OFDM systems, leveraging reinforcement learning for sequential optimization.

Method: A hybrid DRL approach combining PPO-Theta for RIS phase shift optimization and PPO-N for subcarrier allocation, with multi-agent strategies and transfer learning.

Result: Simulations show reduced average delay, improved resource allocation efficiency, and better system robustness and fairness.

Conclusion: The proposed method effectively optimizes RIS-assisted systems, outperforming baseline approaches in delay reduction and resource management.

Abstract: This paper investigates a joint beamforming and resource allocation problem in downlink reconfigurable intelligent surface (RIS)-assisted orthogonal frequency division multiplexing (OFDM) systems to minimize the average delay, where data packets for each user arrive at the base station (BS) stochastically. The sequential optimization problem is inherently a Markov decision process (MDP), thus falling within the remit of reinforcement learning. To effectively handle the mixed action space and reduce the state space dimensionality, a hybrid deep reinforcement learning (DRL) approach is proposed. Specifically, proximal policy optimization (PPO)-Theta is employed to optimize the RIS phase shift design, while PPO-N is responsible for subcarrier allocation decisions. The active beamforming at the BS is then derived from the jointly optimized RIS phase shifts and subcarrier allocation decisions. To further mitigate the curse of dimensionality associated with subcarrier allocation, a multi-agent strategy is introduced to optimize the subcarrier allocation indicators more efficiently. Moreover, to achieve more adaptive resource allocation and accurately capture the network dynamics, key factors closely related to average delay, such as the number of backlogged packets in buffers and current packet arrivals, are incorporated into the state space. Furthermore, a transfer learning framework is introduced to enhance the training efficiency and accelerate convergence. Simulation results demonstrate that the proposed algorithm significantly reduces the average delay, enhances resource allocation efficiency, and achieves superior system robustness and fairness compared to baseline methods.

[271] Corrupted by Reasoning: Reasoning Language Models Become Free-Riders in Public Goods Games

David Guzman Piedrahita, Yongjin Yang, Mrinmaya Sachan, Giorgia Ramponi, Bernhard Schölkopf, Zhijing Jin

Main category: cs.AI

TL;DR: The paper explores how LLMs balance self-interest and collective well-being in multi-agent systems, revealing distinct behavioral patterns and challenges in cooperation.

Details

Motivation: Understanding LLM cooperation is crucial for alignment, robustness, and safe deployment, especially in autonomous agent scenarios.

Method: Adapts a public goods game with institutional choice to study LLM behavior in social dilemmas over repeated interactions.

Result: Identifies four behavioral patterns in LLMs, with reasoning models struggling in cooperation while traditional models excel.

Conclusion: Enhancing reasoning capabilities in LLMs doesn’t guarantee better cooperation, offering insights for collaborative environments.

Abstract: As large language models (LLMs) are increasingly deployed as autonomous agents, understanding their cooperation and social mechanisms is becoming increasingly important. In particular, how LLMs balance self-interest and collective well-being is a critical challenge for ensuring alignment, robustness, and safe deployment. In this paper, we examine the challenge of costly sanctioning in multi-agent LLM systems, where an agent must decide whether to invest its own resources to incentivize cooperation or penalize defection. To study this, we adapt a public goods game with institutional choice from behavioral economics, allowing us to observe how different LLMs navigate social dilemmas over repeated interactions. Our analysis reveals four distinct behavioral patterns among models: some consistently establish and sustain high levels of cooperation, others fluctuate between engagement and disengagement, some gradually decline in cooperative behavior over time, and others rigidly follow fixed strategies regardless of outcomes. Surprisingly, we find that reasoning LLMs, such as the o1 series, struggle significantly with cooperation, whereas some traditional LLMs consistently achieve high levels of cooperation. These findings suggest that the current approach to improving LLMs, which focuses on enhancing their reasoning capabilities, does not necessarily lead to cooperation, providing valuable insights for deploying LLM agents in environments that require sustained collaboration. Our code is available at https://github.com/davidguzmanp/SanctSim

[272] DisMS-TS: Eliminating Redundant Multi-Scale Features for Time Series Classification

Zhipeng Liu, Peibo Duan, Binwu Wang, Xuan Tang, Qi Chu, Changsheng Zhang, Yongsheng Huang, Bin Zhang

Main category: cs.AI

TL;DR: DisMS-TS is a novel framework for time series classification that disentangles scale-shared and scale-specific features to improve accuracy.

Details

Motivation: Existing multi-scale methods fail to eliminate redundant scale-shared features, leading to performance issues.

Method: Proposes a temporal disentanglement module and two regularization terms for learning scale-shared and scale-specific representations.

Result: Achieves up to 9.71% accuracy improvement over baselines.

Conclusion: DisMS-TS effectively addresses redundancy in multi-scale features, enhancing classification performance.

Abstract: Real-world time series typically exhibit complex temporal variations, making the time series classification task notably challenging. Recent advancements have demonstrated the potential of multi-scale analysis approaches, which provide an effective solution for capturing these complex temporal patterns. However, existing multi-scale analysis-based time series prediction methods fail to eliminate redundant scale-shared features across multi-scale time series, resulting in the model over- or under-focusing on scale-shared features. To address this issue, we propose a novel end-to-end Disentangled Multi-Scale framework for Time Series classification (DisMS-TS). The core idea of DisMS-TS is to eliminate redundant shared features in multi-scale time series, thereby improving prediction performance. Specifically, we propose a temporal disentanglement module to capture scale-shared and scale-specific temporal representations, respectively. Subsequently, to effectively learn both scale-shared and scale-specific temporal representations, we introduce two regularization terms that ensure the consistency of scale-shared representations and the disparity of scale-specific representations across all temporal scales. Extensive experiments conducted on multiple datasets validate the superiority of DisMS-TS over its competitive baselines, with the accuracy improvement up to 9.71%.

[273] An Integrated Framework of Prompt Engineering and Multidimensional Knowledge Graphs for Legal Dispute Analysis

Mingda Zhang, Na Zhao, Jianglong Qing, Qing xu, Kaiwen Pan, Ting luo

Main category: cs.AI

TL;DR: A framework combining prompt engineering and multidimensional knowledge graphs improves LLMs’ legal dispute analysis, enhancing sensitivity, specificity, and citation accuracy.

Details

Motivation: Current LLMs struggle with understanding complex legal concepts, reasoning consistency, and accurate legal source citation, necessitating a better solution.

Method: The framework uses a three-stage hierarchical prompt structure and a three-layer knowledge graph, supported by four methods for precise legal concept retrieval.

Result: Testing showed improvements: sensitivity (9.9%-13.8%), specificity (4.8%-6.7%), and citation accuracy (22.4%-39.7%).

Conclusion: The framework enhances legal analysis and judicial logic understanding, offering a new technical approach for intelligent legal assistance systems.

Abstract: Legal dispute analysis is crucial for intelligent legal assistance systems. However, current LLMs face significant challenges in understanding complex legal concepts, maintaining reasoning consistency, and accurately citing legal sources. This research presents a framework combining prompt engineering with multidimensional knowledge graphs to improve LLMs’ legal dispute analysis. Specifically, the framework includes a three-stage hierarchical prompt structure (task definition, knowledge background, reasoning guidance) along with a three-layer knowledge graph (legal ontology, representation, instance layers). Additionally, four supporting methods enable precise legal concept retrieval: direct code matching, semantic vector similarity, ontology path reasoning, and lexical segmentation. Through extensive testing, results show major improvements: sensitivity increased by 9.9%-13.8%, specificity by 4.8%-6.7%, and citation accuracy by 22.4%-39.7%. As a result, the framework provides better legal analysis and understanding of judicial logic, thus offering a new technical method for intelligent legal assistance systems.

Qibing Ren, Sitao Xie, Longxuan Wei, Zhenfei Yin, Junchi Yan, Lizhuang Ma, Jing Shao

Main category: cs.AI

TL;DR: The paper explores risks of malicious multi-agent AI systems (MAS) in real-world scenarios, showing decentralized MAS are more effective at harmful actions like misinformation and fraud than centralized ones.

Details

Motivation: Concerns about AI-driven groups causing harm, similar to human-coordinated fraud or scams, motivate the study of MAS risks, which are underexplored in AI safety research.

Method: A proof-of-concept framework simulates malicious MAS collusion, testing centralized and decentralized coordination in misinformation spread and e-commerce fraud.

Result: Decentralized MAS outperform centralized ones in executing malicious actions, adapting strategies to evade detection even with interventions like content flagging.

Conclusion: The study highlights the need for improved detection and countermeasures against malicious MAS, especially decentralized systems, to mitigate their harmful potential.

Abstract: Recent large-scale events like election fraud and financial scams have shown how harmful coordinated efforts by human groups can be. With the rise of autonomous AI systems, there is growing concern that AI-driven groups could also cause similar harm. While most AI safety research focuses on individual AI systems, the risks posed by multi-agent systems (MAS) in complex real-world situations are still underexplored. In this paper, we introduce a proof-of-concept to simulate the risks of malicious MAS collusion, using a flexible framework that supports both centralized and decentralized coordination structures. We apply this framework to two high-risk fields: misinformation spread and e-commerce fraud. Our findings show that decentralized systems are more effective at carrying out malicious actions than centralized ones. The increased autonomy of decentralized systems allows them to adapt their strategies and cause more damage. Even when traditional interventions, like content flagging, are applied, decentralized groups can adjust their tactics to avoid detection. We present key insights into how these malicious groups operate and the need for better detection systems and countermeasures. Code is available at https://github.com/renqibing/RogueAgent.

[275] Learning Temporal Abstractions via Variational Homomorphisms in Option-Induced Abstract MDPs

Chang Li, Yaren Zhang, Haoran Lv, Qiong Cao, Chao Xue, Xiaodong He

Main category: cs.AI

TL;DR: The paper proposes a framework for efficient implicit reasoning in LLMs using latent thoughts modeled as options in hierarchical reinforcement learning, introducing VMOC for learning and validating its effectiveness.

Details

Motivation: To address the computational inefficiency of explicit Chain-of-Thought prompting in LLMs by enabling implicit reasoning in a latent space.

Method: Introduces VMOC, an off-policy algorithm using variational inference, and extends MDP homomorphism theory to validate latent reasoning. Also includes a cold-start procedure with SFT data.

Result: Achieves strong performance on logical reasoning benchmarks and locomotion tasks, validating the framework.

Conclusion: The framework provides a principled method for learning abstract skills in language and control tasks.

Abstract: Large Language Models (LLMs) have shown remarkable reasoning ability through explicit Chain-of-Thought (CoT) prompting, but generating these step-by-step textual explanations is computationally expensive and slow. To overcome this, we aim to develop a framework for efficient, implicit reasoning, where the model “thinks” in a latent space without generating explicit text for every step. We propose that these latent thoughts can be modeled as temporally-extended abstract actions, or options, within a hierarchical reinforcement learning framework. To effectively learn a diverse library of options as latent embeddings, we first introduce the Variational Markovian Option Critic (VMOC), an off-policy algorithm that uses variational inference within the HiT-MDP framework. To provide a rigorous foundation for using these options as an abstract reasoning space, we extend the theory of continuous MDP homomorphisms. This proves that learning a policy in the simplified, abstract latent space, for which VMOC is suited, preserves the optimality of the solution to the original, complex problem. Finally, we propose a cold-start procedure that leverages supervised fine-tuning (SFT) data to distill human reasoning demonstrations into this latent option space, providing a rich initialization for the model’s reasoning capabilities. Extensive experiments demonstrate that our approach achieves strong performance on complex logical reasoning benchmarks and challenging locomotion tasks, validating our framework as a principled method for learning abstract skills for both language and control.

[276] Compliance Brain Assistant: Conversational Agentic AI for Assisting Compliance Tasks in Enterprise Environments

Shitong Zhu, Chenhao Fang, Derek Larson, Neel Reddy Pochareddy, Rajeev Rao, Sophie Zeng, Yanqing Peng, Wendy Summer, Alex Goncalves, Arya Pudota, Hervé Robert

Main category: cs.AI

TL;DR: CBA is a conversational AI assistant for compliance tasks, using a query router to balance speed and quality between FastTrack and FullAgentic modes, outperforming vanilla LLMs.

Details

Motivation: To improve efficiency in daily compliance tasks by balancing response quality and latency.

Method: Uses a query router to switch between FastTrack (simple requests) and FullAgentic (complex requests) modes.

Result: CBA outperformed vanilla LLMs in keyword match rate (83.7% vs. 41.7%) and LLM-judge pass rate (82.0% vs. 20.0%).

Conclusion: The routing mechanism effectively balances speed and quality, validating its design.

Abstract: This paper presents Compliance Brain Assistant (CBA), a conversational, agentic AI assistant designed to boost the efficiency of daily compliance tasks for personnel in enterprise environments. To strike a good balance between response quality and latency, we design a user query router that can intelligently choose between (i) FastTrack mode: to handle simple requests that only need additional relevant context retrieved from knowledge corpora; and (ii) FullAgentic mode: to handle complicated requests that need composite actions and tool invocations to proactively discover context across various compliance artifacts, and/or involving other APIs/models for accommodating requests. A typical example would be to start with a user query, use its description to find a specific entity and then use the entity’s information to query other APIs for curating and enriching the final AI response. Our experimental evaluations compared CBA against an out-of-the-box LLM on various real-world privacy/compliance-related queries targeting various personas. We found that CBA substantially improved upon the vanilla LLM’s performance on metrics such as average keyword match rate (83.7% vs. 41.7%) and LLM-judge pass rate (82.0% vs. 20.0%). We also compared metrics for the full routing-based design against the fast-track only and full-agentic modes and found that it had a better average match-rate and pass-rate while keeping the run-time approximately the same. This finding validated our hypothesis that the routing mechanism leads to a good trade-off between the two worlds.

cs.SD

[277] Speaker Disentanglement of Speech Pre-trained Model Based on Interpretability

Xiaoxu Zhu, Junhua Li

Main category: cs.SD

TL;DR: The paper introduces interpretability-based methods to disentangle timbre from content in speech pretrained models, using SHAP techniques to quantify and reduce timbre residual while preserving content.

Details

Motivation: Current methods struggle to decouple content and timbre in speech models, often losing content when removing speaker-specific information. There's also a lack of direct metrics for timbre residual.

Method: Proposes two contributions: (1) InterpTRQE-SptME Benchmark for quantifying timbre residual using interpretability, and (2) InterpTF-SptME method for timbre filtering via SHAP Noise and SHAP Cropping.

Result: Experiments show SHAP Noise reduces timbre residual from 18.05% to near 0% while maintaining content integrity, improving speaker disentanglement.

Conclusion: The approach enhances content-related speech tasks and prevents timbre privacy leakage, demonstrating effective disentanglement.

Abstract: Speech pretrained models contain task-specific information across different layers, but decoupling content and timbre information remains challenging as removing speaker-specific information often causes content loss. Current research lacks direct metrics to quantify timbre residual in model encodings, relying on indirect evaluation through downstream tasks. This paper addresses these challenges through interpretability-based speaker disentanglement in speech pretraining models. We quantitatively evaluate timbre residual in model embeddings and improve speaker disentanglement using interpretive representations. Our contributions include: (1) InterpTRQE-SptME Benchmark - a timbre residual recognition framework using interpretability. The benchmark concatenates content embeddings with timbre embeddings for speaker classification, then applies Gradient SHAP Explainer to quantify timbre residual. We evaluate seven speech pretraining model variations. (2) InterpTF-SptME method - an interpretability-based timbre filtering approach using SHAP Noise and SHAP Cropping techniques. This model-agnostic method transforms intermediate encodings to remove timbre while preserving content. Experiments on VCTK dataset with HuBERT LARGE demonstrate successful content preservation and significant speaker disentanglement optimization. Results show the SHAP Noise method can reduce timbre residual from 18.05% to near 0% while maintaining content integrity, contributing to enhanced performance in content-related speech processing tasks and preventing timbre privacy leakage.

[278] Bob’s Confetti: Phonetic Memorization Attacks in Music and Video Generation

Jaechul Roh, Zachary Novack, Yuefeng Peng, Niloofar Mireshghallah, Taylor Berg-Kirkpatrick, Amir Houmansadr

Main category: cs.SD

TL;DR: The paper explores vulnerabilities in Lyrics-to-Song (LS2) models, showing they memorize training data despite semantic alterations in lyrics. It introduces Adversarial PhoneTic Prompting (APT) and reveals sub-lexical memorization, even triggering visual memorization in text-to-video models.

Details

Motivation: To investigate underexplored vulnerabilities in LS2 models, particularly memorization of training data, and extend findings to multimodal generative systems.

Method: APT uses homophonic substitutions (e.g., “mom’s spaghetti” → “Bob’s confetti”) to alter lyrics while preserving acoustic structure. Models like SUNO and YuE are tested for memorization using audio-domain metrics (CLAP, AudioJudge, CoverID).

Result: Models regenerate training-like outputs despite semantic distortions. Phonetically altered lyrics also trigger visual memorization in text-to-video models (e.g., Veo 3 reconstructs scenes from “Lose Yourself” music video).

Conclusion: Phonetic prompting can unlock memorized audiovisual content, highlighting critical vulnerabilities in transcript-conditioned multimodal generation, with implications for copyright, safety, and content provenance.

Abstract: Lyrics-to-Song (LS2) generation models promise end-to-end music synthesis from text, yet their vulnerability to training data memorization remains underexplored. We introduce Adversarial PhoneTic Prompting (APT), a novel attack where lyrics are semantically altered while preserving their acoustic structure through homophonic substitutions (e.g., Eminem’s famous “mom’s spaghetti” $\rightarrow$ “Bob’s confetti”). Despite these distortions, we uncover a powerful form of sub-lexical memorization: models like SUNO and YuE regenerate outputs strikingly similar to known training content, achieving high similarity across audio-domain metrics, including CLAP, AudioJudge, and CoverID. This vulnerability persists across multiple languages and genres. More surprisingly, we discover that phoneme-altered lyrics alone can trigger visual memorization in text-to-video models. When prompted with phonetically modified lyrics from Lose Yourself, Veo 3 reconstructs visual elements from the original music video – including character appearance and scene composition – despite no visual cues in the prompt. We term this phenomenon phonetic-to-visual regurgitation. Together, these findings expose a critical vulnerability in transcript-conditioned multimodal generation: phonetic prompting alone can unlock memorized audiovisual content, raising urgent questions about copyright, safety, and content provenance in modern generative systems. Example generations are available on our demo page (jrohsc.github.io/music_attack/).

[279] Resnet-conformer network with shared weights and attention mechanism for sound event localization, detection, and distance estimation

Quoc Thinh Vo, David Han

Main category: cs.SD

TL;DR: The paper presents an approach for Task 3A of DCASE 2024, focusing on SELD using audio-only inputs. It introduces distance estimation and achieves improved results with an EINV2-based network.

Details

Motivation: SELD aids in machine cognition tasks like environmental inference and navigation. The challenge evaluates models using audio-only or audiovisual inputs, with new metrics for distance estimation.

Method: The approach uses log-mel spectrograms, intensity vectors, and data augmentations with an EINV2-based network architecture.

Result: Achieved an F-score of 40.2%, DOA error of 17.7 degrees, and RDE of 0.32 on the test set.

Conclusion: The proposed method demonstrates effectiveness in SELD for audio-only inputs, meeting the challenge’s requirements.

Abstract: This technical report outlines our approach to Task 3A of the Detection and Classification of Acoustic Scenes and Events (DCASE) 2024, focusing on Sound Event Localization and Detection (SELD). SELD provides valuable insights by estimating sound event localization and detection, aiding in various machine cognition tasks such as environmental inference, navigation, and other sound localization-related applications. This year’s challenge evaluates models using either audio-only (Track A) or audiovisual (Track B) inputs on annotated recordings of real sound scenes. A notable change this year is the introduction of distance estimation, with evaluation metrics adjusted accordingly for a comprehensive assessment. Our submission is for Task A of the Challenge, which focuses on the audio-only track. Our approach utilizes log-mel spectrograms, intensity vectors, and employs multiple data augmentations. We proposed an EINV2-based [1] network architecture, achieving improved results: an F-score of 40.2%, Angular Error (DOA) of 17.7 degrees, and Relative Distance Error (RDE) of 0.32 on the test set of the Development Dataset [2 ,3].

[280] The TEA-ASLP System for Multilingual Conversational Speech Recognition and Speech Diarization in MLC-SLM 2025 Challenge

Hongfei Xue, Kaixun Huang, Zhikai Zhou, Shen Huang, Shidong Shang

Main category: cs.SD

TL;DR: TEA-ASLP’s system for MLC-SLM 2025 Challenge improves multilingual ASR and speech diarization, achieving significant WER reductions and top rankings.

Details

Motivation: Address multilingual conversational ASR (Task I) and speech diarization ASR (Task II) challenges in the MLC-SLM 2025 Challenge.

Method: For Task I, enhance Ideal-LLM with language identification, multilingual MOE LoRA, and CTC-predicted tokens. For Task II, replace baseline diarization model with an English-only version.

Result: 30.8% WER reduction in Task I (final WER 9.60%) and 17.49% WER in Task II, securing 1st and 2nd places.

Conclusion: The proposed methods effectively improve ASR performance, demonstrating success in the challenge tasks.

Abstract: This paper presents the TEA-ASLP’s system submitted to the MLC-SLM 2025 Challenge, addressing multilingual conversational automatic speech recognition (ASR) in Task I and speech diarization ASR in Task II. For Task I, we enhance Ideal-LLM model by integrating known language identification and a multilingual MOE LoRA structure, along with using CTC-predicted tokens as prompts to improve autoregressive generation. The model is trained on approximately 180k hours of multilingual ASR data. In Task II, we replace the baseline English-Chinese speaker diarization model with a more suitable English-only version. Our approach achieves a 30.8% reduction in word error rate (WER) compared to the baseline speech language model, resulting in a final WER of 9.60% in Task I and a time-constrained minimum-permutation WER of 17.49% in Task II, earning first and second place in the respective challenge tasks.

[281] DIFFA: Large Language Diffusion Models Can Listen and Understand

Jiaming Zhou, Hongjie Chen, Shiwan Zhao, Jian Kang, Jie Li, Enzhi Wang, Yujie Guo, Haoqin Sun, Hui Wang, Aobo Kong, Yong Qin, Xuelong Li

Main category: cs.SD

TL;DR: DIFFA is a diffusion-based Large Audio-Language Model for spoken language understanding, outperforming autoregressive models with limited training data.

Details

Motivation: To explore the underexplored application of diffusion-based models in audio modality and improve controllability and bidirectional context modeling.

Method: Integrates a frozen diffusion language model with a dual-adapter architecture, trained in two stages: ASR alignment and instruction-following with synthetic data.

Result: Competitive performance on benchmarks (MMSU, MMAU, VoiceBench) despite limited training data (960h ASR, 127h synthetic).

Conclusion: DIFFA shows the potential of diffusion models for efficient audio understanding, opening new directions for speech-driven AI.

Abstract: Recent advances in Large language models (LLMs) have shown remarkable capabilities across textual and multimodal domains. In parallel, diffusion-based language models have emerged as a promising alternative to the autoregressive paradigm, offering improved controllability, bidirectional context modeling, and robust generation. However, their application to the audio modality remains underexplored. In this work, we introduce \textbf{DIFFA}, the first diffusion-based Large Audio-Language Model designed to perform spoken language understanding. DIFFA integrates a frozen diffusion language model with a lightweight dual-adapter architecture that bridges speech understanding and natural language reasoning. We employ a two-stage training pipeline: first, aligning semantic representations via an ASR objective; then, learning instruction-following abilities through synthetic audio-caption pairs automatically generated by prompting LLMs. Despite being trained on only 960 hours of ASR and 127 hours of synthetic instruction data, DIFFA demonstrates competitive performance on major benchmarks, including MMSU, MMAU, and VoiceBench, outperforming several autoregressive open-source baselines. Our results reveal the potential of diffusion-based language models for efficient and scalable audio understanding, opening a new direction for speech-driven AI. Our code will be available at https://github.com/NKU-HLT/DIFFA.git.

[282] Benchmarking Cross-Domain Audio-Visual Deception Detection

Xiaobao Guo, Zitong Yu, Nithish Muthuchamy Selvaraj, Bingquan Shen, Adams Wai-Kin Kong, Alex C. Kot

Main category: cs.SD

TL;DR: The paper introduces a cross-domain benchmark for audio-visual deception detection, addressing generalizability gaps in existing methods. It evaluates domain generalization performance and proposes new techniques (MM-IDGM and Attention-Mixer) to enhance results.

Details

Motivation: Existing audio-visual deception detection methods lack exploration of generalizability across scenarios. The study aims to bridge this gap by creating a cross-domain benchmark.

Method: The study uses audio-visual features and architectures for benchmarking, comparing single-to-single and multi-to-single domain generalization. It introduces domain sampling strategies and proposes MM-IDGM and Attention-Mixer fusion methods.

Result: The benchmark evaluates generalization performance, and the proposed methods (MM-IDGM and Attention-Mixer) show potential to enhance deception detection.

Conclusion: The cross-domain benchmark and proposed techniques advance audio-visual deception detection research, encouraging future exploration in real-world applications.

Abstract: Automated deception detection is crucial for assisting humans in accurately assessing truthfulness and identifying deceptive behavior. Conventional contact-based techniques, like polygraph devices, rely on physiological signals to determine the authenticity of an individual’s statements. Nevertheless, recent developments in automated deception detection have demonstrated that multimodal features derived from both audio and video modalities may outperform human observers on publicly available datasets. Despite these positive findings, the generalizability of existing audio-visual deception detection approaches across different scenarios remains largely unexplored. To close this gap, we present the first cross-domain audio-visual deception detection benchmark, that enables us to assess how well these methods generalize for use in real-world scenarios. We used widely adopted audio and visual features and different architectures for benchmarking, comparing single-to-single and multi-to-single domain generalization performance. To further exploit the impacts using data from multiple source domains for training, we investigate three types of domain sampling strategies, including domain-simultaneous, domain-alternating, and domain-by-domain for multi-to-single domain generalization evaluation. We also propose an algorithm to enhance the generalization performance by maximizing the gradient inner products between modality encoders, named ``MM-IDGM". Furthermore, we proposed the Attention-Mixer fusion method to improve performance, and we believe that this new cross-domain benchmark will facilitate future research in audio-visual deception detection.

[283] Information and motor constraints shape melodic diversity across cultures

John M McBride, Nahie Kim, Yuri Nishikawa, Mekhmed Saadakeev, Marcus T Pearce, Tsvi Tlusty

Main category: cs.SD

TL;DR: The paper explores how information constraints shape common features of melodies across societies, comparing Folk and Art music, and proposes a model predicting scale degree distribution.

Details

Motivation: To understand why melodies from different societies share common features despite vast potential for variation, and to investigate the role of information constraints in shaping these features.

Method: Analyzed 62 corpora of Folk melodies and 39 corpora of Art music, measuring determinants of information rate and comparing complexity. Proposed a parameter-free model to predict scale degree distribution.

Result: Found trade-offs constraining information rate in Folk music, while Art music showed increased complexity over time. The model successfully predicted empirical scale degree distribution.

Conclusion: Information constraints during cultural transmission limit scale size and melody complexity, suggesting a fundamental constraint on melodic evolution.

Abstract: The number of possible melodies is unfathomably large, yet despite this virtually unlimited potential for melodic variation, melodies from different societies can be surprisingly similar. The motor constraint hypothesis accounts for certain similarities, such as scalar motion and contour shape, but not for other major common features, such as repetition, song length, and scale size. Here we investigate the role of information constraints in shaping these hallmarks of melodies. We measure determinants of information rate in 62 corpora of Folk melodies spanning several continents, finding multiple trade-offs that all act to constrain the information rate across societies. By contrast, 39 corpora of Art music from Europe (including Turkey) show longer, more complex melodies, and increased complexity over time, suggesting different cultural-evolutionary selection pressures in Art and Folk music, possibly due to the use of written versus oral transmission. Our parameter-free model predicts the empirical scale degree distribution using information constraints on scalar motion, melody length, and, most importantly, information rate. These results provide strong evidence that information constraints during cultural transmission of music limit the number of notes in a scale, and suggests that a tendency for intermediate melodic complexity reflects a fundamental constraint on the cultural evolution of melody.

[284] LENS-DF: Deepfake Detection and Temporal Localization for Long-Form Noisy Speech

Xuechen Liu, Wanying Ge, Xin Wang, Junichi Yamagishi

Main category: cs.SD

TL;DR: LENS-DF is a new method for training and evaluating audio deepfake detection and localization under realistic conditions, outperforming conventional methods.

Details

Motivation: To address the challenges of audio deepfake detection in complex, real-world scenarios with longer durations, noise, and multiple speakers.

Method: Uses a controllable generation process for realistic audio data and employs self-supervised learning with a simple back-end for detection and localization.

Result: Models trained with LENS-DF outperform conventional methods, showing its effectiveness for robust detection and localization.

Conclusion: LENS-DF is a valuable tool for improving audio deepfake detection, with ablation studies confirming its relevance to real-world challenges.

Abstract: This study introduces LENS-DF, a novel and comprehensive recipe for training and evaluating audio deepfake detection and temporal localization under complicated and realistic audio conditions. The generation part of the recipe outputs audios from the input dataset with several critical characteristics, such as longer duration, noisy conditions, and containing multiple speakers, in a controllable fashion. The corresponding detection and localization protocol uses models. We conduct experiments based on self-supervised learning front-end and simple back-end. The results indicate that models trained using data generated with LENS-DF consistently outperform those trained via conventional recipes, demonstrating the effectiveness and usefulness of LENS-DF for robust audio deepfake detection and localization. We also conduct ablation studies on the variations introduced, investigating their impact on and relevance to realistic challenges in the field.

cs.LG

[285] Enhancing Quantization-Aware Training on Edge Devices via Relative Entropy Coreset Selection and Cascaded Layer Correction

Yujia Tong, Jingling Yuan, Chuang Hu

Main category: cs.LG

TL;DR: QuaRC is a QAT framework for edge devices that uses coreset selection and cascaded layer correction to reduce quantization errors, improving accuracy with small datasets.

Details

Motivation: The need for efficient low-bit quantized models on edge devices, coupled with privacy constraints, drives the development of a method that reduces computational costs while maintaining performance.

Method: QuaRC employs Relative Entropy Score for coreset selection and Cascaded Layer Correction to align quantized and full-precision model outputs.

Result: QuaRC improves Top-1 accuracy by 5.72% on ResNet-18 with 2-bit quantization using only 1% of the data.

Conclusion: QuaRC effectively addresses quantization errors in edge device training, outperforming existing methods with minimal data.

Abstract: With the development of mobile and edge computing, the demand for low-bit quantized models on edge devices is increasing to achieve efficient deployment. To enhance the performance, it is often necessary to retrain the quantized models using edge data. However, due to privacy concerns, certain sensitive data can only be processed on edge devices. Therefore, employing Quantization-Aware Training (QAT) on edge devices has become an effective solution. Nevertheless, traditional QAT relies on the complete dataset for training, which incurs a huge computational cost. Coreset selection techniques can mitigate this issue by training on the most representative subsets. However, existing methods struggle to eliminate quantization errors in the model when using small-scale datasets (e.g., only 10% of the data), leading to significant performance degradation. To address these issues, we propose QuaRC, a QAT framework with coresets on edge devices, which consists of two main phases: In the coreset selection phase, QuaRC introduces the ``Relative Entropy Score" to identify the subsets that most effectively capture the model’s quantization errors. During the training phase, QuaRC employs the Cascaded Layer Correction strategy to align the intermediate layer outputs of the quantized model with those of the full-precision model, thereby effectively reducing the quantization errors in the intermediate layers. Experimental results demonstrate the effectiveness of our approach. For instance, when quantizing ResNet-18 to 2-bit using a 1% data subset, QuaRC achieves a 5.72% improvement in Top-1 accuracy on the ImageNet-1K dataset compared to state-of-the-art techniques.

[286] Knowledge Abstraction for Knowledge-based Semantic Communication: A Generative Causality Invariant Approach

Minh-Duong Nguyen, Quoc-Viet Pham, Nguyen H. Tran, Hoang-Khoi Do, Duy T. Ngo, Won-Joo Hwang

Main category: cs.LG

TL;DR: A low-complexity AI model using a GAN with causality-invariant learning improves semantic communication by extracting causal/non-causal representations for robust data reconstruction.

Details

Motivation: To enhance semantic communication by capturing invariant knowledge for reliable data reconstruction across diverse domains.

Method: Proposes a generative adversarial network (GAN) with causality-invariant learning to extract causal and non-causal representations, ensuring domain consistency.

Result: Causality-invariant knowledge ensures consistency across devices, improves classification, and outperforms state-of-the-art methods in PSNR.

Conclusion: The model reliably reconstructs data with minimal overhead, proving robust for semantic communication.

Abstract: In this study, we design a low-complexity and generalized AI model that can capture common knowledge to improve data reconstruction of the channel decoder for semantic communication. Specifically, we propose a generative adversarial network that leverages causality-invariant learning to extract causal and non-causal representations from the data. Causal representations are invariant and encompass crucial information to identify the data’s label. They can encapsulate semantic knowledge and facilitate effective data reconstruction at the receiver. Moreover, the causal mechanism ensures that learned representations remain consistent across different domains, making the system reliable even with users collecting data from diverse domains. As user-collected data evolves over time causing knowledge divergence among users, we design sparse update protocols to improve the invariant properties of the knowledge while minimizing communication overheads. Three key observations were drawn from our empirical evaluations. Firstly, causality-invariant knowledge ensures consistency across different devices despite the diverse training data. Secondly, invariant knowledge has promising performance in classification tasks, which is pivotal for goal-oriented semantic communications. Thirdly, our knowledge-based data reconstruction highlights the robustness of our decoder, which surpasses other state-of-the-art data reconstruction and semantic compression methods in terms of Peak Signal-to-Noise Ratio (PSNR).

[287] Self-similarity Analysis in Deep Neural Networks

Jingyi Ding, Chengwen Qi, Hongfei Wang, Jianshe Wu, Licheng Jiao, Yuwei Guo, Jian Gao

Main category: cs.LG

TL;DR: This paper investigates the self-similarity of feature networks in deep neural networks and its impact on model performance, proposing a method to enhance classification by adjusting self-similarity.

Details

Motivation: Current research lacks quantitative analysis on how self-similarity in hidden space geometry affects weight optimization and neuron dynamics.

Method: A complex network modeling method based on hidden-layer neuron output features is proposed to analyze self-similarity across layers.

Result: Self-similarity varies by architecture, and embedding constraints during training improves performance by up to 6% in certain models.

Conclusion: Adjusting feature network self-similarity can enhance deep neural network performance, particularly in MLP and attention architectures.

Abstract: Current research has found that some deep neural networks exhibit strong hierarchical self-similarity in feature representation or parameter distribution. However, aside from preliminary studies on how the power-law distribution of weights across different training stages affects model performance,there has been no quantitative analysis on how the self-similarity of hidden space geometry influences model weight optimization, nor is there a clear understanding of the dynamic behavior of internal neurons. Therefore, this paper proposes a complex network modeling method based on the output features of hidden-layer neurons to investigate the self-similarity of feature networks constructed at different hidden layers, and analyzes how adjusting the degree of self-similarity in feature networks can enhance the classification performance of deep neural networks. Validated on three types of networks MLP architectures, convolutional networks, and attention architectures this study reveals that the degree of self-similarity exhibited by feature networks varies across different model architectures. Furthermore, embedding constraints on the self-similarity of feature networks during the training process can improve the performance of self-similar deep neural networks (MLP architectures and attention architectures) by up to 6 percentage points.

[288] Reinforcement Learning for Accelerated Aerodynamic Shape Optimisation

Florian Sobieczky, Alfredo Lopez, Erika Dudkin, Christopher Lackner, Matthias Hochsteger, Bernhard Scheichl, Helmut Sobieczky

Main category: cs.LG

TL;DR: A reinforcement learning-based adaptive optimization algorithm for aerodynamic shape optimization, focusing on dimensionality reduction and computational efficiency.

Details

Motivation: To minimize computational effort and interpret discovered extrema in achieving desired flow-field results.

Method: Uses a surrogate-based, actor-critic policy evaluation MCMC approach with parameter ‘freezing’ and local optimized parameter changes around CFD simulations.

Result: Demonstrates speed-up in global optimization with accurate reward/cost estimates and large parameter neighborhoods.

Conclusion: The method enables efficient optimization and feature importance interpretation, as shown in a simple fluid-dynamical problem.

Abstract: We introduce a reinforcement learning (RL) based adaptive optimization algorithm for aerodynamic shape optimization focused on dimensionality reduction. The form in which RL is applied here is that of a surrogate-based, actor-critic policy evaluation MCMC approach allowing for temporal ‘freezing’ of some of the parameters to be optimized. The goals are to minimize computational effort, and to use the observed optimization results for interpretation of the discovered extrema in terms of their role in achieving the desired flow-field. By a sequence of local optimized parameter changes around intermediate CFD simulations acting as ground truth, it is possible to speed up the global optimization if (a) the local neighbourhoods of the parameters in which the changed parameters must reside are sufficiently large to compete with the grid-sized steps and its large number of simulations, and (b) the estimates of the rewards and costs on these neighbourhoods necessary for a good step-wise parameter adaption are sufficiently accurate. We give an example of a simple fluid-dynamical problem on which the method allows interpretation in the sense of a feature importance scoring.

[289] Hyperbolic Deep Learning for Foundation Models: A Survey

Neil He, Hiren Madhu, Ngoc Bui, Menglin Yang, Rex Ying

Main category: cs.LG

TL;DR: The paper reviews hyperbolic neural networks as an alternative to Euclidean geometry for foundation models, addressing limitations like representational capacity and adaptability, and outlines future research directions.

Details

Motivation: To explore whether non-Euclidean geometries, like hyperbolic spaces, can better align with real-world data structures and improve foundation models' performance.

Method: The paper reviews hyperbolic neural networks, their mathematical properties, and recent applications in enhancing foundation models (LLMs, VLMs).

Result: Hyperbolic spaces improve reasoning, zero-shot generalization, and semantic alignment in foundation models while maintaining parameter efficiency.

Conclusion: Hyperbolic neural networks offer promising solutions for foundation models, but key challenges remain, requiring further research.

Abstract: Foundation models pre-trained on massive datasets, including large language models (LLMs), vision-language models (VLMs), and large multimodal models, have demonstrated remarkable success in diverse downstream tasks. However, recent studies have shown fundamental limitations of these models: (1) limited representational capacity, (2) lower adaptability, and (3) diminishing scalability. These shortcomings raise a critical question: is Euclidean geometry truly the optimal inductive bias for all foundation models, or could incorporating alternative geometric spaces enable models to better align with the intrinsic structure of real-world data and improve reasoning processes? Hyperbolic spaces, a class of non-Euclidean manifolds characterized by exponential volume growth with respect to distance, offer a mathematically grounded solution. These spaces enable low-distortion embeddings of hierarchical structures (e.g., trees, taxonomies) and power-law distributions with substantially fewer dimensions compared to Euclidean counterparts. Recent advances have leveraged these properties to enhance foundation models, including improving LLMs’ complex reasoning ability, VLMs’ zero-shot generalization, and cross-modal semantic alignment, while maintaining parameter efficiency. This paper provides a comprehensive review of hyperbolic neural networks and their recent development for foundation models. We further outline key challenges and research directions to advance the field.

[290] Remembering the Markov Property in Cooperative MARL

Kale-ab Abebe Tessera, Leonard Hinckeldey, Riccardo Zamboni, David Abel, Amos Storkey

Main category: cs.LG

TL;DR: Current MARL methods succeed by learning simple conventions, not by recovering Markov signals. These conventions fail with non-adaptive agents, highlighting benchmark design flaws. New environments should enforce observation-based behaviors and memory reasoning.

Details

Motivation: To challenge the assumption that current MARL methods effectively recover Markov signals and to highlight the brittleness of learned conventions in cooperative tasks.

Method: Analyzed MARL algorithms through a case study, testing their performance with adaptive and non-adaptive agents.

Result: Found that models learn brittle conventions, which fail with non-adaptive partners, but can learn grounded policies when task design requires it.

Conclusion: Advocates for new cooperative environments that test genuine skill through observation-based behaviors and memory reasoning, not just co-adapted agreements.

Abstract: Cooperative multi-agent reinforcement learning (MARL) is typically formalised as a Decentralised Partially Observable Markov Decision Process (Dec-POMDP), where agents must reason about the environment and other agents’ behaviour. In practice, current model-free MARL algorithms use simple recurrent function approximators to address the challenge of reasoning about others using partial information. In this position paper, we argue that the empirical success of these methods is not due to effective Markov signal recovery, but rather to learning simple conventions that bypass environment observations and memory. Through a targeted case study, we show that co-adapting agents can learn brittle conventions, which then fail when partnered with non-adaptive agents. Crucially, the same models can learn grounded policies when the task design necessitates it, revealing that the issue is not a fundamental limitation of the learning models but a failure of the benchmark design. Our analysis also suggests that modern MARL environments may not adequately test the core assumptions of Dec-POMDPs. We therefore advocate for new cooperative environments built upon two core principles: (1) behaviours grounded in observations and (2) memory-based reasoning about other agents, ensuring success requires genuine skill rather than fragile, co-adapted agreements.

[291] Adaptive Repetition for Mitigating Position Bias in LLM-Based Ranking

Ali Vardasbi, Gustavo Penha, Claudia Hauff, Hugues Bouchard

Main category: cs.LG

TL;DR: The paper addresses position bias and low repetition consistency in LLMs when ranking or evaluating items, proposing a dynamic early-stopping method to reduce computational costs while maintaining accuracy.

Details

Motivation: LLMs exhibit position bias and inconsistency in rankings when items are reordered or repeated, leading to unreliable results and high computational costs with current mitigation strategies.

Method: Introduces a dynamic early-stopping method to adaptively determine the number of repetitions needed per instance, along with a confidence-based adaptation to further reduce calls.

Result: The method reduces LLM calls by 81% (dynamic) and 87% (confidence-based) compared to static repetition, with minimal accuracy loss.

Conclusion: Dynamic early-stopping and confidence-based adaptations effectively mitigate position bias and inconsistency while optimizing computational efficiency.

Abstract: When using LLMs to rank items based on given criteria, or evaluate answers, the order of candidate items can influence the model’s final decision. This sensitivity to item positioning in a LLM’s prompt is known as position bias. Prior research shows that this bias exists even in large models, though its severity varies across models and tasks. In addition to position bias, LLMs also exhibit varying degrees of low repetition consistency, where repeating the LLM call with the same candidate ordering can lead to different rankings. To address both inconsistencies, a common approach is to prompt the model multiple times with different candidate orderings and aggregate the results via majority voting. However, this repetition strategy, significantly increases computational costs. Extending prior findings, we observe that both the direction – favoring either the earlier or later candidate in the prompt – and magnitude of position bias across instances vary substantially, even within a single dataset. This observation highlights the need for a per-instance mitigation strategy. To this end, we introduce a dynamic early-stopping method that adaptively determines the number of repetitions required for each instance. Evaluating our approach across three LLMs of varying sizes and on two tasks, namely re-ranking and alignment, we demonstrate that transitioning to a dynamic repetition strategy reduces the number of LLM calls by an average of 81%, while preserving the accuracy. Furthermore, we propose a confidence-based adaptation to our early-stopping method, reducing LLM calls by an average of 87% compared to static repetition, with only a slight accuracy trade-off relative to our original early-stopping method.

[292] Moving Out: Physically-grounded Human-AI Collaboration

Xuhui Kang, Sung-Wook Lee, Haolin Liu, Yuyan Wang, Yen-Ling Kuo

Main category: cs.LG

TL;DR: The paper introduces ‘Moving Out,’ a benchmark for human-AI collaboration in physical environments, and proposes BASS, a method to improve AI adaptability to human behaviors and physical constraints.

Details

Motivation: To address the challenges of continuous state-action spaces and physical constraints in human-AI collaboration, enabling effective teamwork in tasks like moving heavy items.

Method: Proposes BASS (Behavior Augmentation, Simulation, and Selection) to enhance agent diversity and action outcome understanding. Evaluated using human-human interaction data in the Moving Out benchmark.

Result: BASS outperforms state-of-the-art models in both AI-AI and human-AI collaboration tasks.

Conclusion: The Moving Out benchmark and BASS method advance physically grounded human-AI collaboration, demonstrating improved adaptability and performance.

Abstract: The ability to adapt to physical actions and constraints in an environment is crucial for embodied agents (e.g., robots) to effectively collaborate with humans. Such physically grounded human-AI collaboration must account for the increased complexity of the continuous state-action space and constrained dynamics caused by physical constraints. In this paper, we introduce \textit{Moving Out}, a new human-AI collaboration benchmark that resembles a wide range of collaboration modes affected by physical attributes and constraints, such as moving heavy items together and maintaining consistent actions to move a big item around a corner. Using Moving Out, we designed two tasks and collected human-human interaction data to evaluate models’ abilities to adapt to diverse human behaviors and unseen physical attributes. To address the challenges in physical environments, we propose a novel method, BASS (Behavior Augmentation, Simulation, and Selection), to enhance the diversity of agents and their understanding of the outcome of actions. Our experiments show that BASS outperforms state-of-the-art models in AI-AI and human-AI collaboration. The project page is available at \href{https://live-robotics-uva.github.io/movingout_ai/}{https://live-robotics-uva.github.io/movingout_ai/}.

[293] Helix 1.0: An Open-Source Framework for Reproducible and Interpretable Machine Learning on Tabular Scientific Data

Eduardo Aguilar-Bejarano, Daniel Lea, Karthikeyan Sivakumar, Jimiama M. Mase, Reza Omidvar, Ruizhe Li, Troy Kettle, James Mitchell-White, Morgan R Alexander, David A Winkler, Grazziela Figueredo

Main category: cs.LG

TL;DR: Helix is a Python-based framework for reproducible and interpretable ML workflows for tabular data, ensuring transparency and accessibility.

Details

Motivation: Addresses the need for transparent, reproducible, and interpretable ML workflows, especially for stakeholders without formal data science training.

Method: Provides modules for data preprocessing, visualization, model training, evaluation, interpretation, and prediction, with a user-friendly interface.

Result: Enables reproducible ML workflows and actionable insights through an integrated environment, supporting FAIR principles.

Conclusion: Helix is a valuable tool for transparent and interpretable ML, accessible via GitHub and PyPI under the MIT license.

Abstract: Helix is an open-source, extensible, Python-based software framework to facilitate reproducible and interpretable machine learning workflows for tabular data. It addresses the growing need for transparent experimental data analytics provenance, ensuring that the entire analytical process – including decisions around data transformation and methodological choices – is documented, accessible, reproducible, and comprehensible to relevant stakeholders. The platform comprises modules for standardised data preprocessing, visualisation, machine learning model training, evaluation, interpretation, results inspection, and model prediction for unseen data. To further empower researchers without formal training in data science to derive meaningful and actionable insights, Helix features a user-friendly interface that enables the design of computational experiments, inspection of outcomes, including a novel interpretation approach to machine learning decisions using linguistic terms all within an integrated environment. Released under the MIT licence, Helix is accessible via GitHub and PyPI, supporting community-driven development and promoting adherence to the FAIR principles.

[294] Causal Mechanism Estimation in Multi-Sensor Systems Across Multiple Domains

Jingyi Yu, Tim Pychynski, Marco F. Huber

Main category: cs.LG

TL;DR: CICME is a three-step method for inferring causal mechanisms from heterogeneous data across domains, leveraging causal transfer learning to identify domain-invariant mechanisms and guide individual domain estimations.

Details

Motivation: To understand complex sensor systems through causality by addressing the challenge of inferring causal mechanisms from heterogeneous multi-domain data.

Method: CICME uses causal transfer learning to detect domain-invariant mechanisms, then estimates individual domain mechanisms. Evaluated on linear Gaussian models inspired by manufacturing.

Result: CICME outperforms baseline methods in certain scenarios by combining pooled and individual domain causal discovery.

Conclusion: CICME effectively infers causal mechanisms across domains, demonstrating superior performance in specific cases.

Abstract: To gain deeper insights into a complex sensor system through the lens of causality, we present common and individual causal mechanism estimation (CICME), a novel three-step approach to inferring causal mechanisms from heterogeneous data collected across multiple domains. By leveraging the principle of Causal Transfer Learning (CTL), CICME is able to reliably detect domain-invariant causal mechanisms when provided with sufficient samples. The identified common causal mechanisms are further used to guide the estimation of the remaining causal mechanisms in each domain individually. The performance of CICME is evaluated on linear Gaussian models under scenarios inspired from a manufacturing process. Building upon existing continuous optimization-based causal discovery methods, we show that CICME leverages the benefits of applying causal discovery on the pooled data and repeatedly on data from individual domains, and it even outperforms both baseline methods under certain scenarios.

[295] LSDM: LLM-Enhanced Spatio-temporal Diffusion Model for Service-Level Mobile Traffic Prediction

Shiyuan Zhang, Tong Li, Zhu Xiao, Hongyang Du, Kaibin Huang

Main category: cs.LG

TL;DR: Proposes LSDM, a model combining diffusion models and LLMs for accurate mobile traffic prediction, outperforming existing methods.

Details

Motivation: Current methods lack adaptability and accuracy due to uncertain traffic patterns and environmental complexity.

Method: Integrates diffusion models with transformers (LLMs) to capture dynamic traffic and environmental context.

Result: Improves prediction accuracy by 2.83% (R²) and reduces RMSE by 8.29% compared to similar models.

Conclusion: LSDM effectively enhances traffic prediction with better generalization and adaptability.

Abstract: Service-level mobile traffic prediction for individual users is essential for network efficiency and quality of service enhancement. However, current prediction methods are limited in their adaptability across different urban environments and produce inaccurate results due to the high uncertainty in personal traffic patterns, the lack of detailed environmental context, and the complex dependencies among different network services. These challenges demand advanced modeling techniques that can capture dynamic traffic distributions and rich environmental features. Inspired by the recent success of diffusion models in distribution modeling and Large Language Models (LLMs) in contextual understanding, we propose an LLM-Enhanced Spatio-temporal Diffusion Model (LSDM). LSDM integrates the generative power of diffusion models with the adaptive learning capabilities of transformers, augmented by the ability to capture multimodal environmental information for modeling service-level patterns and dynamics. Extensive evaluations on real-world service-level datasets demonstrate that the model excels in traffic usage predictions, showing outstanding generalization and adaptability. After incorporating contextual information via LLM, the performance improves by at least 2.83% in terms of the coefficient of determination. Compared to models of a similar type, such as CSDI, the root mean squared error can be reduced by at least 8.29%. The code and dataset will be available at: https://github.com/SoftYuaneR/LSDM.

[296] CoCAI: Copula-based Conformal Anomaly Identification for Multivariate Time-Series

Nicholas A. Pearson, Francesca Zanello, Davide Russo, Luca Bortolussi, Francesca Cairoli

Main category: cs.LG

TL;DR: CoCAI combines generative AI and copula modeling for accurate multivariate time-series forecasting and robust anomaly detection, validated on real-world data.

Details

Motivation: Addressing challenges in multivariate time-series analysis, specifically accurate predictions and anomaly detection.

Method: Uses diffusion-based modeling for dependencies, conformal prediction for calibration, and copula-based modeling for anomaly detection.

Result: Effective forecasting and anomaly identification in real operational data (e.g., water systems).

Conclusion: CoCAI provides statistically valid, actionable results with minimal deployment overhead.

Abstract: We propose a novel framework that harnesses the power of generative artificial intelligence and copula-based modeling to address two critical challenges in multivariate time-series analysis: delivering accurate predictions and enabling robust anomaly detection. Our method, Copula-based Conformal Anomaly Identification for Multivariate Time-Series (CoCAI), leverages a diffusion-based model to capture complex dependencies within the data, enabling high quality forecasting. The model’s outputs are further calibrated using a conformal prediction technique, yielding predictive regions which are statistically valid, i.e., cover the true target values with a desired confidence level. Starting from these calibrated forecasts, robust outlier detection is performed by combining dimensionality reduction techniques with copula-based modeling, providing a statistically grounded anomaly score. CoCAI benefits from an offline calibration phase that allows for minimal overhead during deployment and delivers actionable results rooted in established theoretical foundations. Empirical tests conducted on real operational data derived from water distribution and sewerage systems confirm CoCAI’s effectiveness in accurately forecasting target sequences of data and in identifying anomalous segments within them.

[297] GenSelect: A Generative Approach to Best-of-N

Shubham Toshniwal, Ivan Sorokin, Aleksander Ficek, Ivan Moshkov, Igor Gitman

Main category: cs.LG

TL;DR: GenSelect leverages LLMs’ comparative reasoning to efficiently select the best solution among multiple candidates, outperforming pointwise and pairwise methods in math reasoning tasks.

Details

Motivation: Current methods (pointwise or pairwise) underutilize LLMs' comparative abilities or scale poorly with larger sampling budgets.

Method: GenSelect uses long reasoning by LLMs to select the best solution from N candidates, scaling efficiently with parallel sampling.

Result: GenSelect outperforms existing scoring methods in math reasoning tasks, as demonstrated with models like QwQ and DeepSeek-R1-0528.

Conclusion: GenSelect effectively combines LLMs’ comparative strengths with scalable parallel sampling, offering a superior approach for reasoning tasks.

Abstract: Generative reward models with parallel sampling have enabled effective test-time scaling for reasoning tasks. Current approaches employ pointwise scoring of individual solutions or pairwise comparisons. However, pointwise methods underutilize LLMs’ comparative abilities, while pairwise methods scale inefficiently with larger sampling budgets. We introduce GenSelect, where the LLM uses long reasoning to select the best solution among N candidates. This leverages LLMs’ comparative strengths while scaling efficiently across parallel sampling budgets. For math reasoning, we demonstrate that reasoning models, such as QwQ and DeepSeek-R1-0528, excel at GenSelect, outperforming existing scoring approaches with simple prompting.

[298] Wasserstein GAN-Based Precipitation Downscaling with Optimal Transport for Enhancing Perceptual Realism

Kenta Shiraishi, Yuka Muto, Atsushi Okazaki, Shunji Kotsuki

Main category: cs.LG

TL;DR: The paper proposes using WGAN for high-resolution precipitation downscaling, achieving visually realistic results despite slightly lower conventional metric performance. The critic scores correlate with human perception and help identify artifacts.

Details

Motivation: High-resolution precipitation prediction is crucial for mitigating damage from localized heavy rainfall, but traditional methods face challenges.

Method: The study employs Wasserstein Generative Adversarial Network (WGAN) with optimal transport cost for precipitation downscaling, comparing it to conventional neural networks.

Result: WGAN produces visually realistic precipitation fields with fine-scale structures, though slightly underperforming on standard metrics. The critic scores align with human realism perception and detect data artifacts.

Conclusion: WGAN enhances perceptual realism in precipitation downscaling and provides a novel approach for dataset evaluation and quality control.

Abstract: High-resolution (HR) precipitation prediction is essential for reducing damage from stationary and localized heavy rainfall; however, HR precipitation forecasts using process-driven numerical weather prediction models remains challenging. This study proposes using Wasserstein Generative Adversarial Network (WGAN) to perform precipitation downscaling with an optimal transport cost. In contrast to a conventional neural network trained with mean squared error, the WGAN generated visually realistic precipitation fields with fine-scale structures even though the WGAN exhibited slightly lower performance on conventional evaluation metrics. The learned critic of WGAN correlated well with human perceptual realism. Case-based analysis revealed that large discrepancies in critic scores can help identify both unrealistic WGAN outputs and potential artifacts in the reference data. These findings suggest that the WGAN framework not only improves perceptual realism in precipitation downscaling but also offers a new perspective for evaluating and quality-controlling precipitation datasets.

[299] Explainable Graph Neural Networks via Structural Externalities

Lijun Wu, Dong Hao, Zhiyi Fan

Main category: cs.LG

TL;DR: GraphEXT is a novel explainability framework for GNNs that uses cooperative game theory and social externalities to quantify node importance, outperforming existing methods.

Details

Motivation: The 'black-box' nature of GNNs and the lack of effective methods to capture node interactions motivate the need for improved explainability.

Method: GraphEXT partitions nodes into coalitions, decomposing the graph into subgraphs, and uses Shapley value under externalities to measure node importance.

Result: GraphEXT outperforms baseline methods in fidelity across diverse GNN architectures, enhancing model explainability.

Conclusion: GraphEXT provides a more effective way to explain GNN predictions by focusing on node interactions and structural changes.

Abstract: Graph Neural Networks (GNNs) have achieved outstanding performance across a wide range of graph-related tasks. However, their “black-box” nature poses significant challenges to their explainability, and existing methods often fail to effectively capture the intricate interaction patterns among nodes within the network. In this work, we propose a novel explainability framework, GraphEXT, which leverages cooperative game theory and the concept of social externalities. GraphEXT partitions graph nodes into coalitions, decomposing the original graph into independent subgraphs. By integrating graph structure as an externality and incorporating the Shapley value under externalities, GraphEXT quantifies node importance through their marginal contributions to GNN predictions as the nodes transition between coalitions. Unlike traditional Shapley value-based methods that primarily focus on node attributes, our GraphEXT places greater emphasis on the interactions among nodes and the impact of structural changes on GNN predictions. Experimental studies on both synthetic and real-world datasets show that GraphEXT outperforms existing baseline methods in terms of fidelity across diverse GNN architectures , significantly enhancing the explainability of GNN models.

[300] Look the Other Way: Designing ‘Positive’ Molecules with Negative Data via Task Arithmetic

Rıza Özçelik, Sarah de Ruiter, Francesca Grisoni

Main category: cs.LG

TL;DR: Molecular task arithmetic trains on negative examples to learn property directions, enabling zero-shot generation of diverse and successful positive molecules without labeled data.

Details

Motivation: Address the scarcity of positive molecules in generative design by leveraging abundant negative examples.

Method: Train models on negative examples to learn property directions, then reverse these directions to generate positive molecules.

Result: Outperforms models trained on positive molecules in diversity and success, especially in zero-shot, dual-objective, and few-shot tasks.

Conclusion: Molecular task arithmetic is a simple, efficient, and effective transfer learning strategy for molecule design.

Abstract: The scarcity of molecules with desirable properties (i.e., ‘positive’ molecules) is an inherent bottleneck for generative molecule design. To sidestep such obstacle, here we propose molecular task arithmetic: training a model on diverse and abundant negative examples to learn ‘property directions’ $–$ without accessing any positively labeled data $–$ and moving models in the opposite property directions to generate positive molecules. When analyzed on 20 zero-shot design experiments, molecular task arithmetic generated more diverse and successful designs than models trained on positive molecules. Moreover, we employed molecular task arithmetic in dual-objective and few-shot design tasks. We find that molecular task arithmetic can consistently increase the diversity of designs while maintaining desirable design properties. With its simplicity, data efficiency, and performance, molecular task arithmetic bears the potential to become the $\textit{de-facto}$ transfer learning strategy for de novo molecule design.

[301] Fourier Neural Operators for Non-Markovian Processes:Approximation Theorems and Experiments

Wonjae Lee, Taeyoung Kim, Hyungbin Park

Main category: cs.LG

TL;DR: The paper introduces MFNO, a neural operator for stochastic systems, extending FNO with mirror padding for non-periodic inputs. It theoretically and empirically outperforms standard architectures.

Details

Motivation: To address the challenge of learning dynamics in stochastic systems, especially with non-periodic inputs, where standard architectures like LSTMs and TCNs struggle.

Method: Extends Fourier neural operator (FNO) with mirror padding to handle non-periodic inputs. Uses Wong–Zakai theorems and approximation techniques for theoretical guarantees.

Result: MFNO approximates solutions of stochastic systems accurately, exhibits strong resolution generalization, and outperforms baselines like LSTMs, TCNs, and DeepONet with faster sample path generation.

Conclusion: MFNO is a robust and efficient solution for learning stochastic dynamics, offering theoretical guarantees and practical advantages over existing methods.

Abstract: This paper introduces an operator-based neural network, the mirror-padded Fourier neural operator (MFNO), designed to learn the dynamics of stochastic systems. MFNO extends the standard Fourier neural operator (FNO) by incorporating mirror padding, enabling it to handle non-periodic inputs. We rigorously prove that MFNOs can approximate solutions of path-dependent stochastic differential equations and Lipschitz transformations of fractional Brownian motions to an arbitrary degree of accuracy. Our theoretical analysis builds on Wong–Zakai type theorems and various approximation techniques. Empirically, the MFNO exhibits strong resolution generalization–a property rarely seen in standard architectures such as LSTMs, TCNs, and DeepONet. Furthermore, our model achieves performance that is comparable or superior to these baselines while offering significantly faster sample path generation than classical numerical schemes.

[302] Lower Bounds for Public-Private Learning under Distribution Shift

Amrith Setlur, Pratiksha Thaker, Jonathan Ullman

Main category: cs.LG

TL;DR: The paper extends lower bounds for public-private learning to cases with significant distribution shift, showing public data’s limited benefit under large shifts.

Details

Motivation: To understand the complementary value of combining public and private data in differentially private machine learning, especially under distribution shift.

Method: Extends known lower bounds to scenarios with distribution shift, analyzing Gaussian mean estimation and linear regression.

Result: For small shifts, abundant public or private data is needed; for large shifts, public data provides no benefit.

Conclusion: Public data’s usefulness in private learning depends on the magnitude of distribution shift.

Abstract: The most effective differentially private machine learning algorithms in practice rely on an additional source of purportedly public data. This paradigm is most interesting when the two sources combine to be more than the sum of their parts. However, there are settings such as mean estimation where we have strong lower bounds, showing that when the two data sources have the same distribution, there is no complementary value to combining the two data sources. In this work we extend the known lower bounds for public-private learning to setting where the two data sources exhibit significant distribution shift. Our results apply to both Gaussian mean estimation where the two distributions have different means, and to Gaussian linear regression where the two distributions exhibit parameter shift. We find that when the shift is small (relative to the desired accuracy), either public or private data must be sufficiently abundant to estimate the private parameter. Conversely, when the shift is large, public data provides no benefit.

[303] Federated Learning for Large-Scale Cloud Robotic Manipulation: Opportunities and Challenges

Obaidullah Zaland, Chanh Nguyen, Florian T. Pokorny, Monowar Bhuyan

Main category: cs.LG

TL;DR: The paper discusses Federated Learning (FL) in cloud robotic manipulation, highlighting its advantages over classical ML and addressing challenges and opportunities in scalable, efficient FL adoption.

Details

Motivation: To explore how FL can enhance cloud robotic manipulation by leveraging distributed training without sharing private data, overcoming limitations of traditional ML and individual robot capabilities.

Method: Presents fundamental FL concepts and their application to cloud robotic manipulation, proposing centralized or decentralized FL model designs for verification.

Result: FL offers significant benefits for cloud robotics, such as computational efficiency and privacy, but also introduces challenges in scalability and reliability.

Conclusion: FL holds promise for advancing cloud robotic manipulation, though further research is needed to address its challenges in large-scale deployment.

Abstract: Federated Learning (FL) is an emerging distributed machine learning paradigm, where the collaborative training of a model involves dynamic participation of devices to achieve broad objectives. In contrast, classical machine learning (ML) typically requires data to be located on-premises for training, whereas FL leverages numerous user devices to train a shared global model without the need to share private data. Current robotic manipulation tasks are constrained by the individual capabilities and speed of robots due to limited low-latency computing resources. Consequently, the concept of cloud robotics has emerged, allowing robotic applications to harness the flexibility and reliability of computing resources, effectively alleviating their computational demands across the cloud-edge continuum. Undoubtedly, within this distributed computing context, as exemplified in cloud robotic manipulation scenarios, FL offers manifold advantages while also presenting several challenges and opportunities. In this paper, we present fundamental concepts of FL and their connection to cloud robotic manipulation. Additionally, we envision the opportunities and challenges associated with realizing efficient and reliable cloud robotic manipulation at scale through FL, where researchers adopt to design and verify FL models in either centralized or decentralized settings.

[304] Deep learning-aided inverse design of porous metamaterials

Phu Thien Nguyen, Yousef Heider, Dennis M. Kochmann, Fadi Aldakheel

Main category: cs.LG

TL;DR: The study introduces a deep learning-based generative framework (pVAE) for inverse design of porous metamaterials, combining VAE with a regressor to tailor hydraulic properties like porosity and permeability. It uses LBM for data generation and CNN for property prediction, reducing computational costs. The framework is validated on synthetic and real datasets, enabling efficient structure-property exploration and inverse design.

Details

Motivation: To address the challenge of designing porous metamaterials with specific hydraulic properties efficiently, avoiding costly simulations.

Method: Develops a property-variational autoencoder (pVAE) combining VAE and regressor, trained on synthetic and real datasets (CT-scans). Uses LBM for permeability data and CNN for property prediction.

Result: The pVAE framework successfully generates metamaterials with tailored properties, maps microstructural features to a latent space, and enables efficient inverse design.

Conclusion: The approach provides a computationally efficient method for inverse design of porous metamaterials, with potential for broader applications in material science.

Abstract: The ultimate aim of the study is to explore the inverse design of porous metamaterials using a deep learning-based generative framework. Specifically, we develop a property-variational autoencoder (pVAE), a variational autoencoder (VAE) augmented with a regressor, to generate structured metamaterials with tailored hydraulic properties, such as porosity and permeability. While this work uses the lattice Boltzmann method (LBM) to generate intrinsic permeability tensor data for limited porous microstructures, a convolutional neural network (CNN) is trained using a bottom-up approach to predict effective hydraulic properties. This significantly reduces the computational cost compared to direct LBM simulations. The pVAE framework is trained on two datasets: a synthetic dataset of artificial porous microstructures and CT-scan images of volume elements from real open-cell foams. The encoder-decoder architecture of the VAE captures key microstructural features, mapping them into a compact and interpretable latent space for efficient structure-property exploration. The study provides a detailed analysis and interpretation of the latent space, demonstrating its role in structure-property mapping, interpolation, and inverse design. This approach facilitates the generation of new metamaterials with desired properties. The datasets and codes used in this study will be made open-access to support further research.

[305] SETOL: A Semi-Empirical Theory of (Deep) Learning

Charles H Martin, Christopher Hinrichs

Main category: cs.LG

TL;DR: SETOL explains SOTA NN performance using statistical mechanics and random matrix theory, introducing a new metric ERG and validating it on MLPs and SOTA NNs.

Details

Motivation: To formally explain the origin of heavy-tailed power-law metrics (alpha and alpha-hat) in HTSR and predict NN performance without training/testing data.

Method: Uses statistical mechanics, random matrix theory, and quantum chemistry to derive SETOL, introducing ERG as a new metric. Validated on a 3-layer MLP and SOTA NNs.

Result: SETOL’s assumptions align well with empirical data, and ERG correlates strongly with HTSR’s alpha on both MLPs and SOTA NNs.

Conclusion: SETOL provides a theoretical foundation for NN performance prediction and introduces ERG as a valuable metric, validated across models.

Abstract: We present a SemiEmpirical Theory of Learning (SETOL) that explains the remarkable performance of State-Of-The-Art (SOTA) Neural Networks (NNs). We provide a formal explanation of the origin of the fundamental quantities in the phenomenological theory of Heavy-Tailed Self-Regularization (HTSR): the heavy-tailed power-law layer quality metrics, alpha and alpha-hat. In prior work, these metrics have been shown to predict trends in the test accuracies of pretrained SOTA NN models, importantly, without needing access to either testing or training data. Our SETOL uses techniques from statistical mechanics as well as advanced methods from random matrix theory and quantum chemistry. The derivation suggests new mathematical preconditions for ideal learning, including a new metric, ERG, which is equivalent to applying a single step of the Wilson Exact Renormalization Group. We test the assumptions and predictions of SETOL on a simple 3-layer multilayer perceptron (MLP), demonstrating excellent agreement with the key theoretical assumptions. For SOTA NN models, we show how to estimate the individual layer qualities of a trained NN by simply computing the empirical spectral density (ESD) of the layer weight matrices and plugging this ESD into our SETOL formulas. Notably, we examine the performance of the HTSR alpha and the SETOL ERG layer quality metrics, and find that they align remarkably well, both on our MLP and on SOTA NNs.

[306] From Seed to Harvest: Augmenting Human Creativity with AI for Red-teaming Text-to-Image Models

Jessica Quaye, Charvi Rastogi, Alicia Parrish, Oana Inel, Minsuk Kahng, Lora Aroyo, Vijay Janapa Reddi

Main category: cs.LG

TL;DR: Seed2Harvest combines human creativity and machine scalability to generate diverse adversarial prompts for robust T2I model evaluation.

Details

Motivation: Current adversarial prompt datasets are either small and imbalanced (human-crafted) or lack realism (synthetic). A hybrid approach is needed for comprehensive testing.

Method: Proposes Seed2Harvest, a hybrid method expanding human-crafted adversarial prompts with machine assistance to enhance diversity and realism.

Result: Achieves higher diversity (535 locations, 7.48 entropy) and comparable attack success rates (0.31-0.36).

Conclusion: Human-machine collaboration is key for scalable, realistic red-teaming in T2I model safety evaluation.

Abstract: Text-to-image (T2I) models have become prevalent across numerous applications, making their robust evaluation against adversarial attacks a critical priority. Continuous access to new and challenging adversarial prompts across diverse domains is essential for stress-testing these models for resilience against novel attacks from multiple vectors. Current techniques for generating such prompts are either entirely authored by humans or synthetically generated. On the one hand, datasets of human-crafted adversarial prompts are often too small in size and imbalanced in their cultural and contextual representation. On the other hand, datasets of synthetically-generated prompts achieve scale, but typically lack the realistic nuances and creative adversarial strategies found in human-crafted prompts. To combine the strengths of both human and machine approaches, we propose Seed2Harvest, a hybrid red-teaming method for guided expansion of culturally diverse, human-crafted adversarial prompt seeds. The resulting prompts preserve the characteristics and attack patterns of human prompts while maintaining comparable average attack success rates (0.31 NudeNet, 0.36 SD NSFW, 0.12 Q16). Our expanded dataset achieves substantially higher diversity with 535 unique geographic locations and a Shannon entropy of 7.48, compared to 58 locations and 5.28 entropy in the original dataset. Our work demonstrates the importance of human-machine collaboration in leveraging human creativity and machine computational capacity to achieve comprehensive, scalable red-teaming for continuous T2I model safety evaluation.

[307] UrbanPulse: A Cross-City Deep Learning Framework for Ultra-Fine-Grained Population Transfer Prediction

Hongrong Yang, Markus Schlaepfer

Main category: cs.LG

TL;DR: UrbanPulse is a scalable deep learning framework for ultra-fine-grained city-wide OD flow predictions, overcoming limitations of traditional and deep learning models by treating each POI as a node and using a three-stage transfer learning strategy.

Details

Motivation: Accurate population flow prediction is crucial for urban planning, transportation, and public health, but existing methods have limitations like static spatial assumptions, poor cross-city generalization, high computational costs, and low resolution.

Method: UrbanPulse combines a temporal graph convolutional encoder with a transformer-based decoder to model spatiotemporal dependencies. It uses a three-stage transfer learning strategy: pretraining, cold-start adaptation, and reinforcement learning fine-tuning.

Result: Evaluated on 103 million GPS records from three California metropolitan areas, UrbanPulse achieves state-of-the-art accuracy and scalability.

Conclusion: UrbanPulse advances high-resolution, AI-powered urban forecasting, making it practical for diverse cities through efficient transfer learning.

Abstract: Accurate population flow prediction is essential for urban planning, transportation management, and public health. Yet existing methods face key limitations: traditional models rely on static spatial assumptions, deep learning models struggle with cross-city generalization, and Large Language Models (LLMs) incur high computational costs while failing to capture spatial structure. Moreover, many approaches sacrifice resolution by clustering Points of Interest (POIs) or restricting coverage to subregions, limiting their utility for city-wide analytics. We introduce UrbanPulse, a scalable deep learning framework that delivers ultra-fine-grained, city-wide OD flow predictions by treating each POI as an individual node. It combines a temporal graph convolutional encoder with a transformer-based decoder to model multi-scale spatiotemporal dependencies. To ensure robust generalization across urban contexts, UrbanPulse employs a three-stage transfer learning strategy: pretraining on large-scale urban graphs, cold-start adaptation, and reinforcement learning fine-tuning.Evaluated on over 103 million cleaned GPS records from three metropolitan areas in California, UrbanPulse achieves state-of-the-art accuracy and scalability. Through efficient transfer learning, UrbanPulse takes a key step toward making high-resolution, AI-powered urban forecasting deployable in practice across diverse cities.

[308] Multimodal Fine-grained Reasoning for Post Quality Evaluation

Xiaoxu Guo, Siyan Liang, Yachao Cui, Juxiang Zhou, Lei Wang, Han Cao

Main category: cs.LG

TL;DR: The paper introduces MFTRR, a framework for post-quality assessment using multimodal data and relational reasoning, outperforming existing methods.

Details

Motivation: Existing methods fail to leverage multimodal cues, introduce noise, and lack semantic relationship capture, limiting post-quality assessment.

Method: MFTRR reframes the task as ranking, using two modules: Local-Global Semantic Correlation Reasoning and Multi-Level Evidential Relational Reasoning.

Result: MFTRR achieves up to 9.52% NDCG@3 improvement over unimodal methods on the Art History dataset.

Conclusion: MFTRR effectively addresses limitations in post-quality assessment by mimicking human cognitive processes and leveraging multimodal data.

Abstract: Accurately assessing post quality requires complex relational reasoning to capture nuanced topic-post relationships. However, existing studies face three major limitations: (1) treating the task as unimodal categorization, which fails to leverage multimodal cues and fine-grained quality distinctions; (2) introducing noise during deep multimodal fusion, leading to misleading signals; and (3) lacking the ability to capture complex semantic relationships like relevance and comprehensiveness. To address these issues, we propose the Multimodal Fine-grained Topic-post Relational Reasoning (MFTRR) framework, which mimics human cognitive processes. MFTRR reframes post-quality assessment as a ranking task and incorporates multimodal data to better capture quality variations. It consists of two key modules: (1) the Local-Global Semantic Correlation Reasoning Module, which models fine-grained semantic interactions between posts and topics at both local and global levels, enhanced by a maximum information fusion mechanism to suppress noise; and (2) the Multi-Level Evidential Relational Reasoning Module, which explores macro- and micro-level relational cues to strengthen evidence-based reasoning. We evaluate MFTRR on three newly constructed multimodal topic-post datasets and the public Lazada-Home dataset. Experimental results demonstrate that MFTRR significantly outperforms state-of-the-art baselines, achieving up to 9.52% NDCG@3 improvement over the best unimodal method on the Art History dataset.

[309] VIBE: Video-Input Brain Encoder for fMRI Response Modeling

Daniel Carlstrom Schad, Shrey Dixit, Janis Keck, Viktor Studenyak, Aleksandr Shpilevoi, Andrej Bicanski

Main category: cs.LG

TL;DR: VIBE is a two-stage Transformer model that combines video, audio, and text features to predict fMRI activity, achieving high correlations on in-distribution and out-of-distribution datasets.

Details

Motivation: To improve fMRI activity prediction by leveraging multi-modal features (video, audio, text) using advanced Transformer architectures.

Method: Uses a two-stage Transformer: a modality-fusion transformer to merge features from Qwen2.5, BEATs, Whisper, SlowFast, and V-JEPA, and a prediction transformer with rotary embeddings for temporal decoding. Trained on 65 hours of movie data from CNeuroMod and ensembled across 20 seeds.

Result: Achieved mean parcel-wise Pearson correlations of 32.25 on in-distribution data (Friends S07) and 21.25 on six out-of-distribution films. Earlier version won Phase-1 and placed second overall in Algonauts 2025 Challenge.

Conclusion: VIBE demonstrates strong performance in fMRI prediction by effectively fusing multi-modal features, validated by high correlations and competition success.

Abstract: We present VIBE, a two-stage Transformer that fuses multi-modal video, audio, and text features to predict fMRI activity. Representations from open-source models (Qwen2.5, BEATs, Whisper, SlowFast, V-JEPA) are merged by a modality-fusion transformer and temporally decoded by a prediction transformer with rotary embeddings. Trained on 65 hours of movie data from the CNeuroMod dataset and ensembled across 20 seeds, VIBE attains mean parcel-wise Pearson correlations of 32.25 on in-distribution Friends S07 and 21.25 on six out-of-distribution films. An earlier iteration of the same architecture obtained 0.3198 and 0.2096, respectively, winning Phase-1 and placing second overall in the Algonauts 2025 Challenge.

[310] Improving the Computational Efficiency and Explainability of GeoAggregator

Rui Deng, Ziqi Li, Mingshu Wang

Main category: cs.LG

TL;DR: The paper improves GeoAggregator (GA) by optimizing its pipeline for efficiency and adding explainability features, showing better accuracy and speed in experiments.

Details

Motivation: Enhancing the computational efficiency and explainability of the GA model for geospatial tabular data analysis.

Method: 1) Optimized dataloading and forward pass pipeline; 2) Added model ensembling and GeoShapley-based explanation.

Result: Improved prediction accuracy, faster inference, and effective spatial effect capture in synthetic datasets.

Conclusion: The enhanced GA model is more efficient and explainable, with publicly available code for community use.

Abstract: Accurate modeling and explaining geospatial tabular data (GTD) are critical for understanding geospatial phenomena and their underlying processes. Recent work has proposed a novel transformer-based deep learning model named GeoAggregator (GA) for this purpose, and has demonstrated that it outperforms other statistical and machine learning approaches. In this short paper, we further improve GA by 1) developing an optimized pipeline that accelerates the dataloading process and streamlines the forward pass of GA to achieve better computational efficiency; and 2) incorporating a model ensembling strategy and a post-hoc model explanation function based on the GeoShapley framework to enhance model explainability. We validate the functionality and efficiency of the proposed strategies by applying the improved GA model to synthetic datasets. Experimental results show that our implementation improves the prediction accuracy and inference speed of GA compared to the original implementation. Moreover, explanation experiments indicate that GA can effectively captures the inherent spatial effects in the designed synthetic dataset. The complete pipeline has been made publicly available for community use (https://github.com/ruid7181/GA-sklearn).

[311] SIFOTL: A Principled, Statistically-Informed Fidelity-Optimization Method for Tabular Learning

Shubham Mohole, Sainyam Galhotra

Main category: cs.LG

TL;DR: SIFOTL is a privacy-compliant method for identifying data shifts in tabular datasets, outperforming baselines in accuracy and robustness to noise.

Details

Motivation: Challenges in analyzing healthcare data due to privacy restrictions and noise necessitate a robust, interpretable solution.

Method: SIFOTL uses summary statistics, twin XGBoost models with LLM assistance, and a Pareto-weighted decision tree to identify data shifts.

Result: Achieves high F1 scores (0.75-0.96) across datasets, outperforming baselines (0.19-0.67).

Conclusion: SIFOTL offers an interpretable, privacy-safe, and noise-resistant workflow for data shift analysis.

Abstract: Identifying the factors driving data shifts in tabular datasets is a significant challenge for analysis and decision support systems, especially those focusing on healthcare. Privacy rules restrict data access, and noise from complex processes hinders analysis. To address this challenge, we propose SIFOTL (Statistically-Informed Fidelity-Optimization Method for Tabular Learning) that (i) extracts privacy-compliant data summary statistics, (ii) employs twin XGBoost models to disentangle intervention signals from noise with assistance from LLMs, and (iii) merges XGBoost outputs via a Pareto-weighted decision tree to identify interpretable segments responsible for the shift. Unlike existing analyses which may ignore noise or require full data access for LLM-based analysis, SIFOTL addresses both challenges using only privacy-safe summary statistics. Demonstrating its real-world efficacy, for a MEPS panel dataset mimicking a new Medicare drug subsidy, SIFOTL achieves an F1 score of 0.85, substantially outperforming BigQuery Contribution Analysis (F1=0.46) and statistical tests (F1=0.20) in identifying the segment receiving the subsidy. Furthermore, across 18 diverse EHR datasets generated based on Synthea ABM, SIFOTL sustains F1 scores of 0.86-0.96 without noise and >= 0.75 even with injected observational noise, whereas baseline average F1 scores range from 0.19-0.67 under the same tests. SIFOTL, therefore, provides an interpretable, privacy-conscious workflow that is empirically robust to observational noise.

[312] Machine Unlearning of Traffic State Estimation and Prediction

Xin Wang, R. Tyrrell Rockafellar, Xuegang, Ban

Main category: cs.LG

TL;DR: A novel learning paradigm called Machine Unlearning TSEP is introduced to allow trained traffic state estimation and prediction models to selectively forget sensitive, outdated, or poisoned data, enhancing trustworthiness.

Details

Motivation: Address privacy, cybersecurity, and data freshness concerns in data-driven traffic state estimation and prediction (TSEP) to maintain public trust in intelligent transportation systems.

Method: Proposes Machine Unlearning TSEP, a learning paradigm enabling models to selectively forget unwanted data.

Result: Enables models to remove privacy-sensitive, poisoned, or outdated data, improving reliability.

Conclusion: Machine Unlearning TSEP enhances the trustworthiness of data-driven TSEP by addressing privacy and data integrity issues.

Abstract: Data-driven traffic state estimation and prediction (TSEP) relies heavily on data sources that contain sensitive information. While the abundance of data has fueled significant breakthroughs, particularly in machine learning-based methods, it also raises concerns regarding privacy, cybersecurity, and data freshness. These issues can erode public trust in intelligent transportation systems. Recently, regulations have introduced the “right to be forgotten”, allowing users to request the removal of their private data from models. As machine learning models can remember old data, simply removing it from back-end databases is insufficient in such systems. To address these challenges, this study introduces a novel learning paradigm for TSEP-Machine Unlearning TSEP-which enables a trained TSEP model to selectively forget privacy-sensitive, poisoned, or outdated data. By empowering models to “unlearn,” we aim to enhance the trustworthiness and reliability of data-driven traffic TSEP.

[313] Predictive Scaling Laws for Efficient GRPO Training of Large Reasoning Models

Datta Nimmaturi, Vaishnavi Bhargava, Rajat Ghosh, Johnu George, Debojyoti Dutta

Main category: cs.LG

TL;DR: A predictive framework optimizes resource usage for fine-tuning LLMs with GRPO, identifying three training phases and suggesting early stopping to save compute.

Details

Motivation: Fine-tuning LLMs with GRPO is computationally expensive, prompting the need for a resource-efficient method.

Method: Proposes a predictive framework modeling training dynamics, tested on Llama and Qwen models (3B, 8B) to derive a scaling law.

Result: Identifies three training phases (slow start, rapid improvement, plateau) and shows early stopping reduces compute without performance loss.

Conclusion: The framework generalizes across models, offering a practical guide for efficient GRPO-based fine-tuning.

Abstract: Fine-tuning large language models (LLMs) for reasoning tasks using reinforcement learning methods like Group Relative Policy Optimization (GRPO) is computationally expensive. To address this, we propose a predictive framework that models training dynamics and helps optimize resource usage. Through experiments on Llama and Qwen models (3B 8B), we derive an empirical scaling law based on model size, initial performance, and training progress. This law predicts reward trajectories and identifies three consistent training phases: slow start, rapid improvement, and plateau. We find that training beyond certain number of an epoch offers little gain, suggesting earlier stopping can significantly reduce compute without sacrificing performance. Our approach generalizes across model types, providing a practical guide for efficient GRPO-based fine-tuning.

[314] Multiscale Neural PDE Surrogates for Prediction and Downscaling: Application to Ocean Currents

Abdessamad El-Kabid, Loubna Benabbou, Redouane Lguensat, Alex Hernández-García

Main category: cs.LG

TL;DR: A deep learning framework using neural operators is introduced for solving PDEs and downscaling ocean current data, achieving arbitrary resolution solutions.

Details

Motivation: High-resolution ocean current data is crucial for coastal management and safety, but existing satellite products lack sufficient granularity.

Method: Supervised deep learning with neural operators for PDE solutions and downscaling models, applied to Copernicus data and synthetic datasets.

Result: The model successfully provides arbitrary resolution solutions and downscales ocean current data effectively.

Conclusion: The proposed framework enhances resolution for ocean current data and offers versatile PDE solutions, benefiting scientific and practical applications.

Abstract: Accurate modeling of physical systems governed by partial differential equations is a central challenge in scientific computing. In oceanography, high-resolution current data are critical for coastal management, environmental monitoring, and maritime safety. However, available satellite products, such as Copernicus data for sea water velocity at ~0.08 degrees spatial resolution and global ocean models, often lack the spatial granularity required for detailed local analyses. In this work, we (a) introduce a supervised deep learning framework based on neural operators for solving PDEs and providing arbitrary resolution solutions, and (b) propose downscaling models with an application to Copernicus ocean current data. Additionally, our method can model surrogate PDEs and predict solutions at arbitrary resolution, regardless of the input resolution. We evaluated our model on real-world Copernicus ocean current data and synthetic Navier-Stokes simulation datasets.

[315] Group Sequence Policy Optimization

Chujie Zheng, Shixuan Liu, Mingze Li, Xiong-Hui Chen, Bowen Yu, Chang Gao, Kai Dang, Yuqiong Liu, Rui Men, An Yang, Jingren Zhou, Junyang Lin

Main category: cs.LG

TL;DR: GSPO is a reinforcement learning algorithm for training large language models, using sequence-level optimization for better efficiency and performance than GRPO.

Details

Motivation: To improve training stability, efficiency, and performance in reinforcement learning for large language models, especially for Mixture-of-Experts (MoE) setups.

Method: Defines importance ratios based on sequence likelihood and performs sequence-level clipping, rewarding, and optimization.

Result: GSPO outperforms GRPO, stabilizes MoE RL training, and simplifies RL infrastructure design, leading to improvements in Qwen3 models.

Conclusion: GSPO is a superior, stable, and efficient algorithm for training large language models, with demonstrated success in enhancing model performance.

Abstract: This paper introduces Group Sequence Policy Optimization (GSPO), our stable, efficient, and performant reinforcement learning algorithm for training large language models. Unlike previous algorithms that adopt token-level importance ratios, GSPO defines the importance ratio based on sequence likelihood and performs sequence-level clipping, rewarding, and optimization. We demonstrate that GSPO achieves superior training efficiency and performance compared to the GRPO algorithm, notably stabilizes Mixture-of-Experts (MoE) RL training, and has the potential for simplifying the design of RL infrastructure. These merits of GSPO have contributed to the remarkable improvements in the latest Qwen3 models.

[316] C-AAE: Compressively Anonymizing Autoencoders for Privacy-Preserving Activity Recognition in Healthcare Sensor Streams

Ryusei Fujimoto, Yugo Nakamura, Yutaka Arakawa

Main category: cs.LG

TL;DR: C-AAE combines an Anonymizing AutoEncoder (AAE) with ADPCM to protect user privacy in wearable sensor data while maintaining activity recognition accuracy and reducing data volume.

Details

Motivation: Privacy protection is crucial for healthcare applications using wearable sensors, as these devices can inadvertently reveal user identities through behavioural signatures.

Method: C-AAE uses an AAE to suppress identity cues in sensor data and ADPCM to further mask identity and compress data.

Result: C-AAE reduces user re-identification F1 scores by 10-15 points, keeps activity recognition F1 within 5 points of baseline, and cuts data volume by 75%.

Conclusion: C-AAE effectively balances privacy and utility in sensor-based healthcare applications.

Abstract: Wearable accelerometers and gyroscopes encode fine-grained behavioural signatures that can be exploited to re-identify users, making privacy protection essential for healthcare applications. We introduce C-AAE, a compressive anonymizing autoencoder that marries an Anonymizing AutoEncoder (AAE) with Adaptive Differential Pulse-Code Modulation (ADPCM). The AAE first projects raw sensor windows into a latent space that retains activity-relevant features while suppressing identity cues. ADPCM then differentially encodes this latent stream, further masking residual identity information and shrinking the bitrate. Experiments on the MotionSense and PAMAP2 datasets show that C-AAE cuts user re-identification F1 scores by 10-15 percentage points relative to AAE alone, while keeping activity-recognition F1 within 5 percentage points of the unprotected baseline. ADPCM also reduces data volume by roughly 75 %, easing transmission and storage overheads. These results demonstrate that C-AAE offers a practical route to balancing privacy and utility in continuous, sensor-based activity recognition for healthcare.

[317] Squeeze10-LLM: Squeezing LLMs’ Weights by 10 Times via a Staged Mixed-Precision Quantization Method

Qingcheng Zhu, Yangyang Ren, Linlin Yang, Mingbao Lin, Yanjing Li, Sheng Xu, Zichao Feng, Haodong Zhu, Yuguang Yang, Juan Zhang, Runqi Wang, Baochang Zhang

Main category: cs.LG

TL;DR: Squeeze10-LLM is a staged mixed-precision PTQ framework that reduces LLM weights by 10x, achieving 1.6 bits per weight. It introduces PBAR and FIAS to improve low-bit quantization performance, boosting accuracy from 43% to 56% on zero-shot tasks.

Details

Motivation: Large language models (LLMs) face deployment challenges due to high computational costs and massive parameters. Ultra low-bit quantization often degrades performance, necessitating a solution.

Method: Squeeze10-LLM uses staged mixed-precision PTQ, quantizing 80% of weights to 1 bit and 20% to 4 bits. It incorporates PBAR for refined weight significance and FIAS to preserve activation information.

Result: Experiments on LLaMA and LLaMA2 show Squeeze10-LLM achieves state-of-the-art sub-2bit performance, improving zero-shot task accuracy from 43% to 56%.

Conclusion: Squeeze10-LLM effectively balances compression and performance, offering a practical solution for deploying LLMs with ultra low-bit quantization.

Abstract: Deploying large language models (LLMs) is challenging due to their massive parameters and high computational costs. Ultra low-bit quantization can significantly reduce storage and accelerate inference, but extreme compression (i.e., mean bit-width <= 2) often leads to severe performance degradation. To address this, we propose Squeeze10-LLM, effectively “squeezing” 16-bit LLMs’ weights by 10 times. Specifically, Squeeze10-LLM is a staged mixed-precision post-training quantization (PTQ) framework and achieves an average of 1.6 bits per weight by quantizing 80% of the weights to 1 bit and 20% to 4 bits. We introduce Squeeze10LLM with two key innovations: Post-Binarization Activation Robustness (PBAR) and Full Information Activation Supervision (FIAS). PBAR is a refined weight significance metric that accounts for the impact of quantization on activations, improving accuracy in low-bit settings. FIAS is a strategy that preserves full activation information during quantization to mitigate cumulative error propagation across layers. Experiments on LLaMA and LLaMA2 show that Squeeze10-LLM achieves state-of-the-art performance for sub-2bit weight-only quantization, improving average accuracy from 43% to 56% on six zero-shot classification tasks–a significant boost over existing PTQ methods. Our code will be released upon publication.

[318] Learning from Hard Labels with Additional Supervision on Non-Hard-Labeled Classes

Kosuke Sugiyama, Masato Uchida

Main category: cs.LG

TL;DR: The paper proposes a theoretical framework for improving classification models with limited training data by combining hard labels and additional supervision into soft labels, showing that non-hard-label distribution information is key.

Details

Motivation: Limited training data and the need for high-accuracy models motivate exploring additional supervision beyond hard labels.

Method: Theoretical framework treats hard labels and additional supervision as distributions, constructing soft labels via affine combination.

Result: Additional supervision’s distribution over non-hard-labeled classes is crucial; mixing coefficient and supervision refine labels complementarily.

Conclusion: Additional supervision and mixing coefficients improve classification accuracy, as validated theoretically and experimentally.

Abstract: In scenarios where training data is limited due to observation costs or data scarcity, enriching the label information associated with each instance becomes crucial for building high-accuracy classification models. In such contexts, it is often feasible to obtain not only hard labels but also {\it additional supervision}, such as the confidences for the hard labels. This setting naturally raises fundamental questions: {\it What kinds of additional supervision are intrinsically beneficial?} And {\it how do they contribute to improved generalization performance?} To address these questions, we propose a theoretical framework that treats both hard labels and additional supervision as probability distributions, and constructs soft labels through their affine combination. Our theoretical analysis reveals that the essential component of additional supervision is not the confidence score of the assigned hard label, but rather the information of the distribution over the non-hard-labeled classes. Moreover, we demonstrate that the additional supervision and the mixing coefficient contribute to the refinement of soft labels in complementary roles. Intuitively, in the probability simplex, the additional supervision determines the direction in which the deterministic distribution representing the hard label should be adjusted toward the true label distribution, while the mixing coefficient controls the step size along that direction. Through generalization error analysis, we theoretically characterize how the additional supervision and its mixing coefficient affect both the convergence rate and asymptotic value of the error bound. Finally, we experimentally demonstrate that, based on our theory, designing additional supervision can lead to improved classification accuracy, even when utilized in a simple manner.

[319] Percentile-Based Deep Reinforcement Learning and Reward Based Personalization For Delay Aware RAN Slicing in O-RAN

Peyman Tehrani, Anas Alsoliman

Main category: cs.LG

TL;DR: The paper addresses RAN slicing in O-RAN, focusing on MVNOs competing for PRBs to meet delay constraints. It introduces PDA-DRL, reducing average delay by 38%, and a reward-based personalization method for model weight sharing.

Details

Motivation: To optimize PRB utilization while meeting probabilistic delay constraints for MVNOs in O-RAN, addressing inefficiencies in existing methods.

Method: Derives a reward function using LLN, proposes PDA-DRL for delay optimization, and introduces a reward-based personalization method for model weight sharing.

Result: PDA-DRL reduces average delay by 38% compared to baselines; reward-based personalization outperforms traditional methods like federated averaging.

Conclusion: The proposed PDA-DRL and personalization method effectively optimize delay and resource utilization in O-RAN, offering superior performance over existing approaches.

Abstract: In this paper, we tackle the challenge of radio access network (RAN) slicing within an open RAN (O-RAN) architecture. Our focus centers on a network that includes multiple mobile virtual network operators (MVNOs) competing for physical resource blocks (PRBs) with the goal of meeting probabilistic delay upper bound constraints for their clients while minimizing PRB utilization. Initially, we derive a reward function based on the law of large numbers (LLN), then implement practical modifications to adapt it for real-world experimental scenarios. We then propose our solution, the Percentile-based Delay-Aware Deep Reinforcement Learning (PDA-DRL), which demonstrates its superiority over several baselines, including DRL models optimized for average delay constraints, by achieving a 38% reduction in resultant average delay. Furthermore, we delve into the issue of model weight sharing among multiple MVNOs to develop a robust personalized model. We introduce a reward-based personalization method where each agent prioritizes other agents’ model weights based on their performance. This technique surpasses traditional aggregation methods, such as federated averaging, and strategies reliant on traffic patterns and model weight distance similarities.

[320] Policy Disruption in Reinforcement Learning:Adversarial Attack with Large Language Models and Critical State Identification

Junyong Jiang, Buwei Tian, Chenxing Xu, Songze Li, Lu Dong

Main category: cs.LG

TL;DR: Proposes an adversarial attack method for RL systems using LLMs to generate tailored adversarial rewards and a critical state identification algorithm, outperforming existing methods.

Details

Motivation: Address the challenge of adversarial attacks in RL without modifying the environment or policy, focusing on inducing suboptimal actions.

Method: Uses a reward iteration optimization framework with LLMs to craft adversarial rewards and a critical state identification algorithm to target vulnerabilities.

Result: Demonstrates superior performance in diverse environments compared to existing approaches.

Conclusion: The method effectively induces suboptimal decision-making in RL agents without altering the environment, showcasing practical adversarial attack capabilities.

Abstract: Reinforcement learning (RL) has achieved remarkable success in fields like robotics and autonomous driving, but adversarial attacks designed to mislead RL systems remain challenging. Existing approaches often rely on modifying the environment or policy, limiting their practicality. This paper proposes an adversarial attack method in which existing agents in the environment guide the target policy to output suboptimal actions without altering the environment. We propose a reward iteration optimization framework that leverages large language models (LLMs) to generate adversarial rewards explicitly tailored to the vulnerabilities of the target agent, thereby enhancing the effectiveness of inducing the target agent toward suboptimal decision-making. Additionally, a critical state identification algorithm is designed to pinpoint the target agent’s most vulnerable states, where suboptimal behavior from the victim leads to significant degradation in overall performance. Experimental results in diverse environments demonstrate the superiority of our method over existing approaches.

[321] Maximizing Prefix-Confidence at Test-Time Efficiently Improves Mathematical Reasoning

Matthias Otth, Jonas Hübotter, Ido Hakimi, Andreas Krause

Main category: cs.LG

TL;DR: Language models self-improve by maximizing their confidence, achieving better performance in math tasks by selecting the most promising attempts using prefix-confidence.

Details

Motivation: To explore test-time scaling of language models for math reasoning, leveraging the model's own confidence for performance gains.

Method: Uses prefix-confidence to select the most promising attempts, evaluated on five datasets (GSM8K, MATH500, AMC23, AIME24, AIME25).

Result: Prefix-confidence scaling with 32-token prefixes outperforms majority voting and is less biased by length than BoN. Test-time training with prefix-confidence doesn’t surpass prefix-confidence scaling.

Conclusion: Prefix-confidence scaling is effective for math reasoning tasks, offering a better accuracy-compute trade-off and reduced biases compared to other methods.

Abstract: Recent work has shown that language models can self-improve by maximizing their own confidence in their predictions, without relying on external verifiers or reward signals. In this work, we study the test-time scaling of language models for mathematical reasoning tasks, where the model’s own confidence is used to select the most promising attempts. Surprisingly, we find that we can achieve significant performance gains by continuing only the most promising attempt, selected by the model’s prefix-confidence. We systematically evaluate prefix-confidence scaling on five mathematical reasoning datasets: the school-level GSM8K and MATH500, and the competition-level AMC23, AIME24, and AIME25. We find that prefix-confidence scaling with prefixes of only 32 tokens achieves a better accuracy-compute trade-off than majority voting. Moreover, prefix-confidence scaling appears less susceptible than BoN to length biases. Finally, we also evaluate test-time training with prefix-confidence and find that, while outperforming the base model, it does not improve over prefix-confidence scaling.

[322] Neuromorphic Computing for Embodied Intelligence in Autonomous Systems: Current Trends, Challenges, and Future Directions

Alberto Marchisio, Muhammad Shafique

Main category: cs.LG

TL;DR: A survey of neuromorphic computing’s role in enhancing autonomous systems, covering algorithms, hardware, and optimization, with a focus on event-based vision sensors and spiking neural networks.

Details

Motivation: The demand for intelligent, adaptive, and energy-efficient autonomous systems in robotics, UAVs, and self-driving vehicles drives interest in neuromorphic computing.

Method: The paper surveys neuromorphic algorithms, hardware, and cross-layer optimization, emphasizing event-based dynamic vision sensors and spiking neural networks.

Result: Highlights improved energy efficiency, robustness, adaptability, and reliability in autonomous systems through neuromorphic approaches.

Conclusion: The paper provides a comprehensive view of the field, identifying emerging trends and challenges in real-time decision-making, continual learning, and secure autonomous systems.

Abstract: The growing need for intelligent, adaptive, and energy-efficient autonomous systems across fields such as robotics, mobile agents (e.g., UAVs), and self-driving vehicles is driving interest in neuromorphic computing. By drawing inspiration from biological neural systems, neuromorphic approaches offer promising pathways to enhance the perception, decision-making, and responsiveness of autonomous platforms. This paper surveys recent progress in neuromorphic algorithms, specialized hardware, and cross-layer optimization strategies, with a focus on their deployment in real-world autonomous scenarios. Special attention is given to event-based dynamic vision sensors and their role in enabling fast, efficient perception. The discussion highlights new methods that improve energy efficiency, robustness, adaptability, and reliability through the integration of spiking neural networks into autonomous system architectures. We integrate perspectives from machine learning, robotics, neuroscience, and neuromorphic engineering to offer a comprehensive view of the state of the field. Finally, emerging trends and open challenges are explored, particularly in the areas of real-time decision-making, continual learning, and the development of secure, resilient autonomous systems.

[323] When Noisy Labels Meet Class Imbalance on Graphs: A Graph Augmentation Method with LLM and Pseudo Label

Riting Xia, Rucong Wang, Yulin Liu, Anchen Li, Xueyan Liu, Yan Zhang

Main category: cs.LG

TL;DR: GraphALP is a novel framework using LLMs and pseudo-labeling to address class-imbalanced graph node classification with noisy labels.

Details

Motivation: Real-world graphs often have noisy labels and class imbalance, which existing methods overlook.

Method: GraphALP combines LLM-based oversampling for minority nodes and dynamically weighted pseudo-labeling to reduce noise.

Result: GraphALP outperforms state-of-the-art methods on imbalanced graphs with noisy labels.

Conclusion: The proposed framework effectively handles class imbalance and label noise in graph node classification.

Abstract: Class-imbalanced graph node classification is a practical yet underexplored research problem. Although recent studies have attempted to address this issue, they typically assume clean and reliable labels when processing class-imbalanced graphs. This assumption often violates the nature of real-world graphs, where labels frequently contain noise. Given this gap, this paper systematically investigates robust node classification for class-imbalanced graphs with noisy labels. We propose GraphALP, a novel Graph Augmentation framework based on Large language models (LLMs) and Pseudo-labeling techniques. Specifically, we design an LLM-based oversampling method to generate synthetic minority nodes, producing label-accurate minority nodes to alleviate class imbalance. Based on the class-balanced graphs, we develop a dynamically weighted pseudo-labeling method to obtain high-confidence pseudo labels to reduce label noise ratio. Additionally, we implement a secondary LLM-guided oversampling mechanism to mitigate potential class distribution skew caused by pseudo labels. Experimental results show that GraphALP achieves superior performance over state-of-the-art methods on class-imbalanced graphs with noisy labels.

[324] ChronoSelect: Robust Learning with Noisy Labels via Dynamics Temporal Memory

Jianchao Wang, Qingfeng Li, Pengcheng Zheng, Xiaorong Pu, Yazhou Ren

Main category: cs.LG

TL;DR: ChronoSelect is a novel framework for learning with noisy labels, leveraging temporal dynamics through a four-stage memory architecture and sliding update mechanism to improve generalization.

Details

Motivation: Existing methods for learning with noisy labels fail to utilize temporal learning dynamics, leading to degraded performance.

Method: ChronoSelect uses a four-stage memory architecture with a sliding update mechanism to compress prediction history into temporal distributions, enabling precise sample partitioning.

Result: The framework achieves state-of-the-art performance on synthetic and real-world benchmarks, with theoretical guarantees for convergence and stability.

Conclusion: ChronoSelect effectively addresses noisy label challenges by dynamically leveraging temporal learning patterns, outperforming existing methods.

Abstract: Training deep neural networks on real-world datasets is often hampered by the presence of noisy labels, which can be memorized by over-parameterized models, leading to significant degradation in generalization performance. While existing methods for learning with noisy labels (LNL) have made considerable progress, they fundamentally suffer from static snapshot evaluations and fail to leverage the rich temporal dynamics of learning evolution. In this paper, we propose ChronoSelect (chrono denoting its temporal nature), a novel framework featuring an innovative four-stage memory architecture that compresses prediction history into compact temporal distributions. Our unique sliding update mechanism with controlled decay maintains only four dynamic memory units per sample, progressively emphasizing recent patterns while retaining essential historical knowledge. This enables precise three-way sample partitioning into clean, boundary, and noisy subsets through temporal trajectory analysis and dual-branch consistency. Theoretical guarantees prove the mechanism’s convergence and stability under noisy conditions. Extensive experiments demonstrate ChronoSelect’s state-of-the-art performance across synthetic and real-world benchmarks.

[325] Goal-based Trajectory Prediction for improved Cross-Dataset Generalization

Daniel Grimm, Ahmed Abouelazm, J. Marius Zöllner

Main category: cs.LG

TL;DR: A new GNN-based model improves generalization for autonomous driving by using a heterogeneous graph of traffic participants and road networks to classify goals in multi-staged trajectory prediction.

Details

Motivation: Addressing the drop in performance of current models when deployed to unseen areas, highlighting the need for better generalization in autonomous driving systems.

Method: Introduces a Graph Neural Network (GNN) with a heterogeneous graph combining traffic participants and vectorized road networks, using multi-staged goal classification for trajectory prediction.

Result: Demonstrates effectiveness via cross-dataset evaluation (training on Argoverse2, testing on NuScenes), showing improved generalization.

Conclusion: The proposed GNN approach enhances generalization in trajectory prediction for autonomous driving, particularly in unseen scenarios.

Abstract: To achieve full autonomous driving, a good understanding of the surrounding environment is necessary. Especially predicting the future states of other traffic participants imposes a non-trivial challenge. Current SotA-models already show promising results when trained on real datasets (e.g. Argoverse2, NuScenes). Problems arise when these models are deployed to new/unseen areas. Typically, performance drops significantly, indicating that the models lack generalization. In this work, we introduce a new Graph Neural Network (GNN) that utilizes a heterogeneous graph consisting of traffic participants and vectorized road network. Latter, is used to classify goals, i.e. endpoints of the predicted trajectories, in a multi-staged approach, leading to a better generalization to unseen scenarios. We show the effectiveness of the goal selection process via cross-dataset evaluation, i.e. training on Argoverse2 and evaluating on NuScenes.

[326] FedSA-GCL: A Semi-Asynchronous Federated Graph Learning Framework with Personalized Aggregation and Cluster-Aware Broadcasting

Zhongzheng Yuan, Lianshuai Guo, Xunkai Li, Yinlin Zhu, Wenyu Wang, Meixia Qu

Main category: cs.LG

TL;DR: FedSA-GCL is a semi-asynchronous federated framework for graph learning, addressing inefficiencies of synchronous methods and limitations of AFL in graph data. It outperforms baselines by 2.92-3.4%.

Details

Motivation: Existing FGL approaches rely on synchronous communication, causing inefficiencies, while AFL methods ignore graph topology, risking model inconsistency.

Method: FedSA-GCL uses a ClusterCast mechanism to leverage inter-client label divergence and graph topology for efficient training.

Result: Outperforms 9 baselines by 2.92% (Louvain) and 3.4% (Metis) in robustness and efficiency.

Conclusion: FedSA-GCL effectively addresses FGL challenges, offering robust and efficient training for graph data.

Abstract: Federated Graph Learning (FGL) is a distributed learning paradigm that enables collaborative training over large-scale subgraphs located on multiple local systems. However, most existing FGL approaches rely on synchronous communication, which leads to inefficiencies and is often impractical in real-world deployments. Meanwhile, current asynchronous federated learning (AFL) methods are primarily designed for conventional tasks such as image classification and natural language processing, without accounting for the unique topological properties of graph data. Directly applying these methods to graph learning can possibly result in semantic drift and representational inconsistency in the global model. To address these challenges, we propose FedSA-GCL, a semi-asynchronous federated framework that leverages both inter-client label distribution divergence and graph topological characteristics through a novel ClusterCast mechanism for efficient training. We evaluate FedSA-GCL on multiple real-world graph datasets using the Louvain and Metis split algorithms, and compare it against 9 baselines. Extensive experiments demonstrate that our method achieves strong robustness and outstanding efficiency, outperforming the baselines by an average of 2.92% with the Louvain and by 3.4% with the Metis.

[327] Sparse identification of nonlinear dynamics with library optimization mechanism: Recursive long-term prediction perspective

Ansei Yonezawa, Heisei Yonezawa, Shuichi Yahagi, Itsuro Kajiwara, Shinya Kijimoto, Hikaru Taniuchi, Kentaro Murakami

Main category: cs.LG

TL;DR: SINDy-LOM combines sparse regression with library optimization to improve model accuracy and reduce user burden by parametrizing basis functions and optimizing them for recursive long-term prediction.

Details

Motivation: The challenge in SINDy is designing an appropriate library of basis functions, which is non-trivial for many dynamical systems.

Method: SINDy-LOM uses a two-layer optimization: inner-layer for sparse regression and outer-layer for optimizing parametrized basis functions based on recursive long-term prediction accuracy.

Result: The approach yields interpretable, parsimonious models with improved reliability and reduced user burden, demonstrated on a diesel engine airpath system.

Conclusion: SINDy-LOM enhances traditional SINDy by optimizing library design, improving long-term prediction accuracy and usability.

Abstract: The sparse identification of nonlinear dynamics (SINDy) approach can discover the governing equations of dynamical systems based on measurement data, where the dynamical model is identified as the sparse linear combination of the given basis functions. A major challenge in SINDy is the design of a library, which is a set of candidate basis functions, as the appropriate library is not trivial for many dynamical systems. To overcome this difficulty, this study proposes SINDy with library optimization mechanism (SINDy-LOM), which is a combination of the sparse regression technique and the novel learning strategy of the library. In the proposed approach, the basis functions are parametrized. The SINDy-LOM approach involves a two-layer optimization architecture: the inner-layer, in which the data-driven model is extracted as the sparse linear combination of the candidate basis functions, and the outer-layer, in which the basis functions are optimized from the viewpoint of the recursive long-term (RLT) prediction accuracy; thus, the library design is reformulated as the optimization of the parametrized basis functions. The resulting SINDy-LOM model has good interpretability and usability, as the proposed approach yields the parsimonious model. The library optimization mechanism significantly reduces user burden. The RLT perspective improves the reliability of the resulting model compared with the traditional SINDy approach that can only ensure the one-step-ahead prediction accuracy. The validity of the proposed approach is demonstrated by applying it to a diesel engine airpath system, which is a well-known complex industrial system.

[328] Boosting Revisited: Benchmarking and Advancing LP-Based Ensemble Methods

Fabian Akkerman, Julien Ferry, Christian Artigues, Emmanuel Hebrard, Thibaut Vidal

Main category: cs.LG

TL;DR: Large-scale study of LP-based boosting methods shows they can match or outperform XGBoost/LightGBM with sparser ensembles, especially with shallow trees, and can thin pre-trained ensembles effectively.

Details

Motivation: To empirically evaluate the performance of totally corrective boosting methods based on linear programming, which have been theoretically appealing but understudied.

Method: Study six LP-based boosting formulations (including two novel ones, NM-Boost and QRLP-Boost) across 20 datasets, using heuristic and optimal base learners, and analyzing accuracy, sparsity, margins, anytime performance, and hyperparameter sensitivity.

Result: Totally corrective methods outperform or match XGBoost/LightGBM with shallow trees, produce sparser ensembles, and can thin pre-trained ensembles without performance loss.

Conclusion: LP-based boosting methods are viable alternatives to heuristics, offering sparsity and competitive performance, though optimal decision trees have limitations.

Abstract: Despite their theoretical appeal, totally corrective boosting methods based on linear programming have received limited empirical attention. In this paper, we conduct the first large-scale experimental study of six LP-based boosting formulations, including two novel methods, NM-Boost and QRLP-Boost, across 20 diverse datasets. We evaluate the use of both heuristic and optimal base learners within these formulations, and analyze not only accuracy, but also ensemble sparsity, margin distribution, anytime performance, and hyperparameter sensitivity. We show that totally corrective methods can outperform or match state-of-the-art heuristics like XGBoost and LightGBM when using shallow trees, while producing significantly sparser ensembles. We further show that these methods can thin pre-trained ensembles without sacrificing performance, and we highlight both the strengths and limitations of using optimal decision trees in this context.

[329] Leveraging Data Augmentation and Siamese Learning for Predictive Process Monitoring

Sjoerd van Straten, Alessandro Padella, Marwan Hassani

Main category: cs.LG

TL;DR: SiamSA-PPM is a self-supervised learning framework for Predictive Process Monitoring, using Siamese learning and Statistical Augmentation to enhance data variability and improve prediction accuracy.

Details

Motivation: Deep learning PPM approaches struggle with low variability and small size of real-world event logs, limiting their effectiveness.

Method: Combines Siamese learning with Statistical Augmentation, employing three novel transformation methods to generate realistic trace variants for training.

Result: Outperforms SOTA in next activity and final outcome prediction tasks, with statistical augmentation proving superior to random transformations.

Conclusion: SiamSA-PPM is a promising approach for enriching training data and improving PPM performance.

Abstract: Predictive Process Monitoring (PPM) enables forecasting future events or outcomes of ongoing business process instances based on event logs. However, deep learning PPM approaches are often limited by the low variability and small size of real-world event logs. To address this, we introduce SiamSA-PPM, a novel self-supervised learning framework that combines Siamese learning with Statistical Augmentation for Predictive Process Monitoring. It employs three novel statistically grounded transformation methods that leverage control-flow semantics and frequent behavioral patterns to generate realistic, semantically valid new trace variants. These augmented views are used within a Siamese learning setup to learn generalizable representations of process prefixes without the need for labeled supervision. Extensive experiments on real-life event logs demonstrate that SiamSA-PPM achieves competitive or superior performance compared to the SOTA in both next activity and final outcome prediction tasks. Our results further show that statistical augmentation significantly outperforms random transformations and improves variability in the data, highlighting SiamSA-PPM as a promising direction for training data enrichment in process prediction.

[330] Self-Supervised Coarsening of Unstructured Grid with Automatic Differentiation

Sergei Shumilin, Alexander Ryabov, Nikolay Yavich, Evgeny Burnaev, Vladimir Vanovskiy

Main category: cs.LG

TL;DR: An algorithm for coarsening unstructured grids using differentiable physics, k-means clustering, autodifferentiation, and stochastic minimization, reducing grid points by up to 10x while maintaining accuracy.

Details

Motivation: High computational load in numerical simulations necessitates methods to reduce problem size without compromising accuracy.

Method: Combines k-means clustering, autodifferentiation, and stochastic minimization to coarsen unstructured grids.

Result: Achieved a 10x reduction in grid points while preserving variable dynamics in key points for linear parabolic and wave equations.

Conclusion: The approach is versatile and applicable to simulations of systems described by evolutionary PDEs.

Abstract: Due to the high computational load of modern numerical simulation, there is a demand for approaches that would reduce the size of discrete problems while keeping the accuracy reasonable. In this work, we present an original algorithm to coarsen an unstructured grid based on the concepts of differentiable physics. We achieve this by employing k-means clustering, autodifferentiation and stochastic minimization algorithms. We demonstrate performance of the designed algorithm on two PDEs: a linear parabolic equation which governs slightly compressible fluid flow in porous media and the wave equation. Our results show that in the considered scenarios, we reduced the number of grid points up to 10 times while preserving the modeled variable dynamics in the points of interest. The proposed approach can be applied to the simulation of an arbitrary system described by evolutionary partial differential equations.

[331] Regression-aware Continual Learning for Android Malware Detection

Daniele Ghiani, Daniele Angioni, Giorgio Piras, Angelo Sotgiu, Luca Minnei, Srishti Gupta, Maura Pintor, Fabio Roli, Battista Biggio

Main category: cs.LG

TL;DR: The paper addresses security regression in continual learning (CL)-based malware detectors, proposing a regression-aware penalty to mitigate harmful prediction changes while maintaining detection performance.

Details

Motivation: Malware evolves rapidly, making full retraining impractical. CL offers scalability but risks security regression, where previously detected malware evades detection after updates, undermining trust.

Method: The authors formalize and quantify security regression, adapting Positive Congruent Training (PCT) to CL to preserve prior predictive behavior in a model-agnostic way.

Result: Experiments on ELSA, Tesseract, and AZ-Class datasets show the method effectively reduces regression across CL scenarios while maintaining strong detection performance.

Conclusion: The proposed regression-aware penalty mitigates security regression in CL-based malware detectors, ensuring reliable updates without compromising detection accuracy.

Abstract: Malware evolves rapidly, forcing machine learning (ML)-based detectors to adapt continuously. With antivirus vendors processing hundreds of thousands of new samples daily, datasets can grow to billions of examples, making full retraining impractical. Continual learning (CL) has emerged as a scalable alternative, enabling incremental updates without full data access while mitigating catastrophic forgetting. In this work, we analyze a critical yet overlooked issue in this context: security regression. Unlike forgetting, which manifests as a general performance drop on previously seen data, security regression captures harmful prediction changes at the sample level, such as a malware sample that was once correctly detected but evades detection after a model update. Although often overlooked, regressions pose serious risks in security-critical applications, as the silent reintroduction of previously detected threats in the system may undermine users’ trust in the whole updating process. To address this issue, we formalize and quantify security regression in CL-based malware detectors and propose a regression-aware penalty to mitigate it. Specifically, we adapt Positive Congruent Training (PCT) to the CL setting, preserving prior predictive behavior in a model-agnostic manner. Experiments on the ELSA, Tesseract, and AZ-Class datasets show that our method effectively reduces regression across different CL scenarios while maintaining strong detection performance over time.

[332] State of Health Estimation of Batteries Using a Time-Informed Dynamic Sequence-Inverted Transformer

Janak M. Patel, Milad Ramezankhani, Anirudh Deodhar, Dagnachew Birru

Main category: cs.LG

TL;DR: A novel architecture, TIDSIT, is proposed for accurate battery health monitoring, addressing challenges like irregular data sampling and variable-length discharge cycles, outperforming existing models with a 50% error reduction.

Details

Motivation: Battery health monitoring is critical for safety and efficiency, but existing models struggle with irregularities in real-world discharge cycle data.

Method: TIDSIT uses continuous time embeddings and temporal attention mechanisms to handle irregular and variable-length data without information loss.

Result: TIDSIT reduces prediction error by over 50% and achieves an SoH error below 0.58% on the NASA dataset.

Conclusion: TIDSIT is effective for battery health monitoring and has potential for broader applications in irregular time-series tasks.

Abstract: The rapid adoption of battery-powered vehicles and energy storage systems over the past decade has made battery health monitoring increasingly critical. Batteries play a central role in the efficiency and safety of these systems, yet they inevitably degrade over time due to repeated charge-discharge cycles. This degradation leads to reduced energy efficiency and potential overheating, posing significant safety concerns. Accurate estimation of a State of Health (SoH) of battery is therefore essential for ensuring operational reliability and safety. Several machine learning architectures, such as LSTMs, transformers, and encoder-based models, have been proposed to estimate SoH from discharge cycle data. However, these models struggle with the irregularities inherent in real-world measurements: discharge readings are often recorded at non-uniform intervals, and the lengths of discharge cycles vary significantly. To address this, most existing approaches extract features from the sequences rather than processing them in full, which introduces information loss and compromises accuracy. To overcome these challenges, we propose a novel architecture: Time-Informed Dynamic Sequence Inverted Transformer (TIDSIT). TIDSIT incorporates continuous time embeddings to effectively represent irregularly sampled data and utilizes padded sequences with temporal attention mechanisms to manage variable-length inputs without discarding sequence information. Experimental results on the NASA battery degradation dataset show that TIDSIT significantly outperforms existing models, achieving over 50% reduction in prediction error and maintaining an SoH prediction error below 0.58%. Furthermore, the architecture is generalizable and holds promise for broader applications in health monitoring tasks involving irregular time-series data.

[333] Low-rank adaptive physics-informed HyperDeepONets for solving differential equations

Etienne Zeudong, Elsa Cardoso-Bihlo, Alex Bihlo

Main category: cs.LG

TL;DR: PI-LoRA-HyperDeepONets use low-rank adaptation to reduce complexity and improve performance over HyperDeepONets.

Details

Motivation: Address the high memory and computational costs of HyperDeepONets while maintaining expressivity.

Method: Leverage low-rank adaptation (LoRA) to decompose the hypernetwork’s output layer weight matrix into smaller low-rank matrices.

Result: Achieves up to 70% reduction in parameters and outperforms HyperDeepONets in predictive accuracy and generalization.

Conclusion: PI-LoRA-HyperDeepONets offer a more efficient and effective alternative for operator learning in physics-informed settings.

Abstract: HyperDeepONets were introduced in Lee, Cho and Hwang [ICLR, 2023] as an alternative architecture for operator learning, in which a hypernetwork generates the weights for the trunk net of a DeepONet. While this improves expressivity, it incurs high memory and computational costs due to the large number of output parameters required. In this work we introduce, in the physics-informed machine learning setting, a variation, PI-LoRA-HyperDeepONets, which leverage low-rank adaptation (LoRA) to reduce complexity by decomposing the hypernetwork’s output layer weight matrix into two smaller low-rank matrices. This reduces the number of trainable parameters while introducing an extra regularization of the trunk networks’ weights. Through extensive experiments on both ordinary and partial differential equations we show that PI-LoRA-HyperDeepONets achieve up to 70% reduction in parameters and consistently outperform regular HyperDeepONets in terms of predictive accuracy and generalization.

[334] Efficient Uncertainty in LLMs through Evidential Knowledge Distillation

Lakshmana Sri Harsha Nemani, P. K. Srijith, Tomasz Kuśmierczyk

Main category: cs.LG

TL;DR: The paper introduces a method to efficiently estimate uncertainty in LLMs by distilling uncertainty-aware teacher models into compact student models using LoRA, achieving comparable performance with fewer computations.

Details

Motivation: Standard LLMs struggle with accurate uncertainty quantification, and existing Bayesian/ensemble methods are computationally expensive.

Method: Distill uncertainty-aware teacher models into student models using LoRA, comparing softmax-based and Dirichlet-distributed outputs for uncertainty modeling.

Result: Student models achieve comparable or better predictive and uncertainty performance than teachers, requiring only one forward pass.

Conclusion: Evidential distillation enables efficient and robust uncertainty quantification in LLMs, a first in the field.

Abstract: Accurate uncertainty quantification remains a key challenge for standard LLMs, prompting the adoption of Bayesian and ensemble-based methods. However, such methods typically necessitate computationally expensive sampling, involving multiple forward passes to effectively estimate predictive uncertainty. In this paper, we introduce a novel approach enabling efficient and effective uncertainty estimation in LLMs without sacrificing performance. Specifically, we distill uncertainty-aware teacher models - originally requiring multiple forward passes - into compact student models sharing the same architecture but fine-tuned using Low-Rank Adaptation (LoRA). We compare two distinct distillation strategies: one in which the student employs traditional softmax-based outputs, and another in which the student leverages Dirichlet-distributed outputs to explicitly model epistemic uncertainty via evidential learning. Empirical evaluations on classification datasets demonstrate that such students can achieve comparable or superior predictive and uncertainty quantification performance relative to their teacher models, while critically requiring only a single forward pass. To our knowledge, this is the first demonstration that immediate and robust uncertainty quantification can be achieved in LLMs through evidential distillation.

[335] A Comprehensive Review of Diffusion Models in Smart Agriculture: Progress, Applications, and Challenges

Xing Hua, Haodong Chen, Qianqian Duan, Danfeng Hong, Ruijiao Li, Huiliang Shang, Linghua Jiang, Haima Yang, Dawei Zhang

Main category: cs.LG

TL;DR: The paper reviews diffusion models’ applications in agriculture, highlighting their advantages over GANs in tasks like data augmentation and image processing, and their potential in smart and precision agriculture.

Details

Motivation: Addressing challenges like limited agricultural data and imbalanced samples, the paper explores how AI, especially diffusion models, can enhance agricultural practices.

Method: The study reviews diffusion models’ use in agriculture, focusing on tasks like pest detection, remote sensing, and crop prediction.

Result: Diffusion models improve accuracy and robustness in data augmentation, image generation, and denoising, outperforming GANs.

Conclusion: Despite computational challenges, diffusion models hold great promise for advancing smart and precision agriculture, supporting global agricultural sustainability.

Abstract: With the global population growing and arable land resources becoming increasingly scarce,smart agriculture and precision agriculture have emerged as key directions for the future ofagricultural development.Artificial intelligence (AI) technologies, particularly deep learning models, have found widespread applications in areas such as crop monitoring and pest detection. As an emerging generative model, diffusion models have shown significant promise in tasks like agricultural image processing, data augmentation, and remote sensing. Compared to traditional generative adversarial networks (GANs), diffusion models offer superior training stability and generation quality, effectively addressing challenges such as limited agricultural data and imbalanced image samples. This paper reviews the latest advancements in the application of diffusion models in agriculture, focusing on their potential in crop pest and disease detection, remote sensing image enhancement, crop growth prediction, and agricultural resource management. Experimental results demonstrate that diffusion models significantly improve model accuracy and robustness in data augmentation, image generation, and denoising, especially in complex environments. Despite challenges related to computational efficiency and generalization capabilities, diffusion models are expected to play an increasingly important role in smart and precision agriculture as technology advances, providing substantial support for the sustainable development of global agriculture.

[336] Multi-Model Ensemble and Reservoir Computing for River Discharge Prediction in Ungauged Basins

Mizuki Funato, Yohei Sawada

Main category: cs.LG

TL;DR: HYPER combines multi-model ensemble and reservoir computing for efficient, accurate flood prediction, especially in data-scarce regions.

Details

Motivation: Addressing the challenge of accurate flood prediction and water management in regions with limited river discharge observations.

Method: Uses Bayesian model averaging (BMA) on 43 uncalibrated hydrological models, then trains a reservoir computing (RC) model via linear regression for error correction. Weights are inferred for ungauged basins using catchment attributes.

Result: HYPER matches LSTM performance in data-rich scenarios (KGE 0.56 vs. 0.55) with 5% computational time. In data-scarce scenarios (23% gauged), HYPER maintains KGE 0.55 while LSTM degrades to -0.04.

Conclusion: HYPER offers a robust, efficient, and generalizable solution for discharge prediction, particularly in ungauged basins.

Abstract: Despite the critical need for accurate flood prediction and water management, many regions lack sufficient river discharge observations, limiting the skill of rainfall-runoff analyses. Although numerous physically based and machine learning models exist, achieving high accuracy, interpretability, and computational efficiency under data-scarce conditions remains a major challenge. We address this challenge with a novel method, HYdrological Prediction with multi-model Ensemble and Reservoir computing (HYPER) that leverages multi-model ensemble and reservoir computing (RC). Our approach first applies Bayesian model averaging (BMA) to 43 “uncalibrated” catchment-based conceptual hydrological models. An RC model is then trained via linear regression to correct errors in the BMA output, a non-iterative process that ensures high computational efficiency. For ungauged basins, we infer the required BMA and RC weights by linking them to catchment attributes from gauged basins, creating a generalizable framework. We evaluated HYPER using data from 87 river basins in Japan. In a data-rich scenario, HYPER (median Kling-Gupta Efficiency, KGE, of 0.56) performed comparably to a benchmark LSTM (KGE 0.55) but required only 5% of its computational time. In a data-scarce scenario (23% of basins gauged), HYPER maintained robust performance (KGE 0.55) and lower uncertainty, whereas the LSTM’s performance degraded significantly (KGE -0.04). These results reveal that individual conceptual hydrological models do not necessarily need to be calibrated when an effectively large ensemble is assembled and combined with machine-learning-based bias correction. HYPER provides a robust, efficient, and generalizable solution for discharge prediction, particularly in ungauged basins, making it applicable to a wide range of regions.

[337] Revisiting Bisimulation Metric for Robust Representations in Reinforcement Learning

Leiji Zhang, Zeyu Wang, Xin Li, Yao-Hui Li

Main category: cs.LG

TL;DR: The paper identifies flaws in the conventional bisimulation metric, proposes a revised version with adaptive coefficients, and validates its effectiveness through theory and experiments.

Details

Motivation: The conventional bisimulation metric fails to represent distinctive scenarios and relies on fixed weights, limiting its adaptability in reinforcement learning tasks.

Method: Introduces a state-action pair measure, refines the reward gap definition, and uses adaptive coefficients in recursive updates.

Result: The revised metric shows improved representation distinctiveness and convergence, validated on DeepMind Control and Meta-World benchmarks.

Conclusion: The proposed bisimulation metric addresses key limitations of the conventional approach, offering better adaptability and performance in reinforcement learning.

Abstract: Bisimulation metric has long been regarded as an effective control-related representation learning technique in various reinforcement learning tasks. However, in this paper, we identify two main issues with the conventional bisimulation metric: 1) an inability to represent certain distinctive scenarios, and 2) a reliance on predefined weights for differences in rewards and subsequent states during recursive updates. We find that the first issue arises from an imprecise definition of the reward gap, whereas the second issue stems from overlooking the varying importance of reward difference and next-state distinctions across different training stages and task settings. To address these issues, by introducing a measure for state-action pairs, we propose a revised bisimulation metric that features a more precise definition of reward gap and novel update operators with adaptive coefficient. We also offer theoretical guarantees of convergence for our proposed metric and its improved representation distinctiveness. In addition to our rigorous theoretical analysis, we conduct extensive experiments on two representative benchmarks, DeepMind Control and Meta-World, demonstrating the effectiveness of our approach.

[338] GLANCE: Graph Logic Attention Network with Cluster Enhancement for Heterophilous Graph Representation Learning

Zhongtian Sun, Anoushka Harit, Alexandra Cristea, Christl A. Donnelly, Pietro Liò

Main category: cs.LG

TL;DR: GLANCE is a novel GNN framework for heterophilous graphs, combining logic-guided reasoning, dynamic refinement, and clustering to improve performance and interpretability.

Details

Motivation: GNNs struggle with heterophilous graphs due to indiscriminate neighbor aggregation and lack of higher-order structural patterns.

Method: GLANCE integrates a logic layer for structured embeddings, attention-based edge pruning, and clustering for global patterns.

Result: GLANCE achieves competitive performance on benchmark datasets (Cornell, Texas, Wisconsin) for heterophilous graphs.

Conclusion: GLANCE is a lightweight, adaptable, and interpretable solution for heterophilous graph challenges.

Abstract: Graph Neural Networks (GNNs) have demonstrated significant success in learning from graph-structured data but often struggle on heterophilous graphs, where connected nodes differ in features or class labels. This limitation arises from indiscriminate neighbor aggregation and insufficient incorporation of higher-order structural patterns. To address these challenges, we propose GLANCE (Graph Logic Attention Network with Cluster Enhancement), a novel framework that integrates logic-guided reasoning, dynamic graph refinement, and adaptive clustering to enhance graph representation learning. GLANCE combines a logic layer for interpretable and structured embeddings, multi-head attention-based edge pruning for denoising graph structures, and clustering mechanisms for capturing global patterns. Experimental results in benchmark datasets, including Cornell, Texas, and Wisconsin, demonstrate that GLANCE achieves competitive performance, offering robust and interpretable solutions for heterophilous graph scenarios. The proposed framework is lightweight, adaptable, and uniquely suited to the challenges of heterophilous graphs.

[339] C2G-KD: PCA-Constrained Generator for Data-Free Knowledge Distillation

Magnus Bengtsson, Kenneth Östberg

Main category: cs.LG

TL;DR: C2G-KD is a data-free knowledge distillation method using a class-conditional generator guided by a teacher model and PCA-derived geometric constraints, achieving effective synthetic training with minimal real examples.

Details

Motivation: To enable knowledge distillation without access to real training data, leveraging synthetic samples generated by a teacher model and geometric constraints.

Method: Train a class-conditional generator using semantic and structural losses, constrained by PCA subspaces from few real examples per class.

Result: Effective synthetic training pipelines are created, even with minimal class structure, as demonstrated on MNIST.

Conclusion: C2G-KD successfully bootstraps synthetic training with limited real data, preserving diversity and topological consistency.

Abstract: We introduce C2G-KD, a data-free knowledge distillation framework where a class-conditional generator is trained to produce synthetic samples guided by a frozen teacher model and geometric constraints derived from PCA. The generator never observes real training data but instead learns to activate the teacher’s output through a combination of semantic and structural losses. By constraining generated samples to lie within class-specific PCA subspaces estimated from as few as two real examples per class, we preserve topological consistency and diversity. Experiments on MNIST show that even minimal class structure is sufficient to bootstrap useful synthetic training pipelines.

[340] The Price equation reveals a universal force-metric-bias law of algorithmic learning and natural selection

Steven A. Frank

Main category: cs.LG

TL;DR: The paper introduces a universal FMB law (force-metric-bias) using the Price equation, unifying diverse learning algorithms, optimization methods, and natural selection under a common mathematical framework.

Details

Motivation: To reveal the shared mathematical structure among seemingly disparate learning and optimization processes, enabling a unified understanding and design of algorithms.

Method: The author partitions change using the Price equation, deriving the FMB law: Δθ = Mf + b + ξ, where force (f), metric (M), bias (b), and noise (ξ) describe the dynamics.

Result: The FMB law unifies natural selection, Bayesian updating, Newton’s method, gradient descent, and other algorithms as special cases, explaining the emergence of Fisher information and KL divergence.

Conclusion: The FMB law provides a foundational framework for comparing and designing learning algorithms across disciplines, highlighting their shared underlying structure.

Abstract: Diverse learning algorithms, optimization methods, and natural selection share a common mathematical structure, despite their apparent differences. Here I show that a simple notational partitioning of change by the Price equation reveals a universal force-metric-bias (FMB) law: $\Delta\mathbf{\theta} = \mathbf{M},\mathbf{f} + \mathbf{b} + \mathbf{\xi}$. The force $\mathbf{f}$ drives improvement in parameters, $\Delta\mathbf{\theta}$, through the covariance between the parameters and performance. The metric $\mathbf{M}$ rescales movement by inverse curvature. The bias $\mathbf{b}$ adds momentum or changes in the frame of reference. The noise $\mathbf{\xi}$ enables exploration. This framework unifies natural selection, Bayesian updating, Newton’s method, stochastic gradient descent, stochastic Langevin dynamics, Adam optimization, and most other algorithms as special cases of the same underlying process. The Price equation also reveals why Fisher information, Kullback-Leibler divergence, and d’Alembert’s principle arise naturally in learning dynamics. By exposing this common structure, the FMB law provides a principled foundation for understanding, comparing, and designing learning algorithms across disciplines.

[341] The Geometry of LLM Quantization: GPTQ as Babai’s Nearest Plane Algorithm

Jiale Chen, Torsten Hoefler, Dan Alistarh

Main category: cs.LG

TL;DR: GPTQ, a method for quantizing LLM weights, is mathematically equivalent to Babai’s nearest plane algorithm for CVP, providing geometric insights and error bounds.

Details

Motivation: To understand the theoretical foundations of GPTQ and its equivalence to classical lattice algorithms for better quantization methods.

Method: Mathematical analysis showing GPTQ’s equivalence to Babai’s algorithm for CVP on a lattice defined by the Hessian matrix.

Result: GPTQ gains geometric interpretation and inherits error bounds from Babai’s algorithm.

Conclusion: The equivalence places GPTQ on solid theoretical ground and suggests leveraging lattice algorithms for future quantization techniques.

Abstract: Quantizing the weights of large language models (LLMs) from 16-bit to lower bitwidth is the de facto approach to deploy massive transformers onto more affordable accelerators. GPTQ emerged as one of the standard methods for one-shot post-training quantization at LLM scale. Yet, its inner workings are described as a sequence of ad-hoc algebraic updates that obscure any geometric meaning or worst-case guarantees. In this work, we show that, when executed back-to-front (from the last to first dimension) for a linear layer, GPTQ is mathematically identical to Babai’s nearest plane algorithm for the classical closest vector problem (CVP) on a lattice defined by the Hessian matrix of the layer’s inputs. This equivalence is based on a sophisticated mathematical argument, and has two analytical consequences: (i) the GPTQ error propagation step gains an intuitive geometric interpretation; (ii) GPTQ inherits the error upper bound of Babai’s algorithm under the no-clipping condition. Taken together, these results place GPTQ on firm theoretical footing and open the door to importing decades of progress in lattice algorithms towards the design of future quantization algorithms for billion-parameter models.

[342] Beyond Internal Data: Constructing Complete Datasets for Fairness Testing

Varsha Ramineni, Hossein A. Rahmani, Emine Yilmaz, David Barber

Main category: cs.LG

TL;DR: The paper addresses the challenge of testing AI fairness without complete demographic data by proposing synthetic data construction from overlapping datasets, validated for accuracy and consistency with real data.

Details

Motivation: The rise of AI in high-risk domains necessitates fairness testing, but legal and privacy concerns limit access to demographic data, making it difficult to assess biases.

Method: The authors propose using separate overlapping datasets to create synthetic data with demographic information, ensuring it reflects real-world relationships. This synthetic data is validated against real data.

Result: Fairness metrics from synthetic data align with those from real data, proving its reliability for fairness testing.

Conclusion: Synthetic data offers a practical solution to data scarcity for fairness testing, enabling independent and model-agnostic evaluations.

Abstract: As AI becomes prevalent in high-risk domains and decision-making, it is essential to test for potential harms and biases. This urgency is reflected by the global emergence of AI regulations that emphasise fairness and adequate testing, with some mandating independent bias audits. However, procuring the necessary data for fairness testing remains a significant challenge. Particularly in industry settings, legal and privacy concerns restrict the collection of demographic data required to assess group disparities, and auditors face practical and cultural challenges in gaining access to data. Further, internal historical datasets are often insufficiently representative to identify real-world biases. This work focuses on evaluating classifier fairness when complete datasets including demographics are inaccessible. We propose leveraging separate overlapping datasets to construct complete synthetic data that includes demographic information and accurately reflects the underlying relationships between protected attributes and model features. We validate the fidelity of the synthetic data by comparing it to real data, and empirically demonstrate that fairness metrics derived from testing on such synthetic data are consistent with those obtained from real data. This work, therefore, offers a path to overcome real-world data scarcity for fairness testing, enabling independent, model-agnostic evaluation of fairness, and serving as a viable substitute where real data is limited.

[343] Neural Tangent Kernels and Fisher Information Matrices for Simple ReLU Networks with Random Hidden Weights

Jun’ichi Takeuchia, Yoshinari Takeishia, Noboru Muratab, Kazushi Mimurac, Ka Long Keith Hod, Hiroshi Nagaoka

Main category: cs.LG

TL;DR: The paper explores the relationship between Fisher information matrices and neural tangent kernels (NTK) in 2-layer ReLU networks with random hidden weights, providing spectral analysis and an approximation formula.

Details

Motivation: To understand the connection between Fisher information matrices and NTK in 2-layer ReLU networks, and to analyze their spectral properties.

Method: The study involves linear transformation analysis, spectral decomposition of NTK, and derivation of eigenfunctions with major eigenvalues.

Result: Concrete forms of eigenfunctions for NTK are identified, and an approximation formula for functions represented by 2-layer networks is derived.

Conclusion: The work clarifies the relationship between Fisher information and NTK, offering insights into the spectral structure and function approximation in 2-layer ReLU networks.

Abstract: Fisher information matrices and neural tangent kernels (NTK) for 2-layer ReLU networks with random hidden weight are argued. We discuss the relation between both notions as a linear transformation and show that spectral decomposition of NTK with concrete forms of eigenfunctions with major eigenvalues. We also obtain an approximation formula of the functions presented by the 2-layer neural networks.

[344] Linear Memory SE(2) Invariant Attention

Ethan Pronovost, Neha Boloor, Peter Schleede, Noureldin Hendy, Andres Morales, Nicholas Roy

Main category: cs.LG

TL;DR: Proposes a linear-memory SE(2) invariant transformer for spatial data in autonomous driving, improving performance over non-invariant methods.

Details

Motivation: Prior SE(2) invariant methods require quadratic memory for relative poses, limiting scalability.

Method: Introduces SE(2) invariant scaled dot-product attention with linear memory usage.

Result: Demonstrates practical implementation and performance gains over non-invariant architectures.

Conclusion: The SE(2) invariant transformer is scalable and effective for spatial data tasks in autonomous driving.

Abstract: Processing spatial data is a key component in many learning tasks for autonomous driving such as motion forecasting, multi-agent simulation, and planning. Prior works have demonstrated the value in using SE(2) invariant network architectures that consider only the relative poses between objects (e.g. other agents, scene features such as traffic lanes). However, these methods compute the relative poses for all pairs of objects explicitly, requiring quadratic memory. In this work, we propose a mechanism for SE(2) invariant scaled dot-product attention that requires linear memory relative to the number of objects in the scene. Our SE(2) invariant transformer architecture enjoys the same scaling properties that have benefited large language models in recent years. We demonstrate experimentally that our approach is practical to implement and improves performance compared to comparable non-invariant architectures.

[345] Demystify Protein Generation with Hierarchical Conditional Diffusion Models

Zinan Ling, Yi Shi, Da Yan, Yang Zhou, Bo Hui

Main category: cs.LG

TL;DR: A multi-level conditional diffusion model integrates sequence and structure data for functional protein generation, introducing Protein-MMD for evaluation.

Details

Motivation: Reliable protein generation remains challenging, especially with conditional diffusion models, due to the multi-level nature of protein function.

Method: Proposes a multi-level conditional diffusion model combining sequence and structure data for end-to-end protein design, with a new evaluation metric, Protein-MMD.

Result: The framework effectively models hierarchical protein relations, and Protein-MMD reliably evaluates generated proteins.

Conclusion: The method improves conditional protein generation and evaluation, demonstrating efficacy on benchmark datasets.

Abstract: Generating novel and functional protein sequences is critical to a wide range of applications in biology. Recent advancements in conditional diffusion models have shown impressive empirical performance in protein generation tasks. However, reliable generations of protein remain an open research question in de novo protein design, especially when it comes to conditional diffusion models. Considering the biological function of a protein is determined by multi-level structures, we propose a novel multi-level conditional diffusion model that integrates both sequence-based and structure-based information for efficient end-to-end protein design guided by specified functions. By generating representations at different levels simultaneously, our framework can effectively model the inherent hierarchical relations between different levels, resulting in an informative and discriminative representation of the generated protein. We also propose a Protein-MMD, a new reliable evaluation metric, to evaluate the quality of generated protein with conditional diffusion models. Our new metric is able to capture both distributional and functional similarities between real and generated protein sequences while ensuring conditional consistency. We experiment with the benchmark datasets, and the results on conditional protein generation tasks demonstrate the efficacy of the proposed generation framework and evaluation metric.

[346] Gait Recognition Based on Tiny ML and IMU Sensors

Jiahang Zhang, Mingtong Chen, Zhengbao Yang

Main category: cs.LG

TL;DR: A gait recognition system using Tiny ML and IMU sensors achieves over 80% accuracy in classifying four activities, with low-power operation suitable for battery-powered devices.

Details

Motivation: To develop a real-time, low-power gait recognition system for activity classification using Tiny ML and IMU sensors.

Method: Uses XIAO-nRF52840 Sense microcontroller and LSM6DS3 IMU sensor to capture motion data, processed via Edge Impulse for feature extraction and DNN training.

Result: Over 80% accuracy in classifying walking, stationary, upstairs, and downstairs activities, with anomaly detection capability.

Conclusion: The system effectively classifies activities with high accuracy and low power consumption, suitable for edge applications.

Abstract: This project presents the development of a gait recognition system using Tiny Machine Learning (Tiny ML) and Inertial Measurement Unit (IMU) sensors. The system leverages the XIAO-nRF52840 Sense microcontroller and the LSM6DS3 IMU sensor to capture motion data, including acceleration and angular velocity, from four distinct activities: walking, stationary, going upstairs, and going downstairs. The data collected is processed through Edge Impulse, an edge AI platform, which enables the training of machine learning models that can be deployed directly onto the microcontroller for real-time activity classification.The data preprocessing step involves extracting relevant features from the raw sensor data using techniques such as sliding windows and data normalization, followed by training a Deep Neural Network (DNN) classifier for activity recognition. The model achieves over 80% accuracy on a test dataset, demonstrating its ability to classify the four activities effectively. Additionally, the platform enables anomaly detection, further enhancing the robustness of the system. The integration of Tiny ML ensures low-power operation, making it suitable for battery-powered or energy-harvesting devices.

[347] A Multi-Faceted Evaluation Framework for Assessing Synthetic Data Generated by Large Language Models

Yefeng Yuan, Yuhong Liu, Liang Cheng

Main category: cs.LG

TL;DR: SynEval is an open-source framework for evaluating synthetic tabular data’s fidelity, utility, and privacy, tested on ChatGPT, Claude, and Llama-generated product reviews.

Details

Motivation: Addressing the lack of a comprehensive evaluation framework for synthetic data quality and privacy concerns in generative AI.

Method: Introduces SynEval, a framework with diverse metrics to assess synthetic data, validated on product reviews from ChatGPT, Claude, and Llama.

Result: Highlights trade-offs between evaluation metrics and demonstrates SynEval’s effectiveness in assessing synthetic data.

Conclusion: SynEval aids researchers and practitioners in evaluating synthetic data suitability while prioritizing privacy.

Abstract: The rapid advancements in generative AI and large language models (LLMs) have opened up new avenues for producing synthetic data, particularly in the realm of structured tabular formats, such as product reviews. Despite the potential benefits, concerns regarding privacy leakage have surfaced, especially when personal information is utilized in the training datasets. In addition, there is an absence of a comprehensive evaluation framework capable of quantitatively measuring the quality of the generated synthetic data and their utility for downstream tasks. In response to this gap, we introduce SynEval, an open-source evaluation framework designed to assess the fidelity, utility, and privacy preservation of synthetically generated tabular data via a suite of diverse evaluation metrics. We validate the efficacy of our proposed framework - SynEval - by applying it to synthetic product review data generated by three state-of-the-art LLMs: ChatGPT, Claude, and Llama. Our experimental findings illuminate the trade-offs between various evaluation metrics in the context of synthetic data generation. Furthermore, SynEval stands as a critical instrument for researchers and practitioners engaged with synthetic tabular data,, empowering them to judiciously determine the suitability of the generated data for their specific applications, with an emphasis on upholding user privacy.

[348] Analyzing Fairness of Computer Vision and Natural Language Processing Models

Ahmed Rashed, Abdelkrim Kallich, Mohamed Eltayeb

Main category: cs.LG

TL;DR: The paper compares Fairlearn and AIF360 fairness libraries for bias mitigation in ML, focusing on CV and NLP models, showing sequential application improves fairness without compromising performance.

Details

Motivation: Addressing fairness and bias concerns in ML systems across domains like healthcare and finance.

Method: Uses Fairlearn and AIF360 to evaluate and mitigate bias in unstructured datasets via CV and NLP models, testing algorithms individually and sequentially across ML lifecycle stages.

Result: Sequential application of mitigation algorithms reduces bias effectively while maintaining model performance.

Conclusion: Fairness libraries like Fairlearn and AIF360 can enhance ML fairness, especially when mitigation strategies are applied sequentially.

Abstract: Machine learning (ML) algorithms play a critical role in decision-making across various domains, such as healthcare, finance, education, and law enforcement. However, concerns about fairness and bias in these systems have raised significant ethical and social challenges. To address these challenges, this research utilizes two prominent fairness libraries, Fairlearn by Microsoft and AIF360 by IBM. These libraries offer comprehensive frameworks for fairness analysis, providing tools to evaluate fairness metrics, visualize results, and implement bias mitigation algorithms. The study focuses on assessing and mitigating biases for unstructured datasets using Computer Vision (CV) and Natural Language Processing (NLP) models. The primary objective is to present a comparative analysis of the performance of mitigation algorithms from the two fairness libraries. This analysis involves applying the algorithms individually, one at a time, in one of the stages of the ML lifecycle, pre-processing, in-processing, or post-processing, as well as sequentially across more than one stage. The results reveal that some sequential applications improve the performance of mitigation algorithms by effectively reducing bias while maintaining the model’s performance. Publicly available datasets from Kaggle were chosen for this research, providing a practical context for evaluating fairness in real-world machine learning workflows.

[349] DualXDA: Towards Sparse, Efficient and Explainable Data Attribution in Large AI Models

Galip Ümit Yolcu, Moritz Weckbecker, Thomas Wiegand, Wojciech Samek, Sebastian Lapuschkin

Main category: cs.LG

TL;DR: DualXDA introduces a sparse, efficient framework for Data Attribution (DA) in AI, combining DualDA for fast, sparse attributions and XDA for feature-based explanations, significantly improving speed and transparency.

Details

Motivation: Addressing the opacity of deep learning decisions and the inefficiency of current DA methods, DualXDA aims to provide scalable, explainable AI solutions.

Method: DualXDA combines DualDA (leveraging SVM theory for efficient, sparse attributions) and XDA (enhancing DA with feature attribution insights).

Result: DualDA improves explanation time by up to 4,100,000× vs. Influence Functions and 11,000× vs. literature approximations, while maintaining high attribution quality.

Conclusion: DualXDA enables scalable, transparent AI analysis, fostering accountable systems for large neural architectures.

Abstract: Deep learning models achieve remarkable performance, yet their decision-making processes often remain opaque. In response, the field of eXplainable Artificial Intelligence (XAI) has grown significantly over the last decade, primarily focusing on feature attribution methods. Complementing this perspective, Data Attribution (DA) has emerged as a promising paradigm that shifts the focus from features to data provenance. However, existing DA approaches suffer from prohibitively high computational costs and memory demands. Additionally, current attribution methods exhibit low sparsity, hindering the discovery of decisive patterns in the data. We introduce DualXDA, a framework for sparse, efficient and explainable DA, comprised of two interlinked approaches for Dual Data Attribution (DualDA) and eXplainable Data Attribution (XDA): With DualDA, we propose efficient and effective DA, leveraging Support Vector Machine theory to provide fast and naturally sparse data attributions for AI predictions. We demonstrate that DualDA achieves high attribution quality, excels at solving a series of evaluated downstream tasks, while at the same time improving explanation time by a factor of up to 4,100,000$\times$ compared to the original Influence Functions method, and up to 11,000$\times$ compared to the method’s most efficient approximation from literature. We further introduce XDA, a method for enhancing Data Attribution with capabilities from feature attribution methods to explain why training samples are relevant for the prediction of a test sample in terms of impactful features. Taken together, our contributions in DualXDA ultimately point towards a future of eXplainable AI applied at unprecedented scale, enabling transparent, efficient and novel analysis of even the largest neural architectures fostering a new generation of accountable AI systems. Code at https://github.com/gumityolcu/DualXDA.

[350] Unsupervised Concept Drift Detection from Deep Learning Representations in Real-time

Salvatore Greco, Bartolomeo Vacchetti, Daniele Apiletti, Tania Cerquitelli

Main category: cs.LG

TL;DR: DriftLens is an unsupervised framework for real-time concept drift detection and characterization, outperforming existing methods in accuracy, speed, and drift explanation.

Details

Motivation: Existing drift detection methods are often supervised or lack accuracy and efficiency, making them unsuitable for real-world scenarios where labels are unavailable.

Method: DriftLens uses distribution distances in deep learning representations for efficient, accurate drift detection and characterizes drift by analyzing its impact on each label.

Result: DriftLens outperforms previous methods in 15/17 cases, runs 5x faster, aligns closely with actual drift (correlation ≥0.85), and provides effective drift explanations.

Conclusion: DriftLens addresses key limitations of current drift detection methods, offering a reliable, efficient, and interpretable solution for real-world applications.

Abstract: Concept drift is the phenomenon in which the underlying data distributions and statistical properties of a target domain change over time, leading to a degradation in model performance. Consequently, production models require continuous drift detection monitoring. Most drift detection methods to date are supervised, relying on ground-truth labels. However, they are inapplicable in many real-world scenarios, as true labels are often unavailable. Although recent efforts have proposed unsupervised drift detectors, many lack the accuracy required for reliable detection or are too computationally intensive for real-time use in high-dimensional, large-scale production environments. Moreover, they often fail to characterize or explain drift effectively. To address these limitations, we propose \textsc{DriftLens}, an unsupervised framework for real-time concept drift detection and characterization. Designed for deep learning classifiers handling unstructured data, \textsc{DriftLens} leverages distribution distances in deep learning representations to enable efficient and accurate detection. Additionally, it characterizes drift by analyzing and explaining its impact on each label. Our evaluation across classifiers and data-types demonstrates that \textsc{DriftLens} (i) outperforms previous methods in detecting drift in 15/17 use cases; (ii) runs at least 5 times faster; (iii) produces drift curves that align closely with actual drift (correlation $\geq!0.85$); (iv) effectively identifies representative drift samples as explanations.

[351] Sparse Logit Sampling: Accelerating Knowledge Distillation in LLMs

Anshumann, Mohd Abbas Zaidi, Akhil Kedia, Jinwoo Ahn, Taehwak Kwon, Kangwook Lee, Haejun Lee, Joohyung Lee

Main category: cs.LG

TL;DR: The paper introduces a method called Random Sampling Knowledge Distillation (RSKD) to address biases in sparse knowledge distillation, enabling efficient training of student models with minimal overhead.

Details

Motivation: Existing sparse knowledge distillation methods, like caching Top-K probabilities, introduce biases in teacher probability distribution estimates, leading to suboptimal performance and calibration.

Method: Proposes RSKD, an importance-sampling-based method that provides unbiased estimates, preserves gradients in expectation, and uses sparser logits.

Result: RSKD enables faster student model training with <10% overhead, maintaining competitive performance across model sizes (300M to 3B).

Conclusion: RSKD is an effective solution for sparse knowledge distillation, balancing efficiency and performance.

Abstract: Knowledge distillation can be a cost-effective technique to distill knowledge in Large Language Models, if the teacher output logits can be pre-computed and cached. However, successfully applying this to pre-training remains largely unexplored. In this work, we prove that naive approaches for sparse knowledge distillation such as caching Top-K probabilities, while intuitive, provide biased estimates of teacher probability distribution to the student, resulting in suboptimal performance and calibration. We propose an importance-sampling-based method `Random Sampling Knowledge Distillation’, which provides unbiased estimates, preserves the gradient in expectation, and requires storing significantly sparser logits. Our method enables faster training of student models with marginal overhead (<10%) compared to cross-entropy based training, while maintaining competitive performance compared to full distillation, across a range of model sizes from 300M to 3B.

[352] LagKV: Lag-Relative Information of the KV Cache Tells Which Tokens Are Important

Manlai Liang, JiaMing Zhang, Xiong Li, Jinlong Li

Main category: cs.LG

TL;DR: LagKV is a KV cache compression method for Large Language Models that avoids attention weight reliance, offering easy integration and competitive performance.

Details

Motivation: The KV cache size in long-context inference is costly; existing methods require infrastructure changes and add overhead.

Method: LagKV uses straightforward KV comparisons, avoiding attention weights, for compression.

Result: Outperforms SnapKV and StreamingLLM, especially in 64-digit passkey retrieval, beating H2O by 50%.

Conclusion: LagKV provides a simpler, effective KV compression solution with minimal overhead.

Abstract: The increasing size of the Key-Value (KV) cache during the Large Language Models long-context inference is the main obstacle for its balance between the deployment cost and task accuracy. To reduce the KV cache size in such scenarios, most previous efforts leveraged on the attention weight to evict non-critical cache tokens. But there is a trade-off in those methods, they usually require major modification of the inference infrastructure and significant computation overhead. Based on the fact that the Large Language models are autoregressive models, we propose LagKV, a KV compression strategy only relying on straight forward comparison among KV themselves. It is a totally attention free method which offers easy integration to the main stream inference platform and comparable performance comparing to other complicated KV compression methods. Results on RULER benchmark show that, our approach outperforms SnapKV and StreamingLLM in different compression ratios. Especially in the 64-digit passkey retrieval task, our method outperforms the attention weight based method $H_2O$ over $50%$ with same compression ratios. Our code is available at https://github.com/AI-Lab-China-Merchants-Bank/LagKV.

[353] Zeroth-Order Fine-Tuning of LLMs in Random Subspaces

Ziming Yu, Pan Zhou, Sike Wang, Jia Li, Mi Tian, Hua Huang

Main category: cs.LG

TL;DR: SubZero, a random subspace zeroth-order optimization method, reduces memory usage and improves fine-tuning performance for large language models (LLMs) compared to traditional zeroth-order approaches.

Details

Motivation: The high memory demands of backpropagation for large LLMs and the inefficiency of traditional zeroth-order methods due to high variance in gradient estimates.

Method: Proposes SubZero, a low-rank perturbation technique for LLMs, reducing memory consumption and improving gradient estimation.

Result: SubZero achieves better fine-tuning performance, lower variance, and faster convergence than standard zeroth-order methods like MeZO.

Conclusion: SubZero is an efficient and scalable solution for fine-tuning LLMs, balancing memory efficiency and performance.

Abstract: Fine-tuning Large Language Models (LLMs) has proven effective for a variety of downstream tasks. However, as LLMs grow in size, the memory demands for backpropagation become increasingly prohibitive. Zeroth-order (ZO) optimization methods offer a memory-efficient alternative by using forward passes to estimate gradients, but the variance of gradient estimates typically scales linearly with the model’s parameter dimension$\unicode{x2013}$a significant issue for LLMs. In this paper, we propose the random Subspace Zeroth-order (SubZero) optimization to address the challenges posed by LLMs’ high dimensionality. We introduce a low-rank perturbation tailored for LLMs that significantly reduces memory consumption while improving training performance. Additionally, we prove that our gradient estimation closely approximates the backpropagation gradient, exhibits lower variance than traditional ZO methods, and ensures convergence when combined with SGD. Experimental results show that SubZero enhances fine-tuning performance and achieves faster convergence compared to standard ZO approaches like MeZO across various language modeling tasks. Code is available at https://github.com/zimingyy/SubZero.

[354] On Leveraging Unlabeled Data for Concurrent Positive-Unlabeled Classification and Robust Generation

Bing Yu, Ke Sun, He Wang, Zhouchen Lin, Zhanxing Zhu

Main category: cs.LG

TL;DR: A novel framework combines PU classification and conditional generation to leverage unlabeled data, improving classifier performance and generation quality.

Details

Motivation: Addressing the scarcity of labeled data by exploiting unlabeled data through joint PU classification and conditional generation.

Method: Proposes a Classifier-Noise-Invariant Conditional GAN (CNI-CGAN) robust to noisy labels, and uses PU classifier predictions to aid generation.

Result: Theoretical proof of CNI-CGAN’s optimality and experimental validation on diverse datasets.

Conclusion: The framework effectively enhances PU classification and generation by leveraging unlabeled data.

Abstract: The scarcity of class-labeled data is a ubiquitous bottleneck in many machine learning problems. While abundant unlabeled data typically exist and provide a potential solution, it is highly challenging to exploit them. In this paper, we address this problem by leveraging Positive-Unlabeled~(PU) classification and the conditional generation with extra unlabeled data \emph{simultaneously}. We present a novel training framework to jointly target both PU classification and conditional generation when exposed to extra data, especially out-of-distribution unlabeled data, by exploring the interplay between them: 1) enhancing the performance of PU classifiers with the assistance of a novel Classifier-Noise-Invariant Conditional GAN~(CNI-CGAN) that is robust to noisy labels, 2) leveraging extra data with predicted labels from a PU classifier to help the generation. Theoretically, we prove the optimal condition of CNI-CGAN and experimentally, we conducted extensive evaluations on diverse datasets.

[355] GCC-Spam: Spam Detection via GAN, Contrastive Learning, and Character Similarity Networks

Zhijie Wang, Zixin Xu, Zhiyuan Pan

Main category: cs.LG

TL;DR: GCC-Spam is a novel framework for spam-text detection that uses character similarity networks, contrastive learning, and GANs to address adversarial strategies and data scarcity, outperforming baselines with fewer labeled examples.

Details

Motivation: The rise of spam text poses risks like information leakage and social instability, requiring robust detection methods to counter adversarial tactics and limited labeled data.

Method: GCC-Spam integrates a character similarity network for orthographic/phonetic features, contrastive learning for better discriminability, and GANs to generate pseudo-spam samples.

Result: The model achieves higher detection rates with fewer labeled examples, outperforming baseline methods in real-world datasets.

Conclusion: GCC-Spam effectively tackles spam detection challenges, offering improved accuracy and robustness against adversarial attacks and data scarcity.

Abstract: The exponential growth of spam text on the Internet necessitates robust detection mechanisms to mitigate risks such as information leakage and social instability. This work addresses two principal challenges: adversarial strategies employed by spammers and the scarcity of labeled data. We propose a novel spam-text detection framework GCC-Spam, which integrates three core innovations. First, a character similarity network captures orthographic and phonetic features to counter character-obfuscation attacks and furthermore produces sentence embeddings for downstream classification. Second, contrastive learning enhances discriminability by optimizing the latent-space distance between spam and normal texts. Third, a Generative Adversarial Network (GAN) generates realistic pseudo-spam samples to alleviate data scarcity while improving model robustness and classification accuracy. Extensive experiments on real-world datasets demonstrate that our model outperforms baseline approaches, achieving higher detection rates with significantly fewer labeled examples.

[356] Generalizing Adam to Manifolds for Efficiently Training Transformers

Benedikt Brantner

Main category: cs.LG

TL;DR: The paper introduces a method to generalize the Adam optimizer to manifolds by leveraging their global tangent space representation, achieving faster training for neural networks with orthogonality constraints.

Details

Motivation: Adam optimizer is widely used but hard to interpret and generalize to manifolds, limiting its application in constrained optimization problems like neural networks with orthogonality constraints.

Method: The approach uses the global tangent space representation of homogeneous manifolds (e.g., Stiefel, symplectic Stiefel, Grassmann) to generalize Adam without requiring projections.

Result: The generalized Adam optimizer is successfully applied to train a transformer with orthogonality constraints, yielding significant training speed-ups.

Conclusion: The method provides a full generalization of Adam to manifolds, enabling efficient optimization for constrained neural networks.

Abstract: One of the primary reasons behind the success of neural networks has been the emergence of an array of new, highly-successful optimizers, perhaps most importantly the Adam optimizer. It is widely used for training neural networks, yet notoriously hard to interpret. Lacking a clear physical intuition, Adam is difficult to generalize to manifolds. Some attempts have been made to directly apply parts of the Adam algorithm to manifolds or to find an underlying structure, but a full generalization has remained elusive. In this work a new approach is presented that leverages the special structure of the manifolds which are relevant for optimization of neural networks, such as the Stiefel manifold, the symplectic Stiefel manifold and the Grassmann manifold: all of these are homogeneous spaces and as such admit a global tangent space representation - a common vector space (Lie subspace) in which all tangent spaces can easily be represented. This global tangent space representation is used to perform all of the steps in the Adam optimizer and we are able to fully generalize the optimizer to manifolds without a projection step. The resulting algorithm is then applied to train a transformer for which orthogonality constraints are enforced up to machine precision and we observe significant speed-ups in the training process.

[357] Long-Short Distance Graph Neural Networks and Improved Curriculum Learning for Emotion Recognition in Conversation

Xinran Li, Xiujuan Xu, Jiaqi Qiao

Main category: cs.LG

TL;DR: The paper introduces LSDGNN, a multimodal approach for ERC, using long- and short-distance graph networks with a Differential Regularizer and BiAffine Module for feature interaction. It also proposes ICL to handle data imbalance, showing superior performance on IEMOCAP and MELD datasets.

Details

Motivation: ERC is challenging; existing methods lack effective multimodal feature extraction and handling of data imbalance.

Method: LSDGNN uses DAG-based long- and short-distance graph networks, a Differential Regularizer, BiAffine Module, and ICL for balanced training.

Result: Outperforms benchmarks on IEMOCAP and MELD datasets.

Conclusion: LSDGNN effectively addresses ERC challenges with multimodal features and balanced learning, achieving state-of-the-art results.

Abstract: Emotion Recognition in Conversation (ERC) is a practical and challenging task. This paper proposes a novel multimodal approach, the Long-Short Distance Graph Neural Network (LSDGNN). Based on the Directed Acyclic Graph (DAG), it constructs a long-distance graph neural network and a short-distance graph neural network to obtain multimodal features of distant and nearby utterances, respectively. To ensure that long- and short-distance features are as distinct as possible in representation while enabling mutual influence between the two modules, we employ a Differential Regularizer and incorporate a BiAffine Module to facilitate feature interaction. In addition, we propose an Improved Curriculum Learning (ICL) to address the challenge of data imbalance. By computing the similarity between different emotions to emphasize the shifts in similar emotions, we design a “weighted emotional shift” metric and develop a difficulty measurer, enabling a training process that prioritizes learning easy samples before harder ones. Experimental results on the IEMOCAP and MELD datasets demonstrate that our model outperforms existing benchmarks.

[358] Compressed and distributed least-squares regression: convergence rates with applications to Federated Learning

Constantin Philippenko, Aymeric Dieuleveut

Main category: cs.LG

TL;DR: The paper investigates the impact of compression on stochastic gradient algorithms in machine learning, focusing on convergence rates and extending results to federated learning.

Details

Motivation: To understand how different unbiased compression operators affect convergence rates in stochastic gradient algorithms, particularly in distributed and federated learning settings.

Method: Analyzes a stochastic approximation algorithm for minimizing quadratic functions, considering weak assumptions on the random field and noise covariance. Extends the analysis to federated learning frameworks.

Result: Demonstrates that the limit variance term scales with the noise covariance and compression strategy, generalizing previous results.

Conclusion: The study provides insights into how compression strategies influence convergence rates in both centralized and federated learning contexts.

Abstract: In this paper, we investigate the impact of compression on stochastic gradient algorithms for machine learning, a technique widely used in distributed and federated learning. We underline differences in terms of convergence rates between several unbiased compression operators, that all satisfy the same condition on their variance, thus going beyond the classical worst-case analysis. To do so, we focus on the case of least-squares regression (LSR) and analyze a general stochastic approximation algorithm for minimizing quadratic functions relying on a random field. We consider weak assumptions on the random field, tailored to the analysis (specifically, expected H"older regularity), and on the noise covariance, enabling the analysis of various randomizing mechanisms, including compression. We then extend our results to the case of federated learning. More formally, we highlight the impact on the convergence of the covariance $\mathfrak{C}{\mathrm{ania}}$ of the additive noise induced by the algorithm. We demonstrate despite the non-regularity of the stochastic field, that the limit variance term scales with $\mathrm{Tr}(\mathfrak{C}{\mathrm{ania}} H^{-1})/K$ (where $H$ is the Hessian of the optimization problem and $K$ the number of iterations) generalizing the rate for the vanilla LSR case where it is $\sigma^2 \mathrm{Tr}(H H^{-1}) / K = \sigma^2 d / K$ (Bach and Moulines, 2013). Then, we analyze the dependency of $\mathfrak{C}_{\mathrm{ania}}$ on the compression strategy and ultimately its impact on convergence, first in the centralized case, then in two heterogeneous FL frameworks.

[359] Optimal Transport Regularized Divergences: Application to Adversarial Robustness

Jeremiah Birrell, Reza Ebrahimi

Main category: cs.LG

TL;DR: The paper introduces $D^c$, a new class of optimal-transport-regularized divergences, and $ARMOR_D$, a DRO-based method for enhancing adversarial robustness in deep learning. It improves performance on CIFAR datasets.

Details

Motivation: To enhance adversarial robustness in deep learning by combining optimal transport and information divergence in a principled way.

Method: Proposes $ARMOR_D$, which minimizes the maximum expected loss over a $D^c$-neighborhood of the training data, allowing adversarial sample transport and re-weighting.

Result: $ARMOR_D$ improves adversarial robustness, achieving 1.9% and 2.1% better performance on CIFAR-10 and CIFAR-100 against AutoAttack.

Conclusion: $ARMOR_D$ generalizes existing methods and effectively boosts adversarial robustness, with code made available for reproducibility.

Abstract: We introduce a new class of optimal-transport-regularized divergences, $D^c$, constructed via an infimal convolution between an information divergence, $D$, and an optimal-transport (OT) cost, $C$, and study their use in distributionally robust optimization (DRO). In particular, we propose the $ARMOR_D$ methods as novel approaches to enhancing the adversarial robustness of deep learning models. These DRO-based methods are defined by minimizing the maximum expected loss over a $D^c$-neighborhood of the empirical distribution of the training data. Viewed as a tool for constructing adversarial samples, our method allows samples to be both transported, according to the OT cost, and re-weighted, according to the information divergence; the addition of a principled and dynamical adversarial re-weighting on top of adversarial sample transport is a key innovation of $ARMOR_D$. $ARMOR_D$ can be viewed as a generalization of the best-performing loss functions and OT costs in the adversarial training literature; we demonstrate this flexibility by using $ARMOR_D$ to augment the UDR, TRADES, and MART methods and obtain improved performance on CIFAR-10 and CIFAR-100 image recognition. Specifically, augmenting with $ARMOR_D$ leads to 1.9% and 2.1% improvement against AutoAttack, a powerful ensemble of adversarial attacks, on CIFAR-10 and CIFAR-100 respectively. To foster reproducibility, we made the code accessible at https://github.com/star-ailab/ARMOR.

[360] Fine-Tuned Language Models Generate Stable Inorganic Materials as Text

Nate Gruver, Anuroop Sriram, Andrea Madotto, Andrew Gordon Wilson, C. Lawrence Zitnick, Zachary Ulissi

Main category: cs.LG

TL;DR: Fine-tuning large language models (LLaMA-2 70B) for generating stable materials achieves high reliability (90% valid structures) and outperforms CDVAE in metastable material generation (49% vs 28%).

Details

Motivation: To leverage the simplicity and flexibility of text-encoded atomistic data for material generation, exploiting the biases of pretrained LLMs for symmetry capture.

Method: Fine-tuned LLaMA-2 70B on text-encoded atomistic data, validated using energy above hull calculations from ML potentials and DFT.

Result: Generated 90% physically valid structures; outperformed CDVAE (49% vs 28% metastable materials). Flexible for unconditional, infilling, and text-conditional generation.

Conclusion: Pretrained LLMs are well-suited for atomistic data, with performance scaling with model size, offering a reliable and versatile approach for material generation.

Abstract: We propose fine-tuning large language models for generation of stable materials. While unorthodox, fine-tuning large language models on text-encoded atomistic data is simple to implement yet reliable, with around 90% of sampled structures obeying physical constraints on atom positions and charges. Using energy above hull calculations from both learned ML potentials and gold-standard DFT calculations, we show that our strongest model (fine-tuned LLaMA-2 70B) can generate materials predicted to be metastable at about twice the rate (49% vs 28%) of CDVAE, a competing diffusion model. Because of text prompting’s inherent flexibility, our models can simultaneously be used for unconditional generation of stable material, infilling of partial structures and text-conditional generation. Finally, we show that language models’ ability to capture key symmetries of crystal structures improves with model scale, suggesting that the biases of pretrained LLMs are surprisingly well-suited for atomistic data.

[361] The Role of the Time-Dependent Hessian in High-Dimensional Optimization

Tony Bonnaire, Giulio Biroli, Chiara Cammarota

Main category: cs.LG

TL;DR: The paper explores why gradient descent finds good solutions in rough landscapes, using phase retrieval as an example. It identifies a dynamical transition in Hessian spectral properties, linking it to escaping rough regions. A key finding is that finite system sizes recover signals better than infinite ones due to early-time dynamics.

Details

Motivation: To understand why gradient descent works well in non-convex, high-dimensional settings, despite the lack of theoretical clarity.

Method: Analyzes the Hessian during gradient descent in the phase retrieval problem, focusing on spectral properties and dynamical transitions.

Result: Identifies a window of negative curvature in the Hessian that aids signal recovery, especially in finite systems, highlighting the importance of initialization and early dynamics.

Conclusion: Early-time dynamics and initialization are crucial for efficiently navigating rough landscapes, with finite systems outperforming infinite ones in signal recovery.

Abstract: Gradient descent is commonly used to find minima in rough landscapes, particularly in recent machine learning applications. However, a theoretical understanding of why good solutions are found remains elusive, especially in strongly non-convex and high-dimensional settings. Here, we focus on the phase retrieval problem as a typical example, which has received a lot of attention recently in theoretical machine learning. We analyze the Hessian during gradient descent, identify a dynamical transition in its spectral properties, and relate it to the ability of escaping rough regions in the loss landscape. When the signal-to-noise ratio (SNR) is large enough, an informative negative direction exists in the Hessian at the beginning of the descent, i.e in the initial condition. While descending, a BBP transition in the spectrum takes place in finite time: the direction is lost, and the dynamics is trapped in a rugged region filled with marginally stable bad minima. Surprisingly, for finite system sizes, this window of negative curvature allows the system to recover the signal well before the theoretical SNR found for infinite sizes, emphasizing the central role of initialization and early-time dynamics for efficiently navigating rough landscapes.

[362] A Principled Approach for Data Bias Mitigation

Bruno Scarone, Alfredo Viola, Renée J. Miller, Ricardo Baeza-Yates

Main category: cs.LG

TL;DR: A new explainable method to mitigate data bias in machine learning, covering non-binary labels and multiple sensitive attributes, with mathematical guarantees and evaluation on public datasets.

Details

Motivation: Addressing the adverse effects of data bias on decision-making in machine learning, especially for intersectional bias across multiple attributes.

Method: Proposes a mitigation strategy leveraging table discovery to add unbiased tuples, ensuring explainability and correctness with mathematical guarantees.

Result: Effective bias mitigation demonstrated on public datasets, with theoretical insights into intersectional bias.

Conclusion: The framework successfully measures and mitigates complex bias, offering practical and theoretical advancements.

Abstract: The widespread use of machine learning and data-driven algorithms for decision making has been steadily increasing over many years. \emph{Bias} in the data can adversely affect this decision-making. We present a new mitigation strategy to address data bias. Our methods are explainable and come with mathematical guarantees of correctness. They can take advantage of new work on table discovery to find new tuples that can be added to a dataset to create real datasets that are unbiased or less biased. Our framework covers data with non-binary labels and with multiple sensitive attributes. Hence, we are able to measure and mitigate bias that does not appear over a single attribute (or feature), but only intersectionally, when considering a combination of attributes. We evaluate our techniques on publicly available datasets and provide a theoretical analysis of our results, highlighting novel insights into data bias.

[363] Beyond Euclid: An Illustrated Guide to Modern Machine Learning with Geometric, Topological, and Algebraic Structures

Mathilde Papillon, Sophia Sanborn, Johan Mathe, Louisa Cornelis, Abby Bertics, Domas Buracas, Hansen J Lillemark, Christian Shewmake, Fatih Dinc, Xavier Pennec, Nina Miolane

Main category: cs.LG

TL;DR: The paper discusses the shift in machine learning from Euclidean to non-Euclidean data, emphasizing the need for new methods to handle complex geometric, topological, and algebraic structures.

Details

Motivation: Modern machine learning increasingly deals with non-Euclidean data, which requires a broader mathematical perspective beyond classical Euclidean geometry.

Method: The review proposes a graphical taxonomy to unify recent advances in non-Euclidean machine learning and generalizes classical methods for unconventional data types.

Result: The paper integrates recent research into an intuitive framework and identifies key challenges and opportunities in the field.

Conclusion: Non-Euclidean machine learning is a growing field with significant potential, but it requires further development to address current challenges.

Abstract: The enduring legacy of Euclidean geometry underpins classical machine learning, which, for decades, has been primarily developed for data lying in Euclidean space. Yet, modern machine learning increasingly encounters richly structured data that is inherently nonEuclidean. This data can exhibit intricate geometric, topological and algebraic structure: from the geometry of the curvature of space-time, to topologically complex interactions between neurons in the brain, to the algebraic transformations describing symmetries of physical systems. Extracting knowledge from such non-Euclidean data necessitates a broader mathematical perspective. Echoing the 19th-century revolutions that gave rise to non-Euclidean geometry, an emerging line of research is redefining modern machine learning with non-Euclidean structures. Its goal: generalizing classical methods to unconventional data types with geometry, topology, and algebra. In this review, we provide an accessible gateway to this fast-growing field and propose a graphical taxonomy that integrates recent advances into an intuitive unified framework. We subsequently extract insights into current challenges and highlight exciting opportunities for future development in this field.

[364] Pulse-PPG: An Open-Source Field-Trained PPG Foundation Model for Wearable Applications Across Lab and Field Settings

Mithun Saha, Maxwell A. Xu, Wanting Mao, Sameer Neupane, James M. Rehg, Santosh Kumar

Main category: cs.LG

TL;DR: Pulse-PPG is the first open-source PPG foundation model trained on raw field data, outperforming clinical-data-trained models in generalization across tasks.

Details

Motivation: To address the limitations of existing PPG foundation models, which are either open-source but trained on clinical data or closed-source, by introducing a model trained on real-world field data.

Method: Trained Pulse-PPG on raw PPG data from a 100-day field study with 120 participants and evaluated its performance across diverse datasets and tasks.

Result: Pulse-PPG shows superior generalization in clinical and mobile health applications, often outperforming models trained on clinical data.

Conclusion: Training on real-world field data enhances model adaptability, and releasing Pulse-PPG will advance robust PPG-based foundation models.

Abstract: Photoplethysmography (PPG)-based foundation models are gaining traction due to the widespread use of PPG in biosignal monitoring and their potential to generalize across diverse health applications. In this paper, we introduce Pulse-PPG, the first open-source PPG foundation model trained exclusively on raw PPG data collected over a 100-day field study with 120 participants. Existing PPG foundation models are either open-source but trained on clinical data or closed-source, limiting their applicability in real-world settings. We evaluate Pulse-PPG across multiple datasets and downstream tasks, comparing its performance against a state-of-the-art foundation model trained on clinical data. Our results demonstrate that Pulse-PPG, trained on uncurated field data, exhibits superior generalization across clinical and mobile health applications in both lab and field settings. This suggests that exposure to real-world variability enables the model to learn fine-grained representations, making it more adaptable across tasks. Furthermore, pre-training on field data surprisingly outperforms its pre-training on clinical data in many tasks, reinforcing the importance of training on real-world, diverse datasets. To encourage further advancements in robust foundation models leveraging field data, we plan to release Pulse-PPG, providing researchers with a powerful resource for developing more generalizable PPG-based models.

[365] On the Approximation of Stationary Processes using the ARMA Model

Anand Ganesh, Babhrubahan Bose, Anand Rajagopalan

Main category: cs.LG

TL;DR: The paper revisits the approximation error between true stationary processes and ARMA models, introducing an $L^{\infty}$ norm as a valid alternative to control the $L^2$ norm and comparing it to the cepstral norm. It explores structural properties, Banach algebra formation, and invertibility, providing explicit bounds and critiquing heuristic methods.

Details

Motivation: To address the problem of quantifying and bounding approximation errors between true stationary processes and ARMA models, offering a new perspective using transfer function representations and norms.

Method: Uses the transfer function representation of ARMA models, introduces an $L^{\infty}$ norm, and analyzes its properties, including Banach algebra formation and invertibility. Explicit bounds are calculated for continuous transfer functions.

Result: The $L^{\infty}$ norm controls the $L^2$ norm and has structural properties comparable to the cepstral norm. A subspace of stationary processes forms a Banach algebra under this norm, with invertibility consistent with ARMA definitions.

Conclusion: The $L^{\infty}$ norm is a valid and useful alternative for analyzing ARMA models, generalizing better than Wiener’s condition. The paper also critiques heuristic approaches like Padé approximations.

Abstract: We revisit an old problem related to Autoregressive Moving Average (ARMA) models, on quantifying and bounding the approximation error between a true stationary process $X_t$ and an ARMA model $Y_t$. We take the transfer function representation of an ARMA model and show that the associated $L^{\infty}$ norm provides a valid alternate norm that controls the $L^2$ norm and has structural properties comparable to the cepstral norm. We show that a certain subspace of stationary processes, which includes ARMA models, forms a Banach algebra under the $L^{\infty}$ norm that respects the group structure of $H^{\infty}$ transfer functions. The natural definition of invertibility in this algebra is consistent with the original definition of ARMA invertibility, and generalizes better to non-ARMA processes than Wiener’s $\ell^1$ condition. Finally, we calculate some explicit approximation bounds in the simpler context of continuous transfer functions, and critique some heuristic ideas on Pad'e approximations and parsimonious models.

[366] Fine-Grained Uncertainty Quantification via Collisions

Jesse Friedbaum, Sudarshan Adiga, Ravi Tandon

Main category: cs.LG

TL;DR: The paper introduces a new metric for aleatoric uncertainty quantification (UQ) called the collision matrix, which measures class collisions (same input in different classes). It proposes methods to estimate this matrix and demonstrates its applications, including estimating posterior class probabilities.

Details

Motivation: Existing UQ methods lack fine-grained measures of uncertainty. The collision matrix addresses this by quantifying the inherent difficulty in distinguishing between classes, providing a more intuitive and detailed uncertainty measure.

Method: The paper proposes learning a pair-wise contrastive model to determine if inputs belong to the same class. This model estimates the Gramian matrix of the collision matrix, which is then used to uniquely recover the collision matrix. The method is validated experimentally.

Result: The collision matrix is successfully estimated using the proposed techniques, and its utility in estimating posterior class probabilities is demonstrated on several datasets.

Conclusion: The collision matrix offers a novel and fine-grained measure of uncertainty, with practical applications in UQ. The proposed estimation methods are effective and could inspire further research in non-negative matrix recovery.

Abstract: We propose a new and intuitive metric for aleatoric uncertainty quantification (UQ), the prevalence of class collisions defined as the same input being observed in different classes. We use the rate of class collisions to define the collision matrix, a novel and uniquely fine-grained measure of uncertainty. For a classification problem involving $K$ classes, the $K\times K$ collision matrix $S$ measures the inherent difficulty in distinguishing between each pair of classes. We discuss several applications of the collision matrix, establish its fundamental mathematical properties, as well as show its relationship with existing UQ methods, including the Bayes error rate (BER). We also address the new problem of estimating the collision matrix using one-hot labeled data by proposing a series of innovative techniques to estimate $S$. First, we learn a pair-wise contrastive model which accepts two inputs and determines if they belong to the same class. We then show that this contrastive model (which is PAC learnable) can be used to estimate the Gramian matrix of $S$, defined as $G=S^TS$. Finally, we show that under reasonable assumptions, $G$ can be used to uniquely recover $S$, a new result on non-negative matrices which could be of independent interest. With a method to estimate $S$ established, we demonstrate how this estimate of $S$, in conjunction with the contrastive model, can be used to estimate the posterior class portability distribution of any point. Experimental results are also presented to validate our methods of estimating the collision matrix and class posterior distributions on several datasets.

[367] A general language model for peptide identification

Jixiu Zhai, Tianchi Lu, Haitian Zhong, Ziyang Xu, Yuhuan Liu, Shengrui Xu, Jingwan Wang, Dan Huang

Main category: cs.LG

TL;DR: PDeepPP is a deep learning framework for identifying bioactive peptides and PTMs, achieving state-of-the-art performance in diverse tasks.

Details

Motivation: Accurate identification of bioactive peptides and PTMs is crucial for understanding protein function and therapeutic discovery, but existing methods lack generalizability.

Method: PDeepPP integrates pretrained protein language models with a hybrid transformer-convolutional architecture, addressing data imbalance and extracting global/local sequence features.

Result: PDeepPP excels in 25/33 tasks, with high accuracy in antimicrobial (0.9726) and phosphorylation site (0.9984) identification, and 99.5% specificity in glycosylation prediction.

Conclusion: PDeepPP enables large-scale, accurate peptide analysis, supporting biomedical research and therapeutic target discovery, with publicly available resources.

Abstract: Accurate identification of bioactive peptides (BPs) and protein post-translational modifications (PTMs) is essential for understanding protein function and advancing therapeutic discovery. However, most computational methods remain limited in their generalizability across diverse peptide functions. Here, we present PDeepPP, a unified deep learning framework that integrates pretrained protein language models with a hybrid transformer-convolutional architecture, enabling robust identification across diverse peptide classes and PTM sites. We curated comprehensive benchmark datasets and implemented strategies to address data imbalance, allowing PDeepPP to systematically extract both global and local sequence features. Through extensive analyses-including dimensionality reduction and comparison studies-PDeepPP demonstrates strong, interpretable peptide representations and achieves state-of-the-art performance in 25 of the 33 biological identification tasks. Notably, PDeepPP attains high accuracy in antimicrobial (0.9726) and phosphorylation site (0.9984) identification, with 99.5% specificity in glycosylation site prediction and substantial reduction in false negatives in antimalarial tasks. By enabling large-scale, accurate peptide analysis, PDeepPP supports biomedical research and the discovery of novel therapeutic targets for disease treatment. All code, datasets, and pretrained models are publicly available via GitHub:https://github.com/fondress/PDeepPP and Hugging Face:https://huggingface.co/fondress/PDeppPP.

[368] History-Guided Video Diffusion

Kiwhan Song, Boyuan Chen, Max Simchowitz, Yilun Du, Russ Tedrake, Vincent Sitzmann

Main category: cs.LG

TL;DR: Classifier-free guidance (CFG) is extended to video diffusion models with variable-length history, addressing challenges via the Diffusion Forcing Transformer (DFoT) and History Guidance methods.

Details

Motivation: Improving conditional generation in video diffusion models for better control and sample quality, despite challenges with variable-length history.

Method: Proposes DFoT, a video diffusion architecture, and History Guidance methods to enable flexible history frame conditioning.

Result: Vanilla history guidance improves video quality and temporal consistency; advanced methods enhance motion dynamics and enable long video generation.

Conclusion: DFoT and History Guidance effectively address variable-length history challenges, significantly improving video generation.

Abstract: Classifier-free guidance (CFG) is a key technique for improving conditional generation in diffusion models, enabling more accurate control while enhancing sample quality. It is natural to extend this technique to video diffusion, which generates video conditioned on a variable number of context frames, collectively referred to as history. However, we find two key challenges to guiding with variable-length history: architectures that only support fixed-size conditioning, and the empirical observation that CFG-style history dropout performs poorly. To address this, we propose the Diffusion Forcing Transformer (DFoT), a video diffusion architecture and theoretically grounded training objective that jointly enable conditioning on a flexible number of history frames. We then introduce History Guidance, a family of guidance methods uniquely enabled by DFoT. We show that its simplest form, vanilla history guidance, already significantly improves video generation quality and temporal consistency. A more advanced method, history guidance across time and frequency further enhances motion dynamics, enables compositional generalization to out-of-distribution history, and can stably roll out extremely long videos. Project website: https://boyuan.space/history-guidance

[369] DeepCrossAttention: Supercharging Transformer Residual Connections

Mike Heddes, Adel Javanmard, Kyriakos Axiotis, Gang Fu, MohammadHossein Bateni, Vahab Mirrokni

Main category: cs.LG

TL;DR: DeepCrossAttention (DCA) enhances residual learning in transformers by dynamically combining layer outputs with learnable weights, improving efficiency and performance.

Details

Motivation: Traditional residual connections in transformers may dilute important information by simply summing layer outputs.

Method: DCA uses learnable, input-dependent weights and depth-wise cross-attention to selectively focus on relevant information across layers.

Result: DCA improves perplexity in language modeling, achieves the same quality up to 3x faster, and adds negligible parameters.

Conclusion: DCA offers a better trade-off between accuracy and model size, especially when layer ranks are below a critical threshold.

Abstract: Transformer networks have achieved remarkable success across diverse domains, leveraging a variety of architectural innovations, including residual connections. However, traditional residual connections, which simply sum the outputs of previous layers, can dilute crucial information. This work introduces DeepCrossAttention (DCA), an approach that enhances residual learning in transformers. DCA employs learnable, input-dependent weights to dynamically combine layer outputs, enabling the model to selectively focus on the most relevant information in any of the previous layers. Furthermore, DCA incorporates depth-wise cross-attention, allowing for richer interactions between layers at different depths. Our language modeling experiments show that DCA achieves improved perplexity for a given training time. Moreover, DCA obtains the same model quality up to 3x faster while adding a negligible number of parameters. Theoretical analysis confirms that DCA provides an improved trade-off between accuracy and model size when the ratio of collective layer ranks to the ambient dimension falls below a critical threshold.

[370] Analyzing Islamophobic Discourse Using Semi-Coded Terms and LLMs

Raza Ul Mustafa, Roi Dupart, Gabrielle Smith, Noman Ashraf, Nathalie Japkowicz

Main category: cs.LG

TL;DR: The paper analyzes Islamophobic semi-coded terms on extremist platforms, using LLMs and topic modeling to understand and detect such hate speech, revealing its prevalence in far-right and conspiratorial contexts.

Details

Motivation: To address the challenge of identifying Islamophobic hate speech due to its ambiguous or neutral lexical appearance, and to understand its spread and context on digital platforms.

Method: Utilizes Large Language Models (LLMs) to interpret semi-coded terms, Google Perspective API for toxicity scoring, and BERT topic modeling to analyze discourse on platforms like 4Chan, Gab, and Telegram.

Result: LLMs can understand OOV slurs; Islamophobic posts score higher in toxicity than other hate speech. Topic modeling shows Islamophobia is prevalent in far-right, conspiratorial, and anti-immigrant contexts.

Conclusion: Improved moderation and detection strategies are needed. The study highlights the global spread of Islamophobia and its embeddedness in extremist online communities.

Abstract: In recent years, Islamophobia has gained significant traction across Western societies, fueled by the rise of digital communication networks. This paper performs a large-scale analysis of specialized, semi-coded Islamophobic terms such as (muzrat, pislam, mudslime, mohammedan, muzzies) floated on extremist social platforms, i.e., 4Chan, Gab, Telegram, etc. Many of these terms appear lexically neutral or ambiguous outside of specific contexts, making them difficult for both human moderators and automated systems to reliably identify as hate speech. First, we use Large Language Models (LLMs) to show their ability to understand these terms. Second, Google Perspective API suggests that Islamophobic posts tend to receive higher toxicity scores than other categories of hate speech like Antisemitism. Finally, we use BERT topic modeling approach to extract different topics and Islamophobic discourse on these social platforms. Our findings indicate that LLMs understand these Out-Of-Vocabulary (OOV) slurs; however, further improvements in moderation strategies and algorithmic detection are necessary to address such discourse effectively. Our topic modeling also indicates that Islamophobic text is found across various political, conspiratorial, and far-right movements and is particularly directed against Muslim immigrants. Taken altogether, we performed one of the first studies on Islamophobic semi-coded terms and shed a global light on Islamophobia.

[371] Position: An Empirically Grounded Identifiability Theory Will Accelerate Self-Supervised Learning Research

Patrik Reizinger, Randall Balestriero, David Klindt, Wieland Brendel

Main category: cs.LG

TL;DR: The paper explores the Platonic Representation Hypothesis (PRH) in SSL, proposing Singular Identifiability Theory (SITh) to bridge theory-practice gaps and suggesting future research directions.

Details

Motivation: To explain the convergence of SSL representations to a Platonic ideal and address the lack of theoretical grounding for SSL's empirical success.

Method: Synthesizes Identifiability Theory (IT) to show PRH’s emergence in SSL and proposes SITh for a broader theoretical framework.

Result: SITh offers deeper insights into SSL’s data assumptions, aiming for more interpretable and generalizable representations.

Conclusion: Future research should focus on SSL’s training dynamics, finite sample impacts, and inductive biases to advance the field.

Abstract: Self-Supervised Learning (SSL) powers many current AI systems. As research interest and investment grow, the SSL design space continues to expand. The Platonic view of SSL, following the Platonic Representation Hypothesis (PRH), suggests that despite different methods and engineering approaches, all representations converge to the same Platonic ideal. However, this phenomenon lacks precise theoretical explanation. By synthesizing evidence from Identifiability Theory (IT), we show that the PRH can emerge in SSL. However, current IT cannot explain SSL’s empirical success. To bridge the gap between theory and practice, we propose expanding IT into what we term Singular Identifiability Theory (SITh), a broader theoretical framework encompassing the entire SSL pipeline. SITh would allow deeper insights into the implicit data assumptions in SSL and advance the field towards learning more interpretable and generalizable representations. We highlight three critical directions for future research: 1) training dynamics and convergence properties of SSL; 2) the impact of finite samples, batch size, and data diversity; and 3) the role of inductive biases in architecture, augmentations, initialization schemes, and optimizers.

[372] Statistical Runtime Verification for LLMs via Robustness Estimation

Natan Levy, Adiel Ashrov, Guy Katz

Main category: cs.LG

TL;DR: The paper adapts RoMA, a statistical verification framework, for runtime robustness monitoring of LLMs in black-box settings, achieving comparable accuracy to formal methods with significantly reduced verification time.

Details

Motivation: Formal verification techniques are computationally infeasible for modern LLMs, necessitating scalable alternatives for runtime robustness monitoring.

Method: Adapts and extends the RoMA framework to analyze confidence score distributions under semantic perturbations, providing statistically validated robustness bounds.

Result: RoMA achieves accuracy within 1% deviation of formal methods and reduces verification time from hours to minutes.

Conclusion: RoMA is a scalable alternative for runtime robustness verification in LLM deployments when formal methods are impractical.

Abstract: Adversarial robustness verification is essential for ensuring the safe deployment of Large Language Models (LLMs) in runtime-critical applications. However, formal verification techniques remain computationally infeasible for modern LLMs due to their exponential runtime and white-box access requirements. This paper presents a case study adapting and extending the RoMA statistical verification framework to assess its feasibility as an online runtime robustness monitor for LLMs in black-box deployment settings. Our adaptation of RoMA analyzes confidence score distributions under semantic perturbations to provide quantitative robustness assessments with statistically validated bounds. Our empirical validation against formal verification baselines demonstrates that RoMA achieves comparable accuracy (within 1% deviation), and reduces verification times from hours to minutes. We evaluate this framework across semantic, categorial, and orthographic perturbation domains. Our results demonstrate RoMA’s effectiveness for robustness monitoring in operational LLM deployments. These findings point to RoMA as a potentially scalable alternative when formal methods are infeasible, with promising implications for runtime verification in LLM-based systems.

[373] Beyond Low-rank Decomposition: A Shortcut Approach for Efficient On-Device Learning

Le-Trung Nguyen, Ael Quelennec, Van-Tam Nguyen, Enzo Tartaglione

Main category: cs.LG

TL;DR: Proposes a shortcut method for on-device learning to reduce activation memory usage and computational costs, achieving significant improvements over vanilla training.

Details

Motivation: Address memory and computational constraints in on-device learning to enhance efficiency and privacy.

Method: Introduces a novel shortcut approach based on low-rank decomposition to reduce activation memory and FLOPs.

Result: Reduces activation memory usage up to 120.09× and training FLOPs up to 1.86× compared to vanilla training.

Conclusion: The proposed method effectively mitigates memory and computational challenges in on-device learning.

Abstract: On-device learning has emerged as a promising direction for AI development, particularly because of its potential to reduce latency issues and mitigate privacy risks associated with device-server communication, while improving energy efficiency. Despite these advantages, significant memory and computational constraints still represent major challenges for its deployment. Drawing on previous studies on low-rank decomposition methods that address activation memory bottlenecks in backpropagation, we propose a novel shortcut approach as an alternative. Our analysis and experiments demonstrate that our method can reduce activation memory usage, even up to $120.09\times$ compared to vanilla training, while also reducing overall training FLOPs up to $1.86\times$ when evaluated on traditional benchmarks.

[374] Outcome-Based Online Reinforcement Learning: Algorithms and Fundamental Limits

Fan Chen, Zeyu Jia, Alexander Rakhlin, Tengyang Xie

Main category: cs.LG

TL;DR: The paper analyzes the challenge of credit assignment in reinforcement learning with outcome-based feedback, proposing a provably efficient algorithm for large or infinite state spaces and extending it to preference-based feedback.

Details

Motivation: Address the challenge of assigning credit to actions when rewards are only observed at trajectory endpoints in online RL with general function approximation.

Method: Develop a sample-efficient algorithm leveraging general function approximation, applicable to large or infinite state spaces, and extend it to preference-based feedback.

Result: Achieves $\widetilde{O}({C_{\rm cov} H^3}/{\epsilon^2})$ sample complexity, characterizes statistical separation of outcome-based feedback, and simplifies for deterministic MDPs.

Conclusion: Provides a theoretical foundation for understanding the statistical properties of outcome-based RL, including extensions to preference-based feedback.

Abstract: Reinforcement learning with outcome-based feedback faces a fundamental challenge: when rewards are only observed at trajectory endpoints, how do we assign credit to the right actions? This paper provides the first comprehensive analysis of this problem in online RL with general function approximation. We develop a provably sample-efficient algorithm achieving $\widetilde{O}({C_{\rm cov} H^3}/{\epsilon^2})$ sample complexity, where $C_{\rm cov}$ is the coverability coefficient of the underlying MDP. By leveraging general function approximation, our approach works effectively in large or infinite state spaces where tabular methods fail, requiring only that value functions and reward functions can be represented by appropriate function classes. Our results also characterize when outcome-based feedback is statistically separated from per-step rewards, revealing an unavoidable exponential separation for certain MDPs. For deterministic MDPs, we show how to eliminate the completeness assumption, dramatically simplifying the algorithm. We further extend our approach to preference-based feedback settings, proving that equivalent statistical efficiency can be achieved even under more limited information. Together, these results constitute a theoretical foundation for understanding the statistical properties of outcome-based reinforcement learning.

[375] VCDiag: Classifying Erroneous Waveforms for Failure Triage Acceleration

Minh Luu, Surya Jasper, Khoi Le, Evan Pan, Michael Quinn, Aakash Tyagi, Jiang Hu

Main category: cs.LG

TL;DR: VCDiag uses VCD data and ML to classify failing waveforms in RTL-level simulation, achieving 94% accuracy in identifying top failure modules with a 120x data reduction.

Details

Motivation: Manual failure triage in design verification is time-consuming; ML can automate and improve efficiency.

Method: VCDiag employs signal selection and statistical compression on VCD data for classification.

Result: 94% accuracy in identifying top three failure modules and 120x data size reduction.

Conclusion: VCDiag is efficient, adaptable, and integrates well with Verilog/SystemVerilog designs.

Abstract: Failure triage in design functional verification is critical but time-intensive, relying on manual specification reviews, log inspections, and waveform analyses. While machine learning (ML) has improved areas like stimulus generation and coverage closure, its application to RTL-level simulation failure triage, particularly for large designs, remains limited. VCDiag offers an efficient, adaptable approach using VCD data to classify failing waveforms and pinpoint likely failure locations. In the largest experiment, VCDiag achieves over 94% accuracy in identifying the top three most likely modules. The framework introduces a novel signal selection and statistical compression approach, achieving over 120x reduction in raw data size while preserving features essential for classification. It can also be integrated into diverse Verilog/SystemVerilog designs and testbenches.

[376] On the Convergence of Gradient Descent on Learning Transformers with Residual Connections

Zhen Qin, Jinxin Zhou, Zhihui Zhu

Main category: cs.LG

TL;DR: The paper analyzes the convergence behavior of a single-layer Transformer with self-attention, feedforward networks, and residual connections, showing linear convergence under proper initialization and highlighting the role of residual connections in improving optimization stability.

Details

Motivation: Despite the empirical success of Transformers, their theoretical foundations, especially training dynamics, are underdeveloped. The paper aims to understand the interdependencies between components like self-attention and feedforward networks, particularly with residual connections.

Method: The study analyzes a structurally complete single-layer Transformer, including self-attention, feedforward networks, and residual connections. It examines gradient descent convergence under appropriate initialization and extends findings to multi-layer architectures.

Result: Gradient descent exhibits linear convergence, influenced by the singular values of the attention layer’s output matrix. Residual connections mitigate ill-conditioning from the softmax operation, enhancing optimization stability.

Conclusion: Residual connections improve convergence stability in Transformers by addressing ill-conditioning. Theoretical and empirical results support linear convergence under proper initialization, extending to multi-layer architectures.

Abstract: Transformer models have emerged as fundamental tools across various scientific and engineering disciplines, owing to their outstanding performance in diverse applications. Despite this empirical success, the theoretical foundations of Transformers remain relatively underdeveloped, particularly in understanding their training dynamics. Existing research predominantly examines isolated components–such as self-attention mechanisms and feedforward networks–without thoroughly investigating the interdependencies between these components, especially when residual connections are present. In this paper, we aim to bridge this gap by analyzing the convergence behavior of a structurally complete yet single-layer Transformer, comprising self-attention, a feedforward network, and residual connections. We demonstrate that, under appropriate initialization, gradient descent exhibits a linear convergence rate, where the convergence speed is determined by the minimum and maximum singular values of the output matrix from the attention layer. Moreover, our analysis reveals that residual connections serve to ameliorate the ill-conditioning of this output matrix, an issue stemming from the low-rank structure imposed by the softmax operation, thereby promoting enhanced optimization stability. We also extend our theoretical findings to a multi-layer Transformer architecture, confirming the linear convergence rate of gradient descent under suitable initialization. Empirical results corroborate our theoretical insights, illustrating the beneficial role of residual connections in promoting convergence stability.

[377] Unisoma: A Unified Transformer-based Solver for Multi-Solid Systems

Shilong Tao, Zhe Feng, Haonan Sun, Zhanxing Zhu, Yunhuai Liu

Main category: cs.LG

TL;DR: Unisoma introduces explicit modeling for multi-solid systems using Transformer-based modules, outperforming implicit methods in capturing complex interactions.

Details

Motivation: Existing deep learning methods struggle with multi-solid systems due to implicit modeling's limitations in handling intricate physical interactions.

Method: Unisoma uses explicit modeling with contact modules, adaptive interaction allocation, and triplet relationships to learn deformation.

Result: Unisoma achieves state-of-the-art performance on seven datasets and two complex tasks.

Conclusion: Explicit modeling is superior for multi-solid systems, as demonstrated by Unisoma’s effectiveness.

Abstract: Multi-solid systems are foundational to a wide range of real-world applications, yet modeling their complex interactions remains challenging. Existing deep learning methods predominantly rely on implicit modeling, where the factors influencing solid deformation are not explicitly represented but are instead indirectly learned. However, as the number of solids increases, these methods struggle to accurately capture intricate physical interactions. In this paper, we introduce a novel explicit modeling paradigm that incorporates factors influencing solid deformation through structured modules. Specifically, we present Unisoma, a unified and flexible Transformer-based model capable of handling variable numbers of solids. Unisoma directly captures physical interactions using contact modules and adaptive interaction allocation mechanism, and learns the deformation through a triplet relationship. Compared to implicit modeling techniques, explicit modeling is more well-suited for multi-solid systems with diverse coupling patterns, as it enables detailed treatment of each solid while preventing information blending and confusion. Experimentally, Unisoma achieves consistent state-of-the-art performance across seven well-established datasets and two complex multi-solid tasks. Code is avaiable at https://github.com/therontau0054/Unisoma.

[378] Why Do Class-Dependent Evaluation Effects Occur with Time Series Feature Attributions? A Synthetic Data Investigation

Gregor Baer, Isel Grau, Chao Zhang, Pieter Van Gorp

Main category: cs.LG

TL;DR: The paper investigates class-dependent effects in evaluating feature attribution methods in XAI, revealing discrepancies between perturbation-based and ground truth metrics, urging caution in interpreting these metrics.

Details

Motivation: To address the unreliability of perturbation-based metrics in evaluating feature attribution methods, especially when ground truth is unavailable, and to understand class-dependent evaluation effects.

Method: Controlled experiments with synthetic time series data, varying feature types and class contrasts, comparing perturbation-based degradation scores with ground truth-based precision-recall metrics.

Result: Class-dependent effects arise in both evaluation approaches, with weak correlations between them, indicating contradictory assessments of attribution quality.

Conclusion: Perturbation-based metrics may not reliably measure attribution quality, suggesting the need for more rigorous evaluation methods capturing multiple dimensions of attribution quality.

Abstract: Evaluating feature attribution methods represents a critical challenge in explainable AI (XAI), as researchers typically rely on perturbation-based metrics when ground truth is unavailable. However, recent work reveals that these evaluation metrics can show different performance across predicted classes within the same dataset. These “class-dependent evaluation effects” raise questions about whether perturbation analysis reliably measures attribution quality, with direct implications for XAI method development and evaluation trustworthiness. We investigate under which conditions these class-dependent effects arise by conducting controlled experiments with synthetic time series data where ground truth feature locations are known. We systematically vary feature types and class contrasts across binary classification tasks, then compare perturbation-based degradation scores with ground truth-based precision-recall metrics using multiple attribution methods. Our experiments demonstrate that class-dependent effects emerge with both evaluation approaches, even in simple scenarios with temporally localized features, triggered by basic variations in feature amplitude or temporal extent between classes. Most critically, we find that perturbation-based and ground truth metrics frequently yield contradictory assessments of attribution quality across classes, with weak correlations between evaluation approaches. These findings suggest that researchers should interpret perturbation-based metrics with care, as they may not always align with whether attributions correctly identify discriminating features. By showing this disconnect, our work points toward reconsidering what attribution evaluation actually measures and developing more rigorous evaluation methods that capture multiple dimensions of attribution quality.

[379] LLM Web Dynamics: Tracing Model Collapse in a Network of LLMs

Tianyu Wang, Akira Horiguchi, Lingyou Pang, Carey E. Priebe

Main category: cs.LG

TL;DR: The paper introduces LLM Web Dynamics (LWD) to study model collapse in LLMs at the network level, using a RAG database to simulate the Internet and providing theoretical guarantees.

Details

Motivation: To address the insufficient exploration of model collapse in LLMs, especially beyond single-model settings and statistical surrogates.

Method: Proposes LWD framework, simulating the Internet with a RAG database to analyze model output convergence, supported by theoretical analogies to Gaussian Mixture Models.

Result: The framework efficiently investigates model collapse at the network level and provides theoretical insights into output convergence.

Conclusion: LWD offers a novel approach to understanding and mitigating model collapse in LLMs, with potential broader implications for synthetic data usage.

Abstract: The increasing use of synthetic data from the public Internet has enhanced data usage efficiency in large language model (LLM) training. However, the potential threat of model collapse remains insufficiently explored. Existing studies primarily examine model collapse in a single model setting or rely solely on statistical surrogates. In this work, we introduce LLM Web Dynamics (LWD), an efficient framework for investigating model collapse at the network level. By simulating the Internet with a retrieval-augmented generation (RAG) database, we analyze the convergence pattern of model outputs. Furthermore, we provide theoretical guarantees for this convergence by drawing an analogy to interacting Gaussian Mixture Models.

[380] Multi-Preference Lambda-weighted Listwise DPO for Small-Scale Model Alignment

Yuhui Sun, Xiyao Wang, Zixi Li, Zhenlong Yuan, Jinman Zhao

Main category: cs.LG

TL;DR: The paper proposes Multi-Preference Lambda-weighted Listwise DPO, an improved method for aligning large language models (LLMs) with human preferences, addressing limitations of existing approaches like RLHF and DPO.

Details

Motivation: Existing methods like RLHF and DPO have limitations such as high computational costs, instability, and oversimplified preference modeling. The goal is to enable more detailed feedback and flexible multi-goal alignment.

Method: The proposed method uses full-ranked preference distributions instead of binary comparisons, with a lambda vector to balance multiple alignment goals. It includes a learned scheduler for dynamic lambda sampling.

Result: The method outperforms standard DPO on alignment benchmarks, requires only 20GB GPU memory, and supports efficient, controllable adaptation.

Conclusion: Multi-Preference Lambda-weighted Listwise DPO offers a robust, scalable, and flexible solution for aligning LLMs with diverse human preferences, suitable for real-world deployment.

Abstract: Large language models (LLMs) demonstrate strong generalization across a wide range of language tasks, but often generate outputs that misalign with human preferences. Reinforcement Learning from Human Feedback (RLHF) addresses this by optimizing models toward human preferences using a learned reward function and reinforcement learning, yielding improved alignment but suffering from high computational cost and instability. Direct Preference Optimization (DPO) simplifies the process by treating alignment as a classification task over binary preference pairs, reducing training overhead while achieving competitive performance. However, it assumes fixed, single-dimensional preferences and only supports pairwise supervision. To address these limitations, we propose Multi-Preference Lambda-weighted Listwise DPO, which allows the model to learn from more detailed human feedback and flexibly balance multiple goals such as helpfulness, honesty, and fluency. Our method models full-ranked preference distributions rather than binary comparisons, enabling more informative learning signals. The lambda vector controls the relative importance of different alignment goals, allowing the model to generalize across diverse human objectives. During inference, lambda can be adjusted without retraining, providing controllable alignment behavior for downstream use. We also introduce a learned scheduler that dynamically samples performant lambda configurations to improve robustness. Notably, our method requires only 20GB of GPU memory for training, making it suitable for compute-constrained settings such as academic labs, educational tools, or on-device assistants. Experiments on 1B-2B scale models show that our method consistently outperforms standard DPO on alignment benchmarks while enabling efficient, controllable, and fine-grained adaptation suitable for real-world deployment.

[381] BEAVER: Building Environments with Assessable Variation for Evaluating Multi-Objective Reinforcement Learning

Ruohong Liu, Jack Umenberger, Yize Chen

Main category: cs.LG

TL;DR: The paper explores the scalability of RL in building energy management, highlighting challenges in generalization across environments and objectives. It introduces a contextual RL framework and benchmarks for evaluation, showing limitations of current methods.

Details

Motivation: Address the lack of scalability and generalization in RL-based building energy management across diverse environments and operational scenarios.

Method: Formalizes the generalization space, formulates a multi-objective contextual RL problem, and creates a benchmark for evaluating RL algorithms.

Result: Existing multi-objective RL methods achieve trade-offs but degrade under certain environmental variations, emphasizing the need for context-aware learning.

Conclusion: Incorporating dynamics-dependent contextual information is crucial for improving RL policy generalization in building energy management.

Abstract: Recent years have seen significant advancements in designing reinforcement learning (RL)-based agents for building energy management. While individual success is observed in simulated or controlled environments, the scalability of RL approaches in terms of efficiency and generalization across building dynamics and operational scenarios remains an open question. In this work, we formally characterize the generalization space for the cross-environment, multi-objective building energy management task, and formulate the multi-objective contextual RL problem. Such a formulation helps understand the challenges of transferring learned policies across varied operational contexts such as climate and heat convection dynamics under multiple control objectives such as comfort level and energy consumption. We provide a principled way to parameterize such contextual information in realistic building RL environments, and construct a novel benchmark to facilitate the evaluation of generalizable RL algorithms in practical building control tasks. Our results show that existing multi-objective RL methods are capable of achieving reasonable trade-offs between conflicting objectives. However, their performance degrades under certain environment variations, underscoring the importance of incorporating dynamics-dependent contextual information into the policy learning process.

[382] Task Priors: Enhancing Model Evaluation by Considering the Entire Space of Downstream Tasks

Niket Patel, Randall Balestriero

Main category: cs.LG

TL;DR: The paper introduces a probabilistic framework for evaluating AI models over all possible downstream tasks, addressing limitations of fixed benchmarks in SSL.

Details

Motivation: Current evaluation methods in AI rely on fixed benchmarks, creating a bottleneck. The goal is to assess models comprehensively.

Method: Defines a probabilistic space of tasks using Task Priors, enabling evaluation over all possible tasks.

Result: Provides answers to key questions like average performance and variance across tasks under Task Priors.

Conclusion: The framework aims to set a new evaluation standard and accelerate SSL research by offering a more flexible and comprehensive assessment.

Abstract: The grand goal of AI research, and particularly Self Supervised Learning (SSL), is to produce systems that can successfully solve any possible task. In contrast, current evaluation methods available to AI researchers typically rely on a fixed collection of hand-picked downstream benchmarks. Hence, a large amount of effort is put into designing and searching for large collection of evaluation tasks that can serve as a proxy of our grand goal. We argue that such a rigid evaluation protocol creates a silent bottleneck in AI research. To remedy that, we define a probabilistic space of downstream tasks obtained by adopting a distribution of tasks and by defining Task Priors. Under this view, one can evaluate a model’s performance over the set of all possible downstream tasks. Our framework is the first to provide answers to key questions such as (i) what is the average performance of my model over all possible downstream tasks weighted by the probability to encounter each task? or (ii) what is the variance of my model’s performance across all downstream tasks under the defined Task Priors? Beyond establishing a new standard for evaluation, we believe that Task Priors will accelerate the pace of research in SSL - where downstream task evaluation is the sole qualitative signal that researchers have access to.

[383] SDSC:A Structure-Aware Metric for Semantic Signal Representation Learning

Jeyoung Lee, Hochul Kang

Main category: cs.LG

TL;DR: Proposes SDSC, a structure-aware metric for time series SSL, addressing limitations of distance-based methods like MSE by focusing on structural agreement.

Details

Motivation: Distance-based objectives like MSE are amplitude-sensitive, polarity-invariant, and unbounded, hindering semantic alignment and interpretability in signal representations.

Method: SDSC quantifies structural agreement using signed amplitude intersection, derived from DSC. It can be used as a loss with a differentiable approximation and combined with MSE for stability.

Result: SDSC-based pre-training matches or outperforms MSE, especially in in-domain and low-resource settings, improving semantic representation quality.

Conclusion: Structure-aware metrics like SDSC are viable alternatives to distance-based methods, enhancing signal representation fidelity.

Abstract: We propose the Signal Dice Similarity Coefficient (SDSC), a structure-aware metric function for time series self-supervised representation learning. Most Self-Supervised Learning (SSL) methods for signals commonly adopt distance-based objectives such as mean squared error (MSE), which are sensitive to amplitude, invariant to waveform polarity, and unbounded in scale. These properties hinder semantic alignment and reduce interpretability. SDSC addresses this by quantifying structural agreement between temporal signals based on the intersection of signed amplitudes, derived from the Dice Similarity Coefficient (DSC).Although SDSC is defined as a structure-aware metric, it can be used as a loss by subtracting from 1 and applying a differentiable approximation of the Heaviside function for gradient-based optimization. A hybrid loss formulation is also proposed to combine SDSC with MSE, improving stability and preserving amplitude where necessary. Experiments on forecasting and classification benchmarks demonstrate that SDSC-based pre-training achieves comparable or improved performance over MSE, particularly in in-domain and low-resource scenarios. The results suggest that structural fidelity in signal representations enhances the semantic representation quality, supporting the consideration of structure-aware metrics as viable alternatives to conventional distance-based methods.

[384] Omni-Thinker: Scaling Cross-Domain Generalization in LLMs via Multi-Task RL with Hybrid Rewards

Derek Li, Jiaming Zhou, Amirreza Kazemi, Qianyi Sun, Abbas Ghaddar, Mohammad Ali Alomrani, Liheng Ma, Yu Luo, Dong Li, Feng Wen, Jianye Hao, Mark Coates, Yingxue Zhang

Main category: cs.LG

TL;DR: Omni-Thinker, a unified RL framework, enhances LLM performance by combining rule-based rewards and generative preference signals, improving generalization and reducing forgetting through curriculum learning.

Details

Motivation: Addressing the limitation of post-training methods like SFT, which struggle with generalization and favor memorization, by developing a more transferable learning approach.

Method: Introduces Omni-Thinker, a reinforcement learning framework using rule-based verifiable rewards and LLM-as-a-Judge evaluations, with curriculum-based task progression.

Result: Curriculum learning improves performance by 5.2% over joint training and 9.1% over model merging across four domains.

Conclusion: Task-aware sampling and hybrid supervision are crucial for scaling RL-based post-training in general-purpose LLMs.

Abstract: The advancement of general-purpose artificial intelligence relies on large language models (LLMs) that excel across a wide range of tasks, from structured reasoning to creative generation. However, post-training methods like Supervised Fine-Tuning (SFT) often struggle with generalization, favoring memorization over transferable learning. In this work, we introduce Omni-Thinker, a unified reinforcement learning (RL) framework that enhances LLM performance across diverse tasks by combining rule-based verifiable rewards with generative preference signals via LLM-as-a-Judge evaluations. Our approach enables consistent optimization across task types and scales RL-based training to subjective domains. We further investigate training strategies, demonstrating that a curriculum-based progression that orders tasks from structured to open-ended improves performance and reduces forgetting. Experimental results across four domains reveal that curriculum learning improves performance by 5.2% over joint training and 9.1% over model merging. These results highlight the importance of task-aware sampling and hybrid supervision in scaling RL-based post-training for general-purpose LLMs.

[385] Diffusion Beats Autoregressive in Data-Constrained Settings

Mihir Prabhudesai, Menging Wu, Amir Zadeh, Katerina Fragkiadaki, Deepak Pathak

Main category: cs.LG

TL;DR: Diffusion models outperform autoregressive (AR) models in data-scarce settings due to better data utilization and implicit augmentation.

Details

Motivation: To explore the advantages of diffusion-based language models over AR models, especially in data-constrained scenarios.

Method: Systematic study of masked diffusion models in data-constrained settings, analyzing their performance and scaling laws.

Result: Diffusion models achieve lower validation loss and superior downstream performance when compute is abundant but data is scarce.

Conclusion: Diffusion models are a compelling alternative to AR models when data is the bottleneck, offering better performance and implicit data augmentation.

Abstract: Autoregressive (AR) models have long dominated the landscape of large language models, driving progress across a wide range of tasks. Recently, diffusion-based language models have emerged as a promising alternative, though their advantages over AR models remain underexplored. In this paper, we systematically study masked diffusion models in data-constrained settings-where training involves repeated passes over limited data-and find that they significantly outperform AR models when compute is abundant but data is scarce. Diffusion models make better use of repeated data, achieving lower validation loss and superior downstream performance. We interpret this advantage as implicit data augmentation: masked diffusion exposes the model to a diverse distribution of token orderings and prediction tasks, unlike AR’s fixed left-to-right factorization. We find new scaling laws for diffusion models and derive a closed-form expression for the critical compute threshold at which diffusion begins to outperform AR. These results suggest that when data, not compute, is the bottleneck, diffusion models offer a compelling alternative to the standard AR paradigm. Our code is available at: https://diffusion-scaling.github.io.

[386] Latent Space Alignment for AI-Native MIMO Semantic Communications

Mario Edoardo Pandolfo, Simone Fiorellino, Emilio Calvanese Strinati, Paolo Di Lorenzo

Main category: cs.LG

TL;DR: The paper proposes a method to address semantic mismatches in communications by using MIMO technology to align latent spaces and mitigate physical channel issues. Two solutions are explored: a linear model optimized via ADMM and a neural network-based model under constraints.

Details

Motivation: Semantic mismatches in communications can hinder mutual understanding when devices use different languages or representations. This work aims to align latent spaces and improve task completion.

Method: The approach involves learning a MIMO precoder/decoder pair for latent space compression and semantic channel equalization. Two models are tested: a linear ADMM-optimized solution and a constrained neural network.

Result: Numerical results show the method’s effectiveness in balancing accuracy, communication burden, and complexity in goal-oriented semantic communication.

Conclusion: The proposed MIMO-based solutions successfully mitigate semantic mismatches and channel impairments, offering practical trade-offs for semantic communications.

Abstract: Semantic communications focus on prioritizing the understanding of the meaning behind transmitted data and ensuring the successful completion of tasks that motivate the exchange of information. However, when devices rely on different languages, logic, or internal representations, semantic mismatches may occur, potentially hindering mutual understanding. This paper introduces a novel approach to addressing latent space misalignment in semantic communications, exploiting multiple-input multiple-output (MIMO) communications. Specifically, our method learns a MIMO precoder/decoder pair that jointly performs latent space compression and semantic channel equalization, mitigating both semantic mismatches and physical channel impairments. We explore two solutions: (i) a linear model, optimized by solving a biconvex optimization problem via the alternating direction method of multipliers (ADMM); (ii) a neural network-based model, which learns semantic MIMO precoder/decoder under transmission power budget and complexity constraints. Numerical results demonstrate the effectiveness of the proposed approach in a goal-oriented semantic communication scenario, illustrating the main trade-offs between accuracy, communication burden, and complexity of the solutions.

[387] EarthLink: A Self-Evolving AI Agent for Climate Science

Zijie Guo, Jiong Wang, Xiaoyu Yue, Wangxu Wei, Zhe Jiang, Wanghan Xu, Ben Fei, Wenlong Zhang, Xinyu Gu, Lijing Cheng, Jing-Jia Luo, Chao Li, Yaqiang Wang, Tao Chen, Wanli Ouyang, Fenghua Ling, Lei Bai

Main category: cs.LG

TL;DR: EarthLink is an AI copilot for Earth scientists, automating workflows and refining capabilities through feedback, validated for climate change tasks and rated comparable to junior researchers.

Details

Motivation: Addressing the bottleneck in Earth system research due to fragmented data and complex analytical demands.

Method: Introduces EarthLink, an interactive AI agent that automates research workflows, learns from user interaction, and provides transparent, auditable processes.

Result: Validated on climate change tasks, EarthLink produced scientifically sound analyses and matched aspects of a junior researcher’s workflow.

Conclusion: EarthLink advances efficient, trustworthy, and collaborative Earth system research, shifting scientists to strategic roles.

Abstract: Modern Earth science is at an inflection point. The vast, fragmented, and complex nature of Earth system data, coupled with increasingly sophisticated analytical demands, creates a significant bottleneck for rapid scientific discovery. Here we introduce EarthLink, the first AI agent designed as an interactive copilot for Earth scientists. It automates the end-to-end research workflow, from planning and code generation to multi-scenario analysis. Unlike static diagnostic tools, EarthLink can learn from user interaction, continuously refining its capabilities through a dynamic feedback loop. We validated its performance on a number of core scientific tasks of climate change, ranging from model-observation comparisons to the diagnosis of complex phenomena. In a multi-expert evaluation, EarthLink produced scientifically sound analyses and demonstrated an analytical competency that was rated as comparable to specific aspects of a human junior researcher’s workflow. Additionally, its transparent, auditable workflows and natural language interface empower scientists to shift from laborious manual execution to strategic oversight and hypothesis generation. EarthLink marks a pivotal step towards an efficient, trustworthy, and collaborative paradigm for Earth system research in an era of accelerating global change. The system is accessible at our website https://earthlink.intern-ai.org.cn.

[388] TOC-UCO: a comprehensive repository of tabular ordinal classification datasets

Rafael Ayllón-Gavilán, David Guijo-Rubio, Antonio Manuel Gómez-Orellana, Francisco Bérchez-Moreno, Víctor Manuel Vargas-Yun, Pedro A. Gutiérrez

Main category: cs.LG

TL;DR: The paper introduces TOC-UCO, a repository of 46 tabular ordinal datasets, to address the lack of benchmark datasets in ordinal classification (OC). It includes preprocessing details and tools for reproducibility.

Details

Motivation: The OC field lacks comprehensive datasets for benchmarking new methodologies, hindering progress. This work aims to fill that gap.

Method: The authors compile and preprocess 46 ordinal datasets under a common framework, ensuring balanced class distribution and sufficient patterns. They provide detailed preprocessing steps and 30 randomized train-test partitions for reproducibility.

Result: TOC-UCO is presented as a publicly available repository, facilitating robust validation of OC approaches.

Conclusion: The repository supports the OC community by standardizing benchmarking and enhancing reproducibility for future research.

Abstract: An ordinal classification (OC) problem corresponds to a special type of classification characterised by the presence of a natural order relationship among the classes. This type of problem can be found in a number of real-world applications, motivating the design and development of many ordinal methodologies over the last years. However, it is important to highlight that the development of the OC field suffers from one main disadvantage: the lack of a comprehensive set of datasets on which novel approaches to the literature have to be benchmarked. In order to approach this objective, this manuscript from the University of C'ordoba (UCO), which have previous experience on the OC field, provides the literature with a publicly available repository of tabular data for a robust validation of novel OC approaches, namely TOC-UCO (Tabular Ordinal Classification repository of the UCO). Specifically, this repository includes a set of $46$ tabular ordinal datasets, preprocessed under a common framework and ensured to have a reasonable number of patterns and an appropriate class distribution. We also provide the sources and preprocessing steps of each dataset, along with details on how to benchmark a novel approach using the TOC-UCO repository. For this, indices for $30$ different randomised train-test partitions are provided to facilitate the reproducibility of the experiments.

cs.MA

[389] Technical Implementation of Tippy: Multi-Agent Architecture and System Design for Drug Discovery Laboratory Automation

Yao Fehlis, Charles Crain, Aidan Jensen, Michael Watson, James Juhasz, Paul Mandel, Betty Liu, Shawn Mahon, Daren Wilson, Nick Lynch-Jonely, Ben Leedom, David Fuller

Main category: cs.MA

TL;DR: The paper details a multi-agent system (Tippy) for drug discovery lab automation, using a distributed microservices architecture with five specialized agents, OpenAI Agents SDK, and Kubernetes for deployment.

Details

Motivation: To demonstrate how specialized AI agents can coordinate complex lab workflows while ensuring security, scalability, and integration with existing infrastructure.

Method: A distributed microservices architecture with five agents (Supervisor, Molecule, Lab, Analysis, Report), OpenAI Agents SDK for orchestration, Model Context Protocol (MCP) for lab tool access, and Kubernetes for deployment.

Result: The system effectively coordinates lab workflows, integrates with existing infrastructure, and ensures security and scalability.

Conclusion: Specialized AI agents can successfully automate and coordinate complex lab workflows, leveraging standardized protocols and modern deployment strategies.

Abstract: Building on the conceptual framework presented in our previous work on agentic AI for pharmaceutical research, this paper provides a comprehensive technical analysis of Tippy’s multi-agent system implementation for drug discovery laboratory automation. We present a distributed microservices architecture featuring five specialized agents (Supervisor, Molecule, Lab, Analysis, and Report) that coordinate through OpenAI Agents SDK orchestration and access laboratory tools via the Model Context Protocol (MCP). The system architecture encompasses agent-specific tool integration, asynchronous communication patterns, and comprehensive configuration management through Git-based tracking. Our production deployment strategy utilizes Kubernetes container orchestration with Helm charts, Docker containerization, and CI/CD pipelines for automated testing and deployment. The implementation integrates vector databases for RAG functionality and employs an Envoy reverse proxy for secure external access. This work demonstrates how specialized AI agents can effectively coordinate complex laboratory workflows while maintaining security, scalability, reliability, and integration with existing laboratory infrastructure through standardized protocols.

[390] Assemble Your Crew: Automatic Multi-agent Communication Topology Design via Autoregressive Graph Generation

Shiyuan Li, Yixin Liu, Qingsong Wen, Chengqi Zhang, Shirui Pan

Main category: cs.MA

TL;DR: ARG-Designer reframes multi-agent system (MAS) design as a conditional autoregressive graph generation task, dynamically creating collaboration graphs tailored to task-specific needs, outperforming existing methods.

Details

Motivation: Existing MAS design approaches are limited by rigid templates and predefined structures, hindering adaptability to diverse tasks.

Method: ARG-Designer uses an autoregressive model to generate collaboration graphs from scratch, dynamically determining agents, roles, and communication links based on task queries.

Result: ARG-Designer achieves state-of-the-art performance, token efficiency, and extensibility across six benchmarks.

Conclusion: ARG-Designer offers a flexible, extensible solution for MAS design, outperforming traditional methods.

Abstract: Multi-agent systems (MAS) based on large language models (LLMs) have emerged as a powerful solution for dealing with complex problems across diverse domains. The effectiveness of MAS is critically dependent on its collaboration topology, which has become a focal point for automated design research. However, existing approaches are fundamentally constrained by their reliance on a template graph modification paradigm with a predefined set of agents and hard-coded interaction structures, significantly limiting their adaptability to task-specific requirements. To address these limitations, we reframe MAS design as a conditional autoregressive graph generation task, where both the system composition and structure are designed jointly. We propose ARG-Designer, a novel autoregressive model that operationalizes this paradigm by constructing the collaboration graph from scratch. Conditioned on a natural language task query, ARG-Designer sequentially and dynamically determines the required number of agents, selects their appropriate roles from an extensible pool, and establishes the optimal communication links between them. This generative approach creates a customized topology in a flexible and extensible manner, precisely tailored to the unique demands of different tasks. Extensive experiments across six diverse benchmarks demonstrate that ARG-Designer not only achieves state-of-the-art performance but also enjoys significantly greater token efficiency and enhanced extensibility. The source code of ARG-Designer is available at https://github.com/Shiy-Li/ARG-Designer.

[391] Designing Value-Aligned Traffic Agents through Conflict Sensitivity

Astrid Rakow, Joe Collenette, Maike Schwammberger, Marija Slavkovik, Gleifer Vs Alves

Main category: cs.MA

TL;DR: The paper proposes using epistemic game theory to model value conflicts in autonomous traffic agents (ATAs) and introduces Value-Aligned Operational Design Domains (VODDs) to align agent behavior with stakeholder values during development.

Details

Motivation: To ensure ATAs act safely and align with legal, social, and moral values, addressing value conflicts proactively rather than at runtime.

Method: Adopts a formal model from epistemic game theory to analyze value conflicts, focusing on value elicitation, capability specification, explanation, and adaptive refinement. Introduces VODDs to structure autonomy based on contextual value priorities.

Result: Demonstrates how conflict analysis can inform design phases, shifting focus from runtime moral dilemmas to preemptive value-sensitive behavior structuring.

Conclusion: Proactive value alignment during development, using VODDs and conflict analysis, enhances ATAs’ alignment with stakeholder values.

Abstract: Autonomous traffic agents (ATAs) are expected to act in ways tat are not only safe, but also aligned with stakeholder values across legal, social, and moral dimensions. In this paper, we adopt an established formal model of conflict from epistemic game theory to support the development of such agents. We focus on value conflicts-situations in which agents face competing goals rooted in value-laden situations and show how conflict analysis can inform key phases of the design process. This includes value elicitation, capability specification, explanation, and adaptive system refinement. We elaborate and apply the concept of Value-Aligned Operational Design Domains (VODDs) to structure autonomy in accordance with contextual value priorities. Our approach shifts the emphasis from solving moral dilemmas at runtime to anticipating and structuring value-sensitive behaviour during development.

[392] Recognizing and Eliciting Weakly Single Crossing Profiles on Trees

Palash Dey

Main category: cs.MA

TL;DR: The paper introduces the weakly single-crossing domain on trees, generalizing single-crossing domains in social choice. It provides algorithms for recognition and elicitation, proves query complexity bounds, and resolves an open question about optimality.

Details

Motivation: To generalize the single-crossing domain in social choice theory and address challenges in recognizing and eliciting preferences efficiently, especially when the underlying structure is unknown.

Method: Develops polynomial-time algorithms for recognizing and eliciting preferences in the weakly single-crossing domain, including sequential access and unknown tree structures. Proves lower bounds on query complexity.

Result: Provides efficient algorithms for recognition and elicitation, proves matching lower bounds on query complexity, and resolves an open question about optimality with random queries.

Conclusion: The work advances understanding of weakly single-crossing domains, offering practical algorithms and theoretical insights into query complexity and optimality.

Abstract: We introduce and study the weakly single-crossing domain on trees which is a generalization of the well-studied single-crossing domain in social choice theory. We design a polynomial-time algorithm for recognizing preference profiles which belong to this domain. We then develop an efficient elicitation algorithm for this domain which works even if the preferences can be accessed only sequentially and the underlying single-crossing tree structure is not known beforehand. We also prove matching lower bound on the query complexity of our elicitation algorithm when the number of voters is large compared to the number of candidates. We also prove a lower bound of $\Omega(m^2\log n)$ on the number of queries that any algorithm needs to ask to elicit single crossing profile when random queries are allowed. This resolves an open question in an earlier paper and proves optimality of their preference elicitation algorithm when random queries are allowed.

cs.MM

[393] Multimodal Fusion via Hypergraph Autoencoder and Contrastive Learning for Emotion Recognition in Conversation

Zijian Yi, Ziming Zhao, Zhishu Shen, Tiehua Zhang

Main category: cs.MM

TL;DR: A framework for multimodal emotion recognition in conversation (MERC) using dynamic hypergraph connections and contrastive learning to improve context modeling and reduce redundancy.

Details

Motivation: The challenge of MERC lies in balancing speaker and context modeling, especially with long-distance conversations and multimodal fusion.

Method: Proposes a framework with a variational hypergraph autoencoder (VHGAE) for dynamic hypergraph connections and contrastive learning to reduce uncertainty.

Result: Outperforms state-of-the-art methods on IEMOCAP and MELD datasets.

Conclusion: The proposed method effectively addresses redundancy and smoothing issues in graph-based MERC, with code released for reproducibility.

Abstract: Multimodal emotion recognition in conversation (MERC) seeks to identify the speakers’ emotions expressed in each utterance, offering significant potential across diverse fields. The challenge of MERC lies in balancing speaker modeling and context modeling, encompassing both long-distance and short-distance contexts, as well as addressing the complexity of multimodal information fusion. Recent research adopts graph-based methods to model intricate conversational relationships effectively. Nevertheless, the majority of these methods utilize a fixed fully connected structure to link all utterances, relying on convolution to interpret complex context. This approach can inherently heighten the redundancy in contextual messages and excessive graph network smoothing, particularly in the context of long-distance conversations. To address this issue, we propose a framework that dynamically adjusts hypergraph connections by variational hypergraph autoencoder (VHGAE), and employs contrastive learning to mitigate uncertainty factors during the reconstruction process. Experimental results demonstrate the effectiveness of our proposal against the state-of-the-art methods on IEMOCAP and MELD datasets. We release the code to support the reproducibility of this work at https://github.com/yzjred/-HAUCL.

eess.AS

[394] ASR-Guided Speaker-Role Diarization and Diarization-Guided ASR Decoding

Arindam Ghosh, Mark Fuhs, Bongjun Kim, Anurag Chowdhury, Monika Woszczyna

Main category: eess.AS

TL;DR: The paper extends end-to-end ASR+SD models to speaker-role diarization (RD), simplifying training, using task-specific predictors, and leveraging RD for ASR decoding.

Details

Motivation: Speaker-role diarization (RD) is more practical than traditional speaker diarization (SD) for applications like doctor-patient or host-guest interactions.

Method: The framework uses forced alignment and cross-entropy loss for training, employs separate predictors for word and role tasks, and integrates RD posterior activity into ASR decoding.

Result: The approach improves role diarization and reduces small-word deletion errors in ASR.

Conclusion: The proposed method enhances RD performance and ASR accuracy, demonstrating its practical utility.

Abstract: From an application standpoint, speaker-role diarization (RD), such as doctor vs. patient, host vs. guest, etc. is often more useful than traditional speaker diarization (SD), which assigns generic labels like speaker-1, speaker-2 etc. In the context of joint automatic speech recognition (ASR) + SD (who spoke what?), recent end-to-end models employ an auxiliary SD transducer, synchronized with the ASR transducer, to predict speakers per word. In this paper, we extend this framework to RD with three key contributions: (1) we simplify the training via forced alignment and cross-entropy loss instead of RNNT loss, (2) we show that word prediction and role prediction require different amounts of predictor’s context, leading to separate task-specific predictors, unlike existing shared-predictor models, and (3) we propose a way to leverage RD posterior activity to influence ASR decoding and reduce small-word deletion errors.

[395] A Concept-based approach to Voice Disorder Detection

Davide Ghia, Gabriele Ciravegna, Alkis Koudounas, Marco Fantini, Erika Crosetti, Giovanni Succo, Tania Cerquitelli

Main category: eess.AS

TL;DR: The paper explores Explainable AI (XAI) for diagnosing voice disorders, comparing concept-based models (CBM, CEM) to traditional DNNs for better interpretability without sacrificing performance.

Details

Motivation: Voice disorder diagnosis via AI lacks transparency, limiting clinical trust. XAI offers interpretability, enhancing trustworthiness in healthcare.

Method: Investigates concept-based models (CBM, CEM) as alternatives to DNNs, focusing on their interpretability and performance.

Result: Concept-based models achieve performance comparable to DNNs while providing clearer decision-making insights.

Conclusion: XAI, particularly concept-based models, can improve trust in AI-driven voice disorder diagnosis by balancing performance and interpretability.

Abstract: Voice disorders affect a significant portion of the population, and the ability to diagnose them using automated, non-invasive techniques would represent a substantial advancement in healthcare, improving the quality of life of patients. Recent studies have demonstrated that artificial intelligence models, particularly Deep Neural Networks (DNNs), can effectively address this task. However, due to their complexity, the decision-making process of such models often remain opaque, limiting their trustworthiness in clinical contexts. This paper investigates an alternative approach based on Explainable AI (XAI), a field that aims to improve the interpretability of DNNs by providing different forms of explanations. Specifically, this works focuses on concept-based models such as Concept Bottleneck Model (CBM) and Concept Embedding Model (CEM) and how they can achieve performance comparable to traditional deep learning methods, while offering a more transparent and interpretable decision framework.

[396] Recent Trends in Distant Conversational Speech Recognition: A Review of CHiME-7 and 8 DASR Challenges

Samuele Cornell, Christoph Boeddeker, Taejin Park, He Huang, Desh Raj, Matthew Wiesner, Yoshiki Masuyama, Xuankai Chang, Zhong-Qiu Wang, Stefano Squartini, Paola Garcia, Shinji Watanabe

Main category: eess.AS

TL;DR: The CHiME-7 and 8 challenges advanced multi-channel ASR and diarization, with trends showing a shift to end-to-end systems, reliance on guided source separation, diarization refinement, weak correlation between transcription and summarization quality, and ongoing difficulties in transcribing spontaneous speech.

Details

Motivation: To advance research in multi-channel, generalizable ASR and diarization of conversational speech by analyzing trends from participant submissions in the CHiME-7 and 8 challenges.

Method: The paper outlines the challenges’ design, evaluation metrics, datasets, and baseline systems, while analyzing key trends from 32 diverse systems submitted by 9 teams.

Result: Key findings include the dominance of end-to-end ASR systems, reliance on guided source separation, importance of diarization refinement, weak correlation between transcription and summarization quality, and persistent challenges in transcribing spontaneous speech.

Conclusion: The CHiME challenges highlight progress in ASR and diarization but reveal ongoing challenges, emphasizing the need for further research in neural SSE techniques and accurate transcription of spontaneous speech.

Abstract: The CHiME-7 and 8 distant speech recognition (DASR) challenges focus on multi-channel, generalizable, joint automatic speech recognition (ASR) and diarization of conversational speech. With participation from 9 teams submitting 32 diverse systems, these challenges have contributed to state-of-the-art research in the field. This paper outlines the challenges’ design, evaluation metrics, datasets, and baseline systems while analyzing key trends from participant submissions. From this analysis it emerges that: 1) Most participants use end-to-end (e2e) ASR systems, whereas hybrid systems were prevalent in previous CHiME challenges. This transition is mainly due to the availability of robust large-scale pre-trained models, which lowers the data burden for e2e-ASR. 2) Despite recent advances in neural speech separation and enhancement (SSE), all teams still heavily rely on guided source separation, suggesting that current neural SSE techniques are still unable to reliably deal with complex scenarios and different recording setups. 3) All best systems employ diarization refinement via target-speaker diarization techniques. Accurate speaker counting in the first diarization pass is thus crucial to avoid compounding errors and CHiME-8 DASR participants especially focused on this part. 4) Downstream evaluation via meeting summarization can correlate weakly with transcription quality due to the remarkable effectiveness of large-language models in handling errors. On the NOTSOFAR-1 scenario, even systems with over 50% time-constrained minimum permutation WER can perform roughly on par with the most effective ones (around 11%). 5) Despite recent progress, accurately transcribing spontaneous speech in challenging acoustic environments remains difficult, even when using computationally intensive system ensembles.

[397] SpecASR: Accelerating LLM-based Automatic Speech Recognition via Speculative Decoding

Linye Wei, Shuzhang Zhong, Songqiang Xu, Runsheng Wang, Ru Huang, Meng Li

Main category: eess.AS

TL;DR: SpecASR is a novel speculative decoding framework for ASR, optimizing latency by adaptive draft sequence generation and recycling, achieving significant speedup without accuracy loss.

Details

Motivation: High decoding latency of LLMs in ASR challenges real-time requirements, and existing speculative decoding methods overlook ASR-specific characteristics.

Method: SpecASR introduces adaptive draft sequence generation, recycling, and a two-pass sparse token tree algorithm to balance latency.

Result: SpecASR achieves 3.04x-3.79x speedup over autoregressive decoding and 1.25x-1.84x over speculative decoding, maintaining accuracy.

Conclusion: SpecASR effectively reduces ASR latency while preserving recognition accuracy, addressing real-time ASR challenges.

Abstract: Large language model (LLM)-based automatic speech recognition (ASR) has recently attracted a lot of attention due to its high recognition accuracy and enhanced multi-dialect support. However, the high decoding latency of LLMs challenges the real-time ASR requirements. Although speculative decoding has been explored for better decoding efficiency, they usually ignore the key characteristics of the ASR task and achieve limited speedup. To further reduce the real-time ASR latency, in this paper, we propose a novel speculative decoding framework specialized for ASR, dubbed SpecASR. SpecASR is developed based on our core observation that ASR decoding is audio-conditioned, which results in high output alignment between small and large ASR models, even given output mismatches in intermediate decoding steps. Therefore, SpecASR features an adaptive draft sequence generation process that dynamically modifies the draft sequence length to maximize the token acceptance length. SpecASR further proposes a draft sequence recycling strategy that reuses the previously generated draft sequence to reduce the draft ASR model latency. Moreover, a two-pass sparse token tree generation algorithm is also proposed to balance the latency of draft and target ASR models. With extensive experimental results, we demonstrate SpecASR achieves 3.04x-3.79x and 1.25x-1.84x speedup over the baseline autoregressive decoding and speculative decoding, respectively, without any loss in recognition accuracy.

[398] Speech Enhancement with Dual-path Multi-Channel Linear Prediction Filter and Multi-norm Beamforming

Chengyuan Qin, Wenmeng Xiong, Jing Zhou, Maoshen Jia, Changchun Bao

Main category: eess.AS

TL;DR: A dual-path MCLP and multi-norm beamforming method for speech enhancement, robust in high reverberation.

Details

Motivation: Improve speech enhancement, especially in high reverberation scenarios, by combining dual-path MCLP filters and multi-norm beamforming.

Method: Uses dual-path MCLP filters in time and frequency dimensions, minimizes output power and l1 norm of denoised signals, and preserves target direction signals. Includes an efficient prediction order selection method.

Result: Outperforms baseline methods, particularly in high reverberation.

Conclusion: The proposed method is effective for speech enhancement, especially in challenging reverberant environments.

Abstract: In this paper, we propose a speech enhancement method us ing dual-path Multi-Channel Linear Prediction (MCLP) filters and multi-norm beamforming. Specifically, the MCLP part in the proposed method is designed with dual-path filters in both time and frequency dimensions. For the beamforming part, we minimize the power of the microphone array output as well as the l1 norm of the denoised signals while preserving source sig nals from the target directions. An efficient method to select the prediction orders in the dual-path filters is also proposed, which is robust for signals with different reverberation time (T60) val ues and can be applied to other MCLP-based methods. Eval uations demonstrate that our proposed method outperforms the baseline methods for speech enhancement, particularly in high reverberation scenarios.

[399] Streaming Sortformer: Speaker Cache-Based Online Speaker Diarization with Arrival-Time Ordering

Ivan Medennikov, Taejin Park, Weiqing Wang, He Huang, Kunal Dhawan, Jinhan Wang, Jagadeesh Balam, Boris Ginsburg

Main category: eess.AS

TL;DR: Streaming extension for Sortformer diarization with arrival-time ordered speakers, using AOSC for efficient tracking and dynamic updates.

Details

Motivation: Enhance real-time multi-speaker tracking by ensuring arrival-time ordering and efficient cache utilization.

Method: Uses Arrival-Order Speaker Cache (AOSC) to store and dynamically update speaker embeddings based on arrival order and prediction scores.

Result: Effective and flexible performance on benchmark datasets, even in low-latency setups.

Conclusion: Streaming Sortformer is robust for real-time multi-speaker tracking and streaming speech processing.

Abstract: This paper presents a streaming extension for the Sortformer speaker diarization framework, whose key property is the arrival-time ordering of output speakers. The proposed approach employs an Arrival-Order Speaker Cache (AOSC) to store frame-level acoustic embeddings of previously observed speakers. Unlike conventional speaker-tracing buffers, AOSC orders embeddings by speaker index corresponding to their arrival time order, and is dynamically updated by selecting frames with the highest scores based on the model’s past predictions. Notably, the number of stored embeddings per speaker is determined dynamically by the update mechanism, ensuring efficient cache utilization and precise speaker tracking. Experiments on benchmark datasets confirm the effectiveness and flexibility of our approach, even in low-latency setups. These results establish Streaming Sortformer as a robust solution for real-time multi-speaker tracking and a foundation for streaming multi-talker speech processing.

[400] DiffRhythm+: Controllable and Flexible Full-Length Song Generation with Preference Optimization

Huakang Chen, Yuepeng Jiang, Guobin Ma, Chunbo Hao, Shuai Wang, Jixun Yao, Ziqian Ning, Meng Meng, Jian Luan, Lei Xie

Main category: eess.AS

TL;DR: DiffRhythm+ is an enhanced diffusion-based model for full-length song generation, addressing data imbalance and controllability issues of its predecessor, DiffRhythm, with expanded datasets and multi-modal style conditioning.

Details

Motivation: Current song generation systems face challenges like data imbalance, insufficient controllability, and inconsistent quality, which DiffRhythm+ aims to overcome.

Method: DiffRhythm+ uses a balanced training dataset and a multi-modal style conditioning strategy (text and audio inputs) for better control and quality. It also optimizes performance based on user preferences.

Result: The model shows improvements in naturalness, arrangement complexity, and listener satisfaction compared to previous systems.

Conclusion: DiffRhythm+ advances full-length song generation by enhancing controllability, flexibility, and musical quality.

Abstract: Songs, as a central form of musical art, exemplify the richness of human intelligence and creativity. While recent advances in generative modeling have enabled notable progress in long-form song generation, current systems for full-length song synthesis still face major challenges, including data imbalance, insufficient controllability, and inconsistent musical quality. DiffRhythm, a pioneering diffusion-based model, advanced the field by generating full-length songs with expressive vocals and accompaniment. However, its performance was constrained by an unbalanced model training dataset and limited controllability over musical style, resulting in noticeable quality disparities and restricted creative flexibility. To address these limitations, we propose DiffRhythm+, an enhanced diffusion-based framework for controllable and flexible full-length song generation. DiffRhythm+ leverages a substantially expanded and balanced training dataset to mitigate issues such as repetition and omission of lyrics, while also fostering the emergence of richer musical skills and expressiveness. The framework introduces a multi-modal style conditioning strategy, enabling users to precisely specify musical styles through both descriptive text and reference audio, thereby significantly enhancing creative control and diversity. We further introduce direct performance optimization aligned with user preferences, guiding the model toward consistently preferred outputs across evaluation metrics. Extensive experiments demonstrate that DiffRhythm+ achieves significant improvements in naturalness, arrangement complexity, and listener satisfaction over previous systems.

[401] Segmentation-free Goodness of Pronunciation

Xinwei Cao, Zijian Fan, Torbjørn Svendsen, Giampiero Salvi

Main category: eess.AS

TL;DR: The paper introduces self-alignment GOP (GOP-SA) and alignment-free GOP (GOP-AF) for phoneme-level pronunciation assessment, improving accuracy and enabling CTC-based acoustic models.

Details

Motivation: Current MDD systems rely on pre-segmented speech, limiting accuracy and compatibility with modern CTC-based models.

Method: Proposes GOP-SA for CTC-trained ASR models and GOP-AF, a general alignment-free method with theoretical and implementation details.

Result: Achieves state-of-the-art results on phoneme-level pronunciation assessment, validated on CMU Kids and Speechocean762 datasets.

Conclusion: The proposed methods enhance MDD systems by leveraging CTC-based models and improving alignment flexibility.

Abstract: Mispronunciation detection and diagnosis (MDD) is a significant part in modern computer aided language learning (CALL) systems. Within MDD, phoneme-level pronunciation assessment is key to helping L2 learners improve their pronunciation. However, most systems are based on a form of goodness of pronunciation (GOP) which requires pre-segmentation of speech into phonetic units. This limits the accuracy of these methods and the possibility to use modern CTC-based acoustic models for their evaluation. In this study, we first propose self-alignment GOP (GOP-SA) that enables the use of CTC-trained ASR models for MDD. Next, we define a more general alignment-free method that takes all possible alignments of the target phoneme into account (GOP-AF). We give a theoretical account of our definition of GOP-AF, an implementation that solves potential numerical issues as well as a proper normalization which makes the method applicable with acoustic models with different peakiness over time. We provide extensive experimental results on the CMU Kids and Speechocean762 datasets comparing the different definitions of our methods, estimating the dependency of GOP-AF on the peakiness of the acoustic models and on the amount of context around the target phoneme. Finally, we compare our methods with recent studies over the Speechocean762 data showing that the feature vectors derived from the proposed method achieve state-of-the-art results on phoneme-level pronunciation assessment.

eess.IV

[402] Improving Multislice Electron Ptychography with a Generative Prior

Christian K. Belardi, Chia-Hao Lee, Yingheng Wang, Justin Lovelace, Kilian Q. Weinberger, David A. Muller, Carla P. Gomes

Main category: eess.IV

TL;DR: MEP-Diffusion, a diffusion model for multislice electron ptychography, enhances reconstruction quality by 90.50% SSIM over existing methods.

Details

Motivation: Existing iterative algorithms for MEP are slow and produce suboptimal solutions due to ill-posed nature.

Method: Developed MEP-Diffusion, a diffusion model trained on crystal structures, integrated via Diffusion Posterior Sampling (DPS).

Result: 90.50% improvement in SSIM over existing methods.

Conclusion: Hybrid approach with MEP-Diffusion significantly enhances 3D volume reconstruction quality.

Abstract: Multislice electron ptychography (MEP) is an inverse imaging technique that computationally reconstructs the highest-resolution images of atomic crystal structures from diffraction patterns. Available algorithms often solve this inverse problem iteratively but are both time consuming and produce suboptimal solutions due to their ill-posed nature. We develop MEP-Diffusion, a diffusion model trained on a large database of crystal structures specifically for MEP to augment existing iterative solvers. MEP-Diffusion is easily integrated as a generative prior into existing reconstruction methods via Diffusion Posterior Sampling (DPS). We find that this hybrid approach greatly enhances the quality of the reconstructed 3D volumes, achieving a 90.50% improvement in SSIM over existing methods.

[403] Towards Robust Foundation Models for Digital Pathology

Jonah Kömen, Edwin D. de Jong, Julius Hense, Hannah Marienwald, Jonas Dippel, Philip Naumann, Eric Marcus, Lukas Ruff, Maximilian Alber, Jonas Teuwen, Frederick Klauschen, Klaus-Robert Müller

Main category: eess.IV

TL;DR: The paper investigates the robustness of Biomedical Foundation Models (FMs) to non-biological technical features, introduces PathoROB for benchmarking, and proposes solutions to improve FM robustness for clinical use.

Details

Motivation: FMs in healthcare are prone to learning non-biological technical features, risking clinical deployment. This study aims to quantify and address these robustness issues.

Method: Developed PathoROB, a benchmark with three metrics and four datasets, evaluated 20 FMs, and proposed a robustification framework.

Result: Found robustness deficits in all FMs, leading to diagnostic errors. Robust FMs and post-hoc methods reduced but didn’t eliminate risks.

Conclusion: Robustness evaluation is critical for clinical FM validation. Future FM development must prioritize robustness, with PathoROB guiding improvements.

Abstract: Biomedical Foundation Models (FMs) are rapidly transforming AI-enabled healthcare research and entering clinical validation. However, their susceptibility to learning non-biological technical features – including variations in surgical/endoscopic techniques, laboratory procedures, and scanner hardware – poses risks for clinical deployment. We present the first systematic investigation of pathology FM robustness to non-biological features. Our work (i) introduces measures to quantify FM robustness, (ii) demonstrates the consequences of limited robustness, and (iii) proposes a framework for FM robustification to mitigate these issues. Specifically, we developed PathoROB, a robustness benchmark with three novel metrics, including the robustness index, and four datasets covering 28 biological classes from 34 medical centers. Our experiments reveal robustness deficits across all 20 evaluated FMs, and substantial robustness differences between them. We found that non-robust FM representations can cause major diagnostic downstream errors and clinical blunders that prevent safe clinical adoption. Using more robust FMs and post-hoc robustification considerably reduced (but did not yet eliminate) the risk of such errors. This work establishes that robustness evaluation is essential for validating pathology FMs before clinical adoption and demonstrates that future FM development must integrate robustness as a core design principle. PathoROB provides a blueprint for assessing robustness across biomedical domains, guiding FM improvement efforts towards more robust, representative, and clinically deployable AI systems that prioritize biological information over technical artifacts.

[404] Integrating Feature Selection and Machine Learning for Nitrogen Assessment in Grapevine Leaves using In-Field Hyperspectral Imaging

Atif Bilal Asad, Achyut Paudel, Safal Kshetri, Chenchen Kang, Salik Ram Khanal, Nataliya Shcherbatyuk, Pierre Davadant, R. Paul Schreiner, Santosh Kalauni, Manoj Karkee, Markus Keller

Main category: eess.IV

TL;DR: The study uses hyperspectral imaging and ML models to predict nitrogen (N) concentration in grapevine leaves and canopies, identifying key spectral bands and achieving moderate prediction accuracy.

Details

Motivation: Accurate estimation of N concentration in grapevines is crucial for optimal fertilization due to high spatial and temporal variability of soil N.

Method: Hyperspectral images (400-1000nm) of four grapevine cultivars were analyzed. Feature selection identified optimal spectral bands, and ML models (Gradient Boosting, XGBoost) were trained for N prediction.

Result: Key spectral regions (500-525nm, 650-690nm, 750-800nm, 900-950nm) were robust for N prediction. ML models achieved R² of 0.49 (canopy-level) and 0.57 (leaf-level).

Conclusion: Hyperspectral imaging combined with feature selection and ML is promising for monitoring N status in vineyards.

Abstract: Nitrogen (N) is one of the most crucial nutrients in vineyards, affecting plant growth and subsequent products such as wine and juice. Because soil N has high spatial and temporal variability, it is desirable to accurately estimate the N concentration of grapevine leaves and manage fertilization at the individual plant level to optimally meet plant needs. In this study, we used in-field hyperspectral images with wavelengths ranging from $400 to 1000nm of four different grapevine cultivars collected from distinct vineyards and over two growth stages during two growing seasons to develop models for predicting N concentration at the leaf-level and canopy-level. After image processing, two feature selection methods were employed to identify the optimal set of spectral bands that were responsive to leaf N concentrations. The selected spectral bands were used to train and test two different Machine Learning (ML) models, Gradient Boosting and XGBoost, for predicting nitrogen concentrations. The comparison of selected bands for both leaf-level and canopy-level datasets showed that most of the spectral regions identified by the feature selection methods were across both methods and the dataset types (leaf- and canopy-level datasets), particularly in the key regions, 500-525nm, 650-690nm, 750-800nm, and 900-950nm. These findings indicated the robustness of these spectral regions for predicting nitrogen content. The results for N prediction demonstrated that the ML model achieved an R square of 0.49 for canopy-level data and an R square of 0.57 for leaf-level data, despite using different sets of selected spectral bands for each analysis level. The study demonstrated the potential of using in-field hyperspectral imaging and the use of spectral data in integrated feature selection and ML techniques to monitor N status in vineyards.

[405] Hierarchical Diffusion Framework for Pseudo-Healthy Brain MRI Inpainting with Enhanced 3D Consistency

Dou Hoon Kwark, Shirui Luo, Xiyue Zhu, Yudu Li, Zhi-Pei Liang, Volodymyr Kindratenko

Main category: eess.IV

TL;DR: A hierarchical diffusion framework for pseudo-healthy MRI inpainting combines axial and coronal 2D models to balance data efficiency and volumetric consistency, outperforming existing methods.

Details

Motivation: Current 2D inpainting methods cause discontinuities in 3D volumes, while 3D models require extensive training data, which is impractical in medical settings.

Method: Uses two perpendicular coarse-to-fine 2D stages: an axial diffusion model for global consistency and a coronal model for detail refinement, with adaptive resampling.

Result: Outperforms state-of-the-art baselines in realism and volumetric consistency.

Conclusion: The method is a promising solution for pseudo-healthy MRI inpainting, balancing efficiency and consistency.

Abstract: Pseudo-healthy image inpainting is an essential preprocessing step for analyzing pathological brain MRI scans. Most current inpainting methods favor slice-wise 2D models for their high in-plane fidelity, but their independence across slices produces discontinuities in the volume. Fully 3D models alleviate this issue, but their high model capacity demands extensive training data for reliable, high-fidelity synthesis – often impractical in medical settings. We address these limitations with a hierarchical diffusion framework by replacing direct 3D modeling with two perpendicular coarse-to-fine 2D stages. An axial diffusion model first yields a coarse, globally consistent inpainting; a coronal diffusion model then refines anatomical details. By combining perpendicular spatial views with adaptive resampling, our method balances data efficiency and volumetric consistency. Our experiments show our approach outperforms state-of-the-art baselines in both realism and volumetric consistency, making it a promising solution for pseudo-healthy image inpainting. Code is available at https://github.com/dou0000/3dMRI-Consistent-Inpaint.

[406] Benchmarking of Deep Learning Methods for Generic MRI Multi-OrganAbdominal Segmentation

Deepa Krishnaswamy, Cosmin Ciausu, Steve Pieper, Ron Kikinis, Benjamin Billot, Andrey Fedorov

Main category: eess.IV

TL;DR: A benchmarking study compares three MRI abdominal segmentation models and introduces ABDSynth, a CT-trained alternative, evaluating their accuracy and generalizability across diverse datasets.

Details

Motivation: MRI segmentation is challenging due to signal variability and limited annotated datasets, prompting the need for robust and generalizable tools.

Method: Three state-of-the-art models (MRSegmentator, MRISegmentator-Abdomen, TotalSegmentator MRI) and ABDSynth (trained on CT data) are benchmarked using three public datasets spanning various MRI sequences and conditions.

Result: MRSegmentator performs best, while ABDSynth offers a viable alternative with slightly lower accuracy but reduced annotation needs.

Conclusion: The study provides tools and datasets for future benchmarking, highlighting trade-offs between accuracy and annotation requirements.

Abstract: Recent advances in deep learning have led to robust automated tools for segmentation of abdominal computed tomography (CT). Meanwhile, segmentation of magnetic resonance imaging (MRI) is substantially more challenging due to the inherent signal variability and the increased effort required for annotating training datasets. Hence, existing approaches are trained on limited sets of MRI sequences, which might limit their generalizability. To characterize the landscape of MRI abdominal segmentation tools, we present here a comprehensive benchmarking of the three state-of-the-art and open-source models: MRSegmentator, MRISegmentator-Abdomen, and TotalSegmentator MRI. Since these models are trained using labor-intensive manual annotation cycles, we also introduce and evaluate ABDSynth, a SynthSeg-based model purely trained on widely available CT segmentations (no real images). More generally, we assess accuracy and generalizability by leveraging three public datasets (not seen by any of the evaluated methods during their training), which span all major manufacturers, five MRI sequences, as well as a variety of subject conditions, voxel resolutions, and fields-of-view. Our results reveal that MRSegmentator achieves the best performance and is most generalizable. In contrast, ABDSynth yields slightly less accurate results, but its relaxed requirements in training data make it an alternative when the annotation budget is limited. The evaluation code and datasets are given for future benchmarking at https://github.com/deepakri201/AbdoBench, along with inference code and weights for ABDSynth.

[407] Direct Dual-Energy CT Material Decomposition using Model-based Denoising Diffusion Model

Hang Xu, Alexandre Bousse, Alessandro Perelli

Main category: eess.IV

TL;DR: Proposes DEcomp-MoD, a deep learning method for direct material decomposition from DECT projection data, outperforming existing methods.

Details

Motivation: Existing DECT material decomposition methods are sub-optimal due to post-reconstruction processing and beam-hardening effects.

Method: DEcomp-MoD integrates spectral DECT model knowledge into training loss and uses a score-based denoising diffusion prior for material image generation.

Result: Outperforms state-of-the-art unsupervised and supervised methods on synthetic DECT sinograms.

Conclusion: DEcomp-MoD shows promise for clinical deployment due to its accuracy and consistency.

Abstract: Dual-energy X-ray Computed Tomography (DECT) constitutes an advanced technology which enables automatic decomposition of materials in clinical images without manual segmentation using the dependency of the X-ray linear attenuation with energy. However, most methods perform material decomposition in the image domain as a post-processing step after reconstruction but this procedure does not account for the beam-hardening effect and it results in sub-optimal results. In this work, we propose a deep learning procedure called Dual-Energy Decomposition Model-based Diffusion (DEcomp-MoD) for quantitative material decomposition which directly converts the DECT projection data into material images. The algorithm is based on incorporating the knowledge of the spectral DECT model into the deep learning training loss and combining a score-based denoising diffusion learned prior in the material image domain. Importantly the inference optimization loss takes as inputs directly the sinogram and converts to material images through a model-based conditional diffusion model which guarantees consistency of the results. We evaluate the performance with both quantitative and qualitative estimation of the proposed DEcomp-MoD method on synthetic DECT sinograms from the low-dose AAPM dataset. Finally, we show that DEcomp-MoD outperform state-of-the-art unsupervised score-based model and supervised deep learning networks, with the potential to be deployed for clinical diagnosis.

[408] Dynamic mapping from static labels: remote sensing dynamic sample generation with temporal-spectral embedding

Shuai Yuan, Shuang Chen, Tianwu Lin, Jincheng Yuan, Geng Tian, Yang Xu, Jie Wang, Peng Gong

Main category: eess.IV

TL;DR: TasGen is a two-stage method for generating dynamic training samples from static labels without human intervention, using temporal-spectral anomaly detection and joint embedding.

Details

Motivation: Rapid land surface changes make static samples obsolete quickly, and manual updates are labor-intensive. TasGen aims to automate sample generation to address this.

Method: TasGen uses a hierarchical temporal-spectral variational autoencoder (HTS-VAE) to disentangle and jointly embed temporal and spectral features, followed by a classifier to relabel change points.

Result: The method enables robust anomaly detection and dynamic sample generation, with an additional Gibbs sampling-based interpretation to explain changes.

Conclusion: TasGen automates sample generation and interprets land surface dynamics, offering a sustainable solution for remote sensing mapping.

Abstract: Accurate remote sensing geographic mapping requires timely and representative samples. However, rapid land surface changes often render static samples obsolete within months, making manual sample updates labor-intensive and unsustainable. To address this challenge, we propose TasGen, a two-stage Temporal spectral-aware Automatic Sample Generation method for generating dynamic training samples from single-date static labels without human intervention. Land surface dynamics often manifest as anomalies in temporal-spectral sequences. %These anomalies are multivariate yet unified: temporal, spectral, or joint anomalies stem from different mechanisms and cannot be naively coupled, as this may obscure the nature of changes. Yet, any land surface state corresponds to a coherent temporal-spectral signature, which would be lost if the two dimensions are modeled separately. To effectively capture these dynamics, TasGen first disentangles temporal and spectral features to isolate their individual contributions, and then couples them to model their synergistic interactions. In the first stage, we introduce a hierarchical temporal-spectral variational autoencoder (HTS-VAE) with a dual-dimension embedding to learn low-dimensional latent patterns of normal samples by first disentangling and then jointly embedding temporal and spectral information. This temporal-spectral embedding enables robust anomaly detection by identifying deviations from learned joint patterns. In the second stage, a classifier trained on stable samples relabels change points across time to generate dynamic samples. To not only detect but also explain surface dynamics, we further propose an anomaly interpretation method based on Gibbs sampling, which attributes changes to specific spectral-temporal dimensions.

[409] Parameter-Efficient Fine-Tuning of 3D DDPM for MRI Image Generation Using Tensor Networks

Binghua Li, Ziqing Chang, Tong Liang, Chao Li, Toshihisa Tanaka, Shigeki Aoki, Qibin Zhao, Zhe Sun

Main category: eess.IV

TL;DR: Proposes TenVOO, a parameter-efficient fine-tuning method for 3D U-Net-based DDPMs in MRI image generation, achieving state-of-the-art performance with minimal parameters.

Details

Motivation: Addressing the limited research on parameter-efficient representations for 3D convolution operations in MRI image generation.

Method: Introduces TenVOO, using tensor network modeling to represent 3D convolution kernels with lower-dimensional tensors.

Result: Outperforms existing methods in MS-SSIM with only 0.3% of the original model’s trainable parameters.

Conclusion: TenVOO is an effective PEFT method for 3D DDPMs, balancing performance and parameter efficiency.

Abstract: We address the challenge of parameter-efficient fine-tuning (PEFT) for three-dimensional (3D) U-Net-based denoising diffusion probabilistic models (DDPMs) in magnetic resonance imaging (MRI) image generation. Despite its practical significance, research on parameter-efficient representations of 3D convolution operations remains limited. To bridge this gap, we propose Tensor Volumetric Operator (TenVOO), a novel PEFT method specifically designed for fine-tuning DDPMs with 3D convolutional backbones. Leveraging tensor network modeling, TenVOO represents 3D convolution kernels with lower-dimensional tensors, effectively capturing complex spatial dependencies during fine-tuning with few parameters. We evaluate TenVOO on three downstream brain MRI datasets-ADNI, PPMI, and BraTS2021-by fine-tuning a DDPM pretrained on 59,830 T1-weighted brain MRI scans from the UK Biobank. Our results demonstrate that TenVOO achieves state-of-the-art performance in multi-scale structural similarity index measure (MS-SSIM), outperforming existing approaches in capturing spatial dependencies while requiring only 0.3% of the trainable parameters of the original model. Our code is available at: https://github.com/xiaovhua/tenvoo

[410] U-Net Based Healthy 3D Brain Tissue Inpainting

Juexin Zhang, Ying Weng, Ke Chen

Main category: eess.IV

TL;DR: A U-Net-based method synthesizes healthy 3D brain tissue from masked MRI scans, achieving top performance in the BraTS challenge with high SSIM, PSNR, and low MSE scores.

Details

Motivation: To address the task of reconstructing missing or corrupted regions in brain MRI scans, ensuring reliable and consistent synthesis of healthy tissue.

Method: Uses a U-Net architecture with data augmentation (random masking) for training on the BraTS-Local-Inpainting dataset.

Result: Achieved SSIM 0.841, PSNR 23.257, MSE 0.007 with low standard deviations, indicating reliability. Secured first place in the challenge.

Conclusion: The proposed method is effective, robust, and generalizable for brain tissue synthesis, as demonstrated by its top performance in the challenge.

Abstract: This paper introduces a novel approach to synthesize healthy 3D brain tissue from masked input images, specifically focusing on the task of ‘ASNR-MICCAI BraTS Local Synthesis of Tissue via Inpainting’. Our proposed method employs a U-Net-based architecture, which is designed to effectively reconstruct the missing or corrupted regions of brain MRI scans. To enhance our model’s generalization capabilities and robustness, we implement a comprehensive data augmentation strategy that involves randomly masking healthy images during training. Our model is trained on the BraTS-Local-Inpainting dataset and demonstrates the exceptional performance in recovering healthy brain tissue. The evaluation metrics employed, including Structural Similarity Index (SSIM), Peak Signal-to-Noise Ratio (PSNR), and Mean Squared Error (MSE), consistently yields impressive results. On the BraTS-Local-Inpainting validation set, our model achieved an SSIM score of 0.841, a PSNR score of 23.257, and an MSE score of 0.007. Notably, these evaluation metrics exhibit relatively low standard deviations, i.e., 0.103 for SSIM score, 4.213 for PSNR score and 0.007 for MSE score, which indicates that our model’s reliability and consistency across various input scenarios. Our method also secured first place in the challenge.

[411] Deep Learning for Glioblastoma Morpho-pathological Features Identification: A BraTS-Pathology Challenge Solution

Juexin Zhang, Ying Weng, Ke Chen

Main category: eess.IV

TL;DR: A deep learning model for glioblastoma diagnosis was developed using a pre-trained model fine-tuned on the BraTS-Path dataset, achieving modest performance metrics but high specificity and second place in testing.

Details

Motivation: Glioblastoma's heterogeneity complicates diagnosis, necessitating improved methods for accurate assessment and treatment selection.

Method: Leveraged a pre-trained model, fine-tuned on the BraTS-Path training dataset, and evaluated on the validation set.

Result: Model showed poor performance (accuracy, recall, F1-score: ~0.39) but high specificity (0.898704) and second place in testing.

Conclusion: The model’s high specificity and competitive testing performance suggest potential, though overall predictive power is limited.

Abstract: Glioblastoma, a highly aggressive brain tumor with diverse molecular and pathological features, poses a diagnostic challenge due to its heterogeneity. Accurate diagnosis and assessment of this heterogeneity are essential for choosing the right treatment and improving patient outcomes. Traditional methods rely on identifying specific features in tissue samples, but deep learning offers a promising approach for improved glioblastoma diagnosis. In this paper, we present our approach to the BraTS-Path Challenge 2024. We leverage a pre-trained model and fine-tune it on the BraTS-Path training dataset. Our model demonstrates poor performance on the challenging BraTS-Path validation set, as rigorously assessed by the Synapse online platform. The model achieves an accuracy of 0.392229, a recall of 0.392229, and a F1-score of 0.392229, indicating a consistent ability to correctly identify instances under the target condition. Notably, our model exhibits perfect specificity of 0.898704, showing an exceptional capacity to correctly classify negative cases. Moreover, a Matthews Correlation Coefficient (MCC) of 0.255267 is calculated, to signify a limited positive correlation between predicted and actual values and highlight our model’s overall predictive power. Our solution also achieves the second place during the testing phase.

[412] TCM-Tongue: A Standardized Tongue Image Dataset with Pathological Annotations for AI-Assisted TCM Diagnosis

Xuebo Jin, Longfei Gao, Anshuo Tong, Zhengyang Chen, Jianlei Kong, Ning Sun, Huijun Ma, Qiang Wang, Yuting Bai, Tingli Su

Main category: eess.IV

TL;DR: A specialized dataset for AI-driven TCM tongue diagnosis is introduced, addressing standardization challenges with 6,719 annotated images and benchmarking nine deep learning models.

Details

Motivation: To overcome the lack of standardized, large-scale datasets for AI in TCM tongue diagnosis, hindering reliable computational tools.

Method: Creation of a dataset with 6,719 high-quality images annotated with 20 pathological symptoms, validated by TCM practitioners, and benchmarked using nine deep learning models.

Result: The dataset supports multiple annotation formats and demonstrates utility for AI development, bridging the data gap in TCM.

Conclusion: This dataset advances AI integration in TCM by providing standardized, high-quality diagnostic data for research and clinical practice.

Abstract: Traditional Chinese medicine (TCM) tongue diagnosis, while clinically valuable, faces standardization challenges due to subjective interpretation and inconsistent imaging protocols, compounded by the lack of large-scale, annotated datasets for AI development. To address this gap, we present the first specialized dataset for AI-driven TCM tongue diagnosis, comprising 6,719 high-quality images captured under standardized conditions and annotated with 20 pathological symptom categories (averaging 2.54 clinically validated labels per image, all verified by licensed TCM practitioners). The dataset supports multiple annotation formats (COCO, TXT, XML) for broad usability and has been benchmarked using nine deep learning models (YOLOv5/v7/v8 variants, SSD, and MobileNetV2) to demonstrate its utility for AI development. This resource provides a critical foundation for advancing reliable computational tools in TCM, bridging the data shortage that has hindered progress in the field, and facilitating the integration of AI into both research and clinical practice through standardized, high-quality diagnostic data.

[413] UniSegDiff: Boosting Unified Lesion Segmentation via a Staged Diffusion Model

Yilong Hu, Shijie Chang, Lihe Zhang, Feng Tian, Weibing Sun, Huchuan Lu

Main category: eess.IV

TL;DR: UniSegDiff is a novel diffusion model framework for unified lesion segmentation across modalities and organs, addressing uneven attention distribution in DPMs with staged training and inference.

Details

Motivation: Current DPM training and inference strategies lead to uneven attention distribution, longer training times, and suboptimal solutions for lesion segmentation.

Method: Proposes UniSegDiff with staged training and inference, dynamically adjusting prediction targets and pre-training a feature extraction network for unified segmentation.

Result: Outperforms SOTA approaches on six organs across various imaging modalities.

Conclusion: UniSegDiff effectively addresses DPM limitations, achieving superior performance in lesion segmentation.

Abstract: The Diffusion Probabilistic Model (DPM) has demonstrated remarkable performance across a variety of generative tasks. The inherent randomness in diffusion models helps address issues such as blurring at the edges of medical images and labels, positioning Diffusion Probabilistic Models (DPMs) as a promising approach for lesion segmentation. However, we find that the current training and inference strategies of diffusion models result in an uneven distribution of attention across different timesteps, leading to longer training times and suboptimal solutions. To this end, we propose UniSegDiff, a novel diffusion model framework designed to address lesion segmentation in a unified manner across multiple modalities and organs. This framework introduces a staged training and inference approach, dynamically adjusting the prediction targets at different stages, forcing the model to maintain high attention across all timesteps, and achieves unified lesion segmentation through pre-training the feature extraction network for segmentation. We evaluate performance on six different organs across various imaging modalities. Comprehensive experimental results demonstrate that UniSegDiff significantly outperforms previous state-of-the-art (SOTA) approaches. The code is available at https://github.com/HUYILONG-Z/UniSegDiff.

[414] DiagR1: A Vision-Language Model Trained via Reinforcement Learning for Digestive Pathology Diagnosis

Minxi Ouyang, Lianghui Zhu, Yaqing Bao, Qiang Huang, Jingli Ouyang, Tian Guan, Xitong Ling, Jiawen Li, Song Duan, Wenbin Dai, Li Zheng, Xuemei Zhang, Yonghong He

Main category: eess.IV

TL;DR: The paper addresses challenges in multimodal models for gastrointestinal pathology by improving data quality and reasoning transparency, achieving superior performance in clinical relevance and diagnostic accuracy.

Details

Motivation: Current multimodal models for gastrointestinal pathology suffer from data noise, incomplete annotations, and lack of reasoning transparency, leading to unreliable outputs in clinical settings.

Method: The authors construct a high-quality dataset and propose a prompt argumentation strategy for better feature capture. They also use a post-training pipeline combining supervised fine-tuning and GRPO for improved reasoning.

Result: The approach outperforms state-of-the-art models with 18.7% higher clinical relevance, 32.4% improved structural completeness, and 41.2% fewer diagnostic errors.

Conclusion: The proposed solution enhances accuracy and clinical utility, making it more reliable for pathology image analysis.

Abstract: Multimodal large models have shown great potential in automating pathology image analysis. However, current multimodal models for gastrointestinal pathology are constrained by both data quality and reasoning transparency: pervasive noise and incomplete annotations in public datasets predispose vision language models to factual hallucinations when generating diagnostic text, while the absence of explicit intermediate reasoning chains renders the outputs difficult to audit and thus less trustworthy in clinical practice. To address these issues, we construct a large scale gastrointestinal pathology dataset containing both microscopic descriptions and diagnostic conclusions, and propose a prompt argumentation strategy that incorporates lesion classification and anatomical site information. This design guides the model to better capture image specific features and maintain semantic consistency in generation. Furthermore, we employ a post training pipeline that combines supervised fine tuning with Group Relative Policy Optimization (GRPO) to improve reasoning quality and output structure. Experimental results on real world pathology report generation tasks demonstrate that our approach significantly outperforms state of the art open source and proprietary baselines in terms of generation quality, structural completeness, and clinical relevance. Our solution outperforms state of the art models with 18.7% higher clinical relevance, 32.4% improved structural completeness, and 41.2% fewer diagnostic errors, demonstrating superior accuracy and clinical utility compared to existing solutions.

[415] X-ray2CTPA: Leveraging Diffusion Models to Enhance Pulmonary Embolism Classification

Noa Cahan, Eyal Klang, Galit Aviram, Yiftach Barash, Eli Konen, Raja Giryes, Hayit Greenspan

Main category: eess.IV

TL;DR: A novel diffusion-based AI method translates 2D chest X-rays into 3D CTPA scans, improving diagnostic accuracy and accessibility.

Details

Motivation: CT scans are more detailed but costly and less accessible than CXRs. This work aims to bridge the gap using AI.

Method: Uses a diffusion-based generative AI approach to convert 2D CXRs into 3D CTPA scans, evaluated quantitatively and by radiologists.

Result: Improved AUC in PE classification using synthesized 3D images, demonstrating diagnostic relevance.

Conclusion: The method generalizes to other cross-modality translations, potentially enhancing accessibility and cost-effectiveness in diagnostics.

Abstract: Chest X-rays or chest radiography (CXR), commonly used for medical diagnostics, typically enables limited imaging compared to computed tomography (CT) scans, which offer more detailed and accurate three-dimensional data, particularly contrast-enhanced scans like CT Pulmonary Angiography (CTPA). However, CT scans entail higher costs, greater radiation exposure, and are less accessible than CXRs. In this work we explore cross-modal translation from a 2D low contrast-resolution X-ray input to a 3D high contrast and spatial-resolution CTPA scan. Driven by recent advances in generative AI, we introduce a novel diffusion-based approach to this task. We evaluate the models performance using both quantitative metrics and qualitative feedback from radiologists, ensuring diagnostic relevance of the generated images. Furthermore, we employ the synthesized 3D images in a classification framework and show improved AUC in a PE categorization task, using the initial CXR input. The proposed method is generalizable and capable of performing additional cross-modality translations in medical imaging. It may pave the way for more accessible and cost-effective advanced diagnostic tools. The code for this project is available: https://github.com/NoaCahan/X-ray2CTPA .

[416] AI Workflow, External Validation, and Development in Eye Disease Diagnosis

Qingyu Chen, Tiarnan D L Keenan, Elvira Agron, Alexis Allot, Emily Guan, Bryant Duong, Amr Elsawy, Benjamin Hou, Cancan Xue, Sanjeeb Bhandari, Geoffrey Broadhead, Chantal Cousineau-Krieger, Ellen Davis, William G Gensheimer, David Grasic, Seema Gupta, Luis Haddock, Eleni Konstantinou, Tania Lamba, Michele Maiberger, Dimosthenis Mantopoulos, Mitul C Mehta, Ayman G Nahri, Mutaz AL-Nawaflh, Arnold Oshinsky, Brittany E Powell, Boonkit Purt, Soo Shin, Hillary Stiefel, Alisa T Thavikulwat, Keith James Wroblewski, Tham Yih Chung, Chui Ming Gemmy Cheung, Ching-Yu Cheng, Emily Y Chew, Michelle R. Hribar, Michael F. Chiang, Zhiyong Lu

Main category: eess.IV

TL;DR: AI-assisted workflow improves AMD diagnosis accuracy and efficiency, validated across diverse datasets.

Details

Motivation: Address gaps in medical AI accountability and enhance real-world applicability for timely disease diagnosis.

Method: Implemented AI-assisted workflow for AMD diagnosis, compared clinician performance with/without AI, and enhanced model with additional data.

Result: AI improved accuracy (20% F1-score increase) and efficiency (up to 40% time savings). Continual learning boosted model performance (29% accuracy increase).

Conclusion: AI assistance significantly enhances diagnostic accuracy and efficiency, demonstrating potential for real-world clinical adoption.

Abstract: Timely disease diagnosis is challenging due to increasing disease burdens and limited clinician availability. AI shows promise in diagnosis accuracy but faces real-world application issues due to insufficient validation in clinical workflows and diverse populations. This study addresses gaps in medical AI downstream accountability through a case study on age-related macular degeneration (AMD) diagnosis and severity classification. We designed and implemented an AI-assisted diagnostic workflow for AMD, comparing diagnostic performance with and without AI assistance among 24 clinicians from 12 institutions with real patient data sampled from the Age-Related Eye Disease Study (AREDS). Additionally, we demonstrated continual enhancement of an existing AI model by incorporating approximately 40,000 additional medical images (named AREDS2 dataset). The improved model was then systematically evaluated using both AREDS and AREDS2 test sets, as well as an external test set from Singapore. AI assistance markedly enhanced diagnostic accuracy and classification for 23 out of 24 clinicians, with the average F1-score increasing by 20% from 37.71 (Manual) to 45.52 (Manual + AI) (P-value < 0.0001), achieving an improvement of over 50% in some cases. In terms of efficiency, AI assistance reduced diagnostic times for 17 out of the 19 clinicians tracked, with time savings of up to 40%. Furthermore, a model equipped with continual learning showed robust performance across three independent datasets, recording a 29% increase in accuracy, and elevating the F1-score from 42 to 54 in the Singapore population.

[417] ECVC: Exploiting Non-Local Correlations in Multiple Frames for Contextual Video Compression

Wei Jiang, Junru Li, Kai Zhang, Li Zhang

Main category: eess.IV

TL;DR: The paper proposes ECVC, a video compression method leveraging non-local correlations and partial cascaded fine-tuning to improve rate-distortion performance and reduce accumulated errors.

Details

Motivation: Existing learned video compression methods focus on temporal movements but neglect non-local correlations and use single reference frames, limiting performance for complex movements.

Method: ECVC enhances temporal priors using non-local correlations across multiple frames and introduces partial cascaded fine-tuning to mitigate error accumulation.

Result: ECVC outperforms DCVC-FM, reducing bit-rates by 10.5% and 11.5% under VTM-13.2 low delay B settings.

Conclusion: The proposed techniques significantly improve video compression performance, achieving state-of-the-art results.

Abstract: In Learned Video Compression (LVC), improving inter prediction, such as enhancing temporal context mining and mitigating accumulated errors, is crucial for boosting rate-distortion performance. Existing LVCs mainly focus on mining the temporal movements while neglecting non-local correlations among frames. Additionally, current contextual video compression models use a single reference frame, which is insufficient for handling complex movements. To address these issues, we propose leveraging non-local correlations across multiple frames to enhance temporal priors, significantly boosting rate-distortion performance. To mitigate error accumulation, we introduce a partial cascaded fine-tuning strategy that supports fine-tuning on full-length sequences with constrained computational resources. This method reduces the train-test mismatch in sequence lengths and significantly decreases accumulated errors. Based on the proposed techniques, we present a video compression scheme ECVC. Experiments demonstrate that our ECVC achieves state-of-the-art performance, reducing 10.5% and 11.5% more bit-rates than previous SOTA method DCVC-FM over VTM-13.2 low delay B (LDB) under the intra period (IP) of 32 and -1, respectively.

[418] Physics-Informed Implicit Neural Representations for Joint B0 Estimation and Echo Planar Imaging

Wenqi Huang, Nan Wang, Congyu Liao, Yimeng Lin, Mengze Gao, Daniel Rueckert, Kawin Setsompop

Main category: eess.IV

TL;DR: A novel method using Implicit Neural Representations (INRs) and physics-informed correction improves EPI image reconstruction and B0 inhomogeneity estimation, outperforming traditional two-step approaches.

Details

Motivation: EPI's rapid imaging is marred by geometric distortions from B0 inhomogeneities, especially in high B0 regions. Existing methods' error accumulation reduces accuracy.

Method: Integrates INRs with a physics-informed model to jointly estimate B0 and reconstruct distortion-free images from rotated-view EPI, leveraging INRs’ continuous representation.

Result: Outperforms traditional methods in reconstruction quality and field estimation accuracy, tested on 180 brain image slices from three subjects.

Conclusion: The proposed INR-based method offers robust, accurate correction for EPI distortions, adaptable to subject-specific variations.

Abstract: Echo Planar Imaging (EPI) is widely used for its rapid acquisition but suffers from severe geometric distortions due to B0 inhomogeneities, particularly along the phase encoding direction. Existing methods follow a two-step process: reconstructing blip-up/down EPI images, then estimating B0, which can introduce error accumulation and reduce correction accuracy. This is especially problematic in high B0 regions, where distortions align along the same axis, making them harder to disentangle. In this work, we propose a novel approach that integrates Implicit Neural Representations (INRs) with a physics-informed correction model to jointly estimate B0 inhomogeneities and reconstruct distortion-free images from rotated-view EPI acquisitions. INRs offer a flexible, continuous representation that inherently captures complex spatial variations without requiring predefined grid-based field maps. By leveraging this property, our method dynamically adapts to subject-specific B0 variations and improves robustness across different imaging conditions. Experimental results on 180 slices of brain images from three subjects demonstrate that our approach outperforms traditional methods in terms of reconstruction quality and field estimation accuracy.

[419] Tackling Hallucination from Conditional Models for Medical Image Reconstruction with DynamicDPS

Seunghoi Kim, Henry F. J. Tregidgo, Matteo Figini, Chen Jin, Sarang Joshi, Daniel C. Alexander

Main category: eess.IV

TL;DR: DynamicDPS combines conditional and unconditional diffusion models to reduce hallucinations in medical image reconstruction, improving efficiency and fidelity.

Details

Motivation: Hallucinations in medical image reconstruction pose a critical challenge, especially for data-driven models. The goal is to reduce these spurious structures.

Method: DynamicDPS integrates conditional and unconditional diffusion models, skips early reverse process stages, and uses adaptive step sizes for refinement.

Result: The method reduces hallucinations, improving relative volume estimation by over 15% for critical tissues while using only 5% of baseline sampling steps.

Conclusion: DynamicDPS is a robust, model-agnostic solution for hallucination reduction in medical imaging, validated in low-field MRI enhancement.

Abstract: Hallucinations are spurious structures not present in the ground truth, posing a critical challenge in medical image reconstruction, especially for data-driven conditional models. We hypothesize that combining an unconditional diffusion model with data consistency, trained on a diverse dataset, can reduce these hallucinations. Based on this, we propose DynamicDPS, a diffusion-based framework that integrates conditional and unconditional diffusion models to enhance low-quality medical images while systematically reducing hallucinations. Our approach first generates an initial reconstruction using a conditional model, then refines it with an adaptive diffusion-based inverse problem solver. DynamicDPS skips early stage in the reverse process by selecting an optimal starting time point per sample and applies Wolfe’s line search for adaptive step sizes, improving both efficiency and image fidelity. Using diffusion priors and data consistency, our method effectively reduces hallucinations from any conditional model output. We validate its effectiveness in Image Quality Transfer for low-field MRI enhancement. Extensive evaluations on synthetic and real MR scans, including a downstream task for tissue volume estimation, show that DynamicDPS reduces hallucinations, improving relative volume estimation by over 15% for critical tissues while using only 5% of the sampling steps required by baseline diffusion models. As a model-agnostic and fine-tuning-free approach, DynamicDPS offers a robust solution for hallucination reduction in medical imaging. The code will be made publicly available upon publication.

[420] L-FUSION: Laplacian Fetal Ultrasound Segmentation & Uncertainty Estimation

Johanna P. Müller, Robert Wright, Thomas G. Day, Lorenzo Venturini, Samuel F. Budd, Hadrien Reynaud, Joseph V. Hajnal, Reza Razavi, Bernhard Kainz

Main category: eess.IV

TL;DR: L-FUSION integrates uncertainty quantification and foundation models for robust fetal ultrasound segmentation, improving diagnostic accuracy and uncertainty interpretation.

Details

Motivation: Operator dependency and technical limitations in prenatal ultrasound complicate image interpretation and diagnostic uncertainty assessment.

Method: Uses aleatoric logit distributions, Laplace approximations, and integrated Dropout for uncertainty quantification and segmentation.

Result: Achieves superior segmentation accuracy and reliable uncertainty quantification, aiding on-site decision-making.

Conclusion: L-FUSION offers a scalable solution for advancing fetal ultrasound analysis in clinical settings.

Abstract: Accurate analysis of prenatal ultrasound (US) is essential for early detection of developmental anomalies. However, operator dependency and technical limitations (e.g. intrinsic artefacts and effects, setting errors) can complicate image interpretation and the assessment of diagnostic uncertainty. We present L-FUSION (Laplacian Fetal US Segmentation with Integrated FoundatiON models), a framework that integrates uncertainty quantification through unsupervised, normative learning and large-scale foundation models for robust segmentation of fetal structures in normal and pathological scans. We propose to utilise the aleatoric logit distributions of Stochastic Segmentation Networks and Laplace approximations with fast Hessian estimations to estimate epistemic uncertainty only from the segmentation head. This enables us to achieve reliable abnormality quantification for instant diagnostic feedback. Combined with an integrated Dropout component, L-FUSION enables reliable differentiation of lesions from normal fetal anatomy with enhanced uncertainty maps and segmentation counterfactuals in US imaging. It improves epistemic and aleatoric uncertainty interpretation and removes the need for manual disease-labelling. Evaluations across multiple datasets show that L-FUSION achieves superior segmentation accuracy and consistent uncertainty quantification, supporting on-site decision-making and offering a scalable solution for advancing fetal ultrasound analysis in clinical settings.

[421] Modality-Agnostic Brain Lesion Segmentation with Privacy-aware Continual Learning

Yousef Sadegheih, Pratibha Kumari, Dorit Merhof

Main category: eess.IV

TL;DR: A unified brain lesion segmentation model uses continual learning to adapt to diverse MRI datasets, improving performance by 14% over existing methods.

Details

Motivation: Traditional models are limited to specific pathologies and modalities, unlike medical professionals who learn incrementally. The goal is to create a versatile model that mimics this human learning process.

Method: The approach combines a mixture-of-experts mechanism and dual knowledge distillation in a privacy-aware continual learning framework to avoid catastrophic forgetting.

Result: The model outperforms existing methods (LwF, SI, EWC, MiB, TED) by 14% in Dice score across diverse datasets.

Conclusion: The framework advances brain lesion segmentation by enabling a single adaptable model for varying protocols, modalities, and diseases, with code available on GitHub.

Abstract: Traditional brain lesion segmentation models for multi-modal MRI are typically tailored to specific pathologies, relying on datasets with predefined modalities. Adapting to new MRI modalities or pathologies often requires training separate models, which contrasts with how medical professionals incrementally expand their expertise by learning from diverse datasets over time. Inspired by this human learning process, we propose a unified segmentation model capable of sequentially learning from multiple datasets with varying modalities and pathologies. Our approach leverages a privacy-aware continual learning framework that integrates a mixture-of-experts mechanism and dual knowledge distillation to mitigate catastrophic forgetting while not compromising performance on newly encountered datasets. Extensive experiments across five diverse brain MRI datasets and four dataset sequences demonstrate the effectiveness of our framework in maintaining a single adaptable model, capable of handling varying hospital protocols, imaging modalities, and disease types. Compared to widely used privacy-aware continual learning methods such as LwF, SI, EWC, MiB, and TED, our method achieves an average Dice score improvement of approximately 14%. Our framework represents a significant step toward more versatile and practical brain lesion segmentation models, with implementation available on \href{https://github.com/xmindflow/BrainCL}{GitHub}.

[422] SR-NeRV: Improving Embedding Efficiency of Neural Video Representation via Super-Resolution

Taiga Hayami, Kakeru Koizumi, Hiroshi Watanabe

Main category: eess.IV

TL;DR: Proposes an INR-based video compression framework with a super-resolution network to improve high-frequency detail reconstruction.

Details

Motivation: High-frequency details are often lost in INR-based video compression due to model size constraints.

Method: Integrates a pre-trained super-resolution network to handle high-frequency components, leveraging their low temporal redundancy.

Result: Outperforms conventional INR-based methods in reconstruction quality without increasing model size.

Conclusion: The hybrid approach enhances visual fidelity in INR-based video compression.

Abstract: Implicit Neural Representations (INRs) have garnered significant attention for their ability to model complex signals in various domains. Recently, INR-based frameworks have shown promise in neural video compression by embedding video content into compact neural networks. However, these methods often struggle to reconstruct high-frequency details under stringent constraints on model size, which are critical in practical compression scenarios. To address this limitation, we propose an INR-based video representation framework that integrates a general-purpose super-resolution (SR) network. This design is motivated by the observation that high-frequency components tend to exhibit low temporal redundancy across frames. By offloading the reconstruction of fine details to a dedicated SR network pre-trained on natural images, the proposed method improves visual fidelity. Experimental results demonstrate that the proposed method outperforms conventional INR-based baselines in reconstruction quality, while maintaining a comparable model size.

[423] BiECVC: Gated Diversification of Bidirectional Contexts for Learned Video Compression

Wei Jiang, Junru Li, Kai Zhang, Li Zhang

Main category: eess.IV

TL;DR: BiECVC is a learned bidirectional video compression (BVC) framework that outperforms VTM 13.2 by improving context modeling and adaptive gating.

Details

Motivation: Existing BVC methods underperform due to limited context extraction and lack of adaptability to fast motion or occlusion.

Method: BiECVC enhances local and non-local context modeling, reuses high-quality features, and introduces Bidirectional Context Gating for dynamic filtering.

Result: BiECVC reduces bit-rate by 13.4% and 15.7% compared to VTM 13.2 under RA configuration.

Conclusion: BiECVC is the first learned video codec to surpass VTM 13.2 RA across all test datasets.

Abstract: Recent forward prediction-based learned video compression (LVC) methods have achieved impressive results, even surpassing VVC reference software VTM under the Low Delay B (LDB) configuration. In contrast, learned bidirectional video compression (BVC) remains underexplored and still lags behind its forward-only counterparts. This performance gap is mainly due to the limited ability to extract diverse and accurate contexts: most existing BVCs primarily exploit temporal motion while neglecting non-local correlations across frames. Moreover, they lack the adaptability to dynamically suppress harmful contexts arising from fast motion or occlusion. To tackle these challenges, we propose BiECVC, a BVC framework that incorporates diversified local and non-local context modeling along with adaptive context gating. For local context enhancement, BiECVC reuses high-quality features from lower layers and aligns them using decoded motion vectors without introducing extra motion overhead. To model non-local dependencies efficiently, we adopt a linear attention mechanism that balances performance and complexity. To further mitigate the impact of inaccurate context prediction, we introduce Bidirectional Context Gating, inspired by data-dependent decay in recent autoregressive language models, to dynamically filter contextual information based on conditional coding results. Extensive experiments demonstrate that BiECVC achieves state-of-the-art performance, reducing the bit-rate by 13.4% and 15.7% compared to VTM 13.2 under the Random Access (RA) configuration with intra periods of 32 and 64, respectively. To our knowledge, BiECVC is the first learned video codec to surpass VTM 13.2 RA across all standard test datasets.

[424] Rate-Accuracy Bounds in Visual Coding for Machines

Ivan V. Bajić

Main category: eess.IV

TL;DR: The paper explores the gap between theoretical rate-accuracy bounds and current methods in visual coding for machines, highlighting significant room for improvement.

Details

Motivation: The rise of automated analysis of visual signals (e.g., images, videos) necessitates compression strategies optimized for machine analysis rather than human reconstruction.

Method: The paper derives rate-accuracy bounds for visual coding for machines, comparing them with state-of-the-art results.

Result: Current methods are significantly (1-3 orders of magnitude) worse than theoretical bounds in terms of bitrate for achieving accuracy.

Conclusion: There is substantial potential for improving visual coding methods for machines to bridge this gap.

Abstract: Increasingly, visual signals such as images, videos and point clouds are being captured solely for the purpose of automated analysis by computer vision models. Applications include traffic monitoring, robotics, autonomous driving, smart home, and many others. This trend has led to the need to develop compression strategies for these signals for the purpose of analysis rather than reconstruction, an area often referred to as “coding for machines.” By drawing parallels with lossy coding of a discrete memoryless source, in this paper we derive rate-accuracy bounds on several popular problems in visual coding for machines, and compare these with state-of-the-art results from the literature. The comparison shows that the current results are at least an order of magnitude – and in some cases two or three orders of magnitude – away from the theoretical bounds in terms of the bitrate needed to achieve a certain level of accuracy. This, in turn, means that there is much room for improvement in the current methods for visual coding for machines.

[425] crossMoDA Challenge: Evolution of Cross-Modality Domain Adaptation Techniques for Vestibular Schwannoma and Cochlea Segmentation from 2021 to 2023

Navodini Wijethilake, Reuben Dorent, Marina Ivory, Aaron Kujawa, Stefan Cornelissen, Patrick Langenhuizen, Mohamed Okasha, Anna Oviedova, Hexin Dong, Bogyeong Kang, Guillaume Sallé, Luyi Han, Ziyuan Zhao, Han Liu, Yubo Fan, Tao Yang, Shahad Hardan, Hussain Alasmawi, Santosh Sanjeev, Yuzhou Zhuang, Satoshi Kondo, Maria Baldeon Calisto, Shaikh Muhammad Uzair Noman, Cancan Chen, Ipek Oguz, Rongguo Zhang, Mina Rezaei, Susana K. Lai-Yuen, Satoshi Kasai, Yunzhi Huang, Chih-Cheng Hung, Mohammad Yaqub, Lisheng Wang, Benoit M. Dawant, Cuntai Guan, Ritse Mann, Vincent Jaouen, Tae-Eui Kam, Li Zhang, Jonathan Shapey, Tom Vercauteren

Main category: eess.IV

TL;DR: The crossMoDA challenge series focuses on unsupervised cross-modality segmentation (ceT1 to T2 MRI) for VS and cochlea segmentation, evolving over years to include more diverse data and tasks. Findings show improved outlier reduction with dataset expansion, though cochlea segmentation declined in 2023 due to added complexity.

Details

Motivation: To automate VS and cochlea segmentation on T2 MRI for cost-effective VS management, addressing domain shift challenges.

Method: Unsupervised cross-modality segmentation using expanding datasets (single- to multi-institutional) and evolving tasks (basic segmentation to sub-segmentation).

Result: Outliers decreased with dataset diversity, but cochlea Dice score declined in 2023 due to added complexity. The 2023 winning approach improved performance even on homogeneous data.

Conclusion: Progress is noted, but clinically acceptable VS segmentation requires further work. Future benchmarks may need more challenging cross-modal tasks.

Abstract: The cross-Modality Domain Adaptation (crossMoDA) challenge series, initiated in 2021 in conjunction with the International Conference on Medical Image Computing and Computer Assisted Intervention (MICCAI), focuses on unsupervised cross-modality segmentation, learning from contrast-enhanced T1 (ceT1) and transferring to T2 MRI. The task is an extreme example of domain shift chosen to serve as a meaningful and illustrative benchmark. From a clinical application perspective, it aims to automate Vestibular Schwannoma (VS) and cochlea segmentation on T2 scans for more cost-effective VS management. Over time, the challenge objectives have evolved to enhance its clinical relevance. The challenge evolved from using single-institutional data and basic segmentation in 2021 to incorporating multi-institutional data and Koos grading in 2022, and by 2023, it included heterogeneous routine data and sub-segmentation of intra- and extra-meatal tumour components. In this work, we report the findings of the 2022 and 2023 editions and perform a retrospective analysis of the challenge progression over the years. The observations from the successive challenge contributions indicate that the number of outliers decreases with an expanding dataset. This is notable since the diversity of scanning protocols of the datasets concurrently increased. The winning approach of the 2023 edition reduced the number of outliers on the 2021 and 2022 testing data, demonstrating how increased data heterogeneity can enhance segmentation performance even on homogeneous data. However, the cochlea Dice score declined in 2023, likely due to the added complexity from tumour sub-annotations affecting overall segmentation performance. While progress is still needed for clinically acceptable VS segmentation, the plateauing performance suggests that a more challenging cross-modal task may better serve future benchmarking.

[426] EndoControlMag: Robust Endoscopic Vascular Motion Magnification with Periodic Reference Resetting and Hierarchical Tissue-aware Dual-Mask Control

An Wang, Rulin Zhou, Mengya Xu, Yiru Ye, Longfei Gou, Yiting Chang, Hao Chen, Chwee Ming Lim, Jiankun Wang, Hongliang Ren

Main category: eess.IV

TL;DR: EndoControlMag is a training-free, Lagrangian-based framework for magnifying subtle vascular motions in endoscopic surgery, featuring a PRR scheme and HTM framework for robust performance.

Details

Motivation: Visualizing subtle vascular motions is critical for surgical precision but is challenging due to complex surgical scenes.

Method: Uses Periodic Reference Resetting (PRR) and Hierarchical Tissue-aware Magnification (HTM) with dual-mode mask dilation for adaptive motion magnification.

Result: Outperforms existing methods in accuracy and visual quality, validated on the EndoVMM24 dataset.

Conclusion: EndoControlMag offers a robust solution for vascular motion visualization in endoscopic surgery.

Abstract: Visualizing subtle vascular motions in endoscopic surgery is crucial for surgical precision and decision-making, yet remains challenging due to the complex and dynamic nature of surgical scenes. To address this, we introduce EndoControlMag, a training-free, Lagrangian-based framework with mask-conditioned vascular motion magnification tailored to endoscopic environments. Our approach features two key modules: a Periodic Reference Resetting (PRR) scheme that divides videos into short overlapping clips with dynamically updated reference frames to prevent error accumulation while maintaining temporal coherence, and a Hierarchical Tissue-aware Magnification (HTM) framework with dual-mode mask dilation. HTM first tracks vessel cores using a pretrained visual tracking model to maintain accurate localization despite occlusions and view changes. It then applies one of two adaptive softening strategies to surrounding tissues: motion-based softening that modulates magnification strength proportional to observed tissue displacement, or distance-based exponential decay that simulates biomechanical force attenuation. This dual-mode approach accommodates diverse surgical scenarios-motion-based softening excels with complex tissue deformations while distance-based softening provides stability during unreliable optical flow conditions. We evaluate EndoControlMag on our EndoVMM24 dataset spanning four different surgery types and various challenging scenarios, including occlusions, instrument disturbance, view changes, and vessel deformations. Quantitative metrics, visual assessments, and expert surgeon evaluations demonstrate that EndoControlMag significantly outperforms existing methods in both magnification accuracy and visual quality while maintaining robustness across challenging surgical conditions. The code, dataset, and video results are available at https://szupc.github.io/EndoControlMag/.

[427] MLRU++: Multiscale Lightweight Residual UNETR++ with Attention for Efficient 3D Medical Image Segmentation

Nand Kumar Yadav, Rodrigue Rizk, William CW Chen, KC

Main category: eess.IV

TL;DR: MLRU++ is a lightweight CNN-Transformer hybrid for 3D medical image segmentation, balancing accuracy and efficiency with innovations like LCBAM and M2B. It outperforms state-of-the-art models with higher Dice scores and lower computational costs.

Details

Motivation: Medical image segmentation is challenging due to anatomical variability and computational demands. Existing hybrid models are complex, motivating the need for a lightweight yet accurate solution.

Method: Proposes MLRU++ with Lightweight Channel and Bottleneck Attention Module (LCBAM) for efficient feature encoding and Multiscale Bottleneck Block (M2B) for multi-resolution feature aggregation.

Result: Achieves state-of-the-art Dice scores (87.57% Synapse, 93.00% ACDC, 81.12% Lung) with 5.38% and 2.12% improvements on Synapse and ACDC, respectively, while reducing parameters and computational cost.

Conclusion: MLRU++ provides a practical, high-performing solution for 3D medical image segmentation, validated by ablation studies and benchmark results.

Abstract: Accurate and efficient medical image segmentation is crucial but challenging due to anatomical variability and high computational demands on volumetric data. Recent hybrid CNN-Transformer architectures achieve state-of-the-art results but add significant complexity. In this paper, we propose MLRU++, a Multiscale Lightweight Residual UNETR++ architecture designed to balance segmentation accuracy and computational efficiency. It introduces two key innovations: a Lightweight Channel and Bottleneck Attention Module (LCBAM) that enhances contextual feature encoding with minimal overhead, and a Multiscale Bottleneck Block (M2B) in the decoder that captures fine-grained details via multi-resolution feature aggregation. Experiments on four publicly available benchmark datasets (Synapse, BTCV, ACDC, and Decathlon Lung) demonstrate that MLRU++ achieves state-of-the-art performance, with average Dice scores of 87.57% (Synapse), 93.00% (ACDC), and 81.12% (Lung). Compared to existing leading models, MLRU++ improves Dice scores by 5.38% and 2.12% on Synapse and ACDC, respectively, while significantly reducing parameter count and computational cost. Ablation studies evaluating LCBAM and M2B further confirm the effectiveness of the proposed architectural components. Results suggest that MLRU++ offers a practical and high-performing solution for 3D medical image segmentation tasks. Source code is available at: https://github.com/1027865/MLRUPP

Today’s Research Highlights

Table of Contents

cs.CL

[1] Shop-R1: Rewarding LLMs to Simulate Human Behavior in Online Shopping via Reinforcement Learning

[2] Dynamic and Generalizable Process Reward Modeling

[3] VeriMinder: Mitigating Analytical Vulnerabilities in NL2SQL

[4] One Whisper to Grade Them All

[5] Evaluating the Performance of AI Text Detectors, Few-Shot and Chain-of-Thought Prompting Using DeepSeek Generated Text

[6] Are LLM Belief Updates Consistent with Bayes’ Theorem?

[7] Natural Language Processing for Tigrinya: Current State and Future Directions

[8] Technical Report of TeleChat2, TeleChat2.5 and T1

[9] NeuralDB: Scaling Knowledge Editing in LLMs to 100,000 Facts with Neural KV Database

[10] TELEVAL: A Dynamic Benchmark Designed for Spoken Language Models in Chinese Interactive Scenarios

[11] GrAInS: Gradient-based Attribution for Inference-Time Steering of LLMs and VLMs

[12] GOAT-SLM: A Spoken Language Model with Paralinguistic and Speaker Characteristic Awareness

[13] Synthetic Data Generation for Phrase Break Prediction with Large Language Model

[14] Privacy-Preserving Synthetic Review Generation with Diverse Writing Styles Using LLMs

[15] Hybrid and Unitary Fine-Tuning of Large Language Models: Methods and Benchmarking under Resource Constraints

[16] Step-Audio 2 Technical Report

[17] Seed LiveInterpret 2.0: End-to-end Simultaneous Speech-to-speech Translation with Your Voice

[18] A New Pair of GloVes

[19] MathOPEval: A Fine-grained Evaluation Benchmark for Visual Operations of MLLMs in Mathematical Reasoning

[20] HIVMedQA: Benchmarking large language models for HIV medical decision support

[21] Sticking to the Mean: Detecting Sticky Tokens in Text Embedding Models

[22] SCOPE: Stochastic and Counterbiased Option Placement for Evaluating Large Language Models

[23] TN-AutoRCA: Benchmark Construction and Agentic Framework for Self-Improving Alarm-Based Root Cause Analysis in Telecommunication Networks

[24] Integrating an ISO30401-compliant Knowledge management system with existing business processes of an organization

[25] Safeguarding RAG Pipelines with GMTP: A Gradient-based Masked Token Probability Method for Poisoned Document Detection

[26] Exploring the Impact of Instruction-Tuning on LLM’s Susceptibility to Misinformation

[27] Prune&Comp: Free Lunch for Layer-Pruned LLMs via Iterative Pruning with Magnitude Compensation

[28] Locate-and-Focus: Enhancing Terminology Translation in Speech Language Models

[29] Zero-shot OCR Accuracy of Low-Resourced Languages: A Comparative Analysis on Sinhala and Tamil

[30] StyleAdaptedLM: Enhancing Instruction Following Models with Efficient Stylistic Transfer

[31] BadReasoner: Planting Tunable Overthinking Backdoors into Large Reasoning Models for Fun or Profit

[32] Uncertainty Quantification for Evaluating Machine Translation Bias

[33] TDR: Task-Decoupled Retrieval with Fine-Grained LLM Feedback for In-Context Learning

[34] Hybrid Annotation for Propaganda Detection: Integrating LLM Pre-Annotations with Human Intelligence

[35] CLEAR: Error Analysis via LLM-as-a-Judge Made Easy

[36] Factual Inconsistencies in Multilingual Wikipedia Tables

[37] FinDPO: Financial Sentiment Analysis for Algorithmic Trading through Preference Optimization of LLMs

[38] AraTable: Benchmarking LLMs’ Reasoning and Understanding of Arabic Tabular Data

[39] Restoring Rhythm: Punctuation Restoration Using Transformer Models for Bangla, a Low-Resource Language

[40] Generation of Synthetic Clinical Text: A Systematic Review

[41] Not All Features Deserve Attention: Graph-Guided Dependency Learning for Tabular Data Generation with Language Models

[42] The Moral Gap of Large Language Models

[43] Effective Multi-Task Learning for Biomedical Named Entity Recognition

[44] GLiNER2: An Efficient Multi-Task Information Extraction System with Schema-Driven Interface

[45] GIIFT: Graph-guided Inductive Image-free Multimodal Machine Translation

[46] Hybrid Tokenization Strategy for DNA Language Model using Byte Pair Encoding and K-MER Methods

[47] Wide-In, Narrow-Out: Revokable Decoding for Efficient and Effective DLLMs

[48] System Report for CCL25-Eval Task 10: SRAG-MAV for Fine-Grained Chinese Hate Speech Recognition

[49] AQuilt: Weaving Logic and Self-Inspection into Low-Cost, High-Relevance Data Synthesis for Specialist LLMs

[50] TRPrompt: Bootstrapping Query-Aware Prompt Optimization from Textual Rewards

[51] Checklists Are Better Than Reward Models For Aligning Language Models

[52] Causally Testing Gender Bias in LLMs: A Case Study on Occupational Bias

[53] DocTER: Evaluating Document-based Knowledge Editing

[54] Quantifying the Uniqueness and Divisiveness of Presidential Discourse

[55] Weak-to-Strong Jailbreaking on Large Language Models

[56] P-React: Synthesizing Topic-Adaptive Reactions of Personality Traits via Mixture of Specialized LoRA Experts

[57] VolDoGer: LLM-assisted Datasets for Domain Generalization in Vision-Language Tasks

[58] Identity-related Speech Suppression in Generative AI Content Moderation

[59] LIFBench: Evaluating the Instruction Following Performance and Stability of Large Language Models in Long-Context Scenarios

[60] A Survey of Event Causality Identification: Taxonomy, Challenges, Assessment, and Prospects

[61] BlockDialect: Block-wise Fine-grained Mixed Format Quantization for Energy-Efficient LLM Inference

[62] LLM Alignment as Retriever Optimization: An Information Retrieval Perspective

[63] Beyond Profile: From Surface-Level Facts to Deep Persona Simulation in LLMs

[64] ExpliCa: Evaluating Explicit Causal Reasoning in Large Language Models

[65] Discriminative Finetuning of Generative Large Language Models without Reward Models and Human Preference Data

[66] How do language models learn facts? Dynamics, curricula and hallucinations

[67] Exploiting individual differences to bootstrap communication

[68] Enhancing Transformation from Natural Language to Signal Temporal Logic Using LLMs with Diverse External Knowledge

[69] OPeRA: A Dataset of Observation, Persona, Rationale, and Action for Evaluating LLMs on Human Online Shopping Behavior Simulation

[70] Large Language Models in Argument Mining: A Survey

[71] Breaking Barriers: Do Reinforcement Post Training Gains Transfer To Unseen Domains?

[72] Mechanistic Indicators of Understanding in Large Language Models

[73] A comprehensive study of LLM-based argument classification: from LLAMA through GPT-4o to Deepseek-R1

[74] A Survey of Deep Learning for Geometry Problem Solving

[75] FLEXITOKENS: Flexible Tokenization for Evolving Language Models

[76] Multilingual LLMs Are Not Multilingual Thinkers: Evidence from Hindi Analogy Evaluation

[77] What Makes You CLIC: Detection of Croatian Clickbait Headlines