Today’s Research Highlights

AI-enhanced summaries of the latest research papers from arXiv.

cs.CL [Total: 88]
cs.CV [Total: 166]
cs.AI [Total: 61]
cs.SD [Total: 10]
cs.LG [Total: 145]
cs.MA [Total: 4]
cs.MM [Total: 4]
eess.AS [Total: 12]
eess.IV [Total: 17]

cs.CL

[1] Enhancing Dialogue Annotation with Speaker Characteristics Leveraging a Frozen LLM

Thomas Thebaud, Yen-Ju Lu, Matthew Wiesner, Peter Viechnicki, Najim Dehak

Main category: cs.CL

TL;DR: The paper proposes enriching dialogue transcriptions with metadata tags for speaker characteristics using frozen audio and language models, achieving competitive performance without fine-tuning.

Details

Motivation: To enhance dialogue transcriptions by adding speaker metadata (e.g., age, gender, emotion) without task-specific fine-tuning.

Method: Combines frozen audio foundation models (Whisper, WavLM) with a frozen LLAMA language model using lightweight connectors to infer speaker attributes.

Result: Competitive performance in speaker profiling tasks and an 8.8% Equal Error Rate for x-vector comparisons.

Conclusion: The approach effectively enriches transcriptions with speaker metadata while maintaining modularity and speed.

Abstract: In dialogue transcription pipelines, Large Language Models (LLMs) are frequently employed in post-processing to improve grammar, punctuation, and readability. We explore a complementary post-processing step: enriching transcribed dialogues by adding metadata tags for speaker characteristics such as age, gender, and emotion. Some of the tags are global to the entire dialogue, while some are time-variant. Our approach couples frozen audio foundation models, such as Whisper or WavLM, with a frozen LLAMA language model to infer these speaker attributes, without requiring task-specific fine-tuning of either model. Using lightweight, efficient connectors to bridge audio and language representations, we achieve competitive performance on speaker profiling tasks while preserving modularity and speed. Additionally, we demonstrate that a frozen LLAMA model can compare x-vectors directly, achieving an Equal Error Rate of 8.8% in some scenarios.

[2] Parity-Aware Byte-Pair Encoding: Improving Cross-lingual Fairness in Tokenization

Negar Foroutan, Clara Meister, Debjit Paul, Joel Niklaus, Sina Ahmadi, Antoine Bosselut, Rico Sennrich

Main category: cs.CL

TL;DR: Parity-aware BPE improves tokenization fairness across languages without sacrificing global compression or downstream performance.

Details

Motivation: Addressing inequalities in tokenization for lower-resource languages, which suffer from longer or implausible tokenizations due to frequency-based algorithms.

Method: Introduces Parity-aware BPE, which prioritizes compression gain for the worst-compressed language at each merge step.

Result: Achieves more equitable token counts across languages with negligible impact on global compression and downstream task performance.

Conclusion: Parity-aware BPE effectively balances cross-lingual fairness and computational efficiency.

Abstract: Tokenization is the first – and often least scrutinized – step of most NLP pipelines. Standard algorithms for learning tokenizers rely on frequency-based objectives, which favor languages dominant in the training data and consequently leave lower-resource languages with tokenizations that are disproportionately longer, morphologically implausible, or even riddled with placeholders. This phenomenon ultimately amplifies computational and financial inequalities between users from different language backgrounds. To remedy this, we introduce Parity-aware Byte Pair Encoding (BPE), a variant of the widely-used BPE algorithm. At every merge step, Parity-aware BPE maximizes the compression gain of the currently worst-compressed language, trading a small amount of global compression for cross-lingual parity. We find empirically that Parity-aware BPE leads to more equitable token counts across languages, with negligible impact on global compression rate and no substantial effect on language-model performance in downstream tasks.

[3] Pitch Accent Detection improves Pretrained Automatic Speech Recognition

David Sasu, Natalie Schluter

Main category: cs.CL

TL;DR: Joint ASR and pitch accent detection model improves ASR performance and pitch accent detection, reducing WER by 28.3% and closing F1-score gap by 41%.

Details

Motivation: Enhance ASR systems by integrating prosodic cues like pitch accent, leveraging semi-supervised speech representations.

Method: Introduce a joint model combining ASR and pitch accent detection, fine-tuned under limited resources.

Result: Pitch accent detection improves by 41% F1-score; ASR WER drops by 28.3% on LibriSpeech.

Conclusion: Extending pretrained speech models to include prosodic cues like pitch accent significantly boosts performance.

Abstract: We show the performance of Automatic Speech Recognition (ASR) systems that use semi-supervised speech representations can be boosted by a complimentary pitch accent detection module, by introducing a joint ASR and pitch accent detection model. The pitch accent detection component of our model achieves a significant improvement on the state-of-the-art for the task, closing the gap in F1-score by 41%. Additionally, the ASR performance in joint training decreases WER by 28.3% on LibriSpeech, under limited resource fine-tuning. With these results, we show the importance of extending pretrained speech models to retain or re-learn important prosodic cues such as pitch accent.

[4] Persistent Instability in LLM’s Personality Measurements: Effects of Scale, Reasoning, and Conversation History

Tommaso Tosato, Saskia Helbling, Yorguin-Jose Mantilla-Ramos, Mahmood Hegazy, Alberto Tosato, David John Lemay, Irina Rish, Guillaume Dumas

Main category: cs.CL

TL;DR: PERSIST evaluates personality stability in large language models (LLMs), revealing significant response variability and instability despite interventions, questioning their suitability for safety-critical applications.

Details

Motivation: To understand and assess the stability of personality-like traits in LLMs, crucial for safe deployment, given their inconsistent behavioral patterns.

Method: PERSIST framework evaluates 25+ open-source models (1B-671B parameters) using traditional and novel LLM-adapted personality instruments, testing variations like question order, paraphrasing, personas, and reasoning modes.

Result: Key findings: (1) High response variability even in large models (400B+), (2) Prompt reordering shifts personality measurements by 20%, (3) Interventions like chain-of-thought reasoning increase variability, (4) LLM-adapted instruments show instability similar to human-centric ones.

Conclusion: Current LLMs lack behavioral consistency, making personality-based alignment strategies inadequate for safety-critical applications.

Abstract: Large language models require consistent behavioral patterns for safe deployment, yet their personality-like traits remain poorly understood. We present PERSIST (PERsonality Stability in Synthetic Text), a comprehensive evaluation framework testing 25+ open-source models (1B-671B parameters) across 500,000+ responses. Using traditional (BFI-44, SD3) and novel LLM-adapted personality instruments, we systematically vary question order, paraphrasing, personas, and reasoning modes. Our findings challenge fundamental deployment assumptions: (1) Even 400B+ models exhibit substantial response variability (SD

0.4); (2) Minor prompt reordering alone shifts personality measurements by up to 20%; (3) Interventions expected to stabilize behavior, such as chain-of-thought reasoning, detailed personas instruction, inclusion of conversation history, can paradoxically increase variability; (4) LLM-adapted instruments show equal instability to human-centric versions, confirming architectural rather than translational limitations. This persistent instability across scales and mitigation strategies suggests current LLMs lack the foundations for genuine behavioral consistency. For safety-critical applications requiring predictable behavior, these findings indicate that personality-based alignment strategies may be fundamentally inadequate.

[5] RCR-Router: Efficient Role-Aware Context Routing for Multi-Agent LLM Systems with Structured Memory

Jun Liu, Zhenglun Kong, Changdi Yang, Fan Yang, Tianqi Li, Peiyan Dong, Joannah Nanjekye, Hao Tang, Geng Yuan, Wei Niu, Wenbin Zhang, Pu Zhao, Xue Lin, Dong Huang, Yanzhi Wang

Main category: cs.CL

TL;DR: RCR-Router is a dynamic, role-aware context routing framework for multi-agent LLMs, reducing token usage by up to 30% while maintaining answer quality.

Details

Motivation: Existing coordination schemes in multi-agent LLMs are inefficient due to static or full-context routing, leading to high token consumption and limited adaptability.

Method: RCR-Router dynamically selects relevant memory subsets for each agent based on role and task stage, guided by a lightweight scoring policy and iterative memory refinement.

Result: Experiments on multi-hop QA benchmarks show reduced token usage (up to 30%) with maintained or improved answer quality.

Conclusion: Structured memory routing and output-aware evaluation are crucial for scalable multi-agent LLM systems.

Abstract: Multi-agent large language model (LLM) systems have shown strong potential in complex reasoning and collaborative decision-making tasks. However, most existing coordination schemes rely on static or full-context routing strategies, which lead to excessive token consumption, redundant memory exposure, and limited adaptability across interaction rounds. We introduce RCR-Router, a modular and role-aware context routing framework designed to enable efficient, adaptive collaboration in multi-agent LLMs. To our knowledge, this is the first routing approach that dynamically selects semantically relevant memory subsets for each agent based on its role and task stage, while adhering to a strict token budget. A lightweight scoring policy guides memory selection, and agent outputs are iteratively integrated into a shared memory store to facilitate progressive context refinement. To better evaluate model behavior, we further propose an Answer Quality Score metric that captures LLM-generated explanations beyond standard QA accuracy. Experiments on three multi-hop QA benchmarks – HotPotQA, MuSiQue, and 2WikiMultihop – demonstrate that RCR-Router reduces token usage (up to 30%) while improving or maintaining answer quality. These results highlight the importance of structured memory routing and output-aware evaluation in advancing scalable multi-agent LLM systems.

[6] I Think, Therefore I Am Under-Qualified? A Benchmark for Evaluating Linguistic Shibboleth Detection in LLM Hiring Evaluations

Julia Kharchenko, Tanya Roosta, Aman Chadha, Chirag Shah

Main category: cs.CL

TL;DR: A benchmark for detecting demographic bias in LLMs through linguistic shibboleths, showing hedged responses are penalized by 25.6%.

Details

Motivation: To evaluate how LLMs inadvertently reveal or penalize demographic attributes through subtle linguistic markers.

Method: Uses 100 validated question-response pairs in interview simulations to isolate and measure bias while maintaining semantic equivalence.

Result: Hedged responses receive 25.6% lower ratings, revealing systematic bias in LLMs.

Conclusion: Provides a framework for detecting linguistic discrimination in AI, aiding fairness in automated decision-making.

Abstract: This paper introduces a comprehensive benchmark for evaluating how Large Language Models (LLMs) respond to linguistic shibboleths: subtle linguistic markers that can inadvertently reveal demographic attributes such as gender, social class, or regional background. Through carefully constructed interview simulations using 100 validated question-response pairs, we demonstrate how LLMs systematically penalize certain linguistic patterns, particularly hedging language, despite equivalent content quality. Our benchmark generates controlled linguistic variations that isolate specific phenomena while maintaining semantic equivalence, which enables the precise measurement of demographic bias in automated evaluation systems. We validate our approach along multiple linguistic dimensions, showing that hedged responses receive 25.6% lower ratings on average, and demonstrate the benchmark’s effectiveness in identifying model-specific biases. This work establishes a foundational framework for detecting and measuring linguistic discrimination in AI systems, with broad applications to fairness in automated decision-making contexts.

[7] Towards Robust Evaluation of Visual Activity Recognition: Resolving Verb Ambiguity with Sense Clustering

Louie Hong Yao, Nicholas Jarvis, Tianyu Jiang

Main category: cs.CL

TL;DR: A framework for evaluating visual activity recognition using verb sense clusters addresses ambiguities in verb semantics and image interpretation, outperforming standard exact-match methods.

Details

Motivation: Standard exact-match evaluation fails to capture ambiguities in verb semantics and image interpretation, leading to incomplete model assessments.

Method: Proposes a vision-language clustering framework to construct verb sense clusters for robust evaluation.

Result: Analysis shows each image maps to 2.8 sense clusters, and cluster-based evaluation aligns better with human judgements.

Conclusion: Cluster-based evaluation offers a more nuanced and accurate assessment of model performance compared to standard methods.

Abstract: Evaluating visual activity recognition systems is challenging due to inherent ambiguities in verb semantics and image interpretation. When describing actions in images, synonymous verbs can refer to the same event (e.g., brushing vs. grooming), while different perspectives can lead to equally valid but distinct verb choices (e.g., piloting vs. operating). Standard exact-match evaluation, which relies on a single gold answer, fails to capture these ambiguities, resulting in an incomplete assessment of model performance. To address this, we propose a vision-language clustering framework that constructs verb sense clusters, providing a more robust evaluation. Our analysis of the imSitu dataset shows that each image maps to an average of 2.8 sense clusters, with each cluster representing a distinct perspective of the image. We evaluate multiple activity recognition models and compare our cluster-based evaluation with standard evaluation methods. Additionally, our human alignment analysis suggests that the cluster-based evaluation better aligns with human judgements, offering a more nuanced assessment of model performance.

Song Wang, Yishu Wei, Haotian Ma, Max Lovitt, Kelly Deng, Yuan Meng, Zihan Xu, Jingze Zhang, Yunyu Xiao, Ying Ding, Xuhai Xu, Joydeep Ghosh, Yifan Peng

Main category: cs.CL

TL;DR: A multi-stage LLM framework improves extraction of social determinants of health (SDoH) factors from unstructured text, outperforming state-of-the-art models and enhancing explainability.

Details

Motivation: Address challenges in data-driven SDoH analysis for suicide prevention, including long-tailed factor distributions and limited model explainability.

Method: Proposes a multi-stage LLM framework compared to BioBERT, GPT-3.5-turbo, and DeepSeek-R1, with evaluations via automated metrics and a user study.

Result: Outperforms other models in SDoH extraction and context retrieval, with fine-tuned smaller models achieving comparable performance at lower cost.

Conclusion: Enhances accuracy and transparency in SDoH extraction, aiding early risk identification and suicide prevention strategies.

Abstract: Background: Understanding social determinants of health (SDoH) factors contributing to suicide incidents is crucial for early intervention and prevention. However, data-driven approaches to this goal face challenges such as long-tailed factor distributions, analyzing pivotal stressors preceding suicide incidents, and limited model explainability. Methods: We present a multi-stage large language model framework to enhance SDoH factor extraction from unstructured text. Our approach was compared to other state-of-the-art language models (i.e., pre-trained BioBERT and GPT-3.5-turbo) and reasoning models (i.e., DeepSeek-R1). We also evaluated how the model’s explanations help people annotate SDoH factors more quickly and accurately. The analysis included both automated comparisons and a pilot user study. Results: We show that our proposed framework demonstrated performance boosts in the overarching task of extracting SDoH factors and in the finer-grained tasks of retrieving relevant context. Additionally, we show that fine-tuning a smaller, task-specific model achieves comparable or better performance with reduced inference costs. The multi-stage design not only enhances extraction but also provides intermediate explanations, improving model explainability. Conclusions: Our approach improves both the accuracy and transparency of extracting suicide-related SDoH from unstructured texts. These advancements have the potential to support early identification of individuals at risk and inform more effective prevention strategies.

[9] Dialogues Aspect-based Sentiment Quadruple Extraction via Structural Entropy Minimization Partitioning

Kun Peng, Cong Cao, Hao Peng, Zhifeng Hao, Lei Jiang, Kongjing Gu, Yanbing Liu, Philip S. Yu

Main category: cs.CL

TL;DR: The paper proposes a method for DiaASQ by partitioning dialogues into semantically independent sub-dialogues using structural entropy minimization, followed by a two-step quadruple extraction framework, achieving state-of-the-art performance with lower computational costs.

Details

Motivation: Existing methods assume uniform sentiment distribution across dialogues, but dialogues often contain independent sub-dialogues, introducing noise. The goal is to improve extraction by focusing on relevant sub-dialogues.

Method: Partition dialogues into sub-dialogues using structural entropy minimization. Then, use a two-step framework: extract sentiment elements at utterance level and match quadruples at sub-dialogue level.

Result: The approach achieves state-of-the-art performance in DiaASQ with significantly lower computational costs.

Conclusion: Focusing on semantically independent sub-dialogues and using a two-step extraction framework improves DiaASQ performance efficiently.

Abstract: Dialogues Aspect-based Sentiment Quadruple Extraction (DiaASQ) aims to extract all target-aspect-opinion-sentiment quadruples from a given multi-round, multi-participant dialogue. Existing methods typically learn word relations across entire dialogues, assuming a uniform distribution of sentiment elements. However, we find that dialogues often contain multiple semantically independent sub-dialogues without clear dependencies between them. Therefore, learning word relationships across the entire dialogue inevitably introduces additional noise into the extraction process. To address this, our method focuses on partitioning dialogues into semantically independent sub-dialogues. Achieving completeness while minimizing these sub-dialogues presents a significant challenge. Simply partitioning based on reply relationships is ineffective. Instead, we propose utilizing a structural entropy minimization algorithm to partition the dialogues. This approach aims to preserve relevant utterances while distinguishing irrelevant ones as much as possible. Furthermore, we introduce a two-step framework for quadruple extraction: first extracting individual sentiment elements at the utterance level, then matching quadruples at the sub-dialogue level. Extensive experiments demonstrate that our approach achieves state-of-the-art performance in DiaASQ with much lower computational costs.

[10] Evaluation of LLMs in AMR Parsing

Shu Han Ho

Main category: cs.CL

TL;DR: Finetuning decoder-only LLMs (Phi 3.5, Gemma 2, LLaMA 3.2, DeepSeek R1) achieves competitive AMR parsing performance, with LLaMA 3.2 matching SOTA parsers (SMATCH F1: 0.804).

Details

Motivation: To explore a straightforward approach for AMR parsing by finetuning decoder-only LLMs, avoiding complex architectures.

Method: Finetuned four LLMs (Phi 3.5, Gemma 2, LLaMA 3.2, DeepSeek R1) on the LDC2020T02 Gold AMR3.0 test set and evaluated their performance.

Result: LLaMA 3.2 achieved SMATCH F1: 0.804, comparable to SOTA parsers. Phi 3.5 excelled in structural validity, while LLaMA 3.2 led in semantic performance.

Conclusion: Finetuning decoder-only LLMs is a viable, simpler alternative to complex AMR parsers, with LLaMA 3.2 showing competitive results.

Abstract: Meaning Representation (AMR) is a semantic formalism that encodes sentence meaning as rooted, directed, acyclic graphs, where nodes represent concepts and edges denote semantic relations. Finetuning decoder only Large Language Models (LLMs) represent a promising novel straightfoward direction for AMR parsing. This paper presents a comprehensive evaluation of finetuning four distinct LLM architectures, Phi 3.5, Gemma 2, LLaMA 3.2, and DeepSeek R1 LLaMA Distilled using the LDC2020T02 Gold AMR3.0 test set. Our results have shown that straightfoward finetuning of decoder only LLMs can achieve comparable performance to complex State of the Art (SOTA) AMR parsers. Notably, LLaMA 3.2 demonstrates competitive performance against SOTA AMR parsers given a straightforward finetuning approach. We achieved SMATCH F1: 0.804 on the full LDC2020T02 test split, on par with APT + Silver (IBM) at 0.804 and approaching Graphene Smatch (MBSE) at 0.854. Across our analysis, we also observed a consistent pattern where LLaMA 3.2 leads in semantic performance while Phi 3.5 excels in structural validity.

[11] Align, Don’t Divide: Revisiting the LoRA Architecture in Multi-Task Learning

Jinda Liu, Bo Cheng, Yi Chang, Yuan Wu

Main category: cs.CL

TL;DR: Simplified multi-head LoRA outperforms complex multi-adapter systems; single-adapter LoRA with higher rank is competitive. Align-LoRA, with explicit representation alignment, surpasses baselines.

Details

Motivation: Challenges the multi-component paradigm in PEFT for LLMs, questioning the need for structural diversity in multi-task learning.

Method: Proposes Align-LoRA, which uses an explicit loss to align task representations in a shared adapter space.

Result: Align-LoRA outperforms complex multi-adapter and multi-head systems, showing robust shared representations are key.

Conclusion: Simpler, single-adapter approaches with aligned representations are more effective for multi-task adaptation of LLMs.

Abstract: Parameter-Efficient Fine-Tuning (PEFT) is essential for adapting Large Language Models (LLMs). In practice, LLMs are often required to handle a diverse set of tasks from multiple domains, a scenario naturally addressed by multi-task learning (MTL). Within this MTL context, a prevailing trend involves LoRA variants with multiple adapters or heads, which advocate for structural diversity to capture task-specific knowledge. Our findings present a direct challenge to this paradigm. We first show that a simplified multi-head architecture with high inter-head similarity substantially outperforms complex multi-adapter and multi-head systems. This leads us to question the multi-component paradigm itself, and we further demonstrate that a standard single-adapter LoRA, with a sufficiently increased rank, also achieves highly competitive performance. These results lead us to a new hypothesis: effective MTL generalization hinges on learning robust shared representations, not isolating task-specific features. To validate this, we propose Align-LoRA, which incorporates an explicit loss to align task representations within the shared adapter space. Experiments confirm that Align-LoRA significantly surpasses all baselines, establishing a simpler yet more effective paradigm for adapting LLMs to multiple tasks. The code is available at https://github.com/jinda-liu/Align-LoRA.

[12] Recent Advances in Speech Language Models: A Survey

Wenqian Cui, Dianzhi Yu, Xiaoqi Jiao, Ziqiao Meng, Guangyan Zhang, Qichao Wang, Yiwen Guo, Irwin King

Main category: cs.CL

TL;DR: The paper surveys Speech Language Models (SpeechLMs), which are end-to-end models for speech generation, addressing limitations of the ASR+LLM+TTS pipeline.

Details

Motivation: Human interaction relies on speech, but current LLMs focus on text. The ASR+LLM+TTS pipeline has issues like latency and information loss, prompting the need for SpeechLMs.

Method: The paper reviews methodologies for building SpeechLMs, including architecture components and training recipes.

Result: It provides a comprehensive overview of SpeechLMs, their capabilities, evaluation metrics, and challenges.

Conclusion: SpeechLMs are a promising alternative to traditional pipelines, with ongoing research needed to address challenges in this evolving field.

Abstract: Large Language Models (LLMs) have recently garnered significant attention, primarily for their capabilities in text-based interactions. However, natural human interaction often relies on speech, necessitating a shift towards voice-based models. A straightforward approach to achieve this involves a pipeline of ``Automatic Speech Recognition (ASR) + LLM + Text-to-Speech (TTS)", where input speech is transcribed to text, processed by an LLM, and then converted back to speech. Despite being straightforward, this method suffers from inherent limitations, such as information loss during modality conversion, significant latency due to the complex pipeline, and error accumulation across the three stages. To address these issues, Speech Language Models (SpeechLMs) – end-to-end models that generate speech without converting from text – have emerged as a promising alternative. This survey paper provides the first comprehensive overview of recent methodologies for constructing SpeechLMs, detailing the key components of their architecture and the various training recipes integral to their development. Additionally, we systematically survey the various capabilities of SpeechLMs, categorize their evaluation metrics, and discuss the challenges and future research directions in this rapidly evolving field. The GitHub repository is available at https://github.com/dreamtheater123/Awesome-SpeechLM-Survey

[13] Multimodal Fact Checking with Unified Visual, Textual, and Contextual Representations

Aditya Kishore, Gaurav Kumar, Jasabanta Patro

Main category: cs.CL

TL;DR: MultiCheck is a framework for multimodal fact verification, combining text and image analysis to improve accuracy over traditional text-only methods.

Details

Motivation: Address the challenge of multimodal misinformation by integrating both textual and visual evidence for more robust fact-checking.

Method: Uses dedicated encoders for text and images, a fusion module for cross-modal relationships, and a contrastive learning objective for semantic alignment.

Result: Achieves a weighted F1 score of 0.84 on the Factify 2 dataset, outperforming baselines.

Conclusion: Demonstrates the effectiveness of explicit multimodal reasoning for scalable and interpretable fact-checking.

Abstract: The growing rate of multimodal misinformation, where claims are supported by both text and images, poses significant challenges to fact-checking systems that rely primarily on textual evidence. In this work, we have proposed a unified framework for fine-grained multimodal fact verification called “MultiCheck”, designed to reason over structured textual and visual signals. Our architecture combines dedicated encoders for text and images with a fusion module that captures cross-modal relationships using element-wise interactions. A classification head then predicts the veracity of a claim, supported by a contrastive learning objective that encourages semantic alignment between claim-evidence pairs in a shared latent space. We evaluate our approach on the Factify 2 dataset, achieving a weighted F1 score of 0.84, substantially outperforming the baseline. These results highlight the effectiveness of explicit multimodal reasoning and demonstrate the potential of our approach for scalable and interpretable fact-checking in complex, real-world scenarios.

[14] SciReplicate-Bench: Benchmarking LLMs in Agent-driven Algorithmic Reproduction from Research Papers

Yanzheng Xiang, Hanqi Yan, Shuyin Ouyang, Lin Gui, Yulan He

Main category: cs.CL

TL;DR: The study evaluates LLMs in generating code from algorithm descriptions in NLP papers, introducing SciReplicate-Bench and Sci-Reproducer, a dual-agent framework. The best LLM achieves 39% execution accuracy, with inconsistent algorithm descriptions being a major challenge.

Details

Motivation: To assess LLMs' ability to generate code from algorithm descriptions in academic papers, addressing the need for algorithm comprehension and coding expertise.

Method: Introduces SciReplicate-Bench (100 tasks from 36 NLP papers) and Sci-Reproducer, a dual-agent framework (Paper Agent and Code Agent). Evaluates LLMs using reasoning graph accuracy, execution accuracy, CodeBLEU, and dependency recall.

Result: Best-performing LLM achieves 39% execution accuracy. Inconsistent algorithm descriptions hinder reproduction.

Conclusion: The benchmark is challenging, and LLMs struggle with algorithm comprehension and implementation, highlighting the need for better descriptions and models.

Abstract: This study evaluates large language models (LLMs) in generating code from algorithm descriptions in recent NLP papers. The task requires two key competencies: (1) algorithm comprehension: synthesizing information from papers and academic literature to understand implementation logic, and (2) coding expertise: identifying dependencies and correctly implementing necessary APIs. To facilitate rigorous evaluation, we introduce SciReplicate-Bench, a benchmark of 100 tasks from 36 NLP papers published in 2024, featuring detailed annotations and comprehensive test cases. Building on SciReplicate-Bench, we propose Sci-Reproducer, a dual-agent framework consisting of a Paper Agent that interprets algorithmic concepts from literature and a Code Agent that retrieves dependencies from repositories and implements solutions. To assess algorithm understanding, we introduce reasoning graph accuracy, which quantifies similarity between generated and reference reasoning graphs derived from code comments and structure. For evaluating implementation quality, we employ execution accuracy, CodeBLEU, and repository dependency/API recall metrics. In our experiments, we evaluate various powerful non-reasoning and reasoning LLMs as foundational models. The best-performing LLM using \ModelName~achieves only 39% execution accuracy, highlighting the benchmark’s difficulty. Our analysis identifies missing or inconsistent algorithm descriptions as key barriers to successful reproduction. We make available our benchmark and code at https://github.com/xyzCS/SciReplicate-Bench and project homepage at https://xyzcs.github.io/scireplicate.github.io/.

[15] BEE-RAG: Balanced Entropy Engineering for Retrieval-Augmented Generation

Yuhao Wang, Ruiyang Ren, Yucheng Wang, Jing Liu, Wayne Xin Zhao, Hua Wu, Haifeng Wang

Main category: cs.CL

TL;DR: The paper introduces BEE-RAG, a framework addressing performance issues in retrieval-augmented generation (RAG) caused by long context lengths, using entropy invariance to stabilize attention dynamics.

Details

Motivation: Large language models (LLMs) face knowledge limitations, and RAG, while helpful, suffers from performance issues due to unconstrained entropy growth and attention dilution in long retrieval contexts.

Method: Proposes BEE-RAG, leveraging entropy invariance to balance context entropy and reformulate attention dynamics, with zero-shot inference for multi-importance estimation and adaptive fine-tuning.

Result: BEE-RAG demonstrates effectiveness across multiple RAG tasks, improving adaptability to varying context lengths.

Conclusion: BEE-RAG successfully addresses RAG performance issues by stabilizing entropy and attention dynamics, offering a robust solution for diverse settings.

Abstract: With the rapid advancement of large language models (LLMs), retrieval-augmented generation (RAG) has emerged as a critical approach to supplement the inherent knowledge limitations of LLMs. However, due to the typically large volume of retrieved information, RAG tends to operate with long context lengths. From the perspective of entropy engineering, we identify unconstrained entropy growth and attention dilution due to long retrieval context as significant factors affecting RAG performance. In this paper, we propose the balanced entropy-engineered RAG (BEE-RAG) framework, which improves the adaptability of RAG systems to varying context lengths through the principle of entropy invariance. By leveraging balanced context entropy to reformulate attention dynamics, BEE-RAG separates attention sensitivity from context length, ensuring a stable entropy level. Building upon this, we introduce a zero-shot inference strategy for multi-importance estimation and a parameter-efficient adaptive fine-tuning mechanism to obtain the optimal balancing factor for different settings. Extensive experiments across multiple RAG tasks demonstrate the effectiveness of BEE-RAG.

[16] Attention Basin: Why Contextual Position Matters in Large Language Models

Zihao Yi, Delong Zeng, Zhenqing Ling, Haohao Luo, Zhe Xu, Wei Liu, Jian Luan, Wanxia Cao, Ying Shen

Main category: cs.CL

TL;DR: LLMs exhibit positional bias, favoring the start and end of sequences (attention basin). AttnRank reranks inputs to align critical info with high-attention positions, improving performance without model changes.

Details

Motivation: To understand and mitigate positional bias in LLMs, enhancing their performance by optimizing attention allocation.

Method: Introduces AttnRank, a two-stage framework: (i) calibrates positional attention preferences, (ii) reranks inputs to align salient content with high-attention positions.

Result: Substantial improvements in multi-hop QA and few-shot learning across 10 LLMs, without modifying models.

Conclusion: AttnRank effectively addresses positional bias, offering a model-agnostic, training-free solution for better LLM performance.

Abstract: The performance of Large Language Models (LLMs) is significantly sensitive to the contextual position of information in the input. To investigate the mechanism behind this positional bias, our extensive experiments reveal a consistent phenomenon we term the attention basin: when presented with a sequence of structured items (e.g., retrieved documents or few-shot examples), models systematically assign higher attention to the items at the beginning and end of the sequence, while neglecting those in the middle. Crucially, our analysis further reveals that allocating higher attention to critical information is key to enhancing model performance. Based on these insights, we introduce Attention-Driven Reranking (AttnRank), a two-stage framework that (i) estimates a model’s intrinsic positional attention preferences using a small calibration set, and (ii) reorders retrieved documents or few-shot examples to align the most salient content with these high-attention positions. AttnRank is a model-agnostic, training-free, and plug-and-play method with minimal computational overhead. Experiments on multi-hop QA and few-shot in-context learning tasks demonstrate that AttnRank achieves substantial improvements across 10 large language models of varying architectures and scales, without modifying model parameters or training procedures.

[17] Towards Assessing Medical Ethics from Knowledge to Practice

Chang Hong, Minghao Wu, Qingying Xiao, Yuchi Wang, Xiang Wan, Guangjun Yu, Benyou Wang, Yan Hu

Main category: cs.CL

TL;DR: PrinciplismQA is a benchmark with 3,648 questions to evaluate LLMs’ alignment with medical ethics, revealing gaps in practical ethical application, especially in Beneficence.

Details

Motivation: Current benchmarks overlook ethical reasoning in LLMs for healthcare, necessitating a rigorous evaluation tool.

Method: PrinciplismQA includes multiple-choice and open-ended questions from authoritative sources, validated by experts, grounded in Principlism.

Result: LLMs show a gap in ethical knowledge vs. practical application, with frontier closed-source models leading. Medical fine-tuning helps but needs better alignment.

Conclusion: PrinciplismQA provides a scalable framework to diagnose ethical weaknesses, advancing balanced and responsible medical AI.

Abstract: The integration of large language models into healthcare necessitates a rigorous evaluation of their ethical reasoning, an area current benchmarks often overlook. We introduce PrinciplismQA, a comprehensive benchmark with 3,648 questions designed to systematically assess LLMs’ alignment with core medical ethics. Grounded in Principlism, our benchmark features a high-quality dataset. This includes multiple-choice questions curated from authoritative textbooks and open-ended questions sourced from authoritative medical ethics case study literature, all validated by medical experts. Our experiments reveal a significant gap between models’ ethical knowledge and their practical application, especially in dynamically applying ethical principles to real-world scenarios. Most LLMs struggle with dilemmas concerning Beneficence, often over-emphasizing other principles. Frontier closed-source models, driven by strong general capabilities, currently lead the benchmark. Notably, medical domain fine-tuning can enhance models’ overall ethical competence, but further progress requires better alignment with medical ethical knowledge. PrinciplismQA offers a scalable framework to diagnose these specific ethical weaknesses, paving the way for more balanced and responsible medical AI.

[18] ATLANTIS at SemEval-2025 Task 3: Detecting Hallucinated Text Spans in Question Answering

Catherine Kobus, François Lancelot, Marion-Cécile Martin, Nawal Ould Amer

Main category: cs.CL

TL;DR: The ATLANTIS team’s work for SemEval-2025 Task 3 focuses on detecting hallucinations in QA systems using LLMs, achieving top results in Spanish and competitive rankings in English and German.

Details

Motivation: LLMs advance NLG but often produce incorrect or misleading content (hallucinations). The study aims to address this issue.

Method: Explored methods with/without external context: few-shot prompting, token-level classification, and fine-tuning LLMs on synthetic data.

Result: Top rankings in Spanish and competitive placements in English and German.

Conclusion: Integrating context is key to reducing hallucinations; fine-tuned models and prompt engineering show promise.

Abstract: This paper presents the contributions of the ATLANTIS team to SemEval-2025 Task 3, focusing on detecting hallucinated text spans in question answering systems. Large Language Models (LLMs) have significantly advanced Natural Language Generation (NLG) but remain susceptible to hallucinations, generating incorrect or misleading content. To address this, we explored methods both with and without external context, utilizing few-shot prompting with a LLM, token-level classification or LLM fine-tuned on synthetic data. Notably, our approaches achieved top rankings in Spanish and competitive placements in English and German. This work highlights the importance of integrating relevant context to mitigate hallucinations and demonstrate the potential of fine-tuned models and prompt engineering.

[19] Resource-Limited Joint Multimodal Sentiment Reasoning and Classification via Chain-of-Thought Enhancement and Distillation

Haonan Shangguan, Xiaocui Yang, Shi Feng, Daling Wang, Yifei Zhang, Ge Yu

Main category: cs.CL

TL;DR: The paper introduces MulCoT-RD, a lightweight model for joint multimodal sentiment reasoning and classification (JMSRC), addressing resource constraints by using a ‘Teacher-Assistant-Student’ distillation paradigm.

Details

Motivation: Current MSA approaches rely on heavy LLMs, ignoring resource-limited environments. The paper focuses on autonomous sentiment reasoning and classification in such settings.

Method: Proposes MulCoT-RD, a distillation model using a high-performance MLLM to generate reasoning data, training an assistant model, and then a lightweight student model for efficient reasoning and classification.

Result: MulCoT-RD achieves strong performance on JMSRC with only 3B parameters, showing robust generalization and interpretability.

Conclusion: The work successfully addresses resource constraints in MSA, offering a lightweight yet effective solution for multimodal sentiment reasoning and classification.

Abstract: The surge in rich multimodal content on social media platforms has greatly advanced Multimodal Sentiment Analysis (MSA), with Large Language Models (LLMs) further accelerating progress in this field. Current approaches primarily leverage the knowledge and reasoning capabilities of parameter-heavy (Multimodal) LLMs for sentiment classification, overlooking autonomous multimodal sentiment reasoning generation in resource-constrained environments. Therefore, we focus on the Resource-Limited Joint Multimodal Sentiment Reasoning and Classification task, JMSRC, which simultaneously performs multimodal sentiment reasoning chain generation and sentiment classification only with a lightweight model. We propose a Multimodal Chain-of-Thought Reasoning Distillation model, MulCoT-RD, designed for JMSRC that employs a “Teacher-Assistant-Student” distillation paradigm to address deployment constraints in resource-limited environments. We first leverage a high-performance Multimodal Large Language Model (MLLM) to generate the initial reasoning dataset and train a medium-sized assistant model with a multi-task learning mechanism. A lightweight student model is jointly trained to perform efficient multimodal sentiment reasoning generation and classification. Extensive experiments on four datasets demonstrate that MulCoT-RD with only 3B parameters achieves strong performance on JMSRC, while exhibiting robust generalization and enhanced interpretability.

[20] Pruning Large Language Models by Identifying and Preserving Functional Networks

Yiheng Liu, Junhao Ning, Sichen Xia, Xiaohui Gao, Ning Qiang, Bao Ge, Junwei Han, Xintao Hu

Main category: cs.CL

TL;DR: The paper proposes a structured pruning method for LLMs by identifying and preserving functional networks, inspired by neural networks in the human brain, to improve pruning efficiency.

Details

Motivation: Current pruning methods disrupt LLM functionalities by ignoring neuron interactions. The study aims to preserve these interactions for better performance.

Method: Treats LLMs as digital brains, decomposes them into functional networks, and prunes by preserving key neurons within these networks.

Result: Successfully identifies functional networks and key neurons, enabling efficient pruning.

Conclusion: The method improves pruning by preserving functional architecture, with code available for implementation.

Abstract: Structured pruning is one of the representative techniques for compressing large language models (LLMs) to reduce GPU memory consumption and accelerate inference speed. It offers significant practical value in improving the efficiency of LLMs in real-world applications. Current structured pruning methods typically rely on assessment of the importance of the structure units and pruning the units with less importance. Most of them overlooks the interaction and collaboration among artificial neurons that are crucial for the functionalities of LLMs, leading to a disruption in the macro functional architecture of LLMs and consequently a pruning performance degradation. Inspired by the inherent similarities between artificial neural networks and functional neural networks in the human brain, we alleviate this challenge and propose to prune LLMs by identifying and preserving functional networks within LLMs in this study. To achieve this, we treat an LLM as a digital brain and decompose the LLM into functional networks, analogous to identifying functional brain networks in neuroimaging data. Afterwards, an LLM is pruned by preserving the key neurons within these functional networks. Experimental results demonstrate that the proposed method can successfully identify and locate functional networks and key neurons in LLMs, enabling efficient model pruning. Our code is available at https://github.com/WhatAboutMyStar/LLM_ACTIVATION.

[21] CodeBoost: Boosting Code LLMs by Squeezing Knowledge from Code Snippets with RL

Sijie Wang, Quanjiang Guo, Kai Zhao, Yawei Zhang, Xin Li, Xiang Li, Siqi Li, Rui She, Shangshu Yu, Wee Peng Tay

Main category: cs.CL

TL;DR: CodeBoost is a post-training framework for code LLMs that enhances performance using code snippets without human-annotated instructions, addressing scalability and quality issues in instruction collection.

Details

Motivation: Existing code LLMs rely on labor-intensive human-annotated instructions, creating a bottleneck. Code snippets are abundant but underutilized.

Method: CodeBoost uses maximum-clique curation, bi-directional prediction, error-aware prediction, heterogeneous augmentation, and heterogeneous rewarding.

Result: Experiments show CodeBoost consistently improves performance across code LLMs and benchmarks.

Conclusion: CodeBoost provides a scalable and effective training pipeline for code LLMs without human-annotated instructions.

Abstract: Code large language models (LLMs) have become indispensable tools for building efficient and automated coding pipelines. Existing models are typically post-trained using reinforcement learning (RL) from general-purpose LLMs using “human instruction-final answer” pairs, where the instructions are usually from manual annotations. However, collecting high-quality coding instructions is both labor-intensive and difficult to scale. On the other hand, code snippets are abundantly available from various sources. This imbalance presents a major bottleneck in instruction-based post-training. We propose CodeBoost, a post-training framework that enhances code LLMs purely from code snippets, without relying on human-annotated instructions. CodeBoost introduces the following key components: (1) maximum-clique curation, which selects a representative and diverse training corpus from code; (2) bi-directional prediction, which enables the model to learn from both forward and backward prediction objectives; (3) error-aware prediction, which incorporates learning signals from both correct and incorrect outputs; (4) heterogeneous augmentation, which diversifies the training distribution to enrich code semantics; and (5) heterogeneous rewarding, which guides model learning through multiple reward types including format correctness and execution feedback from both successes and failures. Extensive experiments across several code LLMs and benchmarks verify that CodeBoost consistently improves performance, demonstrating its effectiveness as a scalable and effective training pipeline.

[22] ASCoT: An Adaptive Self-Correction Chain-of-Thought Method for Late-Stage Fragility in LLMs

Dongxu Zhang, Ning Yang, Jihua Zhu, Jinnan Yang, Miao Xin, Baoliang Tian

Main category: cs.CL

TL;DR: The paper challenges the ‘cascading failure’ hypothesis in Chain-of-Thought (CoT) reasoning, revealing ‘Late-Stage Fragility’—errors in later stages are more harmful. It introduces ASCoT, an adaptive self-correction method, which outperforms standard CoT.

Details

Motivation: To address the unreliability of reasoning chains in LLMs, particularly the overlooked vulnerability of late-stage errors.

Method: Proposes ASCoT with an Adaptive Verification Manager (AVM) and Multi-Perspective Self-Correction Engine (MSCE), prioritizing high-risk steps and applying targeted corrections.

Result: ASCoT achieves superior accuracy on benchmarks like GSM8K and MATH, outperforming standard CoT.

Conclusion: Highlights the need for adaptive, vulnerability-aware correction in LLM reasoning, moving beyond uniform verification.

Abstract: Chain-of-Thought (CoT) prompting has significantly advanced the reasoning capabilities of Large Language Models (LLMs), yet the reliability of these reasoning chains remains a critical challenge. A widely held “cascading failure” hypothesis suggests that errors are most detrimental when they occur early in the reasoning process. This paper challenges that assumption through systematic error-injection experiments, revealing a counter-intuitive phenomenon we term “Late-Stage Fragility”: errors introduced in the later stages of a CoT chain are significantly more likely to corrupt the final answer than identical errors made at the beginning. To address this specific vulnerability, we introduce the Adaptive Self-Correction Chain-of-Thought (ASCoT) method. ASCoT employs a modular pipeline in which an Adaptive Verification Manager (AVM) operates first, followed by the Multi-Perspective Self-Correction Engine (MSCE). The AVM leverages a Positional Impact Score function I(k) that assigns different weights based on the position within the reasoning chains, addressing the Late-Stage Fragility issue by identifying and prioritizing high-risk, late-stage steps. Once these critical steps are identified, the MSCE applies robust, dual-path correction specifically to the failure parts. Extensive experiments on benchmarks such as GSM8K and MATH demonstrate that ASCoT achieves outstanding accuracy, outperforming strong baselines, including standard CoT. Our work underscores the importance of diagnosing specific failure modes in LLM reasoning and advocates for a shift from uniform verification strategies to adaptive, vulnerability-aware correction mechanisms.

[23] Decision-Making with Deliberation: Meta-reviewing as a Document-grounded Dialogue

Sukannya Purkayastha, Nils Dycke, Anne Lauscher, Iryna Gurevych

Main category: cs.CL

TL;DR: The paper explores using dialogue agents to assist meta-reviewers by generating synthetic data with LLMs and training tailored agents, showing improved efficiency in meta-reviewing.

Details

Motivation: Meta-reviewing is a decision-making process requiring argument weighing, and prior research suggests dialogue agents can assist effectively.

Method: Generate synthetic data using LLMs with self-refinement, train dialogue agents on this data, and test in real-world meta-reviewing.

Result: The method produces high-quality synthetic data and outperforms off-the-shelf LLM-based assistants in meta-reviewing.

Conclusion: Dialogue agents trained on synthetic data enhance meta-reviewing efficiency, demonstrating practical utility.

Abstract: Meta-reviewing is a pivotal stage in the peer-review process, serving as the final step in determining whether a paper is recommended for acceptance. Prior research on meta-reviewing has treated this as a summarization problem over review reports. However, complementary to this perspective, meta-reviewing is a decision-making process that requires weighing reviewer arguments and placing them within a broader context. Prior research has demonstrated that decision-makers can be effectively assisted in such scenarios via dialogue agents. In line with this framing, we explore the practical challenges for realizing dialog agents that can effectively assist meta-reviewers. Concretely, we first address the issue of data scarcity for training dialogue agents by generating synthetic data using Large Language Models (LLMs) based on a self-refinement strategy to improve the relevance of these dialogues to expert domains. Our experiments demonstrate that this method produces higher-quality synthetic data and can serve as a valuable resource towards training meta-reviewing assistants. Subsequently, we utilize this data to train dialogue agents tailored for meta-reviewing and find that these agents outperform \emph{off-the-shelf} LLM-based assistants for this task. Finally, we apply our agents in real-world meta-reviewing scenarios and confirm their effectiveness in enhancing the efficiency of meta-reviewing.\footnote{Code and Data: https://github.com/UKPLab/arxiv2025-meta-review-as-dialog

[24] SONAR-LLM: Autoregressive Transformer that Thinks in Sentence Embeddings and Speaks in Tokens

Nikita Dragunov, Temurbek Rahmatullaev, Elizaveta Goncharova, Andrey Kuznetsov, Anton Razzhigaev

Main category: cs.CL

TL;DR: SONAR-LLM is a decoder-only transformer that combines the semantic abstraction of LCM with token-level cross-entropy training, achieving competitive generation quality across model sizes.

Details

Motivation: To retain the semantic abstraction of LCM while eliminating its diffusion sampler and restoring a likelihood-based training signal.

Method: Uses a hybrid objective: continuous SONAR embedding space with token-level cross-entropy propagated via a frozen SONAR decoder.

Result: Competitive generation quality across model sizes (39M to 1.3B parameters).

Conclusion: SONAR-LLM effectively combines the strengths of LCM and likelihood-based training, with released code and checkpoints for reproducibility.

Abstract: The recently proposed Large Concept Model (LCM) generates text by predicting a sequence of sentence-level embeddings and training with either mean-squared error or diffusion objectives. We present SONAR-LLM, a decoder-only transformer that “thinks” in the same continuous SONAR embedding space, yet is supervised through token-level cross-entropy propagated via the frozen SONAR decoder. This hybrid objective retains the semantic abstraction of LCM while eliminating its diffusion sampler and restoring a likelihood-based training signal. Across model sizes from 39M to 1.3B parameters, SONAR-LLM attains competitive generation quality. We report scaling trends, ablations, benchmark results, and release the complete training code and all pretrained checkpoints to foster reproducibility and future research.

[25] Efficient Reasoning for Large Reasoning Language Models via Certainty-Guided Reflection Suppression

Jiameng Huang, Baijiong Lin, Guhao Feng, Jierun Chen, Di He, Lu Hou

Main category: cs.CL

TL;DR: CGRS reduces overthinking in LRLMs by suppressing reflection triggers when confidence is high, cutting token usage by 18.5%-41.9% without losing accuracy.

Details

Motivation: Reflection behaviors in LRLMs cause overthinking, increasing token usage and costs while reducing utility.

Method: CGRS dynamically suppresses reflection triggers during high-confidence responses, avoiding redundant cycles without retraining.

Result: CGRS reduces token usage significantly (18.5%-41.9%) while maintaining accuracy across benchmarks and model scales.

Conclusion: CGRS offers a practical, model-agnostic solution for efficient reasoning in LRLMs, balancing performance and cost.

Abstract: Recent Large Reasoning Language Models (LRLMs) employ long chain-of-thought reasoning with complex reflection behaviors, typically signaled by specific trigger words (e.g., “Wait” and “Alternatively”) to enhance performance. However, these reflection behaviors can lead to the overthinking problem where the generation of redundant reasoning steps that unnecessarily increase token usage, raise inference costs, and reduce practical utility. In this paper, we propose Certainty-Guided Reflection Suppression (CGRS), a novel method that mitigates overthinking in LRLMs while maintaining reasoning accuracy. CGRS operates by dynamically suppressing the model’s generation of reflection triggers when it exhibits high confidence in its current response, thereby preventing redundant reflection cycles without compromising output quality. Our approach is model-agnostic, requires no retraining or architectural modifications, and can be integrated seamlessly with existing autoregressive generation pipelines. Extensive experiments across four reasoning benchmarks (i.e., AIME24, AMC23, MATH500, and GPQA-D) demonstrate CGRS’s effectiveness: it reduces token usage by an average of 18.5% to 41.9% while preserving accuracy. It also achieves the optimal balance between length reduction and performance compared to state-of-the-art baselines. These results hold consistently across model architectures (e.g., DeepSeek-R1-Distill series, QwQ-32B, and Qwen3 family) and scales (4B to 32B parameters), highlighting CGRS’s practical value for efficient reasoning.

[26] Evaluation of a Sign Language Avatar on Comprehensibility, User Experience & Acceptability

Fenya Wasserroth, Eleftherios Avramidis, Vera Czehmann, Tanja Kojic, Fabrizio Nunnari, Sebastian Möller

Main category: cs.CL

TL;DR: The study examines the impact of adjustable features in a sign language avatar on Hololens 2, finding no significant UX or comprehensibility improvements despite user preference for adjustability. Key issues included missing SL elements and poor implementation.

Details

Motivation: To understand how adjustable features in a sign language avatar affect user experience, comprehensibility, and acceptability among expert users.

Method: Analysis of interactions with adjustable and non-adjustable avatars in a specific use case, focusing on UX, comprehensibility, and acceptability.

Result: No significant UX or comprehensibility improvements; users preferred adjustability but rated hedonic quality higher than pragmatic quality. Stress levels were higher with the adjustable avatar.

Conclusion: Personalisation alone is insufficient; SL avatars must be comprehensible by default. Recommendations include improving animation, interaction interfaces, and using participatory design.

Abstract: This paper presents an investigation into the impact of adding adjustment features to an existing sign language (SL) avatar on a Microsoft Hololens 2 device. Through a detailed analysis of interactions of expert German Sign Language (DGS) users with both adjustable and non-adjustable avatars in a specific use case, this study identifies the key factors influencing the comprehensibility, the user experience (UX), and the acceptability of such a system. Despite user preference for adjustable settings, no significant improvements in UX or comprehensibility were observed, which remained at low levels, amid missing SL elements (mouthings and facial expressions) and implementation issues (indistinct hand shapes, lack of feedback and menu positioning). Hedonic quality was rated higher than pragmatic quality, indicating that users found the system more emotionally or aesthetically pleasing than functionally useful. Stress levels were higher for the adjustable avatar, reflecting lower performance, greater effort and more frustration. Additionally, concerns were raised about whether the Hololens adjustment gestures are intuitive and easy to familiarise oneself with. While acceptability of the concept of adjustability was generally positive, it was strongly dependent on usability and animation quality. This study highlights that personalisation alone is insufficient, and that SL avatars must be comprehensible by default. Key recommendations include enhancing mouthing and facial animation, improving interaction interfaces, and applying participatory design.

[27] Can Language Models Critique Themselves? Investigating Self-Feedback for Retrieval Augmented Generation at BioASQ 2025

Samy Ateia, Udo Kruschwitz

Main category: cs.CL

TL;DR: The paper explores the challenges of applying autonomous LLM systems like Agentic RAG to domain-specific professional search, using the BioASQ CLEF 2025 challenge as a testbed. It evaluates reasoning and non-reasoning LLMs with a self-feedback mechanism for iterative refinement.

Details

Motivation: Professional search tasks, such as biomedical research, require high expertise and transparency, but automated systems may misalign with expert needs. The study aims to assess if LLM self-feedback improves performance.

Method: The study tested LLMs (Gemini-Flash 2.0, o3-mini, o4-mini, DeepSeek-R1) with a self-feedback mechanism for query expansion and multiple answer types. It evaluated iterative self-correction and compared reasoning vs. non-reasoning models.

Result: Preliminary results show varied performance of the self-feedback strategy across models and tasks.

Conclusion: The work provides insights into LLM self-correction and suggests future comparisons between LLM-generated feedback and human expert input in professional search systems.

Abstract: Agentic Retrieval Augmented Generation (RAG) and ‘deep research’ systems aim to enable autonomous search processes where Large Language Models (LLMs) iteratively refine outputs. However, applying these systems to domain-specific professional search, such as biomedical research, presents challenges, as automated systems may reduce user involvement and misalign with expert information needs. Professional search tasks often demand high levels of user expertise and transparency. The BioASQ CLEF 2025 challenge, using expert-formulated questions, can serve as a platform to study these issues. We explored the performance of current reasoning and nonreasoning LLMs like Gemini-Flash 2.0, o3-mini, o4-mini and DeepSeek-R1. A key aspect of our methodology was a self-feedback mechanism where LLMs generated, evaluated, and then refined their outputs for query expansion and for multiple answer types (yes/no, factoid, list, ideal). We investigated whether this iterative self-correction improves performance and if reasoning models are more capable of generating useful feedback. Preliminary results indicate varied performance for the self-feedback strategy across models and tasks. This work offers insights into LLM self-correction and informs future work on comparing the effectiveness of LLM-generated feedback with direct human expert input in these search systems.

[28] The TUB Sign Language Corpus Collection

Eleftherios Avramidis, Vera Czehmann, Fabian Deckert, Lorenz Hufe, Aljoscha Lipski, Yuni Amaloa Quintero Villalobos, Tae Kwon Rhee, Mengqian Shi, Lennart Stölting, Fabrizio Nunnari, Sebastian Möller

Main category: cs.CL

TL;DR: A parallel corpus of 12 sign languages in video format with subtitles, totaling 1,300 hours and 14M tokens, including first-time consistent corpora for 8 Latin American sign languages and a significantly expanded German Sign Language corpus.

Details

Motivation: To address the lack of large-scale, parallel sign language corpora, especially for Latin American sign languages, and to improve resources for research and development in sign language processing.

Method: Data collection from online sources (news shows, governmental bodies, educational channels), involving stages like seeking usage approvals, scraping, and cropping.

Result: A collection of 4,381 video files (1,300 hours) with 1.3M subtitles (14M tokens), notably expanding resources for 8 Latin American sign languages and German Sign Language.

Conclusion: The paper presents a significant contribution to sign language research by providing a large, diverse, and parallel corpus, facilitating advancements in the field.

Abstract: We present a collection of parallel corpora of 12 sign languages in video format, together with subtitles in the dominant spoken languages of the corresponding countries. The entire collection includes more than 1,300 hours in 4,381 video files, accompanied by 1,3~~M subtitles containing 14~~M tokens. Most notably, it includes the first consistent parallel corpora for 8 Latin American sign languages, whereas the size of the German Sign Language corpora is ten times the size of the previously available corpora. The collection was created by collecting and processing videos of multiple sign languages from various online sources, mainly broadcast material of news shows, governmental bodies and educational channels. The preparation involved several stages, including data collection, informing the content creators and seeking usage approvals, scraping, and cropping. The paper provides statistics on the collection and an overview of the methods used to collect the data.

[29] MyCulture: Exploring Malaysia’s Diverse Culture under Low-Resource Language Constraints

Zhong Ken Hew, Jia Xin Low, Sze Jue Yang, Chee Seng chan

Main category: cs.CL

TL;DR: MyCulture is a benchmark for evaluating LLMs on Malaysian culture, using an open-ended multiple-choice format to reduce bias and improve fairness.

Details

Motivation: Address cultural biases in LLMs caused by training data dominated by high-resource languages, ensuring better representation of low-resource languages like Bahasa Melayu.

Method: Introduces MyCulture, a benchmark with six cultural pillars, using open-ended multiple-choice questions to mitigate guessing and format bias. Analyzes structural and language biases.

Result: Reveals significant disparities in cultural comprehension among LLMs, emphasizing the need for culturally inclusive benchmarks.

Conclusion: Culturally grounded and linguistically inclusive benchmarks are crucial for fair and accurate LLM development and evaluation.

Abstract: Large Language Models (LLMs) often exhibit cultural biases due to training data dominated by high-resource languages like English and Chinese. This poses challenges for accurately representing and evaluating diverse cultural contexts, particularly in low-resource language settings. To address this, we introduce MyCulture, a benchmark designed to comprehensively evaluate LLMs on Malaysian culture across six pillars: arts, attire, customs, entertainment, food, and religion presented in Bahasa Melayu. Unlike conventional benchmarks, MyCulture employs a novel open-ended multiple-choice question format without predefined options, thereby reducing guessing and mitigating format bias. We provide a theoretical justification for the effectiveness of this open-ended structure in improving both fairness and discriminative power. Furthermore, we analyze structural bias by comparing model performance on structured versus free-form outputs, and assess language bias through multilingual prompt variations. Our evaluation across a range of regional and international LLMs reveals significant disparities in cultural comprehension, highlighting the urgent need for culturally grounded and linguistically inclusive benchmarks in the development and assessment of LLMs.

[30] LLMEval-3: A Large-Scale Longitudinal Study on Robust and Fair Evaluation of Large Language Models

Ming Zhang, Yujiong Shen, Jingyi Deng, Yuhui Wang, Yue Zhang, Junzhe Wang, Shichun Liu, Shihan Dou, Huayu Sha, Qiyuan Peng, Changhao Jiang, Jingqi Tong, Yilong Wu, Zhihao Zhang, Mingqi Wu, Zhiheng Xi, Mingxu Chai, Tao Liang, Zhihui Fei, Zhen Wang, Mingyang Wan, Guojun Ma, Tao Gui, Qi Zhang, Xuanjing Huang

Main category: cs.CL

TL;DR: LLMEval-3 is a dynamic evaluation framework for LLMs, addressing data contamination and overfitting by using a proprietary question bank and automated integrity measures. It reveals performance ceilings and contamination issues, offering robust and credible evaluation.

Details

Motivation: Static benchmarks for LLMs are flawed due to data contamination and leaderboard overfitting, obscuring true model capabilities.

Method: LLMEval-3 dynamically samples unseen test sets from 220k graduate-level questions, using contamination-resistant curation, anti-cheating architecture, and calibrated LLM-as-a-judge (90% human agreement).

Result: A 20-month study of 50 models exposed knowledge memorization limits and contamination vulnerabilities, with robust ranking stability.

Conclusion: LLMEval-3 provides a credible, dynamic evaluation paradigm, advancing trustworthy LLM assessment beyond static benchmarks.

Abstract: Existing evaluation of Large Language Models (LLMs) on static benchmarks is vulnerable to data contamination and leaderboard overfitting, critical issues that obscure true model capabilities. To address this, we introduce LLMEval-3, a framework for dynamic evaluation of LLMs. LLMEval-3 is built on a proprietary bank of 220k graduate-level questions, from which it dynamically samples unseen test sets for each evaluation run. Its automated pipeline ensures integrity via contamination-resistant data curation, a novel anti-cheating architecture, and a calibrated LLM-as-a-judge process achieving 90% agreement with human experts, complemented by a relative ranking system for fair comparison. An 20-month longitudinal study of nearly 50 leading models reveals a performance ceiling on knowledge memorization and exposes data contamination vulnerabilities undetectable by static benchmarks. The framework demonstrates exceptional robustness in ranking stability and consistency, providing strong empirical validation for the dynamic evaluation paradigm. LLMEval-3 offers a robust and credible methodology for assessing the true capabilities of LLMs beyond leaderboard scores, promoting the development of more trustworthy evaluation standards.

[31] TASE: Token Awareness and Structured Evaluation for Multilingual Language Models

Chenzhuo Zhao, Xinda Wang, Yue Huang, Junting Lu, Ziqian Liu

Main category: cs.CL

TL;DR: TASE is a benchmark evaluating LLMs’ token-level and structural reasoning across languages, revealing significant gaps compared to human performance.

Details

Motivation: LLMs struggle with fine-grained, token-level understanding and structural reasoning, which are critical for precision-demanding applications.

Method: TASE includes 10 tasks (e.g., character counting, syntactic parsing) across Chinese, English, and Korean, with a 35,927-instance evaluation set and synthetic data generation. Over 30 LLMs are evaluated, and a custom Qwen2.5-14B model is trained using GRPO.

Result: Human performance significantly surpasses LLMs, highlighting persistent weaknesses in token-level reasoning.

Conclusion: TASE identifies LLM limitations and offers a diagnostic tool for improving low-level language understanding and cross-lingual generalization.

Abstract: While large language models (LLMs) have demonstrated remarkable performance on high-level semantic tasks, they often struggle with fine-grained, token-level understanding and structural reasoning–capabilities that are essential for applications requiring precision and control. We introduce TASE, a comprehensive benchmark designed to evaluate LLMs’ ability to perceive and reason about token-level information across languages. TASE covers 10 tasks under two core categories: token awareness and structural understanding, spanning Chinese, English, and Korean, with a 35,927-instance evaluation set and a scalable synthetic data generation pipeline for training. Tasks include character counting, token alignment, syntactic structure parsing, and length constraint satisfaction. We evaluate over 30 leading commercial and open-source LLMs, including O3, Claude 4, Gemini 2.5 Pro, and DeepSeek-R1, and train a custom Qwen2.5-14B model using the GRPO training method. Results show that human performance significantly outpaces current LLMs, revealing persistent weaknesses in token-level reasoning. TASE sheds light on these limitations and provides a new diagnostic lens for future improvements in low-level language understanding and cross-lingual generalization. Our code and dataset are publicly available at https://github.com/cyzcz/Tase .

[32] Rethinking Creativity Evaluation: A Critical Analysis of Existing Creativity Evaluations

Li-Chun Lu, Miri Liu, Pin-Chun Lu, Yufei Tian, Shao-Hua Sun, Nanyun Peng

Main category: cs.CL

TL;DR: The paper evaluates four creativity measures (creativity index, perplexity, syntactic templates, LLM-as-a-Judge) across domains, finding limited consistency and highlighting their limitations. It calls for better evaluation frameworks.

Details

Motivation: To systematically assess and compare existing creativity metrics across diverse creative domains to understand their strengths and weaknesses.

Method: Examination and analysis of four representative creativity measures (creativity index, perplexity, syntactic templates, LLM-as-a-Judge) in creative writing, problem-solving, and research ideation.

Result: Metrics show limited consistency and capture different dimensions of creativity, with specific limitations like lexical focus, model sensitivity, and instability.

Conclusion: There is a need for more robust and generalizable creativity evaluation frameworks that align better with human judgments.

Abstract: We systematically examine, analyze, and compare representative creativity measures–creativity index, perplexity, syntactic templates, and LLM-as-a-Judge–across diverse creative domains, including creative writing, unconventional problem-solving, and research ideation. Our analyses reveal that these metrics exhibit limited consistency, capturing different dimensions of creativity. We highlight key limitations, including the creativity index’s focus on lexical diversity, perplexity’s sensitivity to model confidence, and syntactic templates’ inability to capture conceptual creativity. Additionally, LLM-as-a-Judge shows instability and bias. Our findings underscore the need for more robust, generalizable evaluation frameworks that better align with human judgments of creativity.

[33] LAG: Logic-Augmented Generation from a Cartesian Perspective

Yilin Xiao, Chuang Zhou, Qinggang Zhang, Su Dong, Shengyuan Chen, Xiao Huang

Main category: cs.CL

TL;DR: The paper introduces Logic-Augmented Generation (LAG), a method to enhance LLMs’ reasoning by decomposing complex questions into logical sub-questions, resolving them sequentially, and synthesizing answers to reduce hallucinations and improve robustness.

Details

Motivation: LLMs struggle with knowledge-intensive tasks and generate hallucinations. Retrieval-augmented generation (RAG) lacks structured reasoning, prompting the need for a more logical approach.

Method: LAG decomposes questions into atomic sub-questions, resolves them in dependency order, uses logical termination to halt unanswerable queries, and synthesizes verified responses.

Result: Experiments show LAG improves reasoning robustness, reduces hallucinations, and aligns LLMs with human cognition better than RAG.

Conclusion: LAG offers a principled, logical alternative to RAG for enhancing LLM performance in complex reasoning tasks.

Abstract: Large language models (LLMs) have demonstrated remarkable capabilities across a wide range of tasks, yet exhibit critical limitations in knowledge-intensive tasks, often generating hallucinations when faced with questions requiring specialized expertise. While retrieval-augmented generation (RAG) mitigates this by integrating external knowledge, it struggles with complex reasoning scenarios due to its reliance on direct semantic retrieval and lack of structured logical organization. Inspired by Cartesian principles from \textit{Discours de la m'ethode}, this paper introduces Logic-Augmented Generation (LAG), a novel paradigm that reframes knowledge augmentation through systematic question decomposition and dependency-aware reasoning. Specifically, LAG first decomposes complex questions into atomic sub-questions ordered by logical dependencies. It then resolves these sequentially, using prior answers to guide context retrieval for subsequent sub-questions, ensuring stepwise grounding in logical chain. To prevent error propagation, LAG incorporates a logical termination mechanism that halts inference upon encountering unanswerable sub-questions and reduces wasted computation on excessive reasoning. Finally, it synthesizes all sub-resolutions to generate verified responses. Experiments on four benchmark datasets demonstrate that LAG significantly enhances reasoning robustness, reduces hallucination, and aligns LLM problem-solving with human cognition, offering a principled alternative to existing RAG systems.

[34] The World According to LLMs: How Geographic Origin Influences LLMs’ Entity Deduction Capabilities

Harsh Nishant Lalai, Raj Sanjay Shah, Jiaxin Pei, Sashank Varma, Yi-Chia Wang, Ali Emami

Main category: cs.CL

TL;DR: The paper evaluates implicit biases in LLMs using the 20 Questions game, revealing geographic disparities in performance, particularly favoring the Global North and West.

Details

Motivation: To uncover subtle implicit biases in LLMs that persist despite mitigation efforts, using a proactive questioning approach rather than direct probing.

Method: Systematic evaluation using the 20 Questions game (Geo20Q+ dataset) across two gameplay configurations and seven languages.

Result: LLMs show significant geographic disparities, performing better for Global North/West entities. Wikipedia pageviews and corpus frequency don’t fully explain these gaps. Language has minimal impact.

Conclusion: Creative evaluation frameworks like the 20 Questions game reveal hidden biases in LLMs, highlighting geographic and cultural disparities in reasoning processes.

Abstract: Large Language Models (LLMs) have been extensively tuned to mitigate explicit biases, yet they often exhibit subtle implicit biases rooted in their pre-training data. Rather than directly probing LLMs with human-crafted questions that may trigger guardrails, we propose studying how models behave when they proactively ask questions themselves. The 20 Questions game, a multi-turn deduction task, serves as an ideal testbed for this purpose. We systematically evaluate geographic performance disparities in entity deduction using a new dataset, Geo20Q+, consisting of both notable people and culturally significant objects (e.g., foods, landmarks, animals) from diverse regions. We test popular LLMs across two gameplay configurations (canonical 20-question and unlimited turns) and in seven languages (English, Hindi, Mandarin, Japanese, French, Spanish, and Turkish). Our results reveal geographic disparities: LLMs are substantially more successful at deducing entities from the Global North than the Global South, and the Global West than the Global East. While Wikipedia pageviews and pre-training corpus frequency correlate mildly with performance, they fail to fully explain these disparities. Notably, the language in which the game is played has minimal impact on performance gaps. These findings demonstrate the value of creative, free-form evaluation frameworks for uncovering subtle biases in LLMs that remain hidden in standard prompting setups. By analyzing how models initiate and pursue reasoning goals over multiple turns, we find geographic and cultural disparities embedded in their reasoning processes. We release the dataset (Geo20Q+) and code at https://sites.google.com/view/llmbias20q/home.

[35] CoCoLex: Confidence-guided Copy-based Decoding for Grounded Legal Text Generation

Santosh T. Y. S. S, Youssef Tarek Elkhayat, Oana Ichim, Pranav Shetty, Dongsheng Wang, Zhiqiang Ma, Armineh Nourbakhsh, Xiaomo Liu

Main category: cs.CL

TL;DR: CoCoLex, a decoding strategy for legal text generation, improves faithfulness to context by dynamically copying from the source based on model confidence, outperforming existing methods.

Details

Motivation: LLMs in legal domains face issues with unfaithful or hallucinatory outputs, and existing retrieval-augmented methods lack guarantees of effective context integration.

Method: Introduces CoCoLex, which combines model-generated vocabulary with context-derived copying, guided by confidence to ensure fidelity.

Result: Outperforms existing context-aware decoding methods on five legal benchmarks, especially in long-form tasks.

Conclusion: CoCoLex enhances faithfulness in legal text generation by leveraging context dynamically, proving effective in benchmarks.

Abstract: Due to their ability to process long and complex contexts, LLMs can offer key benefits to the Legal domain, but their adoption has been hindered by their tendency to generate unfaithful, ungrounded, or hallucinatory outputs. While Retrieval-Augmented Generation offers a promising solution by grounding generations in external knowledge, it offers no guarantee that the provided context will be effectively integrated. To address this, context-aware decoding strategies have been proposed to amplify the influence of relevant context, but they usually do not explicitly enforce faithfulness to the context. In this work, we introduce Confidence-guided Copy-based Decoding for Legal Text Generation (CoCoLex)-a decoding strategy that dynamically interpolates the model produced vocabulary distribution with a distribution derived based on copying from the context. CoCoLex encourages direct copying based on the model’s confidence, ensuring greater fidelity to the source. Experimental results on five legal benchmarks demonstrate that CoCoLex outperforms existing context-aware decoding methods, particularly in long-form generation tasks.

[36] Conformal Sets in Multiple-Choice Question Answering under Black-Box Settings with Provable Coverage Guarantees

Guang Yang, Xinyang Liu

Main category: cs.CL

TL;DR: A frequency-based uncertainty quantification method using conformal prediction improves LLM reliability in MCQA by outperforming logit-based methods and ensuring coverage guarantees.

Details

Motivation: Address the unreliability of LLMs (e.g., hallucination, overconfidence) in high-risk MCQA applications.

Method: Proposes a frequency-based uncertainty method under black-box settings, using multiple independent samplings and predictive entropy (PE) for quantification.

Result: Frequency-based PE outperforms logit-based PE in distinguishing correct/incorrect predictions (AUROC) and controls miscoverage rates effectively.

Conclusion: The method provides a reliable, model-agnostic framework for uncertainty quantification in MCQA, enhancing LLM trustworthiness.

Abstract: Large Language Models (LLMs) have shown remarkable progress in multiple-choice question answering (MCQA), but their inherent unreliability, such as hallucination and overconfidence, limits their application in high-risk domains. To address this, we propose a frequency-based uncertainty quantification method under black-box settings, leveraging conformal prediction (CP) to ensure provable coverage guarantees. Our approach involves multiple independent samplings of the model’s output distribution for each input, with the most frequent sample serving as a reference to calculate predictive entropy (PE). Experimental evaluations across six LLMs and four datasets (MedMCQA, MedQA, MMLU, MMLU-Pro) demonstrate that frequency-based PE outperforms logit-based PE in distinguishing between correct and incorrect predictions, as measured by AUROC. Furthermore, the method effectively controls the empirical miscoverage rate under user-specified risk levels, validating that sampling frequency can serve as a viable substitute for logit-based probabilities in black-box scenarios. This work provides a distribution-free model-agnostic framework for reliable uncertainty quantification in MCQA with guaranteed coverage, enhancing the trustworthiness of LLMs in practical applications.

[37] Do Political Opinions Transfer Between Western Languages? An Analysis of Unaligned and Aligned Multilingual LLMs

Franziska Weeber, Tanise Ceron, Sebastian Padó

Main category: cs.CL

TL;DR: The study examines if political opinions in multilingual large language models (MLLMs) vary across languages, finding minimal differences in unaligned models and uniform shifts post-alignment.

Details

Motivation: To determine if cross-cultural political opinion differences in surveys translate to cross-lingual differences in MLLMs.

Method: Analyzed MLLMs of various sizes across five Western languages by prompting them to agree/disagree with political statements, evaluating pre- and post-alignment with left/right views using direct preference optimization.

Result: Unaligned models showed few significant cross-lingual differences; alignment shifted opinions uniformly across languages.

Conclusion: Political opinions transfer between languages in Western contexts, highlighting challenges in socio-linguistic and political alignment of MLLMs.

Abstract: Public opinion surveys show cross-cultural differences in political opinions between socio-cultural contexts. However, there is no clear evidence whether these differences translate to cross-lingual differences in multilingual large language models (MLLMs). We analyze whether opinions transfer between languages or whether there are separate opinions for each language in MLLMs of various sizes across five Western languages. We evaluate MLLMs’ opinions by prompting them to report their (dis)agreement with political statements from voting advice applications. To better understand the interaction between languages in the models, we evaluate them both before and after aligning them with more left or right views using direct preference optimization and English alignment data only. Our findings reveal that unaligned models show only very few significant cross-lingual differences in the political opinions they reflect. The political alignment shifts opinions almost uniformly across all five languages. We conclude that in Western language contexts, political opinions transfer between languages, demonstrating the challenges in achieving explicit socio-linguistic, cultural, and political alignment of MLLMs.

[38] MathSmith: Towards Extremely Hard Mathematical Reasoning by Forging Synthetic Problems with a Reinforced Policy

Shaoxiong Zhan, Yanlin Lai, Ziyu Lu, Dahua Lin, Ziqing Yang, Fei Tang

Main category: cs.CL

TL;DR: MathSmith is a framework for synthesizing challenging math problems to improve LLM reasoning by generating new problems from scratch, avoiding data contamination, and using reinforcement learning to optimize complexity and validity.

Details

Motivation: The scarcity of high-quality, high-difficulty training data limits LLM progress in mathematical reasoning. Existing methods lack diversity and scalability.

Method: MathSmith constructs problems from scratch using random concept-explanation pairs from PlanetMath, applies nine predefined strategies for difficulty, and uses reinforcement learning to optimize validity, complexity, and consistency.

Result: MathSmith outperforms baselines on five benchmarks (GSM8K, MATH-500, AIME2024, AIME2025, OlympiadBench) in short and long CoT settings.

Conclusion: MathSmith demonstrates scalability, generalization, and transferability, proving the value of high-difficulty synthetic data for advancing LLM reasoning.

Abstract: Large language models have achieved substantial progress in mathematical reasoning, yet their advancement is limited by the scarcity of high-quality, high-difficulty training data. Existing synthesis methods largely rely on transforming human-written templates, limiting both diversity and scalability. We propose MathSmith, a novel framework for synthesizing challenging mathematical problems to enhance LLM reasoning. Rather than modifying existing problems, MathSmith constructs new ones from scratch by randomly sampling concept-explanation pairs from PlanetMath, ensuring data independence and avoiding contamination. To increase difficulty, we design nine predefined strategies as soft constraints during rationales. We further adopts reinforcement learning to jointly optimize structural validity, reasoning complexity, and answer consistency. The length of the reasoning trace generated under autoregressive prompting is used to reflect cognitive complexity, encouraging the creation of more demanding problems aligned with long-chain-of-thought reasoning. Experiments across five benchmarks, categorized as easy & medium (GSM8K, MATH-500) and hard (AIME2024, AIME2025, OlympiadBench), show that MathSmith consistently outperforms existing baselines under both short and long CoT settings. Additionally, a weakness-focused variant generation module enables targeted improvement on specific concepts. Overall, MathSmith exhibits strong scalability, generalization, and transferability, highlighting the promise of high-difficulty synthetic data in advancing LLM reasoning capabilities.

[39] Cooper: Co-Optimizing Policy and Reward Models in Reinforcement Learning for Large Language Models

Haitao Hong, Yuchen Yan, Xingyu Wu, Guiyang Hou, Wenqi Zhang, Weiming Lu, Yongliang Shen, Jun Xiao

Main category: cs.CL

TL;DR: The paper introduces Cooper, a RL framework that jointly optimizes policy and reward models to address limitations of rule-based and model-based rewards, enhancing robustness and mitigating reward hacking.

Details

Motivation: Current reward paradigms (rule-based and model-based) in LLMs have limitations: rule-based lacks robustness, and model-based is prone to reward hacking.

Method: Cooper co-optimizes policy and reward models, leveraging rule-based precision for correct responses and dynamically training the reward model with positive-negative sample pairs. A hybrid annotation strategy and reference-based reward modeling (VerifyRM) are introduced.

Result: Cooper improves RL performance, achieving a 0.54% accuracy gain on Qwen2.5-1.5B-Instruct, and VerifyRM outperforms same-size models on VerifyBench.

Conclusion: Dynamically updating the reward model effectively combats reward hacking, offering a better integration of reward models into RL.

Abstract: Large language models (LLMs) have demonstrated remarkable performance in reasoning tasks, where reinforcement learning (RL) serves as a key algorithm for enhancing their reasoning capabilities. Currently, there are two mainstream reward paradigms: model-based rewards and rule-based rewards. However, both approaches suffer from limitations: rule-based rewards lack robustness, while model-based rewards are vulnerable to reward hacking. To address these issues, we propose Cooper(Co-optimizing Policy Model and Reward Model), a RL framework that jointly optimizes both the policy model and the reward model. Cooper leverages the high precision of rule-based rewards when identifying correct responses, and dynamically constructs and selects positive-negative sample pairs for continued training the reward model. This design enhances robustness and mitigates the risk of reward hacking. To further support Cooper, we introduce a hybrid annotation strategy that efficiently and accurately generates training data for the reward model. We also propose a reference-based reward modeling paradigm, where the reward model takes a reference answer as input. Based on this design, we train a reward model named VerifyRM, which achieves higher accuracy on VerifyBench compared to other models of the same size. We conduct reinforcement learning using both VerifyRM and Cooper. Our experiments show that Cooper not only alleviates reward hacking but also improves end-to-end RL performance, for instance, achieving a 0.54% gain in average accuracy on Qwen2.5-1.5B-Instruct. Our findings demonstrate that dynamically updating reward model is an effective way to combat reward hacking, providing a reference for better integrating reward models into RL.

[40] OmniEAR: Benchmarking Agent Reasoning in Embodied Tasks

Zixuan Wang, Dingming Li, Hongxing Li, Shuo Chen, Yuchen Yan, Wenqi Zhang, Yongliang Shen, Weiming Lu, Jun Xiao, Yueting Zhuang

Main category: cs.CL

TL;DR: OmniEAR evaluates language models’ embodied reasoning, revealing performance drops in tool usage and multi-agent coordination, highlighting limitations in current models.

Details

Motivation: To explore how language models handle embodied reasoning tasks like physical interactions, tool usage, and multi-agent coordination, which remain understudied.

Method: OmniEAR uses text-based environments to model physical properties and spatial relationships across 1,500 scenarios, testing dynamic capability acquisition and autonomous coordination.

Result: Models perform well with explicit instructions (85-96% success) but struggle with tool reasoning (56-85%) and implicit collaboration (63-85%). Fine-tuning helps single-agent tasks but not multi-agent ones.

Conclusion: Embodied reasoning poses unique challenges beyond current model capabilities, establishing OmniEAR as a benchmark for advancing embodied AI.

Abstract: Large language models excel at abstract reasoning but their capacity for embodied agent reasoning remains largely unexplored. We present OmniEAR, a comprehensive framework for evaluating how language models reason about physical interactions, tool usage, and multi-agent coordination in embodied tasks. Unlike existing benchmarks that provide predefined tool sets or explicit collaboration directives, OmniEAR requires agents to dynamically acquire capabilities and autonomously determine coordination strategies based on task demands. Through text-based environment representation, we model continuous physical properties and complex spatial relationships across 1,500 scenarios spanning household and industrial domains. Our systematic evaluation reveals severe performance degradation when models must reason from constraints: while achieving 85-96% success with explicit instructions, performance drops to 56-85% for tool reasoning and 63-85% for implicit collaboration, with compound tasks showing over 50% failure rates. Surprisingly, complete environmental information degrades coordination performance, indicating models cannot filter task-relevant constraints. Fine-tuning improves single-agent tasks dramatically (0.6% to 76.3%) but yields minimal multi-agent gains (1.5% to 5.5%), exposing fundamental architectural limitations. These findings demonstrate that embodied reasoning poses fundamentally different challenges than current models can address, establishing OmniEAR as a rigorous benchmark for evaluating and advancing embodied AI systems. Our code and data are included in the supplementary materials and will be open-sourced upon acceptance.

[41] Learning to Reason for Factuality

Xilun Chen, Ilia Kulikov, Vincent-Pierre Berges, Barlas Oğuz, Rulin Shao, Gargi Ghosh, Jason Weston, Wen-tau Yih

Main category: cs.CL

TL;DR: The paper addresses hallucination issues in Reasoning Large Language Models (R-LLMs) by proposing a novel reward function for online RL, improving factuality, detail, and relevance.

Details

Motivation: R-LLMs struggle with factuality and hallucinations in long-form tasks, and existing RL methods face challenges like reward hacking when using automatic evaluation frameworks.

Method: A new reward function is introduced, combining factual precision, response detail, and answer relevance, applied via online RL to enhance factual reasoning.

Result: The model reduces hallucination rates by 23.1 percentage points, increases detail by 23%, and maintains response helpfulness across six benchmarks.

Conclusion: The proposed reward function effectively improves R-LLMs’ factuality and detail without compromising relevance or helpfulness.

Abstract: Reasoning Large Language Models (R-LLMs) have significantly advanced complex reasoning tasks but often struggle with factuality, generating substantially more hallucinations than their non-reasoning counterparts on long-form factuality benchmarks. However, extending online Reinforcement Learning (RL), a key component in recent R-LLM advancements, to the long-form factuality setting poses several unique challenges due to the lack of reliable verification methods. Previous work has utilized automatic factuality evaluation frameworks such as FActScore to curate preference data in the offline RL setting, yet we find that directly leveraging such methods as the reward in online RL leads to reward hacking in multiple ways, such as producing less detailed or relevant responses. We propose a novel reward function that simultaneously considers the factual precision, response detail level, and answer relevance, and applies online RL to learn high quality factual reasoning. Evaluated on six long-form factuality benchmarks, our factual reasoning model achieves an average reduction of 23.1 percentage points in hallucination rate, a 23% increase in answer detail level, and no degradation in the overall response helpfulness.

[42] How Do LLMs Persuade? Linear Probes Can Uncover Persuasion Dynamics in Multi-Turn Conversations

Brandon Jaipersaud, David Krueger, Ekdeep Singh Lubana

Main category: cs.CL

TL;DR: Probes analyze persuasion dynamics in LLMs, outperforming prompting in some cases.

Details

Motivation: To understand how LLMs persuade humans in multi-turn conversations using lightweight probes.

Method: Applied linear probes to study persuasion success, persuadee personality, and strategy, leveraging cognitive science insights.

Result: Probes effectively capture persuasion dynamics, identifying key moments and strategies, sometimes outperforming prompting.

Conclusion: Probes offer a scalable method for studying complex behaviors like persuasion, deception, and manipulation in LLMs.

Abstract: Large Language Models (LLMs) have started to demonstrate the ability to persuade humans, yet our understanding of how this dynamic transpires is limited. Recent work has used linear probes, lightweight tools for analyzing model representations, to study various LLM skills such as the ability to model user sentiment and political perspective. Motivated by this, we apply probes to study persuasion dynamics in natural, multi-turn conversations. We leverage insights from cognitive science to train probes on distinct aspects of persuasion: persuasion success, persuadee personality, and persuasion strategy. Despite their simplicity, we show that they capture various aspects of persuasion at both the sample and dataset levels. For instance, probes can identify the point in a conversation where the persuadee was persuaded or where persuasive success generally occurs across the entire dataset. We also show that in addition to being faster than expensive prompting-based approaches, probes can do just as well and even outperform prompting in some settings, such as when uncovering persuasion strategy. This suggests probes as a plausible avenue for studying other complex behaviours such as deception and manipulation, especially in multi-turn settings and large-scale dataset analysis where prompting-based methods would be computationally inefficient.

[43] H-Net++: Hierarchical Dynamic Chunking for Tokenizer-Free Language Modelling in Morphologically-Rich Languages

Mehrdad Zakershahrak, Samira Ghodratnama

Main category: cs.CL

TL;DR: H-NET++ is a hierarchical dynamic-chunking model for byte-level language models, improving efficiency and performance in morphologically-rich languages like Persian.

Details

Motivation: Address computational challenges in morphologically-rich languages (MRLs) where byte-level models struggle due to long word spans.

Method: Proposes H-NET++ with innovations like a lightweight Transformer, two-level latent hyper-prior, specialized orthographic handling, and curriculum training.

Result: Achieves state-of-the-art results: 0.159 BPB reduction, 5.4pp ParsGLUE gain, 53% robustness to ZWNJ corruption, and 73.8% F1 on morphological boundaries.

Conclusion: H-NET++ provides an effective tokenizer-free solution for MRLs, aligning with morphology without supervision while maintaining efficiency.

Abstract: Byte-level language models eliminate fragile tokenizers but face computational challenges in morphologically-rich languages (MRLs), where words span many bytes. We propose H-NET++, a hierarchical dynamic-chunking model that learns linguistically-informed segmentation through end-to-end training. Key innovations include: (1) a lightweight Transformer context-mixer (1.9M parameters) for cross-chunk attention, (2) a two-level latent hyper-prior for document-level consistency, (3) specialized handling of orthographic artifacts (e.g. Persian ZWNJ), and (4) curriculum-based training with staged sequence lengths. On a 1.4B-token Persian corpus, H-NET++ achieves state-of-the-art results: 0.159 BPB reduction versus BPE-based GPT-2-fa (12% better compression), 5.4pp gain on ParsGLUE, 53% improved robustness to ZWNJ corruption, and 73.8% F1 on gold morphological boundaries. Our learned chunks align with Persian morphology without explicit supervision, demonstrating that hierarchical dynamic chunking provides an effective tokenizer-free solution for MRLs while maintaining computational efficiency.

[44] A Latent-Variable Model for Intrinsic Probing

Karolina Stańczak, Lucas Torroba Hennigen, Adina Williams, Ryan Cotterell, Isabelle Augenstein

Main category: cs.CL

TL;DR: The paper proposes a latent-variable model for intrinsic probing to analyze linguistic attributes in pre-trained representations, showing tighter mutual information estimates and cross-lingual morphosyntax evidence.

Details

Motivation: To understand where and how linguistic knowledge is encoded in pre-trained representations, given their empirical success in NLP tasks.

Method: Introduces a novel latent-variable formulation for intrinsic probing with a tractable variational approximation to the log-likelihood.

Result: The model is versatile, outperforms prior intrinsic probes in mutual information estimates, and reveals cross-lingually entangled morphosyntax in pre-trained representations.

Conclusion: The proposed probing method effectively identifies and locates linguistic attributes in representations, with implications for understanding cross-lingual generalization.

Abstract: The success of pre-trained contextualized representations has prompted researchers to analyze them for the presence of linguistic information. Indeed, it is natural to assume that these pre-trained representations do encode some level of linguistic knowledge as they have brought about large empirical improvements on a wide variety of NLP tasks, which suggests they are learning true linguistic generalization. In this work, we focus on intrinsic probing, an analysis technique where the goal is not only to identify whether a representation encodes a linguistic attribute but also to pinpoint where this attribute is encoded. We propose a novel latent-variable formulation for constructing intrinsic probes and derive a tractable variational approximation to the log-likelihood. Our results show that our model is versatile and yields tighter mutual information estimates than two intrinsic probes previously proposed in the literature. Finally, we find empirical evidence that pre-trained representations develop a cross-lingually entangled notion of morphosyntax.

[45] Probabilities of Chat LLMs Are Miscalibrated but Still Predict Correctness on Multiple-Choice Q&A

Benjamin Plaut, Nguyen X. Khanh, Tu Trinh

Main category: cs.CL

TL;DR: The paper analyzes 15 chat-fine-tuned LLMs, finding their MSPs are miscalibrated but still useful for uncertainty. Correct answers have higher MSPs, and Q&A accuracy correlates with MSP correctness prediction, not calibration. Abstention based on MSP improves performance.

Details

Motivation: To investigate if MSPs in fine-tuned LLMs, despite miscalibration, can predict answer correctness and improve model performance via selective abstention.

Method: Studied 15 LLMs, tested MSP correlation with correctness, and evaluated abstention strategies using labeled data for threshold selection.

Result: Correct answers have higher MSPs; Q&A accuracy correlates with MSP correctness prediction. Abstention based on MSP improves performance.

Conclusion: MSPs predict correctness but not calibration. Selective abstention enhances LLM performance with minimal labeled data.

Abstract: We study 15 large language models (LLMs) fine-tuned for chat and find that their maximum softmax probabilities (MSPs) are consistently miscalibrated on multiple-choice Q&A. However, those MSPs might still encode useful uncertainty information. Specifically, we hypothesized that wrong answers would be associated with smaller MSPs compared to correct answers. Via rigorous statistical testing, we show that this hypothesis holds for models which perform well on the underlying Q&A task. We also find a strong direction correlation between Q&A accuracy and MSP correctness prediction, while finding no correlation between Q&A accuracy and calibration error. This suggests that within the current fine-tuning paradigm, we can expect correctness prediction but not calibration to improve as LLM capabilities progress. To demonstrate the utility of correctness prediction, we show that when models have the option to abstain, performance can be improved by selectively abstaining based on the MSP of the initial model response, using only a small amount of labeled data to choose the MSP threshold.

[46] Understanding Large Language Model Behaviors through Interactive Counterfactual Generation and Analysis

Furui Cheng, Vilém Zouhar, Robin Shing Moon Chan, Daniel Fürst, Hendrik Strobelt, Mennatallah El-Assady

Main category: cs.CL

TL;DR: LLM Analyzer is an interactive system for exploring large language model behaviors using counterfactual analysis, addressing inefficiencies and misalignment in existing XAI methods.

Details

Motivation: Existing XAI methods for LLMs are computationally inefficient, misaligned with human reasoning, and treat explanations as static outputs, ignoring their interactive nature.

Method: LLM Analyzer uses a novel algorithm to generate fluent counterfactuals via targeted operations, computes feature attribution scores, and integrates them in a table-based visualization for dynamic analysis.

Result: A user study and expert interviews confirmed the system’s usability and effectiveness, highlighting the value of human involvement in explanations.

Conclusion: Interactive and human-involved XAI methods, like LLM Analyzer, are crucial for better understanding and ensuring the safe use of LLMs.

Abstract: Understanding the behavior of large language models (LLMs) is crucial for ensuring their safe and reliable use. However, existing explainable AI (XAI) methods for LLMs primarily rely on word-level explanations, which are often computationally inefficient and misaligned with human reasoning processes. Moreover, these methods often treat explanation as a one-time output, overlooking its inherently interactive and iterative nature. In this paper, we present LLM Analyzer, an interactive visualization system that addresses these limitations by enabling intuitive and efficient exploration of LLM behaviors through counterfactual analysis. Our system features a novel algorithm that generates fluent and semantically meaningful counterfactuals via targeted removal and replacement operations at user-defined levels of granularity. These counterfactuals are used to compute feature attribution scores, which are then integrated with concrete examples in a table-based visualization, supporting dynamic analysis of model behavior. A user study with LLM practitioners and interviews with experts demonstrate the system’s usability and effectiveness, emphasizing the importance of involving humans in the explanation process as active participants rather than passive recipients.

Kai Yin, Bo Li, Chengkai Liu, Ali Mostafavi, Xia Hu

Main category: cs.CL

TL;DR: The paper introduces a method to enhance disaster-related social media text classification by fine-tuning a pre-trained LLM for multi-label classification, improving situational awareness during crises.

Details

Motivation: Current single-label classification models fail to capture the multifaceted nature of disaster-related social media data, limiting their utility for situational awareness.

Method: The study fine-tunes an open-source LLM using a comprehensive instruction dataset derived from disaster-related tweets, enabling multi-label classification.

Result: The fine-tuned model effectively classifies multiple aspects of disaster-related information, enhancing situational awareness and response strategies.

Conclusion: This approach advances disaster management tools by leveraging LLMs for more adaptable and robust real-time situational awareness.

Abstract: In the field of crisis/disaster informatics, social media is increasingly being used for improving situational awareness to inform response and relief efforts. Efficient and accurate text classification tools have been a focal area of investigation in crisis informatics. However, current methods mostly rely on single-label text classification models, which fails to capture different insights embedded in dynamic and multifaceted disaster-related social media data. This study introduces a novel approach to disaster text classification by enhancing a pre-trained Large Language Model (LLM) through instruction fine-tuning targeted for multi-label classification of disaster-related tweets. Our methodology involves creating a comprehensive instruction dataset from disaster-related tweets, which is then used to fine-tune an open-source LLM, thereby embedding it with disaster-specific knowledge. This fine-tuned model can classify multiple aspects of disaster-related information simultaneously, such as the type of event, informativeness, and involvement of human aid, significantly improving the utility of social media data for situational awareness in disasters. The results demonstrate that this approach enhances the categorization of critical information from social media posts, thereby facilitating a more effective deployment for situational awareness during emergencies. This research paves the way for more advanced, adaptable, and robust disaster management tools, leveraging the capabilities of LLMs to improve real-time situational awareness and response strategies in disaster scenarios.

[48] When in Doubt, Cascade: Towards Building Efficient and Capable Guardrails

Manish Nagireddy, Inkit Padhi, Soumya Ghosh, Prasanna Sattigeri

Main category: cs.CL

TL;DR: The paper addresses the issue of harmful and biased outputs from large language models (LLMs) by proposing a synthetic data generation pipeline for developing guardrail models, achieving competitive performance with lower computational costs.

Details

Motivation: To mitigate undesirable outputs (e.g., harmful, biased text) from LLMs, the authors focus on improving guardrail models, inspired by challenges in social bias detection.

Method: A synthetic data generation pipeline using taxonomy-driven instructions creates labeled contrastive samples (300K+). The approach emphasizes the use-mention distinction for better performance.

Result: The method achieves competitive performance on open-source datasets while reducing computational costs, demonstrating effectiveness in developing guardrail models.

Conclusion: The proposed pipeline offers a scalable and efficient way to iteratively improve guardrail models, addressing biases and harmful outputs in LLMs.

Abstract: Large language models (LLMs) have convincing performance in a variety of downstream tasks. However, these systems are prone to generating undesirable outputs such as harmful and biased text. In order to remedy such generations, the development of guardrail (or detector) models has gained traction. Motivated by findings from developing a detector for social bias, we adopt the notion of a use-mention distinction - which we identified as the primary source of under-performance in the preliminary versions of our social bias detector. Armed with this information, we describe a fully extensible and reproducible synthetic data generation pipeline which leverages taxonomy-driven instructions to create targeted and labeled data. Using this pipeline, we generate over 300K unique contrastive samples and provide extensive experiments to systematically evaluate performance on a suite of open source datasets. We show that our method achieves competitive performance with a fraction of the cost in compute and offers insight into iteratively developing efficient and capable guardrail models. Warning: This paper contains examples of text which are toxic, biased, and potentially harmful.

[49] CRAFT Your Dataset: Task-Specific Synthetic Dataset Generation Through Corpus Retrieval and Augmentation

Ingo Ziegler, Abdullatif Köksal, Desmond Elliott, Hinrich Schütze

Main category: cs.CL

TL;DR: CRAFT is a method for generating synthetic datasets using few-shot examples, retrieval from web corpora, and LLM augmentation, outperforming other methods in QA and summarization tasks.

Details

Motivation: Building high-quality datasets for specialized tasks is resource-intensive and requires domain expertise. CRAFT aims to simplify this by automating dataset generation.

Method: CRAFT uses few-shot examples to retrieve relevant documents from web corpora, then augments them into task-specific samples using instruction-tuned LLMs.

Result: CRAFT outperforms general LLMs on QA tasks and exceeds human-curated summarization models by 46 preference points. It also beats other synthetic dataset methods like Self- and Evol-Instruct.

Conclusion: CRAFT efficiently generates high-quality task-specific datasets, proving robust even with varying few-shot quality, and outperforms existing methods.

Abstract: Building high-quality datasets for specialized tasks is a time-consuming and resource-intensive process that often requires specialized domain knowledge. We propose Corpus Retrieval and Augmentation for Fine-Tuning (CRAFT), a method for generating synthetic datasets, given a small number of user-written few-shots that demonstrate the task to be performed. Given these examples, CRAFT uses large-scale public web-crawled corpora and similarity-based document retrieval to find other relevant human-written documents. Lastly, instruction-tuned large language models (LLMs) augment the retrieved documents into custom-formatted task samples, which then can be used for fine-tuning. We demonstrate that CRAFT can efficiently generate large-scale task-specific training datasets for four diverse tasks: biology, medicine, and commonsense question-answering (QA), as well as summarization. Our experiments show that CRAFT-based models outperform or match general LLMs on QA tasks, while exceeding models trained on human-curated summarization data by 46 preference points. CRAFT outperforms other synthetic dataset generation methods such as Self- and Evol-Instruct, and remains robust even when the quality of the initial few-shots varies.

[50] Medal Matters: Probing LLMs’ Failure Cases Through Olympic Rankings

Juhwan Choi, Seunguk Yu, JungMin Yun, YoungBin Kim

Main category: cs.CL

TL;DR: LLMs perform well in retrieving Olympic medal counts but struggle with ranking tasks, revealing gaps in their knowledge integration compared to human reasoning.

Details

Motivation: To understand the internal knowledge structures of LLMs by evaluating their performance on historical Olympic medal data.

Method: Assessed LLMs on two tasks: retrieving medal counts for specific teams and identifying team rankings.

Result: LLMs excel at recalling medal counts but perform poorly on ranking tasks.

Conclusion: The study highlights limitations in LLMs’ knowledge integration and suggests areas for improvement, with released resources for further research.

Abstract: Large language models (LLMs) have achieved remarkable success in natural language processing tasks, yet their internal knowledge structures remain poorly understood. This study examines these structures through the lens of historical Olympic medal tallies, evaluating LLMs on two tasks: (1) retrieving medal counts for specific teams and (2) identifying rankings of each team. While state-of-the-art LLMs excel in recalling medal counts, they struggle with providing rankings, highlighting a key difference between their knowledge organization and human reasoning. These findings shed light on the limitations of LLMs’ internal knowledge integration and suggest directions for improvement. To facilitate further research, we release our code, dataset, and model outputs.

[51] WhisperNER: Unified Open Named Entity and Speech Recognition

Gil Ayache, Menachem Pirchi, Aviv Navon, Aviv Shamsian, Gill Hetz, Joseph Keshet

Main category: cs.CL

TL;DR: WhisperNER integrates NER with ASR for improved transcription accuracy, using synthetic data and open-type NER to outperform baselines.

Details

Motivation: Enhancing transcription accuracy and informativeness by combining NER and ASR.

Method: Introduces WhisperNER, a model trained on synthetic speech data with NER labels, optimized for joint transcription and entity recognition.

Result: Outperforms baselines in out-of-domain open-type NER and supervised finetuning.

Conclusion: WhisperNER effectively combines ASR and NER, demonstrating superior performance with synthetic data.

Abstract: Integrating named entity recognition (NER) with automatic speech recognition (ASR) can significantly enhance transcription accuracy and informativeness. In this paper, we introduce WhisperNER, a novel model that allows joint speech transcription and entity recognition. WhisperNER supports open-type NER, enabling recognition of diverse and evolving entities at inference. Building on recent advancements in open NER research, we augment a large synthetic dataset with synthetic speech samples. This allows us to train WhisperNER on a large number of examples with diverse NER tags. During training, the model is prompted with NER labels and optimized to output the transcribed utterance along with the corresponding tagged entities. To evaluate WhisperNER, we generate synthetic speech for commonly used NER benchmarks and annotate existing ASR datasets with open NER tags. Our experiments demonstrate that WhisperNER outperforms natural baselines on both out-of-domain open type NER and supervised finetuning.

[52] MedHalu: Hallucinations in Responses to Healthcare Queries by Large Language Models

Vibhor Agarwal, Yiqiao Jin, Mohit Chandra, Munmun De Choudhury, Srijan Kumar, Nishanth Sastry

Main category: cs.CL

TL;DR: The paper introduces MedHalu, a benchmark for evaluating hallucinations in LLM-generated healthcare responses, and MedHaluDetect, a framework for assessing LLMs’ hallucination detection. It highlights LLMs’ underperformance compared to humans and proposes an expert-in-the-loop solution.

Details

Motivation: LLMs are increasingly used for healthcare queries but prone to hallucinations, posing risks for laypeople. Existing evaluations focus on standardized tests, not real-world interactions.

Method: The study introduces MedHalu, a benchmark with annotated hallucination types, and MedHaluDetect for evaluating LLMs. It compares hallucination detection among medical experts, LLMs, and laypeople.

Result: LLMs underperform humans in detecting medical hallucinations. The expert-in-the-loop approach improves detection, e.g., a 6.3% macro-F1 boost for GPT-4.

Conclusion: The work underscores the need for better hallucination detection in LLMs for healthcare, proposing expert integration as a solution.

Abstract: Large language models (LLMs) are starting to complement traditional information seeking mechanisms such as web search. LLM-powered chatbots like ChatGPT are gaining prominence among the general public. AI chatbots are also increasingly producing content on social media platforms. However, LLMs are also prone to hallucinations, generating plausible yet factually incorrect or fabricated information. This becomes a critical problem when laypeople start seeking information about sensitive issues such as healthcare. Existing works in LLM hallucinations in the medical domain mainly focus on testing the medical knowledge of LLMs through standardized medical exam questions which are often well-defined and clear-cut with definitive answers. However, these approaches may not fully capture how these LLMs perform during real-world interactions with patients. This work conducts a pioneering study on hallucinations in LLM-generated responses to real-world healthcare queries from patients.We introduce MedHalu, a novel medical hallucination benchmark featuring diverse health-related topics and hallucinated responses from LLMs, with detailed annotation of the hallucination types and text spans. We also propose MedHaluDetect, a comprehensive framework for evaluating LLMs’ abilities to detect hallucinations. Furthermore, we study the vulnerability to medical hallucinations among three groups – medical experts, LLMs, and laypeople. Notably, LLMs significantly underperform human experts and, in some cases, even laypeople in detecting medical hallucinations. To improve hallucination detection, we propose an expert-in-the-loop approach that integrates expert reasoning into LLM inputs, significantly improving hallucination detection for all LLMs, including a 6.3% macro-F1 improvement for GPT-4.

[53] From Code to Correctness: Closing the Last Mile of Code Generation with Hierarchical Debugging

Yuling Shi, Songsong Wang, Chengcheng Wan, Min Wang, Xiaodong Gu

Main category: cs.CL

TL;DR: MGDebugger is a hierarchical code debugger that isolates and fixes bugs at multiple granularity levels, outperforming existing systems with a 97.6% repair success rate.

Details

Motivation: Existing LLM-based debugging systems treat code as monolithic, failing to address bugs at varying granularities, leading to low pass rates for generated code.

Method: MGDebugger decomposes code into a hierarchical tree of subfunctions, analyzes each level, and resolves bugs bottom-up using an LLM-simulated Python executor for precise error tracking.

Result: MGDebugger achieves an 18.9% accuracy improvement over seed generations in HumanEval and a 97.6% repair success rate in HumanEvalFix.

Conclusion: MGDebugger is robust and effective, fixing bugs across categories and difficulty levels, significantly enhancing code generation pass rates.

Abstract: While large language models have made significant strides in code generation, the pass rate of the generated code is bottlenecked on subtle errors, often requiring human intervention to pass tests, especially for complex problems. Existing LLM-based debugging systems treat generated programs as monolithic units, failing to address bugs at multiple levels of granularity, from low-level syntax errors to high-level algorithmic flaws. In this paper, we introduce Multi-Granularity Debugger (MGDebugger), a hierarchical code debugger by isolating, identifying, and resolving bugs at various levels of granularity. MGDebugger decomposes problematic code into a hierarchical tree structure of subfunctions, with each level representing a particular granularity of error. During debugging, it analyzes each subfunction and iteratively resolves bugs in a bottom-up manner. To effectively test each subfunction, we propose an LLM-simulated Python executor, which traces code execution and tracks important variable states to pinpoint errors accurately. Extensive experiments demonstrate that MGDebugger outperforms existing debugging systems, achieving an 18.9% improvement in accuracy over seed generations in HumanEval and a 97.6% repair success rate in HumanEvalFix. Furthermore, MGDebugger effectively fixes bugs across different categories and difficulty levels, demonstrating its robustness and effectiveness.

[54] BloomWise: Enhancing Problem-Solving capabilities of Large Language Models using Bloom’s-Taxonomy-Inspired Prompts

Maria-Eleni Zoumpoulidi, Georgios Paraskevopoulos, Alexandros Potamianos

Main category: cs.CL

TL;DR: BloomWise is a cognitively-inspired prompting technique that enhances LLMs’ mathematical reasoning by mimicking human cognitive progression, improving performance and explainability.

Details

Motivation: Mathematical reasoning is challenging for LLMs, and human-like cognitive progression may improve their problem-solving abilities.

Method: BloomWise prompts LLMs to solve problems by progressing through cognitive levels (e.g., remembering to evaluating), halting early if answers converge.

Result: Effective across five math reasoning datasets, with ablation studies confirming the strengths of its components.

Conclusion: BloomWise improves LLMs’ mathematical reasoning by leveraging cognitive-inspired prompting, making solutions more explainable and robust.

Abstract: Despite the remarkable capabilities of large language models (LLMs) across a range of tasks, mathematical reasoning remains a challenging frontier. Motivated by the observation that humans learn more effectively when prompted not what to think but how to think, we introduce BloomWise, a cognitively-inspired prompting technique designed to enhance LLMs’ performance on mathematical problem solving while making their solutions more explainable. BloomWise encourages LLMs to generate solutions - in the form of explanations - by progressing through a sequence of cognitive operations-from basic (e.g., remembering) to more advanced reasoning skills (e.g., evaluating) - mirroring how humans build understanding. The process iterates through these levels, halting early if a convergence criterion is met: specifically, if two or more consecutive levels yield the same answer, the solution from the earliest such level is output; otherwise, the process continues until all levels are completed. Through extensive experiments across five popular math reasoning datasets, we demonstrate the effectiveness of BloomWise. We also present comprehensive ablation studies to analyze the strengths of each component within our system.

[55] Scaling Laws For Mixed Quantization

Zeyu Cao, Boyang Gu, Cheng Zhang, Pedro Gimenes, Jianqiao Lu, Jianyi Cheng, Xitong Gao, Yiren Zhao

Main category: cs.CL

TL;DR: The paper explores how much high-precision computation is needed in post-training quantization (PTQ) of LLMs to maintain target accuracy, introducing metrics like quantization ratio ($Q_r$) and block size ($Q_b$). It proposes a scaling law predicting loss degeneration and finds larger models tolerate higher $Q_r$ but don’t require small $Q_b$.

Details

Motivation: To understand the trade-offs between high-precision computation and low-precision quantization in LLMs, aiming to optimize memory and computational efficiency for inference.

Method: Introduces $Q_r$ and $Q_b$ metrics, conducts controlled experiments across models and quantization methods, and derives a scaling law for PTQ.

Result: Larger models support higher $Q_r$, but small $Q_b$ is unnecessary and complicates hardware design.

Conclusion: The study provides a unified scaling law for PTQ, suggesting mixed quantization is viable for larger models while simplifying hardware requirements by avoiding overly small $Q_b$.

Abstract: Post-training quantization of Large Language Models (LLMs) has proven effective in reducing the memory and computational requirements for inference. In this study, we focus on a straightforward question: When aiming for a target accuracy or perplexity with low-precision quantization, how much high-precision computation needs to be preserved, and how fine-grained this quantization would need to be as we scale LLMs to larger sizes? We first introduce two critical metrics, named the quantization ratio ($Q_r$) and quantization block size ($Q_b$). The former measures the number of parameters quantized to low-precision arithmetic normalized by the total parameter count, whereas the latter defines the number of values within a block that share a scaling factor, akin to the block size concept introduced in the FP4 format in NVIDIA’s Blackwell architecture. Through extensive and carefully controlled experiments across different models and quantization methods, we propose a unified scaling law on post-training quantization (PTQ) that can predict loss degeneration for varying $Q_r$ and $Q_b$. For $Q_r$, our scaling law implies that parameter scaling and ratio scaling have a multiplicative relationship. Consequently, larger models are more amenable to a higher quantization ratio $Q_r$, thus supporting an increase in the adoption of mixed quantization for inference. Regarding $Q_b$, our findings indicate that a small block size, similar to that used in Blackwell, is not essential for large models. Employing a small $Q_b$ can instead unnecessarily complicate the design of the hardware circuit.

[56] Data Processing for the OpenGPT-X Model Family

Nicolo’ Brandizzi, Hammam Abdelwahab, Anirban Bhowmick, Lennard Helmer, Benny Jörg Stein, Pavel Denisov, Qasid Saleem, Michael Fromm, Mehdi Ali, Richard Rutmann, Farzad Naderi, Mohamad Saif Agy, Alexander Schwirjow, Fabian Küch, Luzian Hahn, Malte Ostendorff, Pedro Ortiz Suarez, Georg Rehm, Dennis Wegener, Nicolas Flores-Herr, Joachim Köhler, Johannes Leveling

Main category: cs.CL

TL;DR: The paper outlines the data preparation pipeline for OpenGPT-X, focusing on multilingual LLMs for European languages, detailing data selection, processing, and compliance with EU regulations.

Details

Motivation: To create open, high-performance multilingual LLMs for European languages, ensuring transparency and regulatory compliance.

Method: Distinct pipelines for curated and web data, with specialized filtering and deduplication for web data, and minimal processing for curated data.

Result: A transparent, compliant data pipeline for multilingual LLMs, with insights into challenges and recommendations for future projects.

Conclusion: The project successfully developed a robust data preparation framework, highlighting the importance of tailored processing for different data types and regulatory alignment.

Abstract: This paper presents a comprehensive overview of the data preparation pipeline developed for the OpenGPT-X project, a large-scale initiative aimed at creating open and high-performance multilingual large language models (LLMs). The project goal is to deliver models that cover all major European languages, with a particular focus on real-world applications within the European Union. We explain all data processing steps, starting with the data selection and requirement definition to the preparation of the final filtered data. We distinguish between curated data and web data, as each of these categories is handled by distinct pipelines, with curated data undergoing minimal filtering and web data requiring extensive filtering and deduplication. This distinction guided the development of specialized algorithmic solutions for both pipelines. In addition to describing the processing methodologies, we provide an in-depth analysis of the datasets, increasing transparency and alignment with European data regulations. Finally, we share key insights and challenges faced during the project, offering recommendations for future endeavors in large-scale multilingual data preparation for LLMs.

[57] Large Language Models Still Exhibit Bias in Long Text

Wonje Jeung, Dongjae Jeon, Ashkan Yousefpour, Jonghyun Choi

Main category: cs.CL

TL;DR: The paper introduces LTF-TEST, a framework to evaluate biases in LLMs for long-text generation, uncovering subtle biases in five models. It proposes FT-REGARD, a finetuning method to mitigate biases.

Details

Motivation: Existing fairness benchmarks for LLMs focus on simple tasks, missing biases in complex scenarios like long-text generation.

Method: LTF-TEST evaluates biases using essay-style prompts across 14 topics and 10 demographic axes. FT-REGARD finetunes models with neutral responses to biased prompts.

Result: LTF-TEST reveals biases favoring certain groups and excessive sensitivity toward disadvantaged groups. FT-REGARD reduces gender bias by 34.6% and improves performance.

Conclusion: LTF-TEST effectively uncovers biases in long-text generation, and FT-REGARD offers a viable solution to mitigate them.

Abstract: Existing fairness benchmarks for large language models (LLMs) primarily focus on simple tasks, such as multiple-choice questions, overlooking biases that may arise in more complex scenarios like long-text generation. To address this gap, we introduce the Long Text Fairness Test (LTF-TEST), a framework that evaluates biases in LLMs through essay-style prompts. LTF-TEST covers 14 topics and 10 demographic axes, including gender and race, resulting in 11,948 samples. By assessing both model responses and the reasoning behind them, LTF-TEST uncovers subtle biases that are difficult to detect in simple responses. In our evaluation of five recent LLMs, including GPT-4o and LLaMa3, we identify two key patterns of bias. First, these models frequently favor certain demographic groups in their responses. Second, they show excessive sensitivity toward traditionally disadvantaged groups, often providing overly protective responses while neglecting others. To mitigate these biases, we propose FT-REGARD, a finetuning approach that pairs biased prompts with neutral responses. FT-REGARD reduces gender bias by 34.6% and improves performance by 1.4 percentage points on the BBQ benchmark, offering a promising approach to addressing biases in long-text generation tasks.

[58] GuARD: Effective Anomaly Detection through a Text-Rich and Graph-Informed Language Model

Yunhe Pang, Bo Chen, Fanjin Zhang, Yanghui Rao, Evgeny Kharlamov, Jie Tang

Main category: cs.CL

TL;DR: GuARD combines graph structure and text semantics for anomaly detection on text-rich graphs, outperforming existing methods with faster training and inference.

Details

Motivation: Existing methods either ignore structural bias in graphs or incur high costs when using large language models (LLMs) for anomaly detection.

Method: GuARD integrates structural features from graphs with semantic attributes from small language models, optimized via multi-modal instruction tuning.

Result: GuARD outperforms graph-based and LLM-based methods, achieving 5x speedup in training and inference on large datasets.

Conclusion: GuARD effectively balances text and structural information for efficient and accurate anomaly detection.

Abstract: Anomaly detection on text-rich graphs is widely prevalent in real life, such as detecting incorrectly assigned academic papers to authors and detecting bots in social networks. The remarkable capabilities of large language models (LLMs) pave a new revenue by utilizing rich-text information for effective anomaly detection. However, simply introducing rich texts into LLMs can obscure essential detection cues and introduce high fine-tuning costs. Moreover, LLMs often overlook the intrinsic structural bias of graphs which is vital for distinguishing normal from abnormal node patterns. To this end, this paper introduces GuARD, a text-rich and graph-informed language model that combines key structural features from graph-based methods with fine-grained semantic attributes extracted via small language models for effective anomaly detection on text-rich graphs. GuARD is optimized with the progressive multi-modal multi-turn instruction tuning framework in the task-guided instruction tuning regime tailed to incorporate both rich-text and structural modalities. Extensive experiments on four datasets reveal that GuARD outperforms graph-based and LLM-based anomaly detection methods, while offering up to 5$\times$ times speedup in training and 5$\times$ times speedup in inference over vanilla long-context LLMs on the large-scale WhoIsWho dataset.

[59] Efficient Knowledge Injection in LLMs via Self-Distillation

Kalle Kujanpää, Pekka Marttinen, Harri Valpola, Alexander Ilin

Main category: cs.CL

TL;DR: Prompt distillation outperforms fine-tuning and rivals RAG for knowledge injection in LLMs without needing large teachers or structured data.

Details

Motivation: Efficiently internalize new factual knowledge in LLMs beyond pre-training data, addressing limitations of fine-tuning and RAG.

Method: Uses prompt distillation, a self-distillation method, to internalize knowledge from free-form documents without structured formats or large teacher models.

Result: Outperforms standard fine-tuning and can surpass RAG across various LLM sizes and families.

Conclusion: Prompt distillation is a scalable and effective method for knowledge injection in LLMs, with potential to outperform existing approaches.

Abstract: In many practical applications, large language models (LLMs) need to acquire new knowledge not present in their pre-training data. Efficiently leveraging this knowledge usually relies on supervised fine-tuning or retrieval-augmented generation (RAG). Although RAG has emerged as the industry standard for knowledge injection, fine-tuning has not yet achieved comparable success. This paper proposes utilizing prompt distillation, a self-distillation-based method previously explored primarily for style alignment and instruction tuning, to internalize new factual knowledge from free-form documents. Unlike prior methods, our approach requires neither larger teacher models nor structured knowledge formats. Across multiple LLM sizes and model families, we show that prompt distillation outperforms standard supervised fine-tuning and can even surpass RAG. We analyze the key factors contributing to prompt distillation’s effectiveness and examine how it scales.

[60] Rationale-guided Prompting for Knowledge-based Visual Question Answering

Zhongjian Hu, Peng Yang, Bing Li, Fengyuan Liu

Main category: cs.CL

TL;DR: PLRH framework improves VQA by prompting LLMs with rationale heuristics (CoT) for better intermediate reasoning, outperforming baselines.

Details

Motivation: Prior methods neglect intermediate thought processes, underutilizing LLMs' potential.

Method: PLRH uses Chain of Thought (CoT) to generate rationale heuristics, inspiring LLMs for answer prediction.

Result: Outperforms baselines by 2.2 and 2.1 on OK-VQA and A-OKVQA datasets.

Conclusion: PLRH effectively activates LLMs’ capacities for knowledge-based VQA.

Abstract: Recently, Large Language Models (LLMs) have been used for knowledge-based Visual Question Answering (VQA). Despite the encouraging results of previous studies, prior methods prompt LLMs to predict answers directly, neglecting intermediate thought processes. We argue that prior methods do not sufficiently activate the capacities of LLMs. We propose a framework called PLRH that Prompts LLMs with Rationale Heuristics for knowledge-based VQA. The PLRH prompts LLMs with Chain of Thought (CoT) to generate rationale heuristics, i.e., intermediate thought processes, and then leverages the rationale heuristics to inspire LLMs to predict answers. Experiments show that our approach outperforms the existing baselines by more than 2.2 and 2.1 on OK-VQA and A-OKVQA, respectively.

[61] Multi-Agents Based on Large Language Models for Knowledge-based Visual Question Answering

Zhongjian Hu, Peng Yang, Bing Li, Zhenqi Wang

Main category: cs.CL

TL;DR: The paper proposes a multi-agent voting framework for LLMs in VQA, addressing challenges like tool autonomy and teamwork by simulating hierarchical agents with assigned tools and voting for final answers.

Details

Motivation: Existing LLM-based VQA methods lack autonomy in using external tools and teamwork capabilities, unlike humans who adaptively use tools and collaborate.

Method: A multi-agent framework with three LLM-based agents simulating hierarchical team roles, each assigned tools, and voting to determine the final answer.

Result: Outperforms baselines on OK-VQA and A-OKVQA by 2.2 and 1.0 points, respectively.

Conclusion: The multi-agent voting framework effectively enhances LLM performance in VQA by mimicking human-like tool use and collaboration.

Abstract: Large Language Models (LLMs) have achieved impressive results in knowledge-based Visual Question Answering (VQA). However existing methods still have challenges: the inability to use external tools autonomously, and the inability to work in teams. Humans tend to know whether they need to use external tools when they encounter a new question, e.g., they tend to be able to give a direct answer to a familiar question, whereas they tend to use tools such as search engines when they encounter an unfamiliar question. In addition, humans also tend to collaborate and discuss with others to get better answers. Inspired by this, we propose the multi-agent voting framework. We design three LLM-based agents that simulate different levels of staff in a team, and assign the available tools according to the levels. Each agent provides the corresponding answer, and finally all the answers provided by the agents are voted to get the final answer. Experiments on OK-VQA and A-OKVQA show that our approach outperforms other baselines by 2.2 and 1.0, respectively.

[62] Can open source large language models be used for tumor documentation in Germany? – An evaluation on urological doctors’ notes

Stefan Lenz, Arsenij Ustjanzew, Marco Jeray, Meike Ressing, Torsten Panholzer

Main category: cs.CL

TL;DR: Open-source LLMs (7-12B parameters) show promise for automating tumor documentation in Germany, with models like Llama 3.1 8B and Mistral 7B performing well. Few-shot prompting and cross-domain examples improve results.

Details

Motivation: Manual tumor documentation is inefficient. LLMs could enhance this process by improving efficiency and reliability.

Method: Evaluated 11 open-source LLMs (1-70B parameters) on tasks like tumor diagnosis identification, ICD-10 coding, and date extraction using annotated urology notes. Tested few-shot prompting and cross-domain examples.

Result: Models with 7-12B parameters (e.g., Llama 3.1 8B, Mistral 7B) performed best. Larger models didn’t improve performance. Cross-domain examples boosted few-shot results.

Conclusion: Open-source LLMs (7-12B parameters) are promising for clinical documentation. Tailored fine-tuning and prompting could make them valuable tools.

Abstract: Tumor documentation in Germany is largely done manually, requiring reading patient records and entering data into structured databases. Large language models (LLMs) could potentially enhance this process by improving efficiency and reliability. This evaluation tests eleven different open source LLMs with sizes ranging from 1-70 billion model parameters on three basic tasks of the tumor documentation process: identifying tumor diagnoses, assigning ICD-10 codes, and extracting the date of first diagnosis. For evaluating the LLMs on these tasks, a dataset of annotated text snippets based on anonymized doctors’ notes from urology was prepared. Different prompting strategies were used to investigate the effect of the number of examples in few-shot prompting and to explore the capabilities of the LLMs in general. The models Llama 3.1 8B, Mistral 7B, and Mistral NeMo 12 B performed comparably well in the tasks. Models with less extensive training data or having fewer than 7 billion parameters showed notably lower performance, while larger models did not display performance gains. Examples from a different medical domain than urology could also improve the outcome in few-shot prompting, which demonstrates the ability of LLMs to handle tasks needed for tumor documentation. Open source LLMs show a strong potential for automating tumor documentation. Models from 7-12 billion parameters could offer an optimal balance between performance and resource efficiency. With tailored fine-tuning and well-designed prompting, these models might become important tools for clinical documentation in the future. The code for the evaluation is available from https://github.com/stefan-m-lenz/UroLlmEval. We also release the dataset as a new valuable resource that addresses the shortage of authentic and easily accessible benchmarks in German-language medical NLP.

[63] RLTHF: Targeted Human Feedback for LLM Alignment

Yifei Xu, Tusher Chakraborty, Emre Kıcıman, Bibek Aryal, Eduardo Rodrigues, Srinagesh Sharma, Roberto Estevao, Maria Angels de Luis Balaguer, Jessica Wolk, Rafael Padilha, Leonardo Nunes, Shobana Balakrishnan, Songwu Lu, Ranveer Chandra

Main category: cs.CL

TL;DR: RLTHF is a hybrid human-AI framework that minimizes human annotation effort in LLM alignment by combining LLM-based initial alignment with selective human corrections, achieving full-human annotation-level alignment with only 6-7% of the effort.

Details

Motivation: High costs and limitations of human annotations in RLHF and AI Feedback necessitate a more efficient alignment method.

Method: RLTHF uses LLM-based initial alignment, identifies mislabeled samples via reward distribution, and iteratively integrates human corrections.

Result: Achieves full-human annotation alignment with 6-7% effort; downstream models outperform those trained on fully human-annotated datasets.

Conclusion: RLTHF effectively reduces human effort while maintaining or improving alignment quality.

Abstract: Fine-tuning large language models (LLMs) to align with user preferences is challenging due to the high cost of quality human annotations in Reinforcement Learning from Human Feedback (RLHF) and the generalizability limitations of AI Feedback. To address these challenges, we propose RLTHF, a human-AI hybrid framework that combines LLM-based initial alignment with selective human annotations to achieve full-human annotation alignment with minimal effort. RLTHF identifies hard-to-annotate samples mislabeled by LLMs using a reward model’s reward distribution and iteratively enhances alignment by integrating strategic human corrections while leveraging LLM’s correctly labeled samples. Evaluations on HH-RLHF and TL;DR datasets show that RLTHF reaches full-human annotation-level alignment with only 6-7% of the human annotation effort. Furthermore, models trained on RLTHF’s curated datasets for downstream tasks outperform those trained on fully human-annotated datasets, underscoring the effectiveness of RLTHF.

[64] Which Questions Improve Learning the Most? Utility Estimation of Questions with LM-based Simulations

Dong-Ho Lee, Hyundong Cho, Jonathan May, Jay Pujara

Main category: cs.CL

TL;DR: QUEST introduces a framework to evaluate and generate high-utility questions by simulating learners and measuring direct impact on exam performance, outperforming baselines by 20%.

Details

Motivation: Prior work evaluates questions indirectly (e.g., salience), lacking direct correlation with learning outcomes. QUEST aims to bridge this gap by focusing on measurable improvements in exam performance.

Method: QUEST uses language models to simulate learners studying textbook chapters and taking exams. It estimates question utility by their direct effect on exam scores and fine-tunes question generators via rejection sampling.

Result: QUEST-trained models improve simulated test scores by over 20% compared to baselines. Utility is weakly correlated with salience, indicating unique benefits for learning.

Conclusion: QUEST provides an outcome-driven approach for question evaluation and generation, emphasizing measurable learning improvements over indirect metrics.

Abstract: Asking good questions is critical for comprehension and learning, yet evaluating and generating such questions remains a challenging problem. Prior work on inquisitive questions focuses on learner-generated, curiosity-driven queries and evaluates them using indirect metrics, such as salience or information gain, that do not directly capture a question’s impact on actual learning outcomes. We introduce QUEST (Question Utility Estimation with Simulated Tests), a framework that uses language models to simulate learners and directly quantify the utility of a question - its contribution to exam performance. QUEST simulates a learner who asks questions and receives answers while studying a textbook chapter, then uses them to take an end-of-chapter exam. Through this simulation, the utility of each question is estimated by its direct effect on exam performance, rather than inferred indirectly based on the underlying content. To support this evaluation, we curate TEXTBOOK-EXAM, a benchmark that aligns textbook sections with end-of-section exam questions across five academic disciplines. Using QUEST, we filter for high-utility questions and fine-tune question generators via rejection sampling. Experiments show that questions generated by QUEST-trained models improve simulated test scores by over 20% compared to strong baselines that are fine-tuned using indirect metrics or leverage prompting methods. Furthermore, utility is only weakly correlated with salience and similarity to exam questions, suggesting that it captures unique signal that benefits downstream performance. QUEST offers a new outcome-driven paradigm for question evaluation and generation - one that moves beyond question-answer content toward measurable improvements in learning outcomes.

[65] The Impact of Item-Writing Flaws on Difficulty and Discrimination in Item Response Theory

Robin Schmucker, Steven Moore

Main category: cs.CL

TL;DR: The study explores the predictive validity of Item-Writing Flaw (IWF) rubrics for estimating IRT parameters, revealing significant links between IWFs and item difficulty/discrimination in STEM subjects.

Details

Motivation: Traditional validation methods for test items are resource-intensive, and the predictive validity of IWF rubrics for IRT parameters is underexplored.

Method: Analyzed 7,126 multiple-choice questions using a 19-criteria IWF rubric and studied relationships to IRT parameters.

Result: Found statistically significant links between IWFs and IRT parameters, with specific flaws impacting item quality differently.

Conclusion: Automated IWF analysis is a valuable supplement to traditional validation, but further research is needed for domain-general rubrics and domain-specific algorithms.

Abstract: High-quality test items are essential for educational assessments, particularly within Item Response Theory (IRT). Traditional validation methods rely on resource-intensive pilot testing to estimate item difficulty and discrimination. More recently, Item-Writing Flaw (IWF) rubrics emerged as a domain-general approach for evaluating test items based on textual features. This method offers a scalable, pre-deployment evaluation without requiring student data, but its predictive validity concerning empirical IRT parameters is underexplored. To address this gap, we conducted a study involving 7,126 multiple-choice questions across various STEM subjects (physical science, mathematics, and life/earth sciences). Using an automated approach, we annotated each question with a 19-criteria IWF rubric and studied relationships to data-driven IRT parameters. Our analysis revealed statistically significant links between the number of IWFs and IRT difficulty and discrimination parameters, particularly in life/earth and physical science domains. We further observed how specific IWF criteria can impact item quality more and less severely (e.g., negative wording vs. implausible distractors) and how they might make a question more or less challenging. Overall, our findings establish automated IWF analysis as a valuable supplement to traditional validation, providing an efficient method for initial item screening, particularly for flagging low-difficulty MCQs. Our findings show the need for further research on domain-general evaluation rubrics and algorithms that understand domain-specific content for robust item validation.

[66] Language Model Uncertainty Quantification with Attention Chain

Yinghao Li, Rushi Qiang, Lama Moukheiber, Chao Zhang

Main category: cs.CL

TL;DR: UQAC is a method for efficiently quantifying uncertainty in LLMs by narrowing the reasoning space via an attention chain, improving reliability and computational efficiency.

Details

Motivation: Existing uncertainty quantification methods struggle with complex reasoning steps in LLM responses, leading to overconfidence.

Method: UQAC constructs an attention chain of semantically crucial tokens via backtracking, refining it with similarity filtering and probability thresholding.

Result: Validated on reasoning benchmarks, UQAC provides reliable uncertainty estimates efficiently.

Conclusion: UQAC addresses the challenge of uncertainty quantification in complex reasoning tasks, offering a practical solution.

Abstract: Accurately quantifying a large language model’s (LLM) predictive uncertainty is crucial for judging the reliability of its answers. While most existing research focuses on short, directly answerable questions with closed-form outputs (e.g., multiple-choice), involving intermediate reasoning steps in LLM responses is increasingly important. This added complexity complicates uncertainty quantification (UQ) because the probabilities assigned to answer tokens are conditioned on a vast space of preceding reasoning tokens. Direct marginalization is infeasible, and the dependency inflates probability estimates, causing overconfidence in UQ. To address this, we propose UQAC, an efficient method that narrows the reasoning space to a tractable size for marginalization. UQAC iteratively constructs an “attention chain” of tokens deemed “semantically crucial” to the final answer via a backtracking procedure. Starting from the answer tokens, it uses attention weights to identify the most influential predecessors, then iterates this process until reaching the input tokens. The resulting chain is further refined with similarity filtering and probability thresholding, which reduce the reasoning space, facilitating the approximation of the marginal answer token probabilities. We validate UQAC on multiple reasoning benchmarks with advanced open-source LLMs, demonstrating that it consistently delivers reliable UQ estimates with high computational efficiency.

[67] You Cannot Feed Two Birds with One Score: the Accuracy-Naturalness Tradeoff in Translation

Gergely Flamich, David Vilar, Jan-Thorsten Peter, Markus Freitag

Main category: cs.CL

TL;DR: The paper argues that single-score evaluations in machine translation fail to capture the tradeoff between semantic accuracy and naturalness, advocating for a two-dimensional accuracy-naturalness plane instead.

Details

Motivation: Current machine translation evaluations use a single score to measure both semantic accuracy and naturalness, which oversimplifies system performance.

Method: The authors use information theory to mathematically prove the tradeoff and empirically demonstrate it using WMT24 shared task submissions.

Result: They show that optimizing for one metric (e.g., BLEU) can degrade naturalness, highlighting the need for a dual evaluation approach.

Conclusion: The paper recommends evaluating translations on an accuracy-naturalness plane rather than relying on a single score.

Abstract: The goal of translation, be it by human or by machine, is, given some text in a source language, to produce text in a target language that simultaneously 1) preserves the meaning of the source text and 2) achieves natural expression in the target language. However, researchers in the machine translation community usually assess translations using a single score intended to capture semantic accuracy and the naturalness of the output simultaneously. In this paper, we build on recent advances in information theory to mathematically prove and empirically demonstrate that such single-score summaries do not and cannot give the complete picture of a system’s true performance. Concretely, we prove that a tradeoff exists between accuracy and naturalness and demonstrate it by evaluating the submissions to the WMT24 shared task. Our findings help explain well-known empirical phenomena, such as the observation that optimizing translation systems for a specific accuracy metric (like BLEU) initially improves the system’s naturalness, while ``overfitting’’ the system to the metric can significantly degrade its naturalness. Thus, we advocate for a change in how translations are evaluated: rather than comparing systems using a single number, they should be compared on an accuracy-naturalness plane.

[68] PolyGuard: A Multilingual Safety Moderation Tool for 17 Languages

Priyanshu Kumar, Devansh Jain, Akhila Yerukola, Liwei Jiang, Himanshu Beniwal, Thomas Hartvigsen, Maarten Sap

Main category: cs.CL

TL;DR: POLYGUARD introduces a multilingual safety model and datasets to improve moderation for LLMs, outperforming existing classifiers by 5.5%.

Details

Motivation: Current safety moderation for LLMs is limited to few languages and narrow safety definitions, creating gaps in capabilities.

Method: Developed POLYGUARD using POLYGUARDMIX (1.91M samples in 17 languages) and POLYGUARDPROMPTS (29K samples) for training and evaluation.

Result: POLYGUARD outperforms existing safety classifiers by 5.5% in evaluations.

Conclusion: The work advances safer multilingual LLMs for global users.

Abstract: Truly multilingual safety moderation efforts for Large Language Models (LLMs) have been hindered by a narrow focus on a small set of languages (e.g., English, Chinese) as well as a limited scope of safety definition, resulting in significant gaps in moderation capabilities. To bridge these gaps, we release POLYGUARD, a new state-of-the-art multilingual safety model for safeguarding LLM generations, and the corresponding training and evaluation datasets. POLYGUARD is trained on POLYGUARDMIX, the largest multilingual safety training corpus to date containing 1.91M samples across 17 languages (e.g., Chinese, Czech, English, Hindi). We also introduce POLYGUARDPROMPTS, a high quality multilingual benchmark with 29K samples for the evaluation of safety guardrails. Created by combining naturally occurring multilingual human-LLM interactions and human-verified machine translations of an English-only safety dataset (WildGuardMix; Han et al., 2024), our datasets contain prompt-output pairs with labels of prompt harmfulness, response harmfulness, and response refusal. Through extensive evaluations across multiple safety and toxicity benchmarks, we demonstrate that POLYGUARD outperforms existing state-of-the-art open-weight and commercial safety classifiers by 5.5%. Our contributions advance efforts toward safer multilingual LLMs for all global users.

[69] DEL: Context-Aware Dynamic Exit Layer for Efficient Self-Speculative Decoding

Hossein Entezari Zarch, Lei Gao, Chaoyi Jiang, Murali Annavaram

Main category: cs.CL

TL;DR: DEL (Dynamic Exit Layer) is a plug-and-play method for Speculative Decoding (SD) that dynamically selects exit layers and speculation lengths, achieving significant speedups over auto-regressive decoding and outperforming static SD methods.

Details

Motivation: Current SD methods use static hyperparameters for exit layers and speculation lengths, which are task-specific and context-dependent, limiting performance. DEL addresses this by dynamically adapting these parameters during inference.

Method: DEL dynamically tracks token acceptance rates for each layer and heuristically selects optimal exit layers and speculation lengths, improving efficiency without compromising quality.

Result: DEL achieves speedups of 2.16×∼2.62× over auto-regressive decoding and outperforms state-of-the-art SD methods by up to 0.19×.

Conclusion: DEL provides a flexible and efficient solution for accelerating LLM inference, demonstrating superior performance across diverse tasks and models.

Abstract: Speculative Decoding (SD) is a widely used approach to accelerate the inference of large language models (LLMs) without reducing generation quality. It operates by first using a compact model to draft multiple tokens efficiently, followed by parallel verification using the target LLM. This approach leads to faster inference compared to auto-regressive decoding. While there are multiple approaches to create a draft model, one promising approach is to use early-exit methods. These methods draft candidate tokens by using a subset of layers of the primary model and applying the remaining layers for verification, allowing a single model to handle both drafting and verification. While this technique reduces memory usage and computational cost, its performance relies on the choice of the exit layer for drafting and the number of tokens drafted (speculation length) in each SD round. Prior works use hyperparameter exploration to statically select these values. However, our evaluations show that these hyperparameter values are task-specific, and even within a task they are dependent on the current sequence context. We introduce DEL (Dynamic Exit Layer), a plug-and-play method that adaptively selects the exit layer and speculation length during inference. DEL dynamically tracks the token acceptance rate if the tokens are drafted at each layer of an LLM and uses that knowledge to heuristically select the optimal exit layer and speculation length. Our experiments across a broad range of models and downstream tasks show that DEL achieves overall speedups of $2.16\times$$\sim$$2.62\times$ over vanilla auto-regressive decoding and improves upon state-of-the-art SD methods, which peak at $2.43\times$, by up to $0.19\times$. The code is available at https://github.com/hoenza/DEL.

[70] Tell Me Who Your Students Are: GPT Can Generate Valid Multiple-Choice Questions When Students’ (Mis)Understanding Is Hinted

Machi Shimmei, Masaki Uto, Yuichiroh Matsubayashi, Kentaro Inui, Aditi Mallavarapu, Noboru Matsuda

Main category: cs.CL

TL;DR: AnaQuest is a prompting technique for generating MCQs using a pre-trained language model, validated via IRT and expert ratings.

Details

Motivation: To develop an innovative MCQ generation method integrating formative and summative assessments.

Method: AnaQuest generates MCQs from student responses, compares them with ChatGPT and human-crafted items using IRT.

Result: AnaQuest MCQs, especially foils, closely resemble human-crafted items in difficulty and discrimination.

Conclusion: AnaQuest outperforms ChatGPT in generating human-like MCQs, validated by experts and IRT.

Abstract: The primary goal of this study is to develop and evaluate an innovative prompting technique, AnaQuest, for generating multiple-choice questions (MCQs) using a pre-trained large language model. In AnaQuest, the choice items are sentence-level assertions about complex concepts. The technique integrates formative and summative assessments. In the formative phase, students answer open-ended questions for target concepts in free text. For summative assessment, AnaQuest analyzes these responses to generate both correct and incorrect assertions. To evaluate the validity of the generated MCQs, Item Response Theory (IRT) was applied to compare item characteristics between MCQs generated by AnaQuest, a baseline ChatGPT prompt, and human-crafted items. An empirical study found that expert instructors rated MCQs generated by both AI models to be as valid as those created by human instructors. However, IRT-based analysis revealed that AnaQuest-generated questions - particularly those with incorrect assertions (foils) - more closely resembled human-crafted items in terms of difficulty and discrimination than those produced by ChatGPT.

[71] Flex-Judge: Text-Only Reasoning Unleashes Zero-Shot Multimodal Evaluators

Jongwoo Ko, Sungnyun Kim, Sungwoo Cho, Se-Young Yun

Main category: cs.CL

TL;DR: Flex-Judge is a reasoning-guided multimodal judge model that uses minimal textual reasoning data to generalize across modalities, outperforming traditional methods.

Details

Motivation: Human-generated reward signals are costly, and existing LLM evaluators lack generalization across multimodal tasks.

Method: Flex-Judge leverages structured textual reasoning explanations to transfer decision-making patterns to multimodal judgments.

Result: Flex-Judge achieves competitive or superior performance with fewer training data compared to commercial APIs and multimodal evaluators.

Conclusion: Reasoning-based text supervision is a cost-effective alternative to annotation-intensive methods, advancing scalable multimodal evaluation.

Abstract: Human-generated reward signals are critical for aligning generative models with human preferences, guiding both training and inference-time evaluations. While large language models (LLMs) employed as proxy evaluators, i.e., LLM-as-a-Judge, significantly reduce the costs associated with manual annotations, they typically require extensive modality-specific training data and fail to generalize well across diverse multimodal tasks. In this paper, we propose Flex-Judge, a reasoning-guided multimodal judge model that leverages minimal textual reasoning data to robustly generalize across multiple modalities and evaluation formats. Our core intuition is that structured textual reasoning explanations inherently encode generalizable decision-making patterns, enabling an effective transfer to multimodal judgments, e.g., with images or videos. Empirical results demonstrate that Flex-Judge, despite being trained on significantly fewer text data, achieves competitive or superior performance compared to state-of-the-art commercial APIs and extensively trained multimodal evaluators. Notably, Flex-Judge presents broad impact in modalities like molecule, where comprehensive evaluation benchmarks are scarce, underscoring its practical value in resource-constrained domains. Our framework highlights reasoning-based text supervision as a powerful, cost-effective alternative to traditional annotation-intensive approaches, substantially advancing scalable multimodal model-as-a-judge.

[72] Enabling On-Device Medical AI Assistants via Input-Driven Saliency Adaptation

Uttej Kallakurik, Edward Humes, Rithvik Jonna, Xiaomin Lin, Tinoosh Mohsenin

Main category: cs.CL

TL;DR: A novel medical assistant system optimizes LLMs for edge devices via neuron pruning and quantization, achieving real-time, energy-efficient inference.

Details

Motivation: LLMs are too large for real-time, resource-constrained healthcare environments like edge devices.

Method: Neuron saliency-based pruning and post-training quantization to compress LLMs.

Result: 50% compressed Gemma and 67% compressed LLaMA3 models achieve real-time inference on edge devices.

Conclusion: The method enables efficient deployment of LLMs in specialized domains like healthcare.

Abstract: Large Language Models (LLMs) have significant impact on the healthcare scenarios but remain prohibitively large for deployment in real-time, resource-constrained environments such as edge devices. In this work, we introduce a novel medical assistant system, optimized through our general-purpose compression framework, which tailors Large Language Models (LLMs) for deployment in specialized domains. By measuring neuron saliency on domain-specific data, our method can aggressively prune irrelevant neurons, reducing model size while preserving performance. Following pruning, we apply post-training quantization to further reduce the memory footprint, and evaluate the compressed model across medical benchmarks including MedMCQA, MedQA, and PubMedQA. We also deploy the 50% compressed Gemma and the 67% compressed LLaMA3 models on Jetson Orin Nano (18.7W peak) and Raspberry Pi 5 (6.3W peak), achieving real-time, energy-efficient inference under hardware constraints.

[73] Improving Factuality for Dialogue Response Generation via Graph-Based Knowledge Augmentation

Xiangyan Chen, Yujian Gan, Yimeng Gu, Matthew Purver

Main category: cs.CL

TL;DR: Proposes two graph-based frameworks (TG-DRG and GA-DRG) to reduce hallucinations in LLM-generated dialogue responses, introducing a new factuality metric and showing improvements over baselines.

Details

Motivation: Address the issue of LLMs hallucinating in dialogue tasks, which leads to factually incorrect responses.

Method: Introduces TG-DRG and GA-DRG, combining reasoning-guided reformulation, knowledge selection, and graph-enhanced generation. Also proposes a dialogue fact score for evaluation.

Result: Achieves 3.47% and 3.12% improvements in factuality on OpendialKG and HybriDialogue datasets, respectively.

Conclusion: The frameworks effectively enhance factual consistency in dialogue responses, outperforming existing methods.

Abstract: Large Language Models (LLMs) succeed in many natural language processing tasks. However, their tendency to hallucinate - generate plausible but inconsistent or factually incorrect text - can cause significant problems in certain tasks, including response generation in dialogue. To mitigate this issue, we propose two novel graph knowledge-augmented frameworks, Dialogue Response Generation via Textualised Graphs (TG-DRG) and Graph-Aware Dialogue Response Generation (GA-DRG), which combine reasoning-guided dialogue reformulation, dialogue sense knowledge selection, and graph-enhanced response generation to improve the factuality of dialogue responses. To evaluate the factuality of generated responses, we propose a dialogue fact score that addresses the limitations of existing fact-score methods in dialogue settings, providing a more reliable assessment of factual consistency. We evaluate our methods using different baselines on the OpendialKG and HybriDialogue datasets. Our methods noticeably improve factuality compared to other graph knowledge-augmentation baselines, including the state-of-the-art G-retriever, achieving improvements of 3.47% on OpendialKG and 3.12% on HybriDialogue in terms of dialogue fact score. The code will be released on GitHub.

[74] FinCoT: Grounding Chain-of-Thought in Expert Financial Reasoning

Natapong Nitarach, Warit Sirichotedumrong, Panop Pitchayarthorn, Pittawat Taveekitworachai, Potsawee Manakul, Kunat Pipatanakul

Main category: cs.CL

TL;DR: FinCoT is a structured CoT prompting framework for financial NLP, improving model accuracy and interpretability by embedding expert financial reasoning blueprints.

Details

Motivation: Address the gap in structured CoT prompting for financial NLP, which lacks domain expertise and remains underexplored.

Method: Evaluate three prompting styles (standard, unstructured CoT, structured CoT) and introduce FinCoT, incorporating expert blueprints.

Result: FinCoT boosts accuracy (e.g., Qwen3-8B-Base from 63.2% to 80.5%) and reduces output length, especially for models without financial post-training.

Conclusion: FinCoT enhances performance, reduces costs, and provides interpretable, expert-aligned reasoning in financial NLP.

Abstract: This paper presents FinCoT, a structured chain-of-thought (CoT) prompting framework that embeds domain-specific expert financial reasoning blueprints to guide large language models’ behaviors. We identify three main prompting styles in financial NLP (FinNLP): (1) standard prompting (zero-shot), (2) unstructured CoT (free-form reasoning), and (3) structured CoT (with explicitly structured reasoning steps). Prior work has mainly focused on the first two, while structured CoT remains underexplored and lacks domain expertise incorporation. Therefore, we evaluate all three prompting approaches across ten CFA-style financial domains and introduce FinCoT as the first structured finance-specific prompting approach incorporating blueprints from domain experts. FinCoT improves the accuracy of a general-purpose model, Qwen3-8B-Base, from 63.2% to 80.5%, and boosts Fin-R1 (7B), a finance-specific model, from 65.7% to 75.7%, while reducing output length by up to 8.9x and 1.16x compared to structured CoT methods, respectively. We find that FinCoT proves most effective for models lacking financial post-training. Our findings show that FinCoT does not only improve performance and reduce inference costs but also yields more interpretable and expert-aligned reasoning traces.

[75] Can Vision Language Models Understand Mimed Actions?

Hyundong Cho, Spencer Lin, Tejas Srinivasan, Michael Saxon, Deuksin Kwon, Natali T. Chavez, Jonathan May

Main category: cs.CL

TL;DR: The paper introduces MIME, a benchmark for evaluating vision-language models on mimed actions, highlighting their poor performance compared to humans.

Details

Motivation: Studying nonverbal communication (NVC) is challenging due to its broad scope and variance. Mime, a subset of NVC, offers a more controlled way to study embodied actions with lower interpretation variance.

Method: Proposes MIME, a video-based QA benchmark with 86 mimed actions, constructed using motion capture data and variations for robustness testing.

Result: Vision-language models perform significantly worse than humans on MIME, indicating a gap in understanding human gestures.

Conclusion: More research is needed to improve models’ robust understanding of human gestures, with MIME serving as a benchmark.

Abstract: Nonverbal communication (NVC) plays an integral role in human language, but studying NVC in general is challenging because of its broad scope and high variance in interpretation among individuals and cultures. However, mime – the theatrical technique of suggesting intent using only gesture, expression, and movement – is a subset of NVC that consists of explicit and embodied actions with much lower human interpretation variance. We argue that a solid understanding of mimed actions is a crucial prerequisite for vision-language models capable of interpreting and commanding more subtle aspects of NVC. Hence, we propose Mime Identification Multimodal Evaluation (MIME), a novel video-based question answering benchmark comprising of 86 mimed actions. Constructed with motion capture data, MIME consists of variations of each action with perturbations applied to the character, background, and viewpoint for evaluating recognition robustness. We find that both open-weight and API-based vision-language models perform significantly worse than humans on MIME, motivating the need for increased research for instilling more robust understanding of human gestures.

[76] McBE: A Multi-task Chinese Bias Evaluation Benchmark for Large Language Models

Tian Lan, Xiangdong Su, Xu Liu, Ruirui Wang, Ke Chang, Jiang Li, Guanglai Gao

Main category: cs.CL

TL;DR: A new benchmark (McBE) is introduced to evaluate biases in Chinese LLMs, covering diverse categories and tasks, revealing biases in popular models.

Details

Motivation: Existing bias evaluation datasets are limited to English/North American culture and lack multi-task evaluation, necessitating a Chinese-focused benchmark.

Method: Developed McBE with 4,077 instances, 12 bias categories, 82 subcategories, and 5 evaluation tasks to assess LLMs comprehensively.

Result: Popular LLMs exhibited varying degrees of bias, analyzed in-depth for novel insights.

Conclusion: McBE addresses gaps in bias evaluation for Chinese LLMs, providing a comprehensive tool for ethical AI development.

Abstract: As large language models (LLMs) are increasingly applied to various NLP tasks, their inherent biases are gradually disclosed. Therefore, measuring biases in LLMs is crucial to mitigate its ethical risks. However, most existing bias evaluation datasets focus on English and North American culture, and their bias categories are not fully applicable to other cultures. The datasets grounded in the Chinese language and culture are scarce. More importantly, these datasets usually only support single evaluation tasks and cannot evaluate the bias from multiple aspects in LLMs. To address these issues, we present a Multi-task Chinese Bias Evaluation Benchmark (McBE) that includes 4,077 bias evaluation instances, covering 12 single bias categories, 82 subcategories and introducing 5 evaluation tasks, providing extensive category coverage, content diversity, and measuring comprehensiveness. Additionally, we evaluate several popular LLMs from different series and with parameter sizes. In general, all these LLMs demonstrated varying degrees of bias. We conduct an in-depth analysis of results, offering novel insights into bias in LLMs.

[77] Perception-Aware Policy Optimization for Multimodal Reasoning

Zhenhailong Wang, Xuehang Guo, Sofia Stoica, Haiyang Xu, Hongru Wang, Hyeonjeong Ha, Xiusi Chen, Yangyi Chen, Ming Yan, Fei Huang, Heng Ji

Main category: cs.CL

TL;DR: PAPO, a novel policy gradient algorithm, improves multimodal reasoning in RLVR by integrating perception learning with reasoning, achieving significant performance gains and reducing perception errors.

Details

Motivation: Current RLVR methods are suboptimal for multimodal tasks due to poor visual input perception, prompting the need for a perception-aware approach.

Method: PAPO introduces Implicit Perception Loss (KL divergence) and Double Entropy Loss, enhancing perception without extra data or models.

Result: PAPO improves performance by 4.4%-17.5% on benchmarks and reduces perception errors by 30.5%, with higher gains in vision-dependent tasks.

Conclusion: PAPO advances RLVR by integrating perception into learning, enabling visually grounded reasoning and setting a foundation for future RL frameworks.

Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) has proven to be a highly effective strategy for endowing Large Language Models (LLMs) with robust multi-step reasoning abilities. However, its design and optimizations remain tailored to purely textual domains, resulting in suboptimal performance when applied to multimodal reasoning tasks. In particular, we observe that a major source of error in current multimodal reasoning lies in the perception of visual inputs. To address this bottleneck, we propose PAPO, a novel policy gradient algorithm that encourages the model to learn to perceive while learning to reason. Specifically, we introduce the Implicit Perception Loss in the form of a KL divergence term, which can be seamlessly plugged into mainstream RLVR algorithms such as GRPO and DAPO. Notably, PAPO does not rely on additional data curation, reward models, or stronger teacher models. To further enhance the training stability of PAPO, we introduce the Double Entropy Loss, which effectively regularizes the new KL objective without compromising performance. Despite its simplicity, PAPO yields significant overall improvements of 4.4%-17.5% on diverse multimodal benchmarks. The improvements are more pronounced, approaching 8.0%-19.1%, on tasks with high vision dependency. We also observe a substantial reduction of 30.5% in perception errors, indicating improved perceptual capabilities with PAPO. Overall, our work introduces a deeper integration of perception-aware supervision into core learning objectives and lays the groundwork for a new RL framework that encourages visually grounded reasoning. Code and data will be made publicly available for research purposes. Project page: https://mikewangwzhl.github.io/PAPO.

[78] Efficient Attention Mechanisms for Large Language Models: A Survey

Yutao Sun, Zhenyu Li, Yike Zhang, Tengyu Pan, Bowen Dong, Yuyi Guo, Jianyong Wang

Main category: cs.CL

TL;DR: A survey on efficient attention mechanisms in Transformer-based models, addressing quadratic complexity issues in self-attention for long-context modeling.

Details

Motivation: The quadratic time and memory complexity of self-attention in Transformers hinders efficient long-context modeling, prompting research into scalable solutions.

Method: Two main approaches are explored: linear attention (kernel approximations, recurrent formulations, fastweight dynamics) and sparse attention (fixed patterns, block-wise routing, clustering).

Result: The survey integrates algorithmic innovations and hardware considerations, analyzing their application in large-scale pre-trained language models.

Conclusion: This work serves as a foundational reference for designing scalable and efficient language models by aligning theory with practical deployment.

Abstract: Transformer-based architectures have become the prevailing backbone of large language models. However, the quadratic time and memory complexity of self-attention remains a fundamental obstacle to efficient long-context modeling. To address this limitation, recent research has introduced two principal categories of efficient attention mechanisms. Linear attention methods achieve linear complexity through kernel approximations, recurrent formulations, or fastweight dynamics, thereby enabling scalable inference with reduced computational overhead. Sparse attention techniques, in contrast, limit attention computation to selected subsets of tokens based on fixed patterns, block-wise routing, or clustering strategies, enhancing efficiency while preserving contextual coverage. This survey provides a systematic and comprehensive overview of these developments, integrating both algorithmic innovations and hardware-level considerations. In addition, we analyze the incorporation of efficient attention into largescale pre-trained language models, including both architectures built entirely on efficient attention and hybrid designs that combine local and global components. By aligning theoretical foundations with practical deployment strategies, this work aims to serve as a foundational reference for advancing the design of scalable and efficient language models.

[79] Using Sentiment Analysis to Investigate Peer Feedback by Native and Non-Native English Speakers

Brittney Exline, Melanie Duffin, Brittany Harbison, Chrissa da Gomez, David Joyner

Main category: cs.CL

TL;DR: The paper explores how native vs. non-native English speaker status impacts peer feedback in online U.S. CS courses, revealing differences in sentiment and ratings.

Details

Motivation: To understand the role of language background in peer feedback experiences, given the growing enrollment of international students in U.S. CS programs.

Method: Analyzed sentiment of peer reviews using Twitter-roBERTa on a sample of 500 students, correlating sentiment scores and ratings with language background.

Result: Native speakers rate feedback less favorably; non-native speakers write more positively but receive less positive sentiment. Interactions with sex and age were significant.

Conclusion: Language background modestly but complexly influences peer feedback experiences, highlighting nuanced dynamics in online learning.

Abstract: Graduate-level CS programs in the U.S. increasingly enroll international students, with 60.2 percent of master’s degrees in 2023 awarded to non-U.S. students. Many of these students take online courses, where peer feedback is used to engage students and improve pedagogy in a scalable manner. Since these courses are conducted in English, many students study in a language other than their first. This paper examines how native versus non-native English speaker status affects three metrics of peer feedback experience in online U.S.-based computing courses. Using the Twitter-roBERTa-based model, we analyze the sentiment of peer reviews written by and to a random sample of 500 students. We then relate sentiment scores and peer feedback ratings to students’ language background. Results show that native English speakers rate feedback less favorably, while non-native speakers write more positively but receive less positive sentiment in return. When controlling for sex and age, significant interactions emerge, suggesting that language background plays a modest but complex role in shaping peer feedback experiences.

[80] TreeDiff: AST-Guided Code Generation with Diffusion LLMs

Yiming Zeng, Jinghan Cao, Zexin Li, Yiming Chen, Tao Ren, Dawei Xiang, Xidong Wu, Shangqian Gao, Tingting Yu

Main category: cs.CL

TL;DR: Proposes a syntax-aware diffusion framework for code generation, improving correctness and generalization by incorporating AST-derived structural priors.

Details

Motivation: Diffusion models struggle with structured domains like code due to strict syntactic rules. Standard token-level corruption ignores structure, hindering meaningful code representation.

Method: Introduces syntax-aware diffusion, selectively corrupting AST-derived subtrees instead of random tokens to respect grammatical boundaries.

Result: Syntax-aware corruption improves syntactic correctness, reconstruction accuracy, and generalization to unseen code patterns.

Conclusion: Incorporating structural information into diffusion models is promising for advancing code generation tasks.

Abstract: Recent advances in diffusion-based language models have opened new possibilities for controllable and bidirectional sequence generation. These models provide an alternative to traditional autoregressive approaches by framing text generation as an iterative denoising process. However, applying diffusion models to structured domains such as source code remains a significant challenge. Programming languages differ from natural language in that they follow strict syntactic and semantic rules, with hierarchical organization that must be preserved for correctness. Standard token-level corruption techniques used during training often ignore this structure, which may hinder the model’s ability to learn meaningful representations of code. To address this limitation, we propose a syntax-aware diffusion framework that incorporates structural priors from Abstract Syntax Trees (ASTs) into the denoising process. Instead of masking individual tokens at random, we selectively corrupt syntactically meaningful code spans derived from AST subtrees. This enables the model to reconstruct programs in a way that respects grammatical boundaries and captures long-range dependencies. Experimental results demonstrate that syntax-aware corruption significantly improves syntactic correctness, reconstruction accuracy, and generalization to unseen code patterns. These findings highlight the potential of incorporating structural information into diffusion-based training and suggest that syntax-guided denoising is a promising direction for advancing diffusion-based language models in code generation tasks.

[81] CUPID: Evaluating Personalized and Contextualized Alignment of LLMs from Interactions

Tae Soo Kim, Yoonjoo Lee, Yoonah Park, Jiho Kim, Young-Ho Kim, Juho Kim

Main category: cs.CL

TL;DR: CUPID benchmark evaluates LLMs’ ability to infer dynamic user preferences from multi-turn interactions, showing current models struggle with under 50% precision and 65% recall.

Details

Motivation: Humans have dynamic preferences that change with context, but LLMs often assume static preferences. This misalignment needs addressing.

Method: Introduces CUPID, a benchmark of 756 human-curated interaction sessions, to test LLMs’ ability to infer and apply contextual preferences.

Result: State-of-the-art LLMs perform poorly, with under 50% precision and 65% recall, failing to discern relevant context.

Conclusion: Highlights the need for improved LLM capabilities in contextual personalization and proposes CUPID as a tool for advancement.

Abstract: Personalization of Large Language Models (LLMs) often assumes users hold static preferences that reflect globally in all tasks. In reality, humans hold dynamic preferences that change depending on the context. As users interact with an LLM in various contexts, they naturally reveal their contextual preferences, which a model must infer and apply in future contexts to ensure alignment. To assess this, we introduce CUPID, a benchmark of 756 human-curated interaction session histories between users and LLM-based chat assistants. In each interaction session, the user provides a request in a specific context and expresses their preference through multi-turn feedback. Given a new user request and prior interaction sessions, our benchmark assesses whether LLMs can infer the preference relevant to this request and generate a response that satisfies this preference. With CUPID, we evaluated 10 open and proprietary LLMs, revealing that state-of-the-art LLMs struggle to infer preferences from multi-turn interactions and fail to discern what previous context is relevant to a new request – under 50% precision and 65% recall. Our work highlights the need to advance LLM capabilities for more contextually personalized interactions and proposes CUPID as a resource to drive these improvements.

[82] The SMeL Test: A simple benchmark for media literacy in language models

Gustaf Ahdritz, Anat Kleiman

Main category: cs.CL

TL;DR: The paper introduces the Synthetic Media Literacy Test (SMeL Test) to evaluate LLMs’ ability to filter untrustworthy content, finding no model consistently succeeds, with even top models hallucinating up to 70% of the time.

Details

Motivation: The internet contains unreliable content, and it's unclear if LLMs can effectively filter it like humans.

Method: The SMeL Test benchmarks various instruction-tuned LLMs, including reasoning models, on filtering untrustworthy information.

Result: No model consistently succeeds; reasoning helps but top models still hallucinate frequently, and larger models don’t always outperform smaller ones.

Conclusion: The study highlights LLMs’ limitations in filtering unreliable content and calls for new methods to address this hallucination issue.

Abstract: The internet is rife with unattributed, deliberately misleading, or otherwise untrustworthy content. Though large language models (LLMs) are often tasked with autonomous web browsing, the extent to which they have learned the simple heuristics human researchers use to navigate this noisy environment is not currently known. In this paper, we introduce the Synthetic Media Literacy Test (SMeL Test), a minimal benchmark that tests the ability of language models to actively filter out untrustworthy information in context. We benchmark a variety of commonly used instruction-tuned LLMs, including reasoning models, and find that no model consistently succeeds; while reasoning in particular is associated with higher scores, even the best API model we test hallucinates up to 70% of the time. Remarkably, larger and more capable models do not necessarily outperform their smaller counterparts. We hope our work sheds more light on this important form of hallucination and guides the development of new methods to combat it.

[83] VeOmni: Scaling Any Modality Model Training with Model-Centric Distributed Recipe Zoo

Qianli Ma, Yaowei Zheng, Zhelun Shi, Zhongkai Zhao, Bin Jia, Ziyue Huang, Zhiqi Lin, Youjie Li, Jiacheng Yang, Yanghua Peng, Zhi Zhang, Xin Liu

Main category: cs.CL

TL;DR: VeOmni is a modular training framework for omni-modal LLMs, improving efficiency and scalability by decoupling communication from computation and supporting flexible modality integration.

Details

Motivation: Training omni-modal LLMs is challenging due to heterogeneous architectures and scalability issues in existing frameworks.

Method: VeOmni introduces model-centric distributed recipes, enabling 3D parallelism and flexible modality integration with minimal code changes.

Result: VeOmni achieves high throughput (2,800 tokens/sec/GPU) and scales to 160K context lengths on 128 GPUs for a 30B parameter MoE model.

Conclusion: VeOmni demonstrates superior efficiency and scalability for training large omni-modal LLMs.

Abstract: Recent advances in large language models (LLMs) have driven impressive progress in omni-modal understanding and generation. However, training omni-modal LLMs remains a significant challenge due to the heterogeneous model architectures required to process diverse modalities, necessitating sophisticated system design for efficient large-scale training. Existing frameworks typically entangle model definition with parallel logic, incurring limited scalability and substantial engineering overhead for end-to-end omni-modal training. We present VeOmni, a modular and efficient training framework to accelerate the development of omni-modal LLMs. VeOmni introduces model-centric distributed recipes that decouples communication from computation, enabling efficient 3D parallelism on omni-modal LLMs. VeOmni also features a flexible configuration interface supporting seamless integration of new modalities with minimal code change. Using VeOmni, a omni-modal mixture-of-experts (MoE) model with 30B parameters can be trained with over 2,800 tokens/sec/GPU throughput and scale to 160K context lengths via 3D parallelism on 128 GPUs, showcasing its superior efficiency and scalability for training large omni-modal LLMs.

[84] LLMs are Single-threaded Reasoners: Demystifying the Working Mechanism of Soft Thinking

Chünhung Wu, Jinliang Lu, Zixuan Ren, Gangqiang Hu, Zhi Wu, Dai Dai, Hua Wu

Main category: cs.CL

TL;DR: The paper investigates ‘Soft Thinking’ in LLMs, revealing their reliance on dominant soft inputs, limiting reasoning path exploration. Introducing randomness via sampling strategies improves performance.

Details

Motivation: To address the limitation of discrete token generation in LLMs by enabling abstract reasoning in continuous concept spaces.

Method: Probing LLMs’ internal behavior and exploring sampling strategies like Dirichlet resampling and the Gumbel-Softmax trick to introduce randomness.

Result: Randomness alleviates limitations, with the Gumbel-Softmax trick showing superior performance across reasoning benchmarks.

Conclusion: Incorporating randomness enhances Soft Thinking, unlocking its potential for diverse reasoning paths.

Abstract: Human cognition naturally engages with abstract and fluid concepts, whereas existing reasoning models often rely on generating discrete tokens, potentially constraining their expressive capabilities. Recent advancements aim to address this limitation by enabling large language models (LLMs) to generate soft, abstract tokens, thus facilitating reasoning within a continuous concept space. This paper explores the `Soft Thinking’ capabilities of various LLMs by examining the models’ internal behavior using a suite of probing techniques. Contrary to the common belief that Soft Thinking enables the simultaneous exploration of diverse reasoning paths, our findings reveal that LLMs predominantly rely on the most influential component of the soft inputs during subsequent decoding steps. This reliance hinders the exploration of different reasoning paths and reduces vanilla Soft Thinking to a form of greedy decoding, obscuring the advantage of transmitting more information through Soft Tokens. To tackle this issue, we explore sampling strategies to introduce \emph{randomness}, employing methods such as Dirichlet resampling and the Gumbel-Softmax trick. Our experiments demonstrate that incorporating randomness can alleviate the limitations of vanilla approaches and unleash the potential of Soft Thinking. Notably, the Gumbel-Softmax trick provides adequate randomness with controlled smoothness, resulting in superior performance across eight reasoning benchmarks.

[85] An Entity Linking Agent for Question Answering

Yajie Luo, Yihong Wu, Muzhi Li, Fengran Mo, Jia Ao Sun, Xinyu Wang, Liheng Ma, Yingxue Zhang, Jian-Yun Nie

Main category: cs.CL

TL;DR: A QA-focused entity linking agent using a Large Language Model improves performance on short, ambiguous questions by simulating human workflows.

Details

Motivation: Existing EL methods underperform on short, ambiguous QA questions, necessitating a better approach.

Method: Proposes an agent using a Large Language Model to actively identify mentions, retrieve candidates, and decide.

Result: Experiments show robustness and effectiveness in tool-based EL and QA tasks.

Conclusion: The agent effectively addresses EL challenges in QA, outperforming traditional methods.

Abstract: Some Question Answering (QA) systems rely on knowledge bases (KBs) to provide accurate answers. Entity Linking (EL) plays a critical role in linking natural language mentions to KB entries. However, most existing EL methods are designed for long contexts and do not perform well on short, ambiguous user questions in QA tasks. We propose an entity linking agent for QA, based on a Large Language Model that simulates human cognitive workflows. The agent actively identifies entity mentions, retrieves candidate entities, and makes decision. To verify the effectiveness of our agent, we conduct two experiments: tool-based entity linking and QA task evaluation. The results confirm the robustness and effectiveness of our agent.

[86] GM-PRM: A Generative Multimodal Process Reward Model for Multimodal Mathematical Reasoning

Jianghangfan Zhang, Yibo Yan, Kening Zheng, Xin Zou, Song Dai, Xuming Hu

Main category: cs.CL

TL;DR: GM-PRM transforms PRMs into active collaborators for multimodal math reasoning, offering fine-grained error analysis and corrections, improving solution quality and diversity.

Details

Motivation: Existing multimodal PRMs are limited to binary verification, lacking corrective and explanatory power for complex math reasoning.

Method: Introduces GM-PRM, which evaluates steps for intent, visual alignment, and logic, and generates corrections for errors. Uses Refined-BoN to guide policy models.

Result: Achieves state-of-the-art results on multimodal math benchmarks, boosting performance with high data efficiency (20K samples).

Conclusion: GM-PRM enhances reasoning by actively correcting errors, improving solution quality and diversity efficiently.

Abstract: Multimodal Large Language Models (MLLMs) demonstrate remarkable capabilities but often struggle with complex, multi-step mathematical reasoning, where minor errors in visual perception or logical deduction can lead to complete failure. While Process Reward Models (PRMs) offer step-by-step supervision, existing multimodal PRMs are limited to being binary verifiers that can identify but not correct errors, offering little explanatory power. To address these deficiencies, we introduce the Generative Multimodal Process Reward Model (GM-PRM), a novel paradigm that transforms the PRM from a passive judge into an active reasoning collaborator. Instead of a simple scalar score, GM-PRM provides a fine-grained, interpretable analysis of each reasoning step, evaluating its step intent, visual alignment, and logical soundness. More critically, GM-PRM is trained to generate a corrected version of the first erroneous step it identifies. This unique corrective capability enables our new test-time inference strategy, Refined Best-of-N (Refined-BoN). This framework actively enhances solution quality by using the PRM’s generated correction to guide the policy model toward a more promising reasoning trajectory, thereby improving the diversity and correctness of the solution pool. We demonstrate that GM-PRM achieves state-of-the-art results on multiple multimodal math benchmarks, significantly boosting policy model performance with remarkable data efficiency, requiring only a 20K-sample training dataset. Our code will be released upon acceptance.

[87] Balancing Stylization and Truth via Disentangled Representation Steering

Chenglei Shen, Zhongxiang Sun, Teng Shi, Xiao Zhang, Jun Xu

Main category: cs.CL

TL;DR: StyliTruth addresses the trade-off between style and truthfulness in LLM responses by separating style and truth subspaces, enabling independent control without degrading answer correctness.

Details

Motivation: Existing representation editing methods degrade truthfulness when imposing styles, termed 'stylization-induced truthfulness collapse.'

Method: StyliTruth uses orthogonal deflation to separate style and truth subspaces, with adaptive token-level steering vectors for precise control.

Result: Experiments show StyliTruth reduces truthfulness collapse and outperforms existing methods in balancing style and truthfulness.

Conclusion: StyliTruth effectively preserves both stylistic fidelity and truthfulness in LLM responses.

Abstract: Generating stylized large language model (LLM) responses via representation editing is a promising way for fine-grained output control. However, there exists an inherent trade-off: imposing a distinctive style often degrades truthfulness. Existing representation editing methods, by naively injecting style signals, overlook this collateral impact and frequently contaminate the model’s core truthfulness representations, resulting in reduced answer correctness. We term this phenomenon stylization-induced truthfulness collapse. We attribute this issue to latent coupling between style and truth directions in certain key attention heads, and propose StyliTruth, a mechanism that preserves stylization while keeping truthfulness intact. StyliTruth separates the style-relevant and truth-relevant subspaces in the model’s representation space via an orthogonal deflation process. This decomposition enables independent control of style and truth in their own subspaces, minimizing interference. By designing adaptive, token-level steering vectors within each subspace, we dynamically and precisely control the generation process to maintain both stylistic fidelity and truthfulness. We validate our method on multiple styles and languages. Extensive experiments and analyses show that StyliTruth significantly reduces stylization-induced truthfulness collapse and outperforms existing inference-time intervention methods in balancing style adherence with truthfulness.

[88] IFDECORATOR: Wrapping Instruction Following Reinforcement Learning with Verifiable Rewards

Xu Guo, Tianyi Liang, Tong Jian, Xiaogui Yang, Ling-I Wu, Chenhui Li, Zhihui Lu, Qipeng Guo, Kai Chen

Main category: cs.CL

TL;DR: RLVR improves LLMs’ instruction-following but faces inefficiency and over-optimization. IFDecorator enhances RLVR with a robust pipeline, achieving high accuracy and reducing reward hacking.

Details

Motivation: Address training inefficiency and over-optimization in RLVR for better instruction-following in LLMs.

Method: IFDecorator framework with cooperative-adversarial data flywheel, IntentCheck, and trip wires.

Result: 87.43% accuracy on IFEval, outperforming GPT-4o; reduced reward hacking.

Conclusion: IFDecorator effectively improves RLVR, offering a robust and efficient solution for instruction-following tasks.

Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) improves instruction following capabilities of large language models (LLMs), but suffers from training inefficiency due to inadequate difficulty assessment. Moreover, RLVR is prone to over-optimization, where LLMs exploit verification shortcuts without aligning to the actual intent of user instructions. We introduce Instruction Following Decorator (IFDecorator}, a framework that wraps RLVR training into a robust and sample-efficient pipeline. It consists of three components: (1) a cooperative-adversarial data flywheel that co-evolves instructions and hybrid verifications, generating progressively more challenging instruction-verification pairs; (2) IntentCheck, a bypass module enforcing intent alignment; and (3) trip wires, a diagnostic mechanism that detects reward hacking via trap instructions, which trigger and capture shortcut exploitation behaviors. Our Qwen2.5-32B-Instruct-IFDecorator achieves 87.43% accuracy on IFEval, outperforming larger proprietary models such as GPT-4o. Additionally, we demonstrate substantial improvements on FollowBench while preserving general capabilities. Our trip wires show significant reductions in reward hacking rates. We will release models, code, and data for future research.

cs.CV

[89] RetinexDual: Retinex-based Dual Nature Approach for Generalized Ultra-High-Definition Image Restoration

Mohab Kishawy, Ali Abdellatif Hussein, Jun Chen

Main category: cs.CV

TL;DR: RetinexDual, a Retinex theory-based framework, outperforms traditional methods in Ultra-High-Definition Image Restoration (UHD IR) by combining Scale-Attentive maMBA (SAMBA) and Frequency Illumination Adaptor (FIA) for superior artifact reduction and detail restoration.

Details

Motivation: Traditional methods like downsampling or frequency-domain transformations fail in UHD IR due to irreversible information loss or ineffective handling of spatially confined artifacts. RetinexDual addresses these limitations.

Method: RetinexDual uses two sub-networks: SAMBA for reflectance correction via coarse-to-fine mechanisms, and FIA for frequency-domain correction of color and illumination distortions.

Result: RetinexDual excels in deraining, deblurring, dehazing, and Low-Light Image Enhancement, surpassing recent methods in quality and metrics.

Conclusion: RetinexDual’s dual-branch design and component effectiveness validate its superiority in UHD IR tasks.

Abstract: Advancements in image sensing have elevated the importance of Ultra-High-Definition Image Restoration (UHD IR). Traditional methods, such as extreme downsampling or transformation from the spatial to the frequency domain, encounter significant drawbacks: downsampling induces irreversible information loss in UHD images, while our frequency analysis reveals that pure frequency-domain approaches are ineffective for spatially confined image artifacts, primarily due to the loss of degradation locality. To overcome these limitations, we present RetinexDual, a novel Retinex theory-based framework designed for generalized UHD IR tasks. RetinexDual leverages two complementary sub-networks: the Scale-Attentive maMBA (SAMBA) and the Frequency Illumination Adaptor (FIA). SAMBA, responsible for correcting the reflectance component, utilizes a coarse-to-fine mechanism to overcome the causal modeling of mamba, which effectively reduces artifacts and restores intricate details. On the other hand, FIA ensures precise correction of color and illumination distortions by operating in the frequency domain and leveraging the global context provided by it. Evaluating RetinexDual on four UHD IR tasks, namely deraining, deblurring, dehazing, and Low-Light Image Enhancement (LLIE), shows that it outperforms recent methods qualitatively and quantitatively. Ablation studies demonstrate the importance of employing distinct designs for each branch in RetinexDual, as well as the effectiveness of its various components.

[90] ACM Multimedia Grand Challenge on ENT Endoscopy Analysis

Trong-Thuan Nguyen, Viet-Tham Huynh, Thao Thi Phuong Dao, Ha Nguyen Thi, Tien To Vu Thuy, Uyen Hanh Tran, Tam V. Nguyen, Thanh Dinh Le, Minh-Triet Tran

Main category: cs.CV

TL;DR: ENTRep introduces a bilingual (Vietnamese/English) dataset and benchmarks for ENT endoscopy analysis, addressing gaps in classification and retrieval tasks.

Details

Motivation: Existing benchmarks lack support for fine-grained ENT endoscopy analysis, including classification and retrieval of similar cases.

Method: The dataset includes expert-annotated images with anatomical labels and bilingual descriptions, alongside standardized benchmark tasks and evaluation protocols.

Result: Top-performing teams’ results are reported, with insights into performance and challenges.

Conclusion: ENTRep advances ENT care by providing a comprehensive framework for automated endoscopy analysis.

Abstract: Automated analysis of endoscopic imagery is a critical yet underdeveloped component of ENT (ear, nose, and throat) care, hindered by variability in devices and operators, subtle and localized findings, and fine-grained distinctions such as laterality and vocal-fold state. In addition to classification, clinicians require reliable retrieval of similar cases, both visually and through concise textual descriptions. These capabilities are rarely supported by existing public benchmarks. To this end, we introduce ENTRep, the ACM Multimedia 2025 Grand Challenge on ENT endoscopy analysis, which integrates fine-grained anatomical classification with image-to-image and text-to-image retrieval under bilingual (Vietnamese and English) clinical supervision. Specifically, the dataset comprises expert-annotated images, labeled for anatomical region and normal or abnormal status, and accompanied by dual-language narrative descriptions. In addition, we define three benchmark tasks, standardize the submission protocol, and evaluate performance on public and private test splits using server-side scoring. Moreover, we report results from the top-performing teams and provide an insight discussion.

[91] CoMAD: A Multiple-Teacher Self-Supervised Distillation Framework

Sriram Mandalika, Lalitha V

Main category: cs.CV

TL;DR: CoMAD is a lightweight, parameter-free framework unifying knowledge from multiple self-supervised Vision Transformers into a compact student network, achieving state-of-the-art performance.

Details

Motivation: Current self-supervised learning methods are pretrained in isolation, missing complementary insights and creating impractical large models for resource-constrained deployment.

Method: CoMAD distills knowledge from three ViT-Base teachers (MAE, MoCo v3, iBOT) using asymmetric masking, linear adapters, and joint consensus gating, training the student with dual-level KL divergence.

Result: CoMAD’s ViT-Tiny achieves 75.4% Top-1 on ImageNet-1K, 47.3% mIoU on ADE20K, and 44.5%/40.5% AP on MS-COCO, setting new benchmarks.

Conclusion: CoMAD effectively unifies diverse self-supervised insights into a compact model, outperforming prior methods in efficiency and performance.

Abstract: Numerous self-supervised learning paradigms, such as contrastive learning and masked image modeling, learn powerful representations from unlabeled data but are typically pretrained in isolation, overlooking complementary insights and yielding large models that are impractical for resource-constrained deployment. To overcome these challenges, we introduce Consensus-oriented Masked Distillation (CoMAD), a lightweight, parameter-free framework that unifies knowledge from multiple current state-of-the-art self-supervised Vision Transformers into a compact student network. CoMAD distills from three pretrained ViT-Base teachers, MAE, MoCo v3, and iBOT, each offering distinct semantic and contextual priors. Rather than naively averaging teacher outputs, we apply asymmetric masking: the student sees only 25 percent of patches while each teacher receives a progressively lighter, unique mask, forcing the student to interpolate missing features under richer contexts. Teacher embeddings are aligned to the student’s space via a linear adapter and layer normalization, then fused through our joint consensus gating, which weights each token by combining cosine affinity with inter-teacher agreement. The student is trained with dual-level KL divergence on visible tokens and reconstructed feature maps, capturing both local and global structure. On ImageNet-1K, CoMAD’s ViT-Tiny achieves 75.4 percent Top-1, an increment of 0.4 percent over the previous state-of-the-art. In dense-prediction transfers, it attains 47.3 percent mIoU on ADE20K, and 44.5 percent box average precision and 40.5 percent mask average precision on MS-COCO, establishing a new state-of-the-art in compact SSL distillation.

[92] Single-Step Reconstruction-Free Anomaly Detection and Segmentation via Diffusion Models

Mehrdad Moradi, Marco Grasso, Bianca Maria Colosimo, Kamran Paynabar

Main category: cs.CV

TL;DR: RADAR introduces a reconstruction-free, attention-based diffusion model for real-time anomaly detection, outperforming existing methods in accuracy and efficiency.

Details

Motivation: Current reconstruction-based diffusion models for anomaly detection are computationally expensive, prone to errors in complex patterns, and require prior knowledge of anomalies, limiting their practicality.

Method: RADAR directly generates anomaly maps from diffusion models without reconstruction, using attention mechanisms for improved detection.

Result: RADAR achieves higher accuracy, precision, recall, and F1 scores, with a 7% improvement on MVTec-AD and 13% on 3D-printed material datasets.

Conclusion: RADAR offers a more efficient and accurate alternative to reconstruction-based anomaly detection, suitable for real-time applications.

Abstract: Generative models have demonstrated significant success in anomaly detection and segmentation over the past decade. Recently, diffusion models have emerged as a powerful alternative, outperforming previous approaches such as GANs and VAEs. In typical diffusion-based anomaly detection, a model is trained on normal data, and during inference, anomalous images are perturbed to a predefined intermediate step in the forward diffusion process. The corresponding normal image is then reconstructed through iterative reverse sampling. However, reconstruction-based approaches present three major challenges: (1) the reconstruction process is computationally expensive due to multiple sampling steps, making real-time applications impractical; (2) for complex or subtle patterns, the reconstructed image may correspond to a different normal pattern rather than the original input; and (3) Choosing an appropriate intermediate noise level is challenging because it is application-dependent and often assumes prior knowledge of anomalies, an assumption that does not hold in unsupervised settings. We introduce Reconstruction-free Anomaly Detection with Attention-based diffusion models in Real-time (RADAR), which overcomes the limitations of reconstruction-based anomaly detection. Unlike current SOTA methods that reconstruct the input image, RADAR directly produces anomaly maps from the diffusion model, improving both detection accuracy and computational efficiency. We evaluate RADAR on real-world 3D-printed material and the MVTec-AD dataset. Our approach surpasses state-of-the-art diffusion-based and statistical machine learning models across all key metrics, including accuracy, precision, recall, and F1 score. Specifically, RADAR improves F1 score by 7% on MVTec-AD and 13% on the 3D-printed material dataset compared to the next best model. Code available at: https://github.com/mehrdadmoradi124/RADAR

[93] A deep learning approach to track eye movements based on events

Chirag Seth, Divya Naiken, Keyan Lin

Main category: cs.CV

TL;DR: The paper proposes a cost-effective deep learning method using event cameras to track eye movements, achieving 81% accuracy with a CNN_LSTM model, and suggests future improvements with LRP for better interpretability.

Details

Motivation: Accurate eye tracking is challenging due to rapid eye movements and high costs of traditional methods. The goal is to improve VR/AR device comfort and user experience with an affordable solution.

Method: Utilizes event cameras and deep learning, specifically a CNN_LSTM model, to predict eye center positions.

Result: Achieved approximately 81% accuracy in eye tracking.

Conclusion: The CNN_LSTM model is effective, and future work will focus on LRP to enhance interpretability and performance.

Abstract: This research project addresses the challenge of accurately tracking eye movements during specific events by leveraging previous research. Given the rapid movements of human eyes, which can reach speeds of 300{\deg}/s, precise eye tracking typically requires expensive and high-speed cameras. Our primary objective is to locate the eye center position (x, y) using inputs from an event camera. Eye movement analysis has extensive applications in consumer electronics, especially in VR and AR product development. Therefore, our ultimate goal is to develop an interpretable and cost-effective algorithm using deep learning methods to predict human attention, thereby improving device comfort and enhancing overall user experience. To achieve this goal, we explored various approaches, with the CNN_LSTM model proving most effective, achieving approximately 81% accuracy. Additionally, we propose future work focusing on Layer-wise Relevance Propagation (LRP) to further enhance the model’s interpretability and predictive performance.

[94] From Detection to Correction: Backdoor-Resilient Face Recognition via Vision-Language Trigger Detection and Noise-Based Neutralization

Farah Wahida, M. A. P. Chamikara, Yashothara Shanmugarasa, Mohan Baruwal Chhetri, Thilina Ranbaduge, Ibrahim Khalil

Main category: cs.CV

TL;DR: TrueBiometric is a novel method to detect and correct backdoor-poisoned images in biometric systems using majority voting and corrective noise, achieving 100% accuracy without affecting clean data.

Details

Motivation: Backdoor attacks in biometric systems compromise security by manipulating training data, and existing defenses struggle to balance detection and data utility.

Method: TrueBiometric uses majority voting with multiple vision-language models to detect poisoned images and applies targeted corrective noise to fix them.

Result: The method achieves 100% accuracy in detecting and correcting poisoned images while maintaining accuracy on clean data.

Conclusion: TrueBiometric provides a practical, accurate, and effective solution for mitigating backdoor attacks in face recognition systems.

Abstract: Biometric systems, such as face recognition systems powered by deep neural networks (DNNs), rely on large and highly sensitive datasets. Backdoor attacks can subvert these systems by manipulating the training process. By inserting a small trigger, such as a sticker, make-up, or patterned mask, into a few training images, an adversary can later present the same trigger during authentication to be falsely recognized as another individual, thereby gaining unauthorized access. Existing defense mechanisms against backdoor attacks still face challenges in precisely identifying and mitigating poisoned images without compromising data utility, which undermines the overall reliability of the system. We propose a novel and generalizable approach, TrueBiometric: Trustworthy Biometrics, which accurately detects poisoned images using a majority voting mechanism leveraging multiple state-of-the-art large vision language models. Once identified, poisoned samples are corrected using targeted and calibrated corrective noise. Our extensive empirical results demonstrate that TrueBiometric detects and corrects poisoned images with 100% accuracy without compromising accuracy on clean images. Compared to existing state-of-the-art approaches, TrueBiometric offers a more practical, accurate, and effective solution for mitigating backdoor attacks in face recognition systems.

[95] LuKAN: A Kolmogorov-Arnold Network Framework for 3D Human Motion Prediction

Md Zahidul Hasan, A. Ben Hamza, Nizar Bouguila

Main category: cs.CV

TL;DR: LuKAN is a 3D human motion prediction model using Kolmogorov-Arnold Networks with Lucas polynomial activations, balancing accuracy and efficiency.

Details

Motivation: Existing methods struggle to balance prediction accuracy and computational efficiency in 3D human motion prediction.

Method: LuKAN uses discrete wavelet transform for temporal encoding, spatial projection for joint dependencies, and a KAN layer with Lucas polynomials for efficient function approximation.

Result: LuKAN outperforms baselines on three benchmark datasets, offering both accuracy and computational efficiency.

Conclusion: LuKAN provides a compact, efficient, and accurate solution for 3D human motion prediction.

Abstract: The goal of 3D human motion prediction is to forecast future 3D poses of the human body based on historical motion data. Existing methods often face limitations in achieving a balance between prediction accuracy and computational efficiency. In this paper, we present LuKAN, an effective model based on Kolmogorov-Arnold Networks (KANs) with Lucas polynomial activations. Our model first applies the discrete wavelet transform to encode temporal information in the input motion sequence. Then, a spatial projection layer is used to capture inter-joint dependencies, ensuring structural consistency of the human body. At the core of LuKAN is the Temporal Dependency Learner, which employs a KAN layer parameterized by Lucas polynomials for efficient function approximation. These polynomials provide computational efficiency and an enhanced capability to handle oscillatory behaviors. Finally, the inverse discrete wavelet transform reconstructs motion sequences in the time domain, generating temporally coherent predictions. Extensive experiments on three benchmark datasets demonstrate the competitive performance of our model compared to strong baselines, as evidenced by both quantitative and qualitative evaluations. Moreover, its compact architecture coupled with the linear recurrence of Lucas polynomials, ensures computational efficiency.

[96] VER-Bench: Evaluating MLLMs on Reasoning with Fine-Grained Visual Evidence

Chenhui Qiang, Zhaoyang Wei, Xumeng Han Zipeng Wang, Siyao Li, Xiangyuan Lan, Jianbin Jiao, Zhenjun Han

Main category: cs.CV

TL;DR: VER-Bench is introduced to evaluate MLLMs’ ability to identify subtle visual clues and integrate them with world knowledge for complex reasoning, revealing current models’ limitations.

Details

Motivation: Existing benchmarks lack focus on subtle, inconspicuous local details crucial for deep visual understanding and complex reasoning.

Method: VER-Bench includes 374 questions across various reasoning types, each with structured evidence (visual clues and reasoning).

Result: Current models struggle with extracting subtle visual evidence and constructing evidence-based arguments.

Conclusion: Enhancing models’ capabilities in fine-grained visual evidence extraction and reasoning is essential for genuine visual understanding.

Abstract: With the rapid development of MLLMs, evaluating their visual capabilities has become increasingly crucial. Current benchmarks primarily fall into two main types: basic perception benchmarks, which focus on local details but lack deep reasoning (e.g., “what is in the image?”), and mainstream reasoning benchmarks, which concentrate on prominent image elements but may fail to assess subtle clues requiring intricate analysis. However, profound visual understanding and complex reasoning depend more on interpreting subtle, inconspicuous local details than on perceiving salient, macro-level objects. These details, though occupying minimal image area, often contain richer, more critical information for robust analysis. To bridge this gap, we introduce the VER-Bench, a novel framework to evaluate MLLMs’ ability to: 1) identify fine-grained visual clues, often occupying on average just 0.25% of the image area; 2) integrate these clues with world knowledge for complex reasoning. Comprising 374 carefully designed questions across Geospatial, Temporal, Situational, Intent, System State, and Symbolic reasoning, each question in VER-Bench is accompanied by structured evidence: visual clues and question-related reasoning derived from them. VER-Bench reveals current models’ limitations in extracting subtle visual evidence and constructing evidence-based arguments, highlighting the need to enhance models’s capabilities in fine-grained visual evidence extraction, integration, and reasoning for genuine visual understanding and human-like analysis. Dataset and additional materials are available https://github.com/verbta/ACMMM-25-Materials.

[97] A Fast Text-Driven Approach for Generating Artistic Content

Marian Lupascu, Ryan Murdock, Ionut Mironica, Yijun Li

Main category: cs.CV

TL;DR: A flexible framework for generating diverse visual art with improved detail, style, and speed, including an artistic super-resolution module.

Details

Motivation: Overcome limitations of previous stylization methods that lack flexibility in style parameters and domain restrictions.

Method: Proposes a complete framework with no style or domain restrictions, includes an improved version for varied results and faster generation, and adds an artistic super-resolution module for enhanced details.

Result: Generates diverse visual art with varying detail, style, and structure, and improved speed.

Conclusion: The framework offers flexibility, speed, and enhanced artistic details, advancing visual art generation.

Abstract: In this work, we propose a complete framework that generates visual art. Unlike previous stylization methods that are not flexible with style parameters (i.e., they allow stylization with only one style image, a single stylization text or stylization of a content image from a certain domain), our method has no such restriction. In addition, we implement an improved version that can generate a wide range of results with varying degrees of detail, style and structure, with a boost in generation speed. To further enhance the results, we insert an artistic super-resolution module in the generative pipeline. This module will bring additional details such as patterns specific to painters, slight brush marks, and so on.

Noreen Anwar, Guillaume-Alexandre Bilodeau, Wassim Bouachir

Main category: cs.CV

TL;DR: DAMM introduces multi-modal queries and dual-stream attention to improve object detection accuracy and efficiency, outperforming benchmarks.

Details

Motivation: Transformer-based detectors struggle with occlusions, localization, and inefficiency due to fixed queries and dense attention.

Method: DAMM uses appearance, positional, and random queries, along with dual-stream cross-attention for semantic and spatial refinement.

Result: Achieved state-of-the-art performance in AP and recall on four benchmarks.

Conclusion: Multi-modal query adaptation and dual-stream attention effectively enhance detection accuracy and efficiency.

Abstract: Transformer-based object detectors often struggle with occlusions, fine-grained localization, and computational inefficiency caused by fixed queries and dense attention. We propose DAMM, Dual-stream Attention with Multi-Modal queries, a novel framework introducing both query adaptation and structured cross-attention for improved accuracy and efficiency. DAMM capitalizes on three types of queries: appearance-based queries from vision-language models, positional queries using polygonal embeddings, and random learned queries for general scene coverage. Furthermore, a dual-stream cross-attention module separately refines semantic and spatial features, boosting localization precision in cluttered scenes. We evaluated DAMM on four challenging benchmarks, and it achieved state-of-the-art performance in average precision (AP) and recall, demonstrating the effectiveness of multi-modal query adaptation and dual-stream attention. Source code is at: \href{https://github.com/DET-LIP/DAMM}{GitHub}.

[99] RoboTron-Drive: All-in-One Large Multimodal Model for Autonomous Driving

Zhijian Huang, Chengjian Feng, Feng Yan, Baihui Xiao, Zequn Jie, Yujie Zhong, Xiaodan Liang, Lin Ma

Main category: cs.CV

TL;DR: RoboTron-Drive is a general large multimodal model for autonomous driving, excelling in diverse tasks and datasets, achieving state-of-the-art performance.

Details

Motivation: Current AD models focus narrowly on single datasets/tasks, lacking generalization. RoboTron-Drive aims to bridge this gap.

Method: Curriculum pre-training for visual comprehension, followed by dataset augmentation and fine-tuning for diverse AD tasks.

Result: Achieves top performance on six benchmarks and zero-shot transfer on three unseen datasets.

Conclusion: RoboTron-Drive is a promising, versatile solution for real-world autonomous driving.

Abstract: Large Multimodal Models (LMMs) have demonstrated exceptional comprehension and interpretation capabilities in Autonomous Driving (AD) by incorporating large language models. Despite the advancements, current data-driven AD approaches tend to concentrate on a single dataset and specific tasks, neglecting their overall capabilities and ability to generalize. To bridge these gaps, we propose RoboTron-Drive, a general large multimodal model designed to process diverse data inputs, such as images and multi-view videos, while performing a broad spectrum of AD tasks, including perception, prediction, and planning. Initially, the model undergoes curriculum pre-training to process varied visual signals and perform basic visual comprehension and perception tasks. Subsequently, we augment and standardize various AD datasets to finetune the model, resulting in an all-in-one LMM for autonomous driving. To assess the general capabilities and generalization ability, we conduct evaluations on six public benchmarks and undertake zero-shot transfer on three unseen datasets, where RoboTron-Drive achieves state-of-the-art performance across all tasks. We hope RoboTron-Drive as a promising solution for AD in the real world. Project page with code: https://github.com/zhijian11/RoboTron-Drive.

[100] Revealing Temporal Label Noise in Multimodal Hateful Video Classification

Shuonan Yang, Tailin Chen, Rahul Singh, Jiangbei Yue, Jianbo Jiao, Zeyu Fu

Main category: cs.CV

TL;DR: The paper investigates the impact of label ambiguity in hateful video detection by analyzing fine-grained, timestamp-annotated segments, revealing how coarse annotations introduce noise and affect model performance.

Details

Motivation: The spread of hate speech in online multimedia is a critical issue, but existing detection methods often rely on coarse, video-level annotations, leading to label noise and reduced accuracy.

Method: The study trims hateful videos from HateMM and MultiHateClip datasets using annotated timestamps to isolate hateful segments, then analyzes their distribution and characteristics. Controlled experiments assess the impact of label noise.

Result: The analysis shows semantic overlap and confusion from coarse annotations, and experiments reveal that label noise alters model decision boundaries and reduces classification confidence.

Conclusion: The findings emphasize the need for temporally aware models and benchmarks to improve robustness and interpretability in hate speech detection.

Abstract: The rapid proliferation of online multimedia content has intensified the spread of hate speech, presenting critical societal and regulatory challenges. While recent work has advanced multimodal hateful video detection, most approaches rely on coarse, video-level annotations that overlook the temporal granularity of hateful content. This introduces substantial label noise, as videos annotated as hateful often contain long non-hateful segments. In this paper, we investigate the impact of such label ambiguity through a fine-grained approach. Specifically, we trim hateful videos from the HateMM and MultiHateClip English datasets using annotated timestamps to isolate explicitly hateful segments. We then conduct an exploratory analysis of these trimmed segments to examine the distribution and characteristics of both hateful and non-hateful content. This analysis highlights the degree of semantic overlap and the confusion introduced by coarse, video-level annotations. Finally, controlled experiments demonstrated that time-stamp noise fundamentally alters model decision boundaries and weakens classification confidence, highlighting the inherent context dependency and temporal continuity of hate speech expression. Our findings provide new insights into the temporal dynamics of multimodal hateful videos and highlight the need for temporally aware models and benchmarks for improved robustness and interpretability. Code and data are available at https://github.com/Multimodal-Intelligence-Lab-MIL/HatefulVideoLabelNoise.

[101] ESVQA: Perceptual Quality Assessment of Egocentric Spatial Videos

Xilei Zhu, Huiyu Duan, Liu Yang, Yucheng Zhu, Xiongkuo Min, Guangtao Zhai, Patrick Le Callet

Main category: cs.CV

TL;DR: The paper introduces a new database (ESVQAD) and model (ESVQAnet) for assessing the quality of egocentric spatial videos, outperforming existing VQA models.

Details

Motivation: The rapid growth of XR technologies necessitates better tools for evaluating the immersive quality of egocentric spatial videos, a currently understudied area.

Method: The authors create the ESVQAD database with 600 videos and propose ESVQAnet, a model combining binocular spatial, motion, and semantic features for quality prediction.

Result: ESVQAnet surpasses 16 state-of-the-art VQA models in embodied perceptual quality assessment and shows strong generalization on traditional VQA tasks.

Conclusion: The study advances QoE assessment for egocentric spatial videos, offering a robust database and model for future research.

Abstract: With the rapid development of eXtended Reality (XR), egocentric spatial shooting and display technologies have further enhanced immersion and engagement for users, delivering more captivating and interactive experiences. Assessing the quality of experience (QoE) of egocentric spatial videos is crucial to ensure a high-quality viewing experience. However, the corresponding research is still lacking. In this paper, we use the concept of embodied experience to highlight this more immersive experience and study the new problem, i.e., embodied perceptual quality assessment for egocentric spatial videos. Specifically, we introduce the first Egocentric Spatial Video Quality Assessment Database (ESVQAD), which comprises 600 egocentric spatial videos captured using the Apple Vision Pro and their corresponding mean opinion scores (MOSs). Furthermore, we propose a novel multi-dimensional binocular feature fusion model, termed ESVQAnet, which integrates binocular spatial, motion, and semantic features to predict the overall perceptual quality. Experimental results demonstrate the ESVQAnet significantly outperforms 16 state-of-the-art VQA models on the embodied perceptual quality assessment task, and exhibits strong generalization capability on traditional VQA tasks. The database and code are available at https://github.com/iamazxl/ESVQA.

[102] Test-Time Adaptation for Video Highlight Detection Using Meta-Auxiliary Learning and Cross-Modality Hallucinations

Zahidul Islam, Sujoy Paul, Mrigank Rochan

Main category: cs.CV

TL;DR: Highlight-TTA is a test-time adaptation framework for video highlight detection that dynamically adjusts to each test video’s unique characteristics, improving performance.

Details

Motivation: Existing highlight detection methods lack adaptability to diverse video content, styles, and qualities, limiting generalization.

Method: Highlight-TTA uses a meta-auxiliary training scheme with cross-modality hallucinations to adapt models during testing.

Result: Experiments show Highlight-TTA enhances performance of three state-of-the-art models across three datasets.

Conclusion: Highlight-TTA effectively addresses generalization issues in video highlight detection, improving adaptability and performance.

Abstract: Existing video highlight detection methods, although advanced, struggle to generalize well to all test videos. These methods typically employ a generic highlight detection model for each test video, which is suboptimal as it fails to account for the unique characteristics and variations of individual test videos. Such fixed models do not adapt to the diverse content, styles, or audio and visual qualities present in new, unseen test videos, leading to reduced highlight detection performance. In this paper, we propose Highlight-TTA, a test-time adaptation framework for video highlight detection that addresses this limitation by dynamically adapting the model during testing to better align with the specific characteristics of each test video, thereby improving generalization and highlight detection performance. Highlight-TTA is jointly optimized with an auxiliary task, cross-modality hallucinations, alongside the primary highlight detection task. We utilize a meta-auxiliary training scheme to enable effective adaptation through the auxiliary task while enhancing the primary task. During testing, we adapt the trained model using the auxiliary task on the test video to further enhance its highlight detection performance. Extensive experiments with three state-of-the-art highlight detection models and three benchmark datasets show that the introduction of Highlight-TTA to these models improves their performance, yielding superior results.

[103] CountingFruit: Language-Guided 3D Fruit Counting with Semantic Gaussian Splatting

Fengze Li, Yangle Liu, Jieming Ma, Hai-Ning Liang, Yaochun Shen, Huangxiang Li, Zhijing Wu

Main category: cs.CV

TL;DR: FruitLangGS is a language-guided 3D fruit counting framework using adaptive-density Gaussian Splatting for scalable orchard reconstruction, outperforming existing methods with up to 99.7% recall.

Details

Motivation: Challenges in 3D fruit counting include occlusion, semantic ambiguity, and computational cost, which existing methods fail to address efficiently.

Method: Uses adaptive-density Gaussian Splatting with radius-aware pruning, tile-based rasterization, and CLIP-aligned semantic vectors filtered via dual-threshold cosine similarity for robust counting.

Result: Achieves up to 99.7% recall on orchard datasets, outperforming existing pipelines and handling occlusion effectively.

Conclusion: FruitLangGS demonstrates the potential of language-guided 3D perception for scalable agricultural tasks beyond fruit counting.

Abstract: Accurate 3D fruit counting in orchards is challenging due to heavy occlusion, semantic ambiguity between fruits and surrounding structures, and the high computational cost of volumetric reconstruction. Existing pipelines often rely on multi-view 2D segmentation and dense volumetric sampling, which lead to accumulated fusion errors and slow inference. We introduce FruitLangGS, a language-guided 3D fruit counting framework that reconstructs orchard-scale scenes using an adaptive-density Gaussian Splatting pipeline with radius-aware pruning and tile-based rasterization, enabling scalable 3D representation. During inference, compressed CLIP-aligned semantic vectors embedded in each Gaussian are filtered via a dual-threshold cosine similarity mechanism, retrieving Gaussians relevant to target prompts while suppressing common distractors (e.g., foliage), without requiring retraining or image-space masks. The selected Gaussians are then sampled into dense point clouds and clustered geometrically to estimate fruit instances, remaining robust under severe occlusion and viewpoint variation. Experiments on nine different orchard-scale datasets demonstrate that FruitLangGS consistently outperforms existing pipelines in instance counting recall, avoiding multi-view segmentation fusion errors and achieving up to 99.7% recall on Pfuji-Size_Orch2018 orchard dataset. Ablation studies further confirm that language-conditioned semantic embedding and dual-threshold prompt filtering are essential for suppressing distractors and improving counting accuracy under heavy occlusion. Beyond fruit counting, the same framework enables prompt-driven 3D semantic retrieval without retraining, highlighting the potential of language-guided 3D perception for scalable agricultural scene understanding.

[104] RDDPM: Robust Denoising Diffusion Probabilistic Model for Unsupervised Anomaly Segmentation

Mehrdad Moradi, Kamran Paynabar

Main category: cs.CV

TL;DR: Proposes robust denoising diffusion models for unsupervised anomaly segmentation using contaminated data, outperforming existing methods.

Details

Motivation: Traditional diffusion models require normal data for training, limiting real-world applicability. This work addresses scenarios with only contaminated (mixed normal/anomalous) unlabeled data.

Method: Reinterprets denoising diffusion probabilistic models through robust regression, enabling maximum likelihood estimation of data without clean training samples.

Result: Achieves up to 8.08% higher AUROC and 10.37% higher AUPRC on MVTec datasets compared to state-of-the-art diffusion models.

Conclusion: The robust framework enhances flexibility and performance in unsupervised anomaly segmentation with contaminated data, offering practical advantages.

Abstract: Recent advancements in diffusion models have demonstrated significant success in unsupervised anomaly segmentation. For anomaly segmentation, these models are first trained on normal data; then, an anomalous image is noised to an intermediate step, and the normal image is reconstructed through backward diffusion. Unlike traditional statistical methods, diffusion models do not rely on specific assumptions about the data or target anomalies, making them versatile for use across different domains. However, diffusion models typically assume access to normal data for training, limiting their applicability in realistic settings. In this paper, we propose novel robust denoising diffusion models for scenarios where only contaminated (i.e., a mix of normal and anomalous) unlabeled data is available. By casting maximum likelihood estimation of the data as a nonlinear regression problem, we reinterpret the denoising diffusion probabilistic model through a regression lens. Using robust regression, we derive a robust version of denoising diffusion probabilistic models. Our novel framework offers flexibility in constructing various robust diffusion models. Our experiments show that our approach outperforms current state of the art diffusion models, for unsupervised anomaly segmentation when only contaminated data is available. Our method outperforms existing diffusion-based approaches, achieving up to 8.08% higher AUROC and 10.37% higher AUPRC on MVTec datasets. The implementation code is available at: https://github.com/mehrdadmoradi124/RDDPM

[105] Extending Foundational Monocular Depth Estimators to Fisheye Cameras with Calibration Tokens

Suchisrit Gangopadhyay, Jung-Hee Kim, Xien Chen, Patrick Rim, Hyoungseob Park, Alex Wong

Main category: cs.CV

TL;DR: A method to adapt monocular depth estimators (FMDEs) for fisheye images using calibration tokens, avoiding retraining and improving depth estimation accuracy.

Details

Motivation: FMDEs trained on perspective images perform poorly on fisheye images due to covariate shift from different camera calibrations.

Method: Introduces calibration tokens to align latent embeddings of fisheye images to perspective images, leveraging self-supervised training with recalibrated perspective images.

Result: Consistently outperforms state-of-the-art methods on both indoor and outdoor datasets using a single set of tokens.

Conclusion: The method enables effective reuse of FMDEs for fisheye cameras without retraining, addressing covariate shift issues.

Abstract: We propose a method to extend foundational monocular depth estimators (FMDEs), trained on perspective images, to fisheye images. Despite being trained on tens of millions of images, FMDEs are susceptible to the covariate shift introduced by changes in camera calibration (intrinsic, distortion) parameters, leading to erroneous depth estimates. Our method aligns the distribution of latent embeddings encoding fisheye images to those of perspective images, enabling the reuse of FMDEs for fisheye cameras without retraining or finetuning. To this end, we introduce a set of Calibration Tokens as a light-weight adaptation mechanism that modulates the latent embeddings for alignment. By exploiting the already expressive latent space of FMDEs, we posit that modulating their embeddings avoids the negative impact of artifacts and loss introduced in conventional recalibration or map projection to a canonical reference frame in the image space. Our method is self-supervised and does not require fisheye images but leverages publicly available large-scale perspective image datasets. This is done by recalibrating perspective images to fisheye images, and enforcing consistency between their estimates during training. We evaluate our approach with several FMDEs, on both indoors and outdoors, where we consistently improve over state-of-the-art methods using a single set of tokens for both. Code available at: https://github.com/JungHeeKim29/calibration-token.

[106] Toward Errorless Training ImageNet-1k

Bo Deng, Levi Heath

Main category: cs.CV

TL;DR: A feedforward neural network trained on ImageNet 2012 achieves 98.3% accuracy, with 99.69 Top-1 rate, using 322M parameters. The paper suggests double-labeling as a reason for not reaching 100% accuracy.

Details

Motivation: To demonstrate high accuracy in image classification using a feedforward neural network on the ImageNet dataset.

Method: Training a feedforward artificial neural network with a new method on ImageNet 2012, achieving high precision with 4 decimal places.

Result: 98.3% accuracy, 99.69 Top-1 rate, and 285.9 perfectly classified labels per batch.

Conclusion: The model performs exceptionally well but is limited by dataset issues like double-labeling.

Abstract: In this paper, we describe a feedforward artificial neural network trained on the ImageNet 2012 contest dataset [7] with the new method of [5] to an accuracy rate of 98.3% with a 99.69 Top-1 rate, and an average of 285.9 labels that are perfectly classified over the 10 batch partitions of the dataset. The best performing model uses 322,430,160 parameters, with 4 decimal places precision. We conjecture that the reason our model does not achieve a 100% accuracy rate is due to a double-labeling problem, by which there are duplicate images in the dataset with different labels.

[107] Accelerating Conditional Prompt Learning via Masked Image Modeling for Vision-Language Models

Phuoc-Nguyen Bui, Khanh-Binh Nguyen, Hyunseung Choo

Main category: cs.CV

TL;DR: ProMIM enhances VLM adaptation by integrating masked image modeling (MIM) into prompt learning, improving generalization without extra computational cost.

Details

Motivation: Existing prompt learning techniques overfit to known classes, limiting generalization to unseen categories.

Method: ProMIM uses a masking strategy to generate instance-conditioned prompts, augmenting methods like CoOp and CoCoOp.

Result: ProMIM improves feature robustness and generalization in zero-shot and few-shot tasks.

Conclusion: ProMIM offers a lightweight, effective solution for vision-language applications.

Abstract: Vision-language models (VLMs) like CLIP excel in zero-shot learning but often require resource-intensive training to adapt to new tasks. Prompt learning techniques, such as CoOp and CoCoOp, offer efficient adaptation but tend to overfit to known classes, limiting generalization to unseen categories. We introduce ProMIM, a plug-and-play framework that enhances conditional prompt learning by integrating masked image modeling (MIM) into existing VLM pipelines. ProMIM leverages a simple yet effective masking strategy to generate robust, instance-conditioned prompts, seamlessly augmenting methods like CoOp and CoCoOp without altering their core architectures. By masking only visible image patches and using these representations to guide prompt generation, ProMIM improves feature robustness and mitigates overfitting, all while introducing negligible additional computational cost. Extensive experiments across zero-shot and few-shot classification tasks demonstrate that ProMIM consistently boosts generalization performance when plugged into existing approaches, providing a practical, lightweight solution for real-world vision-language applications.

[108] AU-IQA: A Benchmark Dataset for Perceptual Quality Assessment of AI-Enhanced User-Generated Content

Shushi Wang, Chunyi Li, Zicheng Zhang, Han Zhou, Wei Dong, Jun Chen, Guangtao Zhai, Xiaohong Liu

Main category: cs.CV

TL;DR: The paper introduces AU-IQA, a benchmark dataset for assessing perceptual quality of AI-enhanced UGC, evaluates existing models, and analyzes their performance.

Details

Motivation: The lack of specialized quality assessment models for AI-enhanced UGC limits user experience and hinders advancement in enhancement methods.

Method: Constructed AU-IQA dataset with 4,800 AI-UGC images from three enhancement types (super-resolution, low-light, denoising) and evaluated traditional and multimodal IQA models.

Result: Evaluated existing models on AU-IQA, providing insights into their effectiveness for AI-UGC quality assessment.

Conclusion: The study highlights the need for improved quality assessment models tailored to AI-UGC and provides a benchmark for future research.

Abstract: AI-based image enhancement techniques have been widely adopted in various visual applications, significantly improving the perceptual quality of user-generated content (UGC). However, the lack of specialized quality assessment models has become a significant limiting factor in this field, limiting user experience and hindering the advancement of enhancement methods. While perceptual quality assessment methods have shown strong performance on UGC and AIGC individually, their effectiveness on AI-enhanced UGC (AI-UGC) which blends features from both, remains largely unexplored. To address this gap, we construct AU-IQA, a benchmark dataset comprising 4,800 AI-UGC images produced by three representative enhancement types which include super-resolution, low-light enhancement, and denoising. On this dataset, we further evaluate a range of existing quality assessment models, including traditional IQA methods and large multimodal models. Finally, we provide a comprehensive analysis of how well current approaches perform in assessing the perceptual quality of AI-UGC. The access link to the AU-IQA is https://github.com/WNNGGU/AU-IQA-Dataset.

[109] TRKT: Weakly Supervised Dynamic Scene Graph Generation with Temporal-enhanced Relation-aware Knowledge Transferring

Zhu Xu, Ting Lei, Zhimin Li, Guan Wang, Qingchao Chen, Yuxin Peng, Yang liu

Main category: cs.CV

TL;DR: TRKT improves WS-DSGG by enhancing object detection in dynamic scenes using relation-aware knowledge mining and dual-stream fusion, achieving state-of-the-art results.

Details

Motivation: Existing WS-DSGG methods rely on external object detectors, which perform poorly in dynamic, relation-aware scenarios, leading to inaccurate localization and low-confidence proposals.

Method: TRKT uses relation-aware knowledge mining (with attention maps and inter-frame augmentation) and a dual-stream fusion module to refine object detection.

Result: TRKT achieves state-of-the-art performance on the Action Genome dataset.

Conclusion: TRKT effectively addresses the limitations of external detectors in WS-DSGG, improving accuracy and robustness in dynamic scenarios.

Abstract: Dynamic Scene Graph Generation (DSGG) aims to create a scene graph for each video frame by detecting objects and predicting their relationships. Weakly Supervised DSGG (WS-DSGG) reduces annotation workload by using an unlocalized scene graph from a single frame per video for training. Existing WS-DSGG methods depend on an off-the-shelf external object detector to generate pseudo labels for subsequent DSGG training. However, detectors trained on static, object-centric images struggle in dynamic, relation-aware scenarios required for DSGG, leading to inaccurate localization and low-confidence proposals. To address the challenges posed by external object detectors in WS-DSGG, we propose a Temporal-enhanced Relation-aware Knowledge Transferring (TRKT) method, which leverages knowledge to enhance detection in relation-aware dynamic scenarios. TRKT is built on two key components:(1)Relation-aware knowledge mining: we first employ object and relation class decoders that generate category-specific attention maps to highlight both object regions and interactive areas. Then we propose an Inter-frame Attention Augmentation strategy that exploits optical flow for neighboring frames to enhance the attention maps, making them motion-aware and robust to motion blur. This step yields relation- and motion-aware knowledge mining for WS-DSGG. (2) we introduce a Dual-stream Fusion Module that integrates category-specific attention maps into external detections to refine object localization and boost confidence scores for object proposals. Extensive experiments demonstrate that TRKT achieves state-of-the-art performance on Action Genome dataset. Our code is avaliable at https://github.com/XZPKU/TRKT.git.

[110] Automatic Image Colorization with Convolutional Neural Networks and Generative Adversarial Networks

Ruiyu Li, Changyuan Qiu, Hangrui Cao, Qihan Ren, Yuqing Qiu

Main category: cs.CV

TL;DR: The paper explores automatic image colorization using classification and adversarial learning, addressing the ill-posed nature of the task by leveraging scene semantics and texture cues.

Details

Motivation: Image colorization is challenging due to its ill-posed nature, but scene semantics and texture provide useful color cues. The goal is to improve colorization by moving beyond regression to classification and adversarial learning.

Method: The approach involves building models based on prior works, applying modifications for specific scenarios, and comparing results. Classification and adversarial learning are key techniques.

Result: Not explicitly stated in the abstract, but the focus is on improving colorization accuracy by leveraging multi-modal color prediction.

Conclusion: The paper aims to advance image colorization by combining classification and adversarial learning, building on existing methods and addressing their limitations.

Abstract: Image colorization, the task of adding colors to grayscale images, has been the focus of significant research efforts in computer vision in recent years for its various application areas such as color restoration and automatic animation colorization [15, 1]. The colorization problem is challenging as it is highly ill-posed with two out of three image dimensions lost, resulting in large degrees of freedom. However, semantics of the scene as well as the surface texture could provide important cues for colors: the sky is typically blue, the clouds are typically white and the grass is typically green, and there are huge amounts of training data available for learning such priors since any colored image could serve as a training data point [20]. Colorization is initially formulated as a regression task[5], which ignores the multi-modal nature of color prediction. In this project, we explore automatic image colorization via classification and adversarial learning. We will build our models on prior works, apply modifications for our specific scenario and make comparisons.

[111] AdvDINO: Domain-Adversarial Self-Supervised Representation Learning for Spatial Proteomics

Stella Su, Marc Harary, Scott J. Rodig, William Lotter

Main category: cs.CV

TL;DR: AdvDINO is a domain-adversarial SSL framework that enhances domain-invariant feature learning, tested on lung cancer mIF images, improving robustness and biological relevance.

Details

Motivation: Standard SSL methods lack robustness to domain shifts, critical in biomedical imaging where batch effects obscure true signals.

Method: Integrates a gradient reversal layer into DINOv2 for domain-invariant features, applied to lung cancer mIF images.

Result: Mitigates slide-specific biases, uncovers phenotype clusters with prognostic significance, and improves survival prediction.

Conclusion: AdvDINO is broadly applicable to domains with domain shifts and limited annotations, enhancing generalization and interpretability.

Abstract: Self-supervised learning (SSL) has emerged as a powerful approach for learning visual representations without manual annotations. However, the robustness of standard SSL methods to domain shift – systematic differences across data sources – remains uncertain, posing an especially critical challenge in biomedical imaging where batch effects can obscure true biological signals. We present AdvDINO, a domain-adversarial self-supervised learning framework that integrates a gradient reversal layer into the DINOv2 architecture to promote domain-invariant feature learning. Applied to a real-world cohort of six-channel multiplex immunofluorescence (mIF) whole slide images from non-small cell lung cancer patients, AdvDINO mitigates slide-specific biases to learn more robust and biologically meaningful representations than non-adversarial baselines. Across $>5.46$ million mIF image tiles, the model uncovers phenotype clusters with distinct proteomic profiles and prognostic significance, and improves survival prediction in attention-based multiple instance learning. While demonstrated on mIF data, AdvDINO is broadly applicable to other imaging domains – including radiology, remote sensing, and autonomous driving – where domain shift and limited annotated data hinder model generalization and interpretability.

[112] F2PASeg: Feature Fusion for Pituitary Anatomy Segmentation in Endoscopic Surgery

Lumin Chen, Zhiying Wu, Tianye Lei, Xuexue Bai, Ming Feng, Yuxi Wang, Gaofeng Meng, Zhen Lei, Hongbin Liu

Main category: cs.CV

TL;DR: The paper introduces a new dataset (PAS) and a method (F2PASeg) for pituitary anatomy segmentation in surgery, addressing data scarcity and feature inconsistency challenges.

Details

Motivation: Pituitary tumors deform adjacent structures, risking surgery. Accurate segmentation can enhance safety, but annotated datasets are rare.

Method: The PAS dataset (7,845 images from 120 videos) is created with data augmentation. F2PASeg uses a Feature Fusion module to combine high-resolution and deep semantic features for robust segmentation.

Result: F2PASeg achieves consistent real-time segmentation of critical structures, improving intraoperative planning.

Conclusion: The proposed dataset and method provide a reliable solution for pituitary surgery, addressing data and feature challenges effectively.

Abstract: Pituitary tumors often cause deformation or encapsulation of adjacent vital structures. Anatomical structure segmentation can provide surgeons with early warnings of regions that pose surgical risks, thereby enhancing the safety of pituitary surgery. However, pixel-level annotated video stream datasets for pituitary surgeries are extremely rare. To address this challenge, we introduce a new dataset for Pituitary Anatomy Segmentation (PAS). PAS comprises 7,845 time-coherent images extracted from 120 videos. To mitigate class imbalance, we apply data augmentation techniques that simulate the presence of surgical instruments in the training data. One major challenge in pituitary anatomy segmentation is the inconsistency in feature representation due to occlusions, camera motion, and surgical bleeding. By incorporating a Feature Fusion module, F2PASeg is proposed to refine anatomical structure segmentation by leveraging both high-resolution image features and deep semantic embeddings, enhancing robustness against intraoperative variations. Experimental results demonstrate that F2PASeg consistently segments critical anatomical structures in real time, providing a reliable solution for intraoperative pituitary surgery planning. Code: https://github.com/paulili08/F2PASeg.

[113] Open-world Point Cloud Semantic Segmentation: A Human-in-the-loop Framework

Peng Zhang, Songru Yang, Jinsheng Sun, Weiqing Li, Zhiyong Su

Main category: cs.CV

TL;DR: HOW-Seg is a human-in-the-loop framework for open-world point cloud semantic segmentation, leveraging sparse annotations and iterative feedback to achieve high-quality results for base and novel classes.

Details

Motivation: Existing methods for open-world point cloud semantic segmentation rely on resource-intensive offline learning or dense annotations, limiting practicality. HOW-Seg addresses these issues by incorporating human feedback.

Method: HOW-Seg constructs class prototypes directly on query data to avoid bias, uses sparse human annotations for guidance, refines prototypes with a hierarchical disambiguation mechanism, and employs a dense CRF for label optimization.

Result: HOW-Seg matches or surpasses state-of-the-art methods with sparse annotations (e.g., one-click per novel class) and achieves 85.27% mIoU on S3DIS and 66.37% mIoU on ScanNetv2 with denser annotations.

Conclusion: HOW-Seg demonstrates the effectiveness of human-in-the-loop frameworks for open-world segmentation, achieving superior performance with minimal human input.

Abstract: Open-world point cloud semantic segmentation (OW-Seg) aims to predict point labels of both base and novel classes in real-world scenarios. However, existing methods rely on resource-intensive offline incremental learning or densely annotated support data, limiting their practicality. To address these limitations, we propose HOW-Seg, the first human-in-the-loop framework for OW-Seg. Specifically, we construct class prototypes, the fundamental segmentation units, directly on the query data, avoiding the prototype bias caused by intra-class distribution shifts between the support and query data. By leveraging sparse human annotations as guidance, HOW-Seg enables prototype-based segmentation for both base and novel classes. Considering the lack of granularity of initial prototypes, we introduce a hierarchical prototype disambiguation mechanism to refine ambiguous prototypes, which correspond to annotations of different classes. To further enrich contextual awareness, we employ a dense conditional random field (CRF) upon the refined prototypes to optimize their label assignments. Through iterative human feedback, HOW-Seg dynamically improves its predictions, achieving high-quality segmentation for both base and novel classes. Experiments demonstrate that with sparse annotations (e.g., one-novel-class-one-click), HOW-Seg matches or surpasses the state-of-the-art generalized few-shot segmentation (GFS-Seg) method under the 5-shot setting. When using advanced backbones (e.g., Stratified Transformer) and denser annotations (e.g., 10 clicks per sub-scene), HOW-Seg achieves 85.27% mIoU on S3DIS and 66.37% mIoU on ScanNetv2, significantly outperforming alternatives.

[114] Keep It Real: Challenges in Attacking Compression-Based Adversarial Purification

Samuel Räber, Till Aczel, Andreas Plesner, Roger Wattenhofer

Main category: cs.CV

TL;DR: Lossy image compression can defend against adversarial attacks, but realistic reconstructions increase attack difficulty. High-fidelity compression models resist attacks, while low-realism ones fail. Realism, not gradient masking, provides robustness.

Details

Motivation: To evaluate the effectiveness of lossy compression in defending against adversarial attacks and understand the role of image realism in robustness.

Method: Constructed strong white-box and adaptive attacks against various compression models, analyzing attack resistance based on reconstruction realism.

Result: High-realism compression models resist attacks, while low-realism ones are vulnerable. Realism, not gradient masking, provides inherent robustness.

Conclusion: Realistic reconstructions pose a challenge for adversarial attacks, suggesting realism as a key factor for future security evaluations.

Abstract: Previous work has suggested that preprocessing images through lossy compression can defend against adversarial perturbations, but comprehensive attack evaluations have been lacking. In this paper, we construct strong white-box and adaptive attacks against various compression models and identify a critical challenge for attackers: high realism in reconstructed images significantly increases attack difficulty. Through rigorous evaluation across multiple attack scenarios, we demonstrate that compression models capable of producing realistic, high-fidelity reconstructions are substantially more resistant to our attacks. In contrast, low-realism compression models can be broken. Our analysis reveals that this is not due to gradient masking. Rather, realistic reconstructions maintaining distributional alignment with natural images seem to offer inherent robustness. This work highlights a significant obstacle for future adversarial attacks and suggests that developing more effective techniques to overcome realism represents an essential challenge for comprehensive security evaluation.

[115] UGOD: Uncertainty-Guided Differentiable Opacity and Soft Dropout for Enhanced Sparse-View 3DGS

Zhihao Guo, Peng Wang, Zidong Chen, Xiangyu Kong, Yan Lyu, Guanyu Gao, Liangxiu Han

Main category: cs.CV

TL;DR: The paper introduces adaptive weighting for 3D Gaussian Splatting (3DGS) to improve rendering quality in sparse-view scenarios by using learned uncertainties and soft differentiable dropout regularization.

Details

Motivation: Current 3DGS methods treat Gaussians equally, leading to overfitting, especially in sparse-view settings. The paper aims to enhance rendering quality by adaptively weighting Gaussians.

Method: The method uses learned uncertainties to update Gaussian opacity and applies soft differentiable dropout regularization to transform uncertainties into drop probabilities for rendering.

Result: The approach outperforms existing methods in sparse-view 3D synthesis, achieving higher quality with fewer Gaussians (e.g., 3.27% PSNR improvement over DropGaussian on MipNeRF 360).

Conclusion: Adaptive weighting with learned uncertainties and dropout regularization significantly improves 3DGS performance in sparse-view scenarios.

Abstract: 3D Gaussian Splatting (3DGS) has become a competitive approach for novel view synthesis (NVS) due to its advanced rendering efficiency through 3D Gaussian projection and blending. However, Gaussians are treated equally weighted for rendering in most 3DGS methods, making them prone to overfitting, which is particularly the case in sparse-view scenarios. To address this, we investigate how adaptive weighting of Gaussians affects rendering quality, which is characterised by learned uncertainties proposed. This learned uncertainty serves two key purposes: first, it guides the differentiable update of Gaussian opacity while preserving the 3DGS pipeline integrity; second, the uncertainty undergoes soft differentiable dropout regularisation, which strategically transforms the original uncertainty into continuous drop probabilities that govern the final Gaussian projection and blending process for rendering. Extensive experimental results over widely adopted datasets demonstrate that our method outperforms rivals in sparse-view 3D synthesis, achieving higher quality reconstruction with fewer Gaussians in most datasets compared to existing sparse-view approaches, e.g., compared to DropGaussian, our method achieves 3.27% PSNR improvements on the MipNeRF 360 dataset.

[116] CSRAP: Enhanced Canvas Attention Scheduling for Real-Time Mission Critical Perception

Md Iftekharul Islam Sakib, Yigong Hu, Tarek Abdelzaher

Main category: cs.CV

TL;DR: Extends canvas-based attention scheduling with variable-size canvas frames and selectable frame rates, improving quality/cost trade-offs for real-time object detection on edge platforms.

Details

Motivation: Addresses the challenge of executing high-resolution object detection under latency constraints on limited computing resources.

Method: Proposes variable-size canvas frames and selectable frame rates, evaluated using YOLOv11 on NVIDIA Jetson Orin Nano with Waymo Open Dataset.

Result: Improves mean average precision (mAP) and recall, offering better quality/cost trade-offs.

Conclusion: The extended canvas-based scheduling enhances real-time perception performance on resource-limited edge platforms.

Abstract: Real-time perception on edge platforms faces a core challenge: executing high-resolution object detection under stringent latency constraints on limited computing resources. Canvas-based attention scheduling was proposed in earlier work as a mechanism to reduce the resource demands of perception subsystems. It consolidates areas of interest in an input data frame onto a smaller area, called a canvas frame, that can be processed at the requisite frame rate. This paper extends prior canvas-based attention scheduling literature by (i) allowing for variable-size canvas frames and (ii) employing selectable canvas frame rates that may depart from the original data frame rate. We evaluate our solution by running YOLOv11, as the perception module, on an NVIDIA Jetson Orin Nano to inspect video frames from the Waymo Open Dataset. Our results show that the additional degrees of freedom improve the attainable quality/cost trade-offs, thereby allowing for a consistently higher mean average precision (mAP) and recall with respect to the state of the art.

[117] Steering One-Step Diffusion Model with Fidelity-Rich Decoder for Fast Image Compression

Zheng Chen, Mingde Zhou, Jinpei Guo, Jiale Yuan, Yifei Ji, Yulun Zhang

Main category: cs.CV

TL;DR: SODEC is a single-step diffusion image compression model addressing latency and fidelity issues in diffusion-based methods by using a pre-trained VAE and a fidelity guidance module, achieving faster decoding and better performance.

Details

Motivation: Diffusion-based image compression suffers from high decoding latency and poor fidelity due to multi-step sampling and reliance on generative priors.

Method: Leverages a pre-trained VAE for rich latent information, replaces iterative denoising with single-step decoding, and introduces a fidelity guidance module and rate annealing training.

Result: SODEC outperforms existing methods in rate-distortion-perception performance and improves decoding speed by over 20x.

Conclusion: SODEC effectively addresses the drawbacks of diffusion-based compression, offering faster and higher-fidelity image compression.

Abstract: Diffusion-based image compression has demonstrated impressive perceptual performance. However, it suffers from two critical drawbacks: (1) excessive decoding latency due to multi-step sampling, and (2) poor fidelity resulting from over-reliance on generative priors. To address these issues, we propose SODEC, a novel single-step diffusion image compression model. We argue that in image compression, a sufficiently informative latent renders multi-step refinement unnecessary. Based on this insight, we leverage a pre-trained VAE-based model to produce latents with rich information, and replace the iterative denoising process with a single-step decoding. Meanwhile, to improve fidelity, we introduce the fidelity guidance module, encouraging output that is faithful to the original image. Furthermore, we design the rate annealing training strategy to enable effective training under extremely low bitrates. Extensive experiments show that SODEC significantly outperforms existing methods, achieving superior rate-distortion-perception performance. Moreover, compared to previous diffusion-based compression models, SODEC improves decoding speed by more than 20$\times$. Code is released at: https://github.com/zhengchen1999/SODEC.

[118] Propagating Sparse Depth via Depth Foundation Model for Out-of-Distribution Depth Completion

Shenglun Chen, Xinzhu Ma, Hong Zhang, Haojie Li, Zhihui Wang

Main category: cs.CV

TL;DR: A novel depth completion framework leverages depth foundation models for robustness without large-scale training, outperforming state-of-the-art methods in out-of-distribution scenarios.

Details

Motivation: Existing depth completion models degrade in out-of-distribution scenarios due to reliance on limited data. Foundation models offer robustness potential.

Method: Uses a depth foundation model to extract environmental cues from RGB images, dual-space propagation (3D and 2D), and a learnable correction module.

Result: Outperforms state-of-the-art methods on 16 datasets, showing remarkable robustness in OOD scenarios.

Conclusion: The framework effectively enhances depth completion robustness without extensive training, validated by extensive evaluations.

Abstract: Depth completion is a pivotal challenge in computer vision, aiming at reconstructing the dense depth map from a sparse one, typically with a paired RGB image. Existing learning based models rely on carefully prepared but limited data, leading to significant performance degradation in out-of-distribution (OOD) scenarios. Recent foundation models have demonstrated exceptional robustness in monocular depth estimation through large-scale training, and using such models to enhance the robustness of depth completion models is a promising solution. In this work, we propose a novel depth completion framework that leverages depth foundation models to attain remarkable robustness without large-scale training. Specifically, we leverage a depth foundation model to extract environmental cues, including structural and semantic context, from RGB images to guide the propagation of sparse depth information into missing regions. We further design a dual-space propagation approach, without any learnable parameters, to effectively propagates sparse depth in both 3D and 2D spaces to maintain geometric structure and local consistency. To refine the intricate structure, we introduce a learnable correction module to progressively adjust the depth prediction towards the real depth. We train our model on the NYUv2 and KITTI datasets as in-distribution datasets and extensively evaluate the framework on 16 other datasets. Our framework performs remarkably well in the OOD scenarios and outperforms existing state-of-the-art depth completion methods. Our models are released in https://github.com/shenglunch/PSD.

[119] Unified modality separation: A vision-language framework for unsupervised domain adaptation

Xinyao Li, Jingjing Li, Zhekai Du, Lei Zhu, Heng Tao Shen

Main category: cs.CV

TL;DR: The paper proposes a unified modality separation framework for unsupervised domain adaptation (UDA) using vision-language models (VLMs) to address the modality gap, improving performance and efficiency.

Details

Motivation: Direct UDA with VLMs transfers only modality-invariant knowledge due to the modality gap, leading to suboptimal performance. The goal is to leverage both modality-specific and modality-invariant components for better adaptation.

Method: A unified framework disentangles modality components from VLM features, handles them separately, and uses modality-adaptive ensemble weights at test time. A modality discrepancy metric categorizes samples to guide alignment and annotation.

Result: The method achieves up to 9% performance gain and 9x computational efficiency, validated across various backbones, datasets, and settings.

Conclusion: The proposed framework effectively addresses the modality gap in UDA, enhancing performance and efficiency by leveraging both modality-specific and invariant components.

Abstract: Unsupervised domain adaptation (UDA) enables models trained on a labeled source domain to handle new unlabeled domains. Recently, pre-trained vision-language models (VLMs) have demonstrated promising zero-shot performance by leveraging semantic information to facilitate target tasks. By aligning vision and text embeddings, VLMs have shown notable success in bridging domain gaps. However, inherent differences naturally exist between modalities, which is known as modality gap. Our findings reveal that direct UDA with the presence of modality gap only transfers modality-invariant knowledge, leading to suboptimal target performance. To address this limitation, we propose a unified modality separation framework that accommodates both modality-specific and modality-invariant components. During training, different modality components are disentangled from VLM features then handled separately in a unified manner. At test time, modality-adaptive ensemble weights are automatically determined to maximize the synergy of different components. To evaluate instance-level modality characteristics, we design a modality discrepancy metric to categorize samples into modality-invariant, modality-specific, and uncertain ones. The modality-invariant samples are exploited to facilitate cross-modal alignment, while uncertain ones are annotated to enhance model capabilities. Building upon prompt tuning techniques, our methods achieve up to 9% performance gain with 9 times of computational efficiencies. Extensive experiments and analysis across various backbones, baselines, datasets and adaptation settings demonstrate the efficacy of our design.

[120] Modeling Rapid Contextual Learning in the Visual Cortex with Fast-Weight Deep Autoencoder Networks

Yue Li, Weifan Wang, Tai Sing Lee

Main category: cs.CV

TL;DR: The study explores how familiarity training in a Vision Transformer (ViT)-based autoencoder induces global context sensitivity in early layers, using Low-Rank Adaptation (LoRA) for fast weights. Results show manifold transformation, alignment of latent representations, and broader self-attention scope, suggesting a hybrid fast-and-slow weight model for brain-like rapid learning.

Details

Motivation: To understand how familiarity training can induce sensitivity to global context in early layers of deep neural networks, inspired by neurophysiological findings in the early visual cortex.

Method: A ViT-based autoencoder with LoRA for fast weights is used to simulate familiarity training, analyzing self-attention circuits and latent representations.

Result: Familiarity training aligns early-layer representations with global context, broadens self-attention scope, and is enhanced by LoRA-based fast weights.

Conclusion: Familiarity training introduces global sensitivity in hierarchical networks, and a hybrid fast-and-slow weight architecture may model rapid context learning in the brain.

Abstract: Recent neurophysiological studies have revealed that the early visual cortex can rapidly learn global image context, as evidenced by a sparsification of population responses and a reduction in mean activity when exposed to familiar versus novel image contexts. This phenomenon has been attributed primarily to local recurrent interactions, rather than changes in feedforward or feedback pathways, supported by both empirical findings and circuit-level modeling. Recurrent neural circuits capable of simulating these effects have been shown to reshape the geometry of neural manifolds, enhancing robustness and invariance to irrelevant variations. In this study, we employ a Vision Transformer (ViT)-based autoencoder to investigate, from a functional perspective, how familiarity training can induce sensitivity to global context in the early layers of a deep neural network. We hypothesize that rapid learning operates via fast weights, which encode transient or short-term memory traces, and we explore the use of Low-Rank Adaptation (LoRA) to implement such fast weights within each Transformer layer. Our results show that (1) The proposed ViT-based autoencoder’s self-attention circuit performs a manifold transform similar to a neural circuit model of the familiarity effect. (2) Familiarity training aligns latent representations in early layers with those in the top layer that contains global context information. (3) Familiarity training broadens the self-attention scope within the remembered image context. (4) These effects are significantly amplified by LoRA-based fast weights. Together, these findings suggest that familiarity training introduces global sensitivity to earlier layers in a hierarchical network, and that a hybrid fast-and-slow weight architecture may provide a viable computational model for studying rapid global context learning in the brain.

[121] Attribute Guidance With Inherent Pseudo-label For Occluded Person Re-identification

Rui Zhi, Zhen Yang, Haiyang Zhang

Main category: cs.CV

TL;DR: AG-ReID improves occluded Re-ID by leveraging fine-grained attributes from pre-trained models without extra data, outperforming existing methods.

Details

Motivation: Pre-trained vision-language models struggle with occluded Re-ID due to neglect of fine-grained attributes, especially for partially visible pedestrians or subtle differences.

Method: AG-ReID uses a two-stage process: generating attribute pseudo-labels and a dual-guidance mechanism combining holistic and fine-grained features.

Result: AG-ReID achieves state-of-the-art performance on Re-ID datasets, excelling in occlusion handling and subtle attribute distinctions.

Conclusion: AG-ReID effectively addresses occluded Re-ID challenges by integrating fine-grained attributes, demonstrating superior performance without additional annotations.

Abstract: Person re-identification (Re-ID) aims to match person images across different camera views, with occluded Re-ID addressing scenarios where pedestrians are partially visible. While pre-trained vision-language models have shown effectiveness in Re-ID tasks, they face significant challenges in occluded scenarios by focusing on holistic image semantics while neglecting fine-grained attribute information. This limitation becomes particularly evident when dealing with partially occluded pedestrians or when distinguishing between individuals with subtle appearance differences. To address this limitation, we propose Attribute-Guide ReID (AG-ReID), a novel framework that leverages pre-trained models’ inherent capabilities to extract fine-grained semantic attributes without additional data or annotations. Our framework operates through a two-stage process: first generating attribute pseudo-labels that capture subtle visual characteristics, then introducing a dual-guidance mechanism that combines holistic and fine-grained attribute information to enhance image feature extraction. Extensive experiments demonstrate that AG-ReID achieves state-of-the-art results on multiple widely-used Re-ID datasets, showing significant improvements in handling occlusions and subtle attribute differences while maintaining competitive performance on standard Re-ID scenarios.

[122] CRAM: Large-scale Video Continual Learning with Bootstrapped Compression

Shivani Mall, Joao F. Henriques

Main category: cs.CV

TL;DR: The paper introduces CRAM, a method for video continual learning using compressed vision to reduce memory usage while maintaining performance.

Details

Motivation: To address the high memory requirements and challenges of practical video continual learning, especially with long videos and continual streams.

Method: Uses compressed vision (video codes instead of raw inputs) and a rehearsal-based approach with a rolling buffer. Proposes refreshing video codes to combat catastrophic forgetting.

Result: CRAM outperforms prior methods with a significantly reduced memory footprint, storing thousands of long videos in under 2 GB.

Conclusion: CRAM effectively enables video continual learning with reduced memory, demonstrating scalability and performance on large-scale benchmarks.

Abstract: Continual learning (CL) promises to allow neural networks to learn from continuous streams of inputs, instead of IID (independent and identically distributed) sampling, which requires random access to a full dataset. This would allow for much smaller storage requirements and self-sufficiency of deployed systems that cope with natural distribution shifts, similarly to biological learning. We focus on video CL employing a rehearsal-based approach, which reinforces past samples from a memory buffer. We posit that part of the reason why practical video CL is challenging is the high memory requirements of video, further exacerbated by long-videos and continual streams, which are at odds with the common rehearsal-buffer size constraints. To address this, we propose to use compressed vision, i.e. store video codes (embeddings) instead of raw inputs, and train a video classifier by IID sampling from this rolling buffer. Training a video compressor online (so not depending on any pre-trained networks) means that it is also subject to catastrophic forgetting. We propose a scheme to deal with this forgetting by refreshing video codes, which requires careful decompression with a previous version of the network and recompression with a new one. We name our method Continually Refreshed Amodal Memory (CRAM). We expand current video CL benchmarks to large-scale settings, namely EpicKitchens-100 and Kinetics-700, storing thousands of relatively long videos in under 2 GB, and demonstrate empirically that our video CL method outperforms prior art with a significantly reduced memory footprint.

[123] Learned Single-Pass Multitasking Perceptual Graphics for Immersive Displays

Doğa Yılmaz, He Wang, Towaki Takikawa, Duygu Ceylan, Kaan Akşit

Main category: cs.CV

TL;DR: A lightweight multitasking perceptual graphics model is proposed for efficient resource use, performing text-described tasks in one step, avoiding issues of daisy-chaining or dedicated models.

Details

Motivation: Running multiple perceptual graphics methods on devices with limited resources is challenging.

Method: A learned multitasking model processes RGB images and text prompts to perform tasks in a single inference, using a dataset of source and enhanced images with prompts.

Result: The model delivers high-quality perceptual effects with reasonable compute, validated on desktop and embedded platforms via user study.

Conclusion: The flexible, text-guided method efficiently supports dynamic requirements and diverse perceptual tasks.

Abstract: Emerging immersive display technologies efficiently utilize resources with perceptual graphics methods such as foveated rendering and denoising. Running multiple perceptual graphics methods challenges devices with limited power and computational resources. We propose a computationally-lightweight learned multitasking perceptual graphics model. Given RGB images and text-prompts, our model performs text-described perceptual tasks in a single inference step. Simply daisy-chaining multiple models or training dedicated models can lead to model management issues and exhaust computational resources. In contrast, our flexible method unlocks consistent high quality perceptual effects with reasonable compute, supporting various permutations at varied intensities using adjectives in text prompts (e.g. mildly, lightly). Text-guidance provides ease of use for dynamic requirements such as creative processes. To train our model, we propose a dataset containing source and perceptually enhanced images with corresponding text prompts. We evaluate our model on desktop and embedded platforms and validate perceptual quality through a user study.

[124] Multimodal Causal-Driven Representation Learning for Generalizable Medical Image Segmentation

Xusheng Liang, Lihua Zhou, Nianxin Li, Miao Xu, Ziyang Song, Dong Yi, Jinlin Wu, Hongbin Liu, Jiebo Luo, Zhen Lei

Main category: cs.CV

TL;DR: MCDRL integrates causal inference with Vision-Language Models (VLMs) like CLIP to improve domain generalization in medical image segmentation by addressing domain shifts caused by confounders.

Details

Motivation: Medical imaging faces challenges due to domain shifts from equipment differences and artifacts, leading to poor generalization of VLMs like CLIP.

Method: MCDRL uses CLIP to identify lesion regions and build a confounder dictionary, then trains a causal intervention network to eliminate domain-specific variations while preserving anatomical structure.

Result: MCDRL outperforms other methods, achieving better segmentation accuracy and robust generalization.

Conclusion: MCDRL effectively addresses domain generalization in medical imaging by combining causal inference with VLMs, enhancing segmentation performance.

Abstract: Vision-Language Models (VLMs), such as CLIP, have demonstrated remarkable zero-shot capabilities in various computer vision tasks. However, their application to medical imaging remains challenging due to the high variability and complexity of medical data. Specifically, medical images often exhibit significant domain shifts caused by various confounders, including equipment differences, procedure artifacts, and imaging modes, which can lead to poor generalization when models are applied to unseen domains. To address this limitation, we propose Multimodal Causal-Driven Representation Learning (MCDRL), a novel framework that integrates causal inference with the VLM to tackle domain generalization in medical image segmentation. MCDRL is implemented in two steps: first, it leverages CLIP’s cross-modal capabilities to identify candidate lesion regions and construct a confounder dictionary through text prompts, specifically designed to represent domain-specific variations; second, it trains a causal intervention network that utilizes this dictionary to identify and eliminate the influence of these domain-specific variations while preserving the anatomical structural information critical for segmentation tasks. Extensive experiments demonstrate that MCDRL consistently outperforms competing methods, yielding superior segmentation accuracy and exhibiting robust generalizability.

[125] Skin-SOAP: A Weakly Supervised Framework for Generating Structured SOAP Notes

Sadia Kamal, Tim Oates, Joy Wan

Main category: cs.CV

TL;DR: Skin-SOAP is a weakly supervised multimodal framework for generating structured SOAP notes from lesion images and sparse clinical text, reducing manual effort and clinician burnout while matching GPT-4o and other models in performance.

Details

Motivation: Skin carcinoma is highly prevalent and costly, requiring early diagnosis and treatment. Manual SOAP note generation is labor-intensive and contributes to clinician burnout.

Method: Proposes skin-SOAP, a weakly supervised multimodal framework using lesion images and sparse clinical text to generate SOAP notes without heavy manual annotations.

Result: Achieves performance comparable to GPT-4o, Claude, and DeepSeek Janus Pro. Introduces MedConceptEval and CCS metrics for clinical relevance.

Conclusion: Skin-SOAP offers scalable, clinically grounded documentation, reducing clinician burden and reliance on large annotated datasets.

Abstract: Skin carcinoma is the most prevalent form of cancer globally, accounting for over $8 billion in annual healthcare expenditures. Early diagnosis, accurate and timely treatment are critical to improving patient survival rates. In clinical settings, physicians document patient visits using detailed SOAP (Subjective, Objective, Assessment, and Plan) notes. However, manually generating these notes is labor-intensive and contributes to clinician burnout. In this work, we propose skin-SOAP, a weakly supervised multimodal framework to generate clinically structured SOAP notes from limited inputs, including lesion images and sparse clinical text. Our approach reduces reliance on manual annotations, enabling scalable, clinically grounded documentation while alleviating clinician burden and reducing the need for large annotated data. Our method achieves performance comparable to GPT-4o, Claude, and DeepSeek Janus Pro across key clinical relevance metrics. To evaluate this clinical relevance, we introduce two novel metrics MedConceptEval and Clinical Coherence Score (CCS) which assess semantic alignment with expert medical concepts and input features, respectively.

[126] BUFFER-X: Towards Zero-Shot Point Cloud Registration in Diverse Scenes

Minkyun Seo, Hyungtae Lim, Kanghee Lee, Luca Carlone, Jaesik Park

Main category: cs.CV

TL;DR: BUFFER-X is a zero-shot point cloud registration pipeline addressing generalization issues by adaptive voxel size, farthest point sampling, and patch-wise scale normalization, outperforming existing methods without retraining or manual tuning.

Details

Motivation: Current deep learning-based point cloud registration methods lack generalization, requiring retraining or manual tuning for new environments due to reliance on environment-specific parameters and poor out-of-domain robustness.

Method: BUFFER-X adaptively determines voxel size/search radii, uses farthest point sampling, and employs patch-wise scale normalization with multi-scale patch-based descriptors and hierarchical inlier search.

Result: BUFFER-X achieves substantial generalization across 11 diverse datasets without prior information or manual tuning, demonstrating robustness in varied indoor/outdoor scenarios and sensor modalities.

Conclusion: BUFFER-X effectively addresses generalization limitations in point cloud registration, offering a robust, zero-shot solution adaptable to diverse environments.

Abstract: Recent advances in deep learning-based point cloud registration have improved generalization, yet most methods still require retraining or manual parameter tuning for each new environment. In this paper, we identify three key factors limiting generalization: (a) reliance on environment-specific voxel size and search radius, (b) poor out-of-domain robustness of learning-based keypoint detectors, and (c) raw coordinate usage, which exacerbates scale discrepancies. To address these issues, we present a zero-shot registration pipeline called BUFFER-X by (a) adaptively determining voxel size/search radii, (b) using farthest point sampling to bypass learned detectors, and (c) leveraging patch-wise scale normalization for consistent coordinate bounds. In particular, we present a multi-scale patch-based descriptor generation and a hierarchical inlier search across scales to improve robustness in diverse scenes. We also propose a novel generalizability benchmark using 11 datasets that cover various indoor/outdoor scenarios and sensor modalities, demonstrating that BUFFER-X achieves substantial generalization without prior information or manual parameter tuning for the test datasets. Our code is available at https://github.com/MIT-SPARK/BUFFER-X.

[127] A Novel Image Similarity Metric for Scene Composition Structure

Md Redwanul Haque, Manzur Murshed, Manoranjan Paul, Tsz-Kwan Lee

Main category: cs.CV

TL;DR: The paper introduces SCSSIM, a novel metric for evaluating image quality in generative AI by focusing on Scene Composition Structure (SCS) integrity, outperforming traditional and neural-network-based methods.

Details

Motivation: Existing image quality metrics fail to adequately assess the structural fidelity of generative AI outputs, particularly in preserving SCS, which is crucial for accurate scene composition.

Method: The authors propose SCSSIM, a training-free metric using statistical measures from Cuboidal hierarchical partitioning to quantify SCS preservation.

Result: SCSSIM shows high invariance to non-compositional distortions and a strong monotonic decrease for compositional distortions, accurately reflecting SCS changes.

Conclusion: SCSSIM is a superior tool for evaluating generative models, ensuring scene composition integrity where other metrics fall short.

Abstract: The rapid advancement of generative AI models necessitates novel methods for evaluating image quality that extend beyond human perception. A critical concern for these models is the preservation of an image’s underlying Scene Composition Structure (SCS), which defines the geometric relationships among objects and the background, their relative positions, sizes, orientations, etc. Maintaining SCS integrity is paramount for ensuring faithful and structurally accurate GenAI outputs. Traditional image similarity metrics often fall short in assessing SCS. Pixel-level approaches are overly sensitive to minor visual noise, while perception-based metrics prioritize human aesthetic appeal, neither adequately capturing structural fidelity. Furthermore, recent neural-network-based metrics introduce training overheads and potential generalization issues. We introduce the SCS Similarity Index Measure (SCSSIM), a novel, analytical, and training-free metric that quantifies SCS preservation by exploiting statistical measures derived from the Cuboidal hierarchical partitioning of images, robustly capturing non-object-based structural relationships. Our experiments demonstrate SCSSIM’s high invariance to non-compositional distortions, accurately reflecting unchanged SCS. Conversely, it shows a strong monotonic decrease for compositional distortions, precisely indicating when SCS has been altered. Compared to existing metrics, SCSSIM exhibits superior properties for structural evaluation, making it an invaluable tool for developing and evaluating generative models, ensuring the integrity of scene composition.

[128] Towards a General-Purpose Zero-Shot Synthetic Low-Light Image and Video Pipeline

Joanne Lin, Crispian Morris, Ruirui Lin, Fan Zhang, David Bull, Nantheera Anantrasirichai

Main category: cs.CV

TL;DR: A new Degradation Estimation Network (DEN) is proposed to generate realistic sRGB noise for low-light images/videos without camera metadata, improving tasks like noise replication, video enhancement, and object detection.

Details

Motivation: Low-light conditions hinder annotation and research, with existing methods relying on unrealistic noise models or synthetic data.

Method: DEN estimates physics-informed noise distributions in a self-supervised, zero-shot manner to generate diverse, realistic noise.

Result: Improvements of up to 24% KLD, 21% LPIPS, and 62% AP$_{50-95}$ in noise replication, video enhancement, and object detection.

Conclusion: DEN effectively addresses the lack of realistic low-light data, enhancing performance in various low-light tasks.

Abstract: Low-light conditions pose significant challenges for both human and machine annotation. This in turn has led to a lack of research into machine understanding for low-light images and (in particular) videos. A common approach is to apply annotations obtained from high quality datasets to synthetically created low light versions. In addition, these approaches are often limited through the use of unrealistic noise models. In this paper, we propose a new Degradation Estimation Network (DEN), which synthetically generates realistic standard RGB (sRGB) noise without the requirement for camera metadata. This is achieved by estimating the parameters of physics-informed noise distributions, trained in a self-supervised manner. This zero-shot approach allows our method to generate synthetic noisy content with a diverse range of realistic noise characteristics, unlike other methods which focus on recreating the noise characteristics of the training data. We evaluate our proposed synthetic pipeline using various methods trained on its synthetic data for typical low-light tasks including synthetic noise replication, video enhancement, and object detection, showing improvements of up to 24% KLD, 21% LPIPS, and 62% AP$_{50-95}$, respectively.

[129] HAMoBE: Hierarchical and Adaptive Mixture of Biometric Experts for Video-based Person ReID

Yiyang Su, Yunping Shi, Feng Liu, Xiaoming Liu

Main category: cs.CV

TL;DR: A novel framework (HAMoBE) for video-based person re-identification (ReID) improves performance by adaptively integrating multi-layer features and mimicking human perceptual mechanisms.

Details

Motivation: Existing video-based ReID methods often fail to identify and select the most discriminative features for effective matching, limiting robustness in dynamic environments.

Method: HAMoBE uses a pre-trained large model (e.g., CLIP) to extract multi-layer features, with specialized experts modeling appearance, body shape, and gait. A dual-input decision gating network dynamically adjusts expert contributions.

Result: HAMoBE achieves significant improvements, such as +13.0% Rank-1 accuracy on benchmarks like MEVID.

Conclusion: The proposed HAMoBE framework effectively addresses feature selection and integration challenges in video-based ReID, enhancing performance and robustness.

Abstract: Recently, research interest in person re-identification (ReID) has increasingly focused on video-based scenarios, which are essential for robust surveillance and security in varied and dynamic environments. However, existing video-based ReID methods often overlook the necessity of identifying and selecting the most discriminative features from both videos in a query-gallery pair for effective matching. To address this issue, we propose a novel Hierarchical and Adaptive Mixture of Biometric Experts (HAMoBE) framework, which leverages multi-layer features from a pre-trained large model (e.g., CLIP) and is designed to mimic human perceptual mechanisms by independently modeling key biometric features–appearance, static body shape, and dynamic gait–and adaptively integrating them. Specifically, HAMoBE includes two levels: the first level extracts low-level features from multi-layer representations provided by the frozen large model, while the second level consists of specialized experts focusing on long-term, short-term, and temporal features. To ensure robust matching, we introduce a new dual-input decision gating network that dynamically adjusts the contributions of each expert based on their relevance to the input scenarios. Extensive evaluations on benchmarks like MEVID demonstrate that our approach yields significant performance improvements (e.g., +13.0% Rank-1 accuracy).

[130] Finding Needles in Images: Can Multimodal LLMs Locate Fine Details?

Parth Thakkar, Ankush Agarwal, Prasad Kasu, Pulkit Bansal, Chaitanya Devaguptapu

Main category: cs.CV

TL;DR: The paper introduces NiM, a benchmark for evaluating Multi-modal Large Language Models (MLLMs) on fine-grained document understanding tasks, and proposes Spot-IT, a method to enhance MLLMs’ performance in such tasks.

Details

Motivation: Current MLLMs lack thorough evaluation and capability in locating and reasoning about fine-grained details in complex documents, like finding specific nutritional details in menus or disclaimers in articles.

Method: The authors introduce the NiM benchmark and propose Spot-IT, which uses intelligent patch selection and Gaussian attention to mimic human zoom-and-focus behavior.

Result: Spot-IT outperforms baseline methods, especially in tasks requiring precise detail extraction from complex layouts.

Conclusion: The study highlights MLLMs’ limitations in fine-grained document understanding and demonstrates the effectiveness of Spot-IT in addressing these challenges.

Abstract: While Multi-modal Large Language Models (MLLMs) have shown impressive capabilities in document understanding tasks, their ability to locate and reason about fine-grained details within complex documents remains understudied. Consider searching a restaurant menu for a specific nutritional detail or identifying a disclaimer in a lengthy newspaper article tasks that demand careful attention to small but significant details within a broader narrative, akin to Finding Needles in Images (NiM). To address this gap, we introduce NiM, a carefully curated benchmark spanning diverse real-world documents including newspapers, menus, and lecture images, specifically designed to evaluate MLLMs’ capability in these intricate tasks. Building on this, we further propose Spot-IT, a simple yet effective approach that enhances MLLMs capability through intelligent patch selection and Gaussian attention, motivated from how humans zoom and focus when searching documents. Our extensive experiments reveal both the capabilities and limitations of current MLLMs in handling fine-grained document understanding tasks, while demonstrating the effectiveness of our approach. Spot-IT achieves significant improvements over baseline methods, particularly in scenarios requiring precise detail extraction from complex layouts.

[131] DualMat: PBR Material Estimation via Coherent Dual-Path Diffusion

Yifeng Huang, Zhang Chen, Yi Xu, Minh Hoai, Zhong Li

Main category: cs.CV

TL;DR: DualMat introduces a dual-path diffusion framework for PBR material estimation from single images under complex lighting, achieving state-of-the-art results.

Details

Motivation: To accurately estimate PBR materials from single images under complex lighting, addressing limitations in existing methods.

Method: Uses two latent spaces (albedo-optimized and material-specialized) with feature distillation and rectified flow for efficiency. Extends to high-resolution and multi-view inputs.

Result: Achieves 28% better albedo estimation and 39% reduction in metallic-roughness errors, outperforming existing methods.

Conclusion: DualMat is a robust framework for PBR material estimation, enhancing image-to-3D pipelines.

Abstract: We present DualMat, a novel dual-path diffusion framework for estimating Physically Based Rendering (PBR) materials from single images under complex lighting conditions. Our approach operates in two distinct latent spaces: an albedo-optimized path leveraging pretrained visual knowledge through RGB latent space, and a material-specialized path operating in a compact latent space designed for precise metallic and roughness estimation. To ensure coherent predictions between the albedo-optimized and material-specialized paths, we introduce feature distillation during training. We employ rectified flow to enhance efficiency by reducing inference steps while maintaining quality. Our framework extends to high-resolution and multi-view inputs through patch-based estimation and cross-view attention, enabling seamless integration into image-to-3D pipelines. DualMat achieves state-of-the-art performance on both Objaverse and real-world data, significantly outperforming existing methods with up to 28% improvement in albedo estimation and 39% reduction in metallic-roughness prediction errors.

[132] Decoupling Continual Semantic Segmentation

Yifu Guo, Yuquan Lu, Wentao Zhang, Zishan Xu, Dexia Chen, Siyu Zhang, Yizhe Zhang, Ruixuan Wang

Main category: cs.CV

TL;DR: DecoupleCSS introduces a two-stage framework for Continual Semantic Segmentation (CSS), decoupling class-aware detection from class-agnostic segmentation to improve retention-plasticity balance.

Details

Motivation: Address catastrophic forgetting in CSS by overcoming interference between old and new class learning in single-stage architectures.

Method: Uses a two-stage approach: (1) class-aware detection with pre-trained encoders and LoRA adaptation, (2) class-agnostic segmentation with SAM.

Result: Achieves state-of-the-art performance by balancing retention and adaptability.

Conclusion: DecoupleCSS effectively preserves past knowledge while learning new classes, outperforming existing methods.

Abstract: Continual Semantic Segmentation (CSS) requires learning new classes without forgetting previously acquired knowledge, addressing the fundamental challenge of catastrophic forgetting in dense prediction tasks. However, existing CSS methods typically employ single-stage encoder-decoder architectures where segmentation masks and class labels are tightly coupled, leading to interference between old and new class learning and suboptimal retention-plasticity balance. We introduce DecoupleCSS, a novel two-stage framework for CSS. By decoupling class-aware detection from class-agnostic segmentation, DecoupleCSS enables more effective continual learning, preserving past knowledge while learning new classes. The first stage leverages pre-trained text and image encoders, adapted using LoRA, to encode class-specific information and generate location-aware prompts. In the second stage, the Segment Anything Model (SAM) is employed to produce precise segmentation masks, ensuring that segmentation knowledge is shared across both new and previous classes. This approach improves the balance between retention and adaptability in CSS, achieving state-of-the-art performance across a variety of challenging tasks. Our code is publicly available at: https://github.com/euyis1019/Decoupling-Continual-Semantic-Segmentation.

[133] FLUX-Makeup: High-Fidelity, Identity-Consistent, and Robust Makeup Transfer via Diffusion Transformer

Jian Zhu, Shanyuan Liu, Liuzhuozheng Li, Yue Gong, He Wang, Bo Cheng, Yuhang Ma, Liebucha Wu, Xiaoyu Wu, Dawei Leng, Yuhui Yin, Yang Xu

Main category: cs.CV

TL;DR: FLUX-Makeup is a novel makeup transfer framework that eliminates auxiliary face-control components, leveraging source-reference pairs for high-fidelity, identity-consistent results.

Details

Motivation: Existing GAN-based and diffusion-based methods rely on auxiliary components, introducing errors and suboptimal results. FLUX-Makeup aims to overcome these limitations.

Method: FLUX-Makeup uses FLUX-Kontext with the source image as conditional input and introduces RefLoRAInjector for efficient makeup feature extraction. A robust data generation pipeline enhances training supervision.

Result: FLUX-Makeup achieves state-of-the-art performance with strong robustness across diverse scenarios, outperforming existing datasets.

Conclusion: FLUX-Makeup provides a superior, scalable solution for makeup transfer without auxiliary components, setting a new benchmark in the field.

Abstract: Makeup transfer aims to apply the makeup style from a reference face to a target face and has been increasingly adopted in practical applications. Existing GAN-based approaches typically rely on carefully designed loss functions to balance transfer quality and facial identity consistency, while diffusion-based methods often depend on additional face-control modules or algorithms to preserve identity. However, these auxiliary components tend to introduce extra errors, leading to suboptimal transfer results. To overcome these limitations, we propose FLUX-Makeup, a high-fidelity, identity-consistent, and robust makeup transfer framework that eliminates the need for any auxiliary face-control components. Instead, our method directly leverages source-reference image pairs to achieve superior transfer performance. Specifically, we build our framework upon FLUX-Kontext, using the source image as its native conditional input. Furthermore, we introduce RefLoRAInjector, a lightweight makeup feature injector that decouples the reference pathway from the backbone, enabling efficient and comprehensive extraction of makeup-related information. In parallel, we design a robust and scalable data generation pipeline to provide more accurate supervision during training. The paired makeup datasets produced by this pipeline significantly surpass the quality of all existing datasets. Extensive experiments demonstrate that FLUX-Makeup achieves state-of-the-art performance, exhibiting strong robustness across diverse scenarios.

[134] AdaFusion: Prompt-Guided Inference with Adaptive Fusion of Pathology Foundation Models

Yuxiang Xiao, Yang Hu, Bin Li, Tianyang Zhang, Zexi Li, Huazhu Fu, Jens Rittscher, Kaixiang Yang

Main category: cs.CV

TL;DR: AdaFusion is a prompt-guided framework that dynamically integrates multiple pathology foundation models (PFMs) to improve performance and interpretability in downstream tasks.

Details

Motivation: PFMs have latent biases from diverse pretraining contexts, limiting generalisability and transparency. AdaFusion addresses this by combining knowledge from multiple PFMs.

Method: AdaFusion compresses and aligns tile-level features from diverse PFMs, using a lightweight attention mechanism to adaptively fuse them based on tissue phenotype context.

Result: AdaFusion outperforms individual PFMs in tasks like treatment response prediction, tumour grading, and spatial gene expression inference, while providing interpretable insights.

Conclusion: AdaFusion successfully bridges heterogeneous PFMs, enhancing both performance and interpretability of model-specific biases.

Abstract: Pathology foundation models (PFMs) have demonstrated strong representational capabilities through self-supervised pre-training on large-scale, unannotated histopathology image datasets. However, their diverse yet opaque pretraining contexts, shaped by both data-related and structural/training factors, introduce latent biases that hinder generalisability and transparency in downstream applications. In this paper, we propose AdaFusion, a novel prompt-guided inference framework that, to our knowledge, is among the very first to dynamically integrate complementary knowledge from multiple PFMs. Our method compresses and aligns tile-level features from diverse models and employs a lightweight attention mechanism to adaptively fuse them based on tissue phenotype context. We evaluate AdaFusion on three real-world benchmarks spanning treatment response prediction, tumour grading, and spatial gene expression inference. Our approach consistently surpasses individual PFMs across both classification and regression tasks, while offering interpretable insights into each model’s biosemantic specialisation. These results highlight AdaFusion’s ability to bridge heterogeneous PFMs, achieving both enhanced performance and interpretability of model-specific inductive biases.

[135] PoseGen: In-Context LoRA Finetuning for Pose-Controllable Long Human Video Generation

Jingxuan He, Busheng Su, Finn Wong

Main category: cs.CV

TL;DR: PoseGen is a framework for generating long, identity-preserving videos from a single image and pose sequence, using in-context LoRA finetuning and interleaved segment generation.

Details

Motivation: Current diffusion models struggle with identity drift and short video lengths, limiting their practical use for long, controlled video generation.

Method: PoseGen uses in-context LoRA finetuning for identity preservation and pose conditioning, along with interleaved segment generation for unlimited video length.

Result: PoseGen outperforms state-of-the-art methods in identity fidelity, pose accuracy, and produces artifact-free, unlimited-duration videos.

Conclusion: PoseGen addresses key challenges in video generation, offering a scalable solution for long, coherent videos with precise control.

Abstract: Generating long, temporally coherent videos with precise control over subject identity and motion is a formidable challenge for current diffusion models, which often suffer from identity drift and are limited to short clips. We introduce PoseGen, a novel framework that generates arbitrarily long videos of a specific subject from a single reference image and a driving pose sequence. Our core innovation is an in-context LoRA finetuning strategy that injects subject appearance at the token level for identity preservation, while simultaneously conditioning on pose information at the channel level for fine-grained motion control. To overcome duration limits, PoseGen pioneers an interleaved segment generation method that seamlessly stitches video clips together, using a shared KV cache mechanism and a specialized transition process to ensure background consistency and temporal smoothness. Trained on a remarkably small 33-hour video dataset, extensive experiments show that PoseGen significantly outperforms state-of-the-art methods in identity fidelity, pose accuracy, and its unique ability to produce coherent, artifact-free videos of unlimited duration.

[136] Sculpting Margin Penalty: Intra-Task Adapter Merging and Classifier Calibration for Few-Shot Class-Incremental Learning

Liang Bai, Hong Song, Jinfu Li, Yucong Lin, Jingfan Fan, Tianyu Fu, Danni Ai, Deqiang Xiao, Jian Yang

Main category: cs.CV

TL;DR: SMP (Sculpting Margin Penalty) is a novel FSCIL method that integrates margin penalties to balance base-class discriminability and new-class generalization, achieving state-of-the-art performance.

Details

Motivation: Addressing performance degradation in class-incremental learning due to data privacy constraints and high acquisition costs, and improving forward compatibility.

Method: Proposes SMP with Margin-aware Intra-task Adapter Merging (MIAM) for base task learning and Margin Penalty-based Classifier Calibration (MPCC) for incremental tasks.

Result: Achieves state-of-the-art performance on CIFAR100, ImageNet-R, and CUB200, balancing base and new classes.

Conclusion: SMP effectively addresses challenges in FSCIL by strategically integrating margin penalties, improving forward compatibility and decision boundaries.

Abstract: Real-world applications often face data privacy constraints and high acquisition costs, making the assumption of sufficient training data in incremental tasks unrealistic and leading to significant performance degradation in class-incremental learning. Forward-compatible learning, which prospectively prepares for future tasks during base task training, has emerged as a promising solution for Few-Shot Class-Incremental Learning (FSCIL). However, existing methods still struggle to balance base-class discriminability and new-class generalization. Moreover, limited access to original data during incremental tasks often results in ambiguous inter-class decision boundaries. To address these challenges, we propose SMP (Sculpting Margin Penalty), a novel FSCIL method that strategically integrates margin penalties at different stages within the parameter-efficient fine-tuning paradigm. Specifically, we introduce the Margin-aware Intra-task Adapter Merging (MIAM) mechanism for base task learning. MIAM trains two sets of low-rank adapters with distinct classification losses: one with a margin penalty to enhance base-class discriminability, and the other without margin constraints to promote generalization to future new classes. These adapters are then adaptively merged to improve forward compatibility. For incremental tasks, we propose a Margin Penalty-based Classifier Calibration (MPCC) strategy to refine decision boundaries by fine-tuning classifiers on all seen classes’ embeddings with a margin penalty. Extensive experiments on CIFAR100, ImageNet-R, and CUB200 demonstrate that SMP achieves state-of-the-art performance in FSCIL while maintaining a better balance between base and new classes.

[137] AHDMIL: Asymmetric Hierarchical Distillation Multi-Instance Learning for Fast and Accurate Whole-Slide Image Classification

Jiuyang Dong, Jiahan Li, Junjun Jiang, Kui Jiang, Yongbing Zhang

Main category: cs.CV

TL;DR: AHDMIL, an Asymmetric Hierarchical Distillation Multi-Instance Learning framework, reduces inference costs in pathological image classification by filtering irrelevant patches via a two-step training process, achieving faster and more accurate results.

Details

Motivation: High inference costs in multi-instance learning (MIL) for pathological image classification due to processing thousands of patches per gigapixel whole slide image (WSI).

Method: Proposes AHDMIL with Dynamic Multi-Instance Network (DMIN) and Dual-Branch Lightweight Instance Pre-screening Network (DB-LIPN). Uses self-distillation and asymmetric distillation to identify and filter irrelevant patches. Introduces a Chebyshev-polynomial-based Kolmogorov-Arnold (CKA) classifier.

Result: Outperforms state-of-the-art methods in accuracy (5.3% improvement on Camelyon16) and speed (1.2-2.1x faster inference). Consistent gains in AUC, accuracy, F1 score, and Brier score.

Conclusion: AHDMIL effectively balances accuracy and efficiency in pathological image classification, demonstrating significant improvements over existing methods.

Abstract: Although multi-instance learning (MIL) has succeeded in pathological image classification, it faces the challenge of high inference costs due to the need to process thousands of patches from each gigapixel whole slide image (WSI). To address this, we propose AHDMIL, an Asymmetric Hierarchical Distillation Multi-Instance Learning framework that enables fast and accurate classification by eliminating irrelevant patches through a two-step training process. AHDMIL comprises two key components: the Dynamic Multi-Instance Network (DMIN), which operates on high-resolution WSIs, and the Dual-Branch Lightweight Instance Pre-screening Network (DB-LIPN), which analyzes corresponding low-resolution counterparts. In the first step, self-distillation (SD), DMIN is trained for WSI classification while generating per-instance attention scores to identify irrelevant patches. These scores guide the second step, asymmetric distillation (AD), where DB-LIPN learns to predict the relevance of each low-resolution patch. The relevant patches predicted by DB-LIPN have spatial correspondence with patches in high-resolution WSIs, which are used for fine-tuning and efficient inference of DMIN. In addition, we design the first Chebyshev-polynomial-based Kolmogorov-Arnold (CKA) classifier in computational pathology, which improves classification performance through learnable activation layers. Extensive experiments on four public datasets demonstrate that AHDMIL consistently outperforms previous state-of-the-art methods in both classification performance and inference speed. For example, on the Camelyon16 dataset, it achieves a relative improvement of 5.3% in accuracy and accelerates inference by 1.2.times. Across all datasets, area under the curve (AUC), accuracy, f1 score, and brier score show consistent gains, with average inference speedups ranging from 1.2 to 2.1 times. The code is available.

[138] Latent Expression Generation for Referring Image Segmentation and Grounding

Seonghoon Yu, Joonbeom Hong, Joonseok Lee, Jeany Son

Main category: cs.CV

TL;DR: The paper proposes a visual grounding framework using multiple latent expressions from a single textual input to improve object localization by incorporating complementary visual details.

Details

Motivation: Existing methods rely on sparse textual cues, leading to misidentification of similar objects due to the richness of visual details.

Method: Introduces subject distributor and visual concept injector modules, along with positive-margin contrastive learning, to embed shared-subject and distinct-attributes concepts.

Result: Outperforms state-of-the-art RIS and REC methods and excels in the GRES benchmark.

Conclusion: The framework effectively addresses the mismatch between visual and textual data, enhancing localization accuracy.

Abstract: Visual grounding tasks, such as referring image segmentation (RIS) and referring expression comprehension (REC), aim to localize a target object based on a given textual description. The target object in an image can be described in multiple ways, reflecting diverse attributes such as color, position, and more. However, most existing methods rely on a single textual input, which captures only a fraction of the rich information available in the visual domain. This mismatch between rich visual details and sparse textual cues can lead to the misidentification of similar objects. To address this, we propose a novel visual grounding framework that leverages multiple latent expressions generated from a single textual input by incorporating complementary visual details absent from the original description. Specifically, we introduce subject distributor and visual concept injector modules to embed both shared-subject and distinct-attributes concepts into the latent representations, thereby capturing unique and target-specific visual cues. We also propose a positive-margin contrastive learning strategy to align all latent expressions with the original text while preserving subtle variations. Experimental results show that our method not only outperforms state-of-the-art RIS and REC approaches on multiple benchmarks but also achieves outstanding performance on the generalized referring expression segmentation (GRES) benchmark.

Sachin Dudda Nagaraju, Ashkan Moradi, Bendik Skarre Abrahamsen, Mattijs Elschot

Main category: cs.CV

TL;DR: FedGIN is a Federated Learning framework for multimodal organ segmentation, addressing data scarcity and privacy issues with a GIN augmentation module, showing significant performance improvements.

Details

Motivation: To develop a unified model for accurate medical image segmentation across diverse modalities without sharing raw patient data, overcoming challenges like domain shift and data scarcity.

Method: Proposes FedGIN, integrating a GIN augmentation module to harmonize modality-specific intensity distributions, evaluated on imputed and complete datasets with MRI and CT data.

Result: FedGIN improved 3D Dice scores by 12-18% in limited-data scenarios and achieved near-centralized performance in complete datasets, with 30% and 10% improvements over MRI-only and CT-only baselines.

Conclusion: FedGIN effectively generalizes across modalities under privacy constraints, enhancing segmentation accuracy and clinical workflow efficiency.

Abstract: Medical image segmentation plays a crucial role in AI-assisted diagnostics, surgical planning, and treatment monitoring. Accurate and robust segmentation models are essential for enabling reliable, data-driven clinical decision making across diverse imaging modalities. Given the inherent variability in image characteristics across modalities, developing a unified model capable of generalizing effectively to multiple modalities would be highly beneficial. This model could streamline clinical workflows and reduce the need for modality-specific training. However, real-world deployment faces major challenges, including data scarcity, domain shift between modalities (e.g., CT vs. MRI), and privacy restrictions that prevent data sharing. To address these issues, we propose FedGIN, a Federated Learning (FL) framework that enables multimodal organ segmentation without sharing raw patient data. Our method integrates a lightweight Global Intensity Non-linear (GIN) augmentation module that harmonizes modality-specific intensity distributions during local training. We evaluated FedGIN using two types of datasets: an imputed dataset and a complete dataset. In the limited dataset scenario, the model was initially trained using only MRI data, and CT data was added to assess its performance improvements. In the complete dataset scenario, both MRI and CT data were fully utilized for training on all clients. In the limited-data scenario, FedGIN achieved a 12 to 18% improvement in 3D Dice scores on MRI test cases compared to FL without GIN and consistently outperformed local baselines. In the complete dataset scenario, FedGIN demonstrated near-centralized performance, with a 30% Dice score improvement over the MRI-only baseline and a 10% improvement over the CT-only baseline, highlighting its strong cross-modality generalization under privacy constraints.

[140] Deep Learning-based Animal Behavior Analysis: Insights from Mouse Chronic Pain Models

Yu-Hsi Chen, Wei-Hsin Chen, Chien-Yao Wang, Hong-Yuan Mark Liao, James C. Liao, Chien-Chang Chen

Main category: cs.CV

TL;DR: A framework for automatically discovering chronic pain-related features in mice, outperforming human experts and existing methods in classification tasks.

Details

Motivation: Existing methods for assessing chronic pain in mice rely on manual labeling, which is biased and lacks accuracy in capturing persistent behavioral changes.

Method: Uses a universal action space projector to automatically extract mouse action features from video, avoiding human labeling bias.

Result: Achieves 48.41% accuracy in 15-class pain classification (vs. 21.33% for humans, 30.52% for B-SOiD) and 73.1% in 3-class classification (vs. 48% for humans, 58.43% for B-SOiD). Also reveals drug efficacy differences in zero-shot testing.

Conclusion: The method shows clinical potential for pain research and drug development by providing unbiased, automated behavioral analysis.

Abstract: Assessing chronic pain behavior in mice is critical for preclinical studies. However, existing methods mostly rely on manual labeling of behavioral features, and humans lack a clear understanding of which behaviors best represent chronic pain. For this reason, existing methods struggle to accurately capture the insidious and persistent behavioral changes in chronic pain. This study proposes a framework to automatically discover features related to chronic pain without relying on human-defined action labels. Our method uses universal action space projector to automatically extract mouse action features, and avoids the potential bias of human labeling by retaining the rich behavioral information in the original video. In this paper, we also collected a mouse pain behavior dataset that captures the disease progression of both neuropathic and inflammatory pain across multiple time points. Our method achieves 48.41% accuracy in a 15-class pain classification task, significantly outperforming human experts (21.33%) and the widely used method B-SOiD (30.52%). Furthermore, when the classification is simplified to only three categories, i.e., neuropathic pain, inflammatory pain, and no pain, then our method achieves an accuracy of 73.1%, which is notably higher than that of human experts (48%) and B-SOiD (58.43%). Finally, our method revealed differences in drug efficacy for different types of pain on zero-shot Gabapentin drug testing, and the results were consistent with past drug efficacy literature. This study demonstrates the potential clinical application of our method, which can provide new insights into pain research and related drug development.

[141] Rotation Equivariant Arbitrary-scale Image Super-Resolution

Qi Xie, Jiahong Fu, Zongben Xu, Deyu Meng

Main category: cs.CV

TL;DR: The paper proposes a rotation-equivariant ASISR method to improve geometric pattern recovery in arbitrary-scale image super-resolution by redesigning INR and encoder modules.

Details

Motivation: Common geometric patterns in low-resolution images are often distorted, leading to artifacts in high-resolution recoveries. Rotation equivariance is needed to maintain structural integrity.

Method: The authors redesign the INR and encoder modules to incorporate rotation equivariance, enabling end-to-end equivariant ASISR. Theoretical analysis of equivariance error is provided.

Result: Experiments on simulated and real datasets show superior performance. The method can also enhance existing ASISR techniques.

Conclusion: The proposed rotation-equivariant ASISR method effectively maintains geometric integrity and can be integrated into current frameworks for improved performance.

Abstract: The arbitrary-scale image super-resolution (ASISR), a recent popular topic in computer vision, aims to achieve arbitrary-scale high-resolution recoveries from a low-resolution input image. This task is realized by representing the image as a continuous implicit function through two fundamental modules, a deep-network-based encoder and an implicit neural representation (INR) module. Despite achieving notable progress, a crucial challenge of such a highly ill-posed setting is that many common geometric patterns, such as repetitive textures, edges, or shapes, are seriously warped and deformed in the low-resolution images, naturally leading to unexpected artifacts appearing in their high-resolution recoveries. Embedding rotation equivariance into the ASISR network is thus necessary, as it has been widely demonstrated that this enhancement enables the recovery to faithfully maintain the original orientations and structural integrity of geometric patterns underlying the input image. Motivated by this, we make efforts to construct a rotation equivariant ASISR method in this study. Specifically, we elaborately redesign the basic architectures of INR and encoder modules, incorporating intrinsic rotation equivariance capabilities beyond those of conventional ASISR networks. Through such amelioration, the ASISR network can, for the first time, be implemented with end-to-end rotational equivariance maintained from input to output. We also provide a solid theoretical analysis to evaluate its intrinsic equivariance error, demonstrating its inherent nature of embedding such an equivariance structure. The superiority of the proposed method is substantiated by experiments conducted on both simulated and real datasets. We also validate that the proposed framework can be readily integrated into current ASISR methods in a plug & play manner to further enhance their performance.

[142] X-MoGen: Unified Motion Generation across Humans and Animals

Xuan Wang, Kai Ruan, Liyang Qian, Zhizhi Guo, Chang Su, Gaoang Wang

Main category: cs.CV

TL;DR: X-MoGen is a unified framework for cross-species text-driven motion generation, addressing morphological challenges with a two-stage architecture and a large dataset (UniMo4D).

Details

Motivation: Existing methods model human and animal motion separately, lacking a unified approach. Cross-species motion generation offers advantages like shared representation and better generalization.

Method: X-MoGen uses a two-stage process: (1) learning canonical T-pose priors and encoding motion into a shared latent space, and (2) generating motion embeddings from text with masked motion modeling. A morphological consistency module ensures skeletal plausibility.

Result: X-MoGen outperforms state-of-the-art methods on both seen and unseen species, validated on the UniMo4D dataset.

Conclusion: X-MoGen successfully addresses cross-species motion generation challenges, offering a unified and scalable solution.

Abstract: Text-driven motion generation has attracted increasing attention due to its broad applications in virtual reality, animation, and robotics. While existing methods typically model human and animal motion separately, a joint cross-species approach offers key advantages, such as a unified representation and improved generalization. However, morphological differences across species remain a key challenge, often compromising motion plausibility. To address this, we propose \textbf{X-MoGen}, the first unified framework for cross-species text-driven motion generation covering both humans and animals. X-MoGen adopts a two-stage architecture. First, a conditional graph variational autoencoder learns canonical T-pose priors, while an autoencoder encodes motion into a shared latent space regularized by morphological loss. In the second stage, we perform masked motion modeling to generate motion embeddings conditioned on textual descriptions. During training, a morphological consistency module is employed to promote skeletal plausibility across species. To support unified modeling, we construct \textbf{UniMo4D}, a large-scale dataset of 115 species and 119k motion sequences, which integrates human and animal motions under a shared skeletal topology for joint training. Extensive experiments on UniMo4D demonstrate that X-MoGen outperforms state-of-the-art methods on both seen and unseen species.

[143] PhysPatch: A Physically Realizable and Transferable Adversarial Patch Attack for Multimodal Large Language Models-based Autonomous Driving Systems

Qi Guo, Xiaojun Jia, Shanmin Pang, Simeng Qin, Lin Wang, Ju Jia, Yang Liu, Qing Guo

Main category: cs.CV

TL;DR: PhysPatch is a novel adversarial patch framework designed for MLLM-based autonomous driving systems, optimizing patch location, shape, and content for improved attack effectiveness and real-world feasibility.

Details

Motivation: MLLMs are vulnerable to adversarial patch attacks, but existing methods are ineffective due to MLLMs' complex architectures. PhysPatch addresses this gap.

Method: PhysPatch jointly optimizes patch attributes, uses semantic-based mask initialization, SVD-based alignment loss, and potential field-based refinement.

Result: PhysPatch outperforms prior methods in steering MLLM-based systems and ensures physically feasible patch placement.

Conclusion: PhysPatch enhances adversarial attack effectiveness and real-world applicability for MLLM-based AD systems.

Abstract: Multimodal Large Language Models (MLLMs) are becoming integral to autonomous driving (AD) systems due to their strong vision-language reasoning capabilities. However, MLLMs are vulnerable to adversarial attacks, particularly adversarial patch attacks, which can pose serious threats in real-world scenarios. Existing patch-based attack methods are primarily designed for object detection models and perform poorly when transferred to MLLM-based systems due to the latter’s complex architectures and reasoning abilities. To address these limitations, we propose PhysPatch, a physically realizable and transferable adversarial patch framework tailored for MLLM-based AD systems. PhysPatch jointly optimizes patch location, shape, and content to enhance attack effectiveness and real-world applicability. It introduces a semantic-based mask initialization strategy for realistic placement, an SVD-based local alignment loss with patch-guided crop-resize to improve transferability, and a potential field-based mask refinement method. Extensive experiments across open-source, commercial, and reasoning-capable MLLMs demonstrate that PhysPatch significantly outperforms prior methods in steering MLLM-based AD systems toward target-aligned perception and planning outputs. Moreover, PhysPatch consistently places adversarial patches in physically feasible regions of AD scenes, ensuring strong real-world applicability and deployability.

[144] Multi-tracklet Tracking for Generic Targets with Adaptive Detection Clustering

Zewei Wu, Longhao Wang, Cui Wang, César Teixeira, Wei Ke, Zhang Xiong

Main category: cs.CV

TL;DR: A tracklet-enhanced tracker (MTT) is proposed to address challenges in tracking unseen categories by integrating adaptive tracklet generation and multi-clue association.

Details

Motivation: Existing tracking methods struggle with unseen categories due to low-confidence detections, weak constraints, and occlusions.

Method: The MTT framework clusters detections into tracklets using spatio-temporal correlation and associates them using location and appearance clues.

Result: Experiments show the framework’s competitiveness in generic multiple object tracking.

Conclusion: MTT effectively mitigates error propagation and enhances tracking robustness for unseen categories.

Abstract: Tracking specific targets, such as pedestrians and vehicles, has been the focus of recent vision-based multitarget tracking studies. However, in some real-world scenarios, unseen categories often challenge existing methods due to low-confidence detections, weak motion and appearance constraints, and long-term occlusions. To address these issues, this article proposes a tracklet-enhanced tracker called Multi-Tracklet Tracking (MTT) that integrates flexible tracklet generation into a multi-tracklet association framework. This framework first adaptively clusters the detection results according to their short-term spatio-temporal correlation into robust tracklets and then estimates the best tracklet partitions using multiple clues, such as location and appearance over time to mitigate error propagation in long-term association. Finally, extensive experiments on the benchmark for generic multiple object tracking demonstrate the competitiveness of the proposed framework.

[145] SPA++: Generalized Graph Spectral Alignment for Versatile Domain Adaptation

Zhiqing Xiao, Haobo Wang, Xu Lu, Wentao Ye, Gang Chen, Junbo Zhao

Main category: cs.CV

TL;DR: SPA++ is a graph spectral alignment framework for Domain Adaptation (DA) that addresses inter-domain transferability and intra-domain discriminability through coarse graph alignment, spectral regularization, and neighbor-aware propagation. It also incorporates data augmentation and consistency regularization for robustness.

Details

Motivation: Prior DA methods focus on inter-domain transferability but neglect intra-domain structures, leading to poor discriminability. SPA++ aims to balance these aspects.

Method: SPA++ uses graph primitives for coarse alignment, spectral regularization, and neighbor-aware propagation. It includes data augmentation and consistency regularization for adaptability.

Result: SPA++ outperforms state-of-the-art methods on benchmark datasets, showing robustness in various DA scenarios.

Conclusion: SPA++ effectively balances transferability and discriminability, offering superior performance and adaptability in DA tasks.

Abstract: Domain Adaptation (DA) aims to transfer knowledge from a labeled source domain to an unlabeled or sparsely labeled target domain under domain shifts. Most prior works focus on capturing the inter-domain transferability but largely overlook rich intra-domain structures, which empirically results in even worse discriminability. To tackle this tradeoff, we propose a generalized graph SPectral Alignment framework, SPA++. Its core is briefly condensed as follows: (1)-by casting the DA problem to graph primitives, it composes a coarse graph alignment mechanism with a novel spectral regularizer toward aligning the domain graphs in eigenspaces; (2)-we further develop a fine-grained neighbor-aware propagation mechanism for enhanced discriminability in the target domain; (3)-by incorporating data augmentation and consistency regularization, SPA++ can adapt to complex scenarios including most DA settings and even challenging distribution scenarios. Furthermore, we also provide theoretical analysis to support our method, including the generalization bound of graph-based DA and the role of spectral alignment and smoothing consistency. Extensive experiments on benchmark datasets demonstrate that SPA++ consistently outperforms existing cutting-edge methods, achieving superior robustness and adaptability across various challenging adaptation scenarios.

[146] SPEX: A Vision-Language Model for Land Cover Extraction on Spectral Remote Sensing Images

Dongchen Si, Di Wang, Erzhong Gao, Xiaolei Qin, Liu Zhao, Jing Zhang, Minqiang Xu, Jianbo Zhan, Jianshe Wang, Lin Liu, Bo Du, Liangpei Zhang

Main category: cs.CV

TL;DR: SPEX is a multimodal LLM for land cover extraction in spectral remote sensing, outperforming state-of-the-art methods by leveraging spectral priors and textual attributes.

Details

Motivation: Spectral information is underutilized in vision-language models, leading to suboptimal performance in multispectral remote sensing.

Method: Constructs SPIE dataset with spectral priors, proposes SPEX with multiscale feature aggregation, token context condensation, and multispectral visual pre-training.

Result: Outperforms existing methods on five multispectral datasets, excels in extracting land cover categories, and provides textual explanations.

Conclusion: SPEX advances land cover extraction by integrating spectral and textual data, enhancing interpretability and performance.

Abstract: Spectral information has long been recognized as a critical cue in remote sensing observations. Although numerous vision-language models have been developed for pixel-level interpretation, spectral information remains underutilized, resulting in suboptimal performance, particularly in multispectral scenarios. To address this limitation, we construct a vision-language instruction-following dataset named SPIE, which encodes spectral priors of land-cover objects into textual attributes recognizable by large language models (LLMs), based on classical spectral index computations. Leveraging this dataset, we propose SPEX, a multimodal LLM designed for instruction-driven land cover extraction. To this end, we introduce several carefully designed components and training strategies, including multiscale feature aggregation, token context condensation, and multispectral visual pre-training, to achieve precise and flexible pixel-level interpretation. To the best of our knowledge, SPEX is the first multimodal vision-language model dedicated to land cover extraction in spectral remote sensing imagery. Extensive experiments on five public multispectral datasets demonstrate that SPEX consistently outperforms existing state-of-the-art methods in extracting typical land cover categories such as vegetation, buildings, and water bodies. Moreover, SPEX is capable of generating textual explanations for its predictions, thereby enhancing interpretability and user-friendliness. Code will be released at: https://github.com/MiliLab/SPEX.

[147] EndoMatcher: Generalizable Endoscopic Image Matcher via Multi-Domain Pre-training for Robot-Assisted Surgery

Bingyu Yang, Qingyao Tian, Yimeng Geng, Huai Liao, Xinyan Huang, Jiebo Luo, Hongbin Liu

Main category: cs.CV

TL;DR: EndoMatcher is a generalizable endoscopic image matcher using large-scale multi-domain pre-training and a two-branch Vision Transformer for robust feature matching in challenging conditions.

Details

Motivation: Dense feature matching in endoscopic images is essential for robot-assisted tasks but is hindered by difficult visual conditions and lack of annotated data.

Method: EndoMatcher uses a two-branch Vision Transformer with dual interaction blocks and is trained on Endo-Mix6, a multi-domain dataset of 1.2M image pairs. A progressive multi-objective training strategy addresses dataset challenges.

Result: EndoMatcher improves inlier matches by 140.69% and 201.43% on Hamlyn and Bladder datasets, and MDPA by 9.40% on Gastro-Matching, outperforming state-of-the-art methods.

Conclusion: EndoMatcher achieves accurate, dense matching in challenging endoscopic conditions, generalizing well to unseen domains and imaging conditions.

Abstract: Generalizable dense feature matching in endoscopic images is crucial for robot-assisted tasks, including 3D reconstruction, navigation, and surgical scene understanding. Yet, it remains a challenge due to difficult visual conditions (e.g., weak textures, large viewpoint variations) and a scarcity of annotated data. To address these challenges, we propose EndoMatcher, a generalizable endoscopic image matcher via large-scale, multi-domain data pre-training. To address difficult visual conditions, EndoMatcher employs a two-branch Vision Transformer to extract multi-scale features, enhanced by dual interaction blocks for robust correspondence learning. To overcome data scarcity and improve domain diversity, we construct Endo-Mix6, the first multi-domain dataset for endoscopic matching. Endo-Mix6 consists of approximately 1.2M real and synthetic image pairs across six domains, with correspondence labels generated using Structure-from-Motion and simulated transformations. The diversity and scale of Endo-Mix6 introduce new challenges in training stability due to significant variations in dataset sizes, distribution shifts, and error imbalance. To address them, a progressive multi-objective training strategy is employed to promote balanced learning and improve representation quality across domains. This enables EndoMatcher to generalize across unseen organs and imaging conditions in a zero-shot fashion. Extensive zero-shot matching experiments demonstrate that EndoMatcher increases the number of inlier matches by 140.69% and 201.43% on the Hamlyn and Bladder datasets over state-of-the-art methods, respectively, and improves the Matching Direction Prediction Accuracy (MDPA) by 9.40% on the Gastro-Matching dataset, achieving dense and accurate matching under challenging endoscopic conditions. The code is publicly available at https://github.com/Beryl2000/EndoMatcher.

[148] VFlowOpt: A Token Pruning Framework for LMMs with Visual Information Flow-Guided Optimization

Sihan Yang, Runsen Xu, Chenhang Cui, Tai Wang, Dahua Lin, Jiangmiao Pang

Main category: cs.CV

TL;DR: VFlowOpt is a token pruning framework for Large Multimodal Models (LMMs) that reduces computational costs by pruning 90% of visual tokens while maintaining performance, achieving faster inference and lower memory usage.

Details

Motivation: Current token pruning methods in LMMs are simplistic and cause performance degradation. VFlowOpt aims to address this by optimizing pruning strategies and minimizing information loss.

Method: VFlowOpt introduces an importance map based on attention-derived relevance and information entropy, a progressive pruning module with recycling, and a visual information flow-guided method to optimize pruning hyperparameters.

Result: The framework prunes 90% of visual tokens with minimal performance loss, reducing KV-Cache memory by 89% and speeding up inference by 3.8 times.

Conclusion: VFlowOpt effectively balances computational efficiency and performance in LMMs, offering a superior pruning strategy for visual-language tasks.

Abstract: Large Multimodal Models (LMMs) excel in visual-language tasks by leveraging numerous visual tokens for fine-grained visual information, but this token redundancy results in significant computational costs. Previous research aimed at reducing visual tokens during inference typically leverages importance maps derived from attention scores among vision-only tokens or vision-language tokens to prune tokens across one or multiple pruning stages. Despite this progress, pruning frameworks and strategies remain simplistic and insufficiently explored, often resulting in substantial performance degradation. In this paper, we propose VFlowOpt, a token pruning framework that introduces an importance map derivation process and a progressive pruning module with a recycling mechanism. The hyperparameters of its pruning strategy are further optimized by a visual information flow-guided method. Specifically, we compute an importance map for image tokens based on their attention-derived context relevance and patch-level information entropy. We then decide which tokens to retain or prune and aggregate the pruned ones as recycled tokens to avoid potential information loss. Finally, we apply a visual information flow-guided method that regards the last token in the LMM as the most representative signal of text-visual interactions. This method minimizes the discrepancy between token representations in LMMs with and without pruning, thereby enabling superior pruning strategies tailored to different LMMs. Experiments demonstrate that VFlowOpt can prune 90% of visual tokens while maintaining comparable performance, leading to an 89% reduction in KV-Cache memory and 3.8 times faster inference.

[149] Textual and Visual Guided Task Adaptation for Source-Free Cross-Domain Few-Shot Segmentation

Jianming Liu, Wenlong Qiu, Haitao Wei

Main category: cs.CV

TL;DR: A source-free Cross-Domain Few-Shot Segmentation (CD-FSS) method is proposed, leveraging textual and visual information for target domain adaptation without needing source domain data, achieving significant accuracy improvements.

Details

Motivation: Address performance degradation in Few-Shot Segmentation due to domain discrepancies and privacy/data transfer concerns by developing a source-free CD-FSS approach.

Method: Uses Task-Specific Attention Adapters (TSAA) and alignment modules (VVEA and TVEA) to adapt features and align embeddings without source data.

Result: Achieves average accuracy improvements of 2.18% (1-shot) and 4.11% (5-shot) across four datasets, outperforming state-of-the-art methods.

Conclusion: The proposed method effectively addresses domain discrepancies and privacy concerns, demonstrating superior performance in cross-domain few-shot segmentation.

Abstract: Few-Shot Segmentation(FSS) aims to efficient segmentation of new objects with few labeled samples. However, its performance significantly degrades when domain discrepancies exist between training and deployment. Cross-Domain Few-Shot Segmentation(CD-FSS) is proposed to mitigate such performance degradation. Current CD-FSS methods primarily sought to develop segmentation models on a source domain capable of cross-domain generalization. However, driven by escalating concerns over data privacy and the imperative to minimize data transfer and training expenses, the development of source-free CD-FSS approaches has become essential. In this work, we propose a source-free CD-FSS method that leverages both textual and visual information to facilitate target domain task adaptation without requiring source domain data. Specifically, we first append Task-Specific Attention Adapters (TSAA) to the feature pyramid of a pretrained backbone, which adapt multi-level features extracted from the shared pre-trained backbone to the target task. Then, the parameters of the TSAA are trained through a Visual-Visual Embedding Alignment (VVEA) module and a Text-Visual Embedding Alignment (TVEA) module. The VVEA module utilizes global-local visual features to align image features across different views, while the TVEA module leverages textual priors from pre-aligned multi-modal features (e.g., from CLIP) to guide cross-modal adaptation. By combining the outputs of these modules through dense comparison operations and subsequent fusion via skip connections, our method produces refined prediction masks. Under both 1-shot and 5-shot settings, the proposed approach achieves average segmentation accuracy improvements of 2.18% and 4.11%, respectively, across four cross-domain datasets, significantly outperforming state-of-the-art CD-FSS methods. Code are available at https://github.com/ljm198134/TVGTANet.

[150] MELLA: Bridging Linguistic Capability and Cultural Groundedness for Low-Resource Language MLLMs

Yufei Gao, Jiaying Fei, Nuo Chen, Ruirui Chen, Guohang Yan, Yunshi Lan, Botian Shi

Main category: cs.CV

TL;DR: The paper addresses the limitations of Multimodal Large Language Models (MLLMs) in low-resource languages, proposing a dual-source strategy to enhance linguistic capability and cultural groundedness, and introduces the MELLA dataset for improved performance.

Details

Motivation: Current multilingual enhancement methods for MLLMs are inadequate for low-resource languages, lacking multimodal informativeness and cultural awareness, which are crucial for effective user service.

Method: A dual-source strategy is proposed, using native web alt-text for cultural groundedness and MLLM-generated captions for linguistic capability, implemented via the MELLA dataset.

Result: Fine-tuning on MELLA improves performance across eight languages, with models producing richer “thick descriptions” due to enhanced cultural and linguistic capabilities.

Conclusion: The study demonstrates the importance of cultural awareness and linguistic capability in MLLMs for low-resource languages, validated by the success of the MELLA dataset.

Abstract: Multimodal Large Language Models (MLLMs) have shown remarkable performance in high-resource languages. However, their effectiveness diminishes significantly in the contexts of low-resource languages. Current multilingual enhancement methods are often limited to text modality or rely solely on machine translation. While such approaches help models acquire basic linguistic capabilities and produce “thin descriptions”, they neglect the importance of multimodal informativeness and cultural groundedness, both of which are crucial for serving low-resource language users effectively. To bridge this gap, in this study, we identify two significant objectives for a truly effective MLLM in low-resource language settings, namely 1) linguistic capability and 2) cultural groundedness, placing special emphasis on cultural awareness. To achieve these dual objectives, we propose a dual-source strategy that guides the collection of data tailored to each goal, sourcing native web alt-text for culture and MLLM-generated captions for linguistics. As a concrete implementation, we introduce MELLA, a multimodal, multilingual dataset. Experiment results show that after fine-tuning on MELLA, there is a general performance improvement for the eight languages on various MLLM backbones, with models producing “thick descriptions”. We verify that the performance gains are from both cultural knowledge enhancement and linguistic capability enhancement. Our dataset can be found at https://opendatalab.com/applyMultilingualCorpus.

[151] ReasoningTrack: Chain-of-Thought Reasoning for Long-term Vision-Language Tracking

Xiao Wang, Liye Jin, Xufeng Lou, Shiao Wang, Lan Chen, Bo Jiang, Zhipeng Zhang

Main category: cs.CV

TL;DR: The paper introduces ReasoningTrack, a reasoning-based vision-language tracking framework using Qwen2.5-VL, optimized with SFT and GRPO, and validated on a new dataset TNLLT.

Details

Motivation: Existing vision-language tracking methods lack flexibility and reasoning insights, limiting performance. This work aims to improve by leveraging pre-trained models and reasoning-based strategies.

Method: Proposes ReasoningTrack, embedding updated language descriptions with vision features in a unified backbone, optimized via SFT and GRPO, and tested on TNLLT dataset.

Result: Extensive experiments confirm the effectiveness of the reasoning-based natural language generation strategy.

Conclusion: ReasoningTrack outperforms baselines, validated by experiments, and introduces a new benchmark dataset for future research.

Abstract: Vision-language tracking has received increasing attention in recent years, as textual information can effectively address the inflexibility and inaccuracy associated with specifying the target object to be tracked. Existing works either directly fuse the fixed language with vision features or simply modify using attention, however, their performance is still limited. Recently, some researchers have explored using text generation to adapt to the variations in the target during tracking, however, these works fail to provide insights into the model’s reasoning process and do not fully leverage the advantages of large models, which further limits their overall performance. To address the aforementioned issues, this paper proposes a novel reasoning-based vision-language tracking framework, named ReasoningTrack, based on a pre-trained vision-language model Qwen2.5-VL. Both SFT (Supervised Fine-Tuning) and reinforcement learning GRPO are used for the optimization of reasoning and language generation. We embed the updated language descriptions and feed them into a unified tracking backbone network together with vision features. Then, we adopt a tracking head to predict the specific location of the target object. In addition, we propose a large-scale long-term vision-language tracking benchmark dataset, termed TNLLT, which contains 200 video sequences. 20 baseline visual trackers are re-trained and evaluated on this dataset, which builds a solid foundation for the vision-language visual tracking task. Extensive experiments on multiple vision-language tracking benchmark datasets fully validated the effectiveness of our proposed reasoning-based natural language generation strategy. The source code of this paper will be released on https://github.com/Event-AHU/Open_VLTrack

[152] Segmenting the Complex and Irregular in Two-Phase Flows: A Real-World Empirical Study with SAM2

Semanur Küçük, Cosimo Della Santina, Angeliki Laskari

Main category: cs.CV

TL;DR: A fine-tuned Segment Anything Model (SAM v2.1) effectively segments irregular gas bubbles in multiphase flows using minimal annotated data.

Details

Motivation: Traditional and learning-based methods fail to segment deformed or coalesced bubbles, a common issue in industrial applications like air lubrication systems.

Method: The study uses transfer learning with SAM v2.1, fine-tuning it on 100 annotated images to segment non-convex bubble structures.

Result: The model accurately segments highly irregular bubble shapes, overcoming limitations of previous methods.

Conclusion: SAM v2.1, with fine-tuning, is a promising solution for segmenting complex bubble structures in industrial settings.

Abstract: Segmenting gas bubbles in multiphase flows is a critical yet unsolved challenge in numerous industrial settings, from metallurgical processing to maritime drag reduction. Traditional approaches-and most recent learning-based methods-assume near-spherical shapes, limiting their effectiveness in regimes where bubbles undergo deformation, coalescence, or breakup. This complexity is particularly evident in air lubrication systems, where coalesced bubbles form amorphous and topologically diverse patches. In this work, we revisit the problem through the lens of modern vision foundation models. We cast the task as a transfer learning problem and demonstrate, for the first time, that a fine-tuned Segment Anything Model SAM v2.1 can accurately segment highly non-convex, irregular bubble structures using as few as 100 annotated images.

[153] ArbiViewGen: Controllable Arbitrary Viewpoint Camera Data Generation for Autonomous Driving via Stable Diffusion Models

Yatong Lan, Jingfeng Chen, Yiru Wang, Lei He

Main category: cs.CV

TL;DR: Arbiviewgen is a diffusion-based framework for generating controllable camera images from arbitrary viewpoints, using FAVS for feature-aware stitching and CVC-SSL for self-supervised cross-view consistency.

Details

Motivation: The lack of ground-truth data for extrapolated views in autonomous driving hinders high-fidelity generative model training.

Method: Uses Feature-Aware Adaptive View Stitching (FAVS) for hierarchical matching and Cross-View Consistency Self-Supervised Learning (CVC-SSL) for self-supervised training.

Result: Generates controllable arbitrary-view camera images without needing additional sensors or depth maps.

Conclusion: Arbiviewgen is the first method enabling controllable arbitrary-view image generation in multiple vehicle configurations.

Abstract: Arbitrary viewpoint image generation holds significant potential for autonomous driving, yet remains a challenging task due to the lack of ground-truth data for extrapolated views, which hampers the training of high-fidelity generative models. In this work, we propose Arbiviewgen, a novel diffusion-based framework for the generation of controllable camera images from arbitrary points of view. To address the absence of ground-truth data in unseen views, we introduce two key components: Feature-Aware Adaptive View Stitching (FAVS) and Cross-View Consistency Self-Supervised Learning (CVC-SSL). FAVS employs a hierarchical matching strategy that first establishes coarse geometric correspondences using camera poses, then performs fine-grained alignment through improved feature matching algorithms, and identifies high-confidence matching regions via clustering analysis. Building upon this, CVC-SSL adopts a self-supervised training paradigm where the model reconstructs the original camera views from the synthesized stitched images using a diffusion model, enforcing cross-view consistency without requiring supervision from extrapolated data. Our framework requires only multi-camera images and their associated poses for training, eliminating the need for additional sensors or depth maps. To our knowledge, Arbiviewgen is the first method capable of controllable arbitrary view camera image generation in multiple vehicle configurations.

[154] Uni-cot: Towards Unified Chain-of-Thought Reasoning Across Text and Vision

Luozheng Qin, Jia Gong, Yuqing Sun, Tianjiao Li, Mengping Yang, Xiaomeng Yang, Chao Qu, Zhiyu Tan, Hao Li

Main category: cs.CV

TL;DR: Uni-CoT is a unified Chain-of-Thought framework for multimodal reasoning, combining image understanding and generation to model visual state transitions efficiently.

Details

Motivation: Extending CoT reasoning to vision-language tasks is challenging due to difficulties in modeling visual state transitions and incoherent visual trajectories.

Method: Uni-CoT uses a two-level reasoning paradigm (Macro-Level for task planning, Micro-Level for subtask execution) and structured training with interleaved image-text supervision.

Result: Achieves state-of-the-art performance on benchmarks (WISE, RISE, KRIS) with efficient training on 8 A100 GPUs.

Conclusion: Uni-CoT is a scalable and coherent solution for multimodal reasoning, demonstrating strong generalization.

Abstract: Chain-of-Thought (CoT) reasoning has been widely adopted to enhance Large Language Models (LLMs) by decomposing complex tasks into simpler, sequential subtasks. However, extending CoT to vision-language reasoning tasks remains challenging, as it often requires interpreting transitions of visual states to support reasoning. Existing methods often struggle with this due to limited capacity of modeling visual state transitions or incoherent visual trajectories caused by fragmented architectures. To overcome these limitations, we propose Uni-CoT, a Unified Chain-of-Thought framework that enables coherent and grounded multimodal reasoning within a single unified model. The key idea is to leverage a model capable of both image understanding and generation to reason over visual content and model evolving visual states. However, empowering a unified model to achieve that is non-trivial, given the high computational cost and the burden of training. To address this, Uni-CoT introduces a novel two-level reasoning paradigm: A Macro-Level CoT for high-level task planning and A Micro-Level CoT for subtask execution. This design significantly reduces the computational overhead. Furthermore, we introduce a structured training paradigm that combines interleaved image-text supervision for macro-level CoT with multi-task objectives for micro-level CoT. Together, these innovations allow Uni-CoT to perform scalable and coherent multi-modal reasoning. Furthermore, thanks to our design, all experiments can be efficiently completed using only 8 A100 GPUs with 80GB VRAM each. Experimental results on reasoning-driven image generation benchmark (WISE) and editing benchmarks (RISE and KRIS) indicates that Uni-CoT demonstrates SOTA performance and strong generalization, establishing Uni-CoT as a promising solution for multi-modal reasoning. Project Page and Code: https://sais-fuxi.github.io/projects/uni-cot/

[155] Navigating the Trade-off: A Synthesis of Defensive Strategies for Zero-Shot Adversarial Robustness in Vision-Language Models

Zane Xu, Jason Sun

Main category: cs.CV

TL;DR: The report reviews eight papers on zero-shot adversarial robustness in vision-language models (VLMs), focusing on the trade-off between robustness and generalization. It compares Adversarial Fine-Tuning (AFT) and Training-Free/Test-Time Defenses, and traces the evolution of defense methods. Future directions include hybrid strategies and adversarial pre-training.

Details

Motivation: To address the challenge of balancing adversarial robustness with zero-shot generalization in VLMs like CLIP.

Method: Analyzes two defense paradigms: Adversarial Fine-Tuning (AFT) and Training-Free/Test-Time Defenses, reviewing methods like TeCoA, LAAT, TIMA, AOM, TTC, and CLIPure.

Result: Identifies the evolution of defense methods from alignment-preserving to embedding space re-engineering and latent-space purification.

Conclusion: Highlights key challenges and suggests future directions, such as hybrid defense strategies and adversarial pre-training.

Abstract: This report synthesizes eight seminal papers on the zero-shot adversarial robustness of vision-language models (VLMs) like CLIP. A central challenge in this domain is the inherent trade-off between enhancing adversarial robustness and preserving the model’s zero-shot generalization capabilities. We analyze two primary defense paradigms: Adversarial Fine-Tuning (AFT), which modifies model parameters, and Training-Free/Test-Time Defenses, which preserve them. We trace the evolution from alignment-preserving methods (TeCoA) to embedding space re-engineering (LAAT, TIMA), and from input heuristics (AOM, TTC) to latent-space purification (CLIPure). Finally, we identify key challenges and future directions including hybrid defense strategies and adversarial pre-training.

[156] Test-Time Reinforcement Learning for GUI Grounding via Region Consistency

Yong Du, Yuchen Yan, Fei Tang, Zhengxi Lu, Chang Zong, Weiming Lu, Shengpei Jiang, Yongliang Shen

Main category: cs.CV

TL;DR: GUI-RC and GUI-RCPO improve GUI grounding accuracy by leveraging spatial voting and test-time reinforcement learning, achieving 2-3% gains without additional training.

Details

Motivation: Existing GUI grounding methods rely on costly pixel-level annotations, limiting scalability and efficiency.

Method: GUI-RC uses spatial voting grids from multiple predictions to identify consensus regions. GUI-RCPO transforms consistency patterns into rewards for test-time reinforcement learning.

Result: GUI-RC improves accuracy by 2-3% on ScreenSpot benchmarks; GUI-RCPO further boosts Qwen2.5-VL-3B-Instruct to 85.14%.

Conclusion: The approach demonstrates the potential of test-time scaling and reinforcement learning for more robust, data-efficient GUI agents.

Abstract: Graphical User Interface (GUI) grounding, the task of mapping natural language instructions to precise screen coordinates, is fundamental to autonomous GUI agents. While existing methods achieve strong performance through extensive supervised training or reinforcement learning with labeled rewards, they remain constrained by the cost and availability of pixel-level annotations. We observe that when models generate multiple predictions for the same GUI element, the spatial overlap patterns reveal implicit confidence signals that can guide more accurate localization. Leveraging this insight, we propose GUI-RC (Region Consistency), a test-time scaling method that constructs spatial voting grids from multiple sampled predictions to identify consensus regions where models show highest agreement. Without any training, GUI-RC improves accuracy by 2-3% across various architectures on ScreenSpot benchmarks. We further introduce GUI-RCPO (Region Consistency Policy Optimization), which transforms these consistency patterns into rewards for test-time reinforcement learning. By computing how well each prediction aligns with the collective consensus, GUI-RCPO enables models to iteratively refine their outputs on unlabeled data during inference. Extensive experiments demonstrate the generality of our approach: GUI-RC boosts Qwen2.5-VL-3B-Instruct from 80.11% to 83.57% on ScreenSpot-v2, while GUI-RCPO further improves it to 85.14% through self-supervised optimization. Our approach reveals the untapped potential of test-time scaling and test-time reinforcement learning for GUI grounding, offering a promising path toward more robust and data-efficient GUI agents.

[157] RegionMed-CLIP: A Region-Aware Multimodal Contrastive Learning Pre-trained Model for Medical Image Understanding

Tianchen Fang, Guiru Liu

Main category: cs.CV

TL;DR: RegionMed-CLIP is a region-aware multimodal contrastive learning framework for medical image understanding, addressing data scarcity and global feature limitations by integrating localized pathological signals.

Details

Motivation: The challenges of limited annotated medical data and reliance on global image features, which miss subtle pathological regions, motivate the development of RegionMed-CLIP.

Method: The framework uses a region-of-interest (ROI) processor to integrate fine-grained regional features with global context, supported by progressive training and hierarchical multimodal alignment. MedRegion-500k, a large annotated medical image-text corpus, is also introduced.

Result: RegionMed-CLIP outperforms state-of-the-art models in tasks like image-text retrieval, zero-shot classification, and visual question answering.

Conclusion: Region-aware contrastive pre-training is crucial, and RegionMed-CLIP serves as a robust foundation for advancing multimodal medical image understanding.

Abstract: Medical image understanding plays a crucial role in enabling automated diagnosis and data-driven clinical decision support. However, its progress is impeded by two primary challenges: the limited availability of high-quality annotated medical data and an overreliance on global image features, which often miss subtle but clinically significant pathological regions. To address these issues, we introduce RegionMed-CLIP, a region-aware multimodal contrastive learning framework that explicitly incorporates localized pathological signals along with holistic semantic representations. The core of our method is an innovative region-of-interest (ROI) processor that adaptively integrates fine-grained regional features with the global context, supported by a progressive training strategy that enhances hierarchical multimodal alignment. To enable large-scale region-level representation learning, we construct MedRegion-500k, a comprehensive medical image-text corpus that features extensive regional annotations and multilevel clinical descriptions. Extensive experiments on image-text retrieval, zero-shot classification, and visual question answering tasks demonstrate that RegionMed-CLIP consistently exceeds state-of-the-art vision language models by a wide margin. Our results highlight the critical importance of region-aware contrastive pre-training and position RegionMed-CLIP as a robust foundation for advancing multimodal medical image understanding.

[158] A Study of Gender Classification Techniques Based on Iris Images: A Deep Survey and Analysis

Basna Mohammed Salih Hasan, Ramadhan J. Mstafa

Main category: cs.CV

TL;DR: The paper reviews gender classification methods, focusing on facial and iris biometrics, and highlights gaps and future research directions.

Details

Motivation: Gender classification is useful in applications like surveillance and human-computer interaction, with soft biometrics like facial and iris features being key.

Method: The study reviews existing literature and methodologies for gender classification, emphasizing facial and iris-based approaches.

Result: The paper provides an analysis of current methods, identifies gaps, and suggests future improvements.

Conclusion: The study aids researchers by summarizing existing approaches, challenges, and potential advancements in gender classification.

Abstract: Gender classification is attractive in a range of applications, including surveillance and monitoring, corporate profiling, and human-computer interaction. Individuals’ identities may be gleaned from information about their gender, which is a kind of soft biometric.Over the years, several methods for determining a person’s gender have been devised. Some of the most well-known ones are based on physical characteristics like face, fingerprint, palmprint, DNA, ears, gait, and iris. On the other hand, facial features account for the vast majority of gender classification methods. Also, the iris is a significant biometric trait because the iris, according to research, remains basically constant during an individual’s life. Besides that, the iris is externally visible and is non-invasive to the user, which is important for practical applications. Furthermore, there are already high-quality methods for segmenting and encoding iris images, and the current methods facilitate selecting and extracting attribute vectors from iris textures. This study discusses several approaches to determining gender. The previous works of literature are briefly reviewed. Additionally, there are a variety of methodologies for different steps of gender classification. This study provides researchers with knowledge and analysis of the existing gender classification approaches. Also, it will assist researchers who are interested in this specific area, as well as highlight the gaps and challenges in the field, and finally provide suggestions and future paths for improvement.

[159] CF3: Compact and Fast 3D Feature Fields

Hyunjoon Lee, Joonkyu Min, Jaesik Park

Main category: cs.CV

TL;DR: The paper introduces CF3, a top-down pipeline for creating compact and fast 3D Gaussian feature fields, reducing computational costs by leveraging multi-view 2D features and adaptive sparsification.

Details

Motivation: Current 3D Gaussian Splatting (3DGS) methods rely on bottom-up optimization of raw 2D features, which is computationally expensive. The goal is to develop a more efficient approach.

Method: Proposes CF3, which fuses multi-view 2D features with pre-trained Gaussians, trains a per-Gaussian autoencoder on lifted features, and uses adaptive sparsification to prune redundant Gaussians.

Result: Achieves a competitive 3D feature field using only 5% of the Gaussians compared to Feature-3DGS, preserving geometric details.

Conclusion: CF3 offers a computationally efficient and compact alternative to existing 3DGS methods, with improved feature alignment and reduced redundancy.

Abstract: 3D Gaussian Splatting (3DGS) has begun incorporating rich information from 2D foundation models. However, most approaches rely on a bottom-up optimization process that treats raw 2D features as ground truth, incurring increased computational costs. We propose a top-down pipeline for constructing compact and fast 3D Gaussian feature fields, namely, CF3. We first perform a fast weighted fusion of multi-view 2D features with pre-trained Gaussians. This approach enables training a per-Gaussian autoencoder directly on the lifted features, instead of training autoencoders in the 2D domain. As a result, the autoencoder better aligns with the feature distribution. More importantly, we introduce an adaptive sparsification method that optimizes the Gaussian attributes of the feature field while pruning and merging the redundant Gaussians, constructing an efficient representation with preserved geometric details. Our approach achieves a competitive 3D feature field using as little as 5% of the Gaussians compared to Feature-3DGS.

[160] Robust Tracking with Particle Filtering for Fluorescent Cardiac Imaging

Suresh Guttikonda, Maximilian Neidhart, Johanna Sprenger, Johannes Petersen, Christian Detter, Alexander Schlaefer

Main category: cs.CV

TL;DR: A particle filtering tracker with cyclic-consistency checks is proposed for robust real-time tracking of cardiac features during bypass surgery, outperforming deep learning and conventional methods.

Details

Motivation: Traditional tracking methods struggle with heart motion and image fluctuations during coronary bypass grafting surgery, necessitating a more robust solution.

Method: A particle filtering tracker based on cyclic-consistency checks is used to track target landmarks, enabling simultaneous tracking of 117 targets at 25.4 fps.

Result: The method achieves a low tracking error of 5.00 +/- 0.22 px, significantly better than deep learning (22.3 +/- 1.1 px) and conventional trackers (58.1 +/- 27.1 px).

Conclusion: The proposed tracker is effective for real-time cardiac imaging during interventions, offering superior accuracy and robustness.

Abstract: Intraoperative fluorescent cardiac imaging enables quality control following coronary bypass grafting surgery. We can estimate local quantitative indicators, such as cardiac perfusion, by tracking local feature points. However, heart motion and significant fluctuations in image characteristics caused by vessel structural enrichment limit traditional tracking methods. We propose a particle filtering tracker based on cyclicconsistency checks to robustly track particles sampled to follow target landmarks. Our method tracks 117 targets simultaneously at 25.4 fps, allowing real-time estimates during interventions. It achieves a tracking error of (5.00 +/- 0.22 px) and outperforms other deep learning trackers (22.3 +/- 1.1 px) and conventional trackers (58.1 +/- 27.1 px).

[161] SGDFuse: SAM-Guided Diffusion for High-Fidelity Infrared and Visible Image Fusion

Xiaoyang Zhang, Zhen Hua, Yakun Ju, Wei Zhou, Jun Liu, Alex C. Kot

Main category: cs.CV

TL;DR: SGDFuse uses SAM-guided conditional diffusion for high-fidelity infrared and visible image fusion, outperforming existing methods.

Details

Motivation: Existing IVIF methods lack deep semantic understanding, causing artifacts and detail loss.

Method: Two-stage process: preliminary fusion followed by SAM-guided diffusion for semantic-aware refinement.

Result: State-of-the-art performance in subjective/objective evaluations and downstream task adaptability.

Conclusion: SGDFuse effectively addresses core challenges in image fusion with explicit semantic directionality.

Abstract: Infrared and visible image fusion (IVIF) aims to combine the thermal radiation information from infrared images with the rich texture details from visible images to enhance perceptual capabilities for downstream visual tasks. However, existing methods often fail to preserve key targets due to a lack of deep semantic understanding of the scene, while the fusion process itself can also introduce artifacts and detail loss, severely compromising both image quality and task performance. To address these issues, this paper proposes SGDFuse, a conditional diffusion model guided by the Segment Anything Model (SAM), to achieve high-fidelity and semantically-aware image fusion. The core of our method is to utilize high-quality semantic masks generated by SAM as explicit priors to guide the optimization of the fusion process via a conditional diffusion model. Specifically, the framework operates in a two-stage process: it first performs a preliminary fusion of multi-modal features, and then utilizes the semantic masks from SAM jointly with the preliminary fused image as a condition to drive the diffusion model’s coarse-to-fine denoising generation. This ensures the fusion process not only has explicit semantic directionality but also guarantees the high fidelity of the final result. Extensive experiments demonstrate that SGDFuse achieves state-of-the-art performance in both subjective and objective evaluations, as well as in its adaptability to downstream tasks, providing a powerful solution to the core challenges in image fusion. The code of SGDFuse is available at https://github.com/boshizhang123/SGDFuse.

[162] B4DL: A Benchmark for 4D LiDAR LLM in Spatio-Temporal Understanding

Changho Choi, Youngwoo Shin, Gyojin Han, Dong-Jae Lee, Junmo Kim

Main category: cs.CV

TL;DR: The paper introduces B4DL, a benchmark for 4D LiDAR understanding in MLLMs, addressing the lack of annotations and suitable architectures. It includes a data pipeline and a model for direct 4D LiDAR processing.

Details

Motivation: 4D LiDAR is underexplored in MLLMs due to missing annotations and architectures, despite its potential for dynamic outdoor scene understanding.

Method: Proposes B4DL benchmark, a scalable data generation pipeline, and an MLLM model for raw 4D LiDAR processing.

Result: The model and benchmark offer a unified solution for spatio-temporal reasoning in dynamic environments.

Conclusion: The work advances 4D LiDAR understanding in MLLMs with practical tools and datasets.

Abstract: Understanding dynamic outdoor environments requires capturing complex object interactions and their evolution over time. LiDAR-based 4D point clouds provide precise spatial geometry and rich temporal cues, making them ideal for representing real-world scenes. However, despite their potential, 4D LiDAR remains underexplored in the context of Multimodal Large Language Models (MLLMs) due to the absence of high-quality, modality-specific annotations and the lack of MLLM architectures capable of processing its high-dimensional composition. To address these challenges, we introduce B4DL, a new benchmark specifically designed for training and evaluating MLLMs on 4D LiDAR understanding. In addition, we propose a scalable data generation pipeline and an MLLM model that, for the first time, directly processes raw 4D LiDAR by bridging it with language understanding. Combined with our dataset and benchmark, our model offers a unified solution for spatio-temporal reasoning in dynamic outdoor environments. We provide rendered 4D LiDAR videos, generated dataset, and inference outputs on diverse scenarios at: https://mmb4dl.github.io/mmb4dl/

[163] Wavelet-Guided Dual-Frequency Encoding for Remote Sensing Change Detection

Xiaoyang Zhang, Guodong Fan, Guang-Yong Chen, Zhen Hua, Jinjiang Li, Min Gan, C. L. Philip Chen

Main category: cs.CV

TL;DR: The paper proposes Wavelet-Guided Dual-Frequency Encoding (WGDF) for remote sensing change detection, leveraging frequency-domain features to enhance edge and fine-grained change detection, outperforming existing methods.

Details

Motivation: Existing deep learning methods for change detection rely on spatial-domain modeling, limiting feature diversity and hindering detection of subtle changes. Frequency-domain features, especially in the wavelet domain, can amplify fine-grained differences.

Method: WGDF uses Discrete Wavelet Transform (DWT) to decompose images into high- and low-frequency components. High-frequency features are enhanced with a Dual-Frequency Feature Enhancement (DFFE) module and a Frequency-Domain Interactive Difference (FDID) module. Low-frequency features use Transformers and a Progressive Contextual Difference Module (PCDM) for global semantic refinement.

Result: WGDF significantly reduces edge ambiguity and achieves superior accuracy and robustness on multiple remote sensing datasets.

Conclusion: WGDF effectively unifies local sensitivity and global discriminability, outperforming state-of-the-art methods in change detection.

Abstract: Change detection in remote sensing imagery plays a vital role in various engineering applications, such as natural disaster monitoring, urban expansion tracking, and infrastructure management. Despite the remarkable progress of deep learning in recent years, most existing methods still rely on spatial-domain modeling, where the limited diversity of feature representations hinders the detection of subtle change regions. We observe that frequency-domain feature modeling particularly in the wavelet domain an amplify fine-grained differences in frequency components, enhancing the perception of edge changes that are challenging to capture in the spatial domain. Thus, we propose a method called Wavelet-Guided Dual-Frequency Encoding (WGDF). Specifically, we first apply Discrete Wavelet Transform (DWT) to decompose the input images into high-frequency and low-frequency components, which are used to model local details and global structures, respectively. In the high-frequency branch, we design a Dual-Frequency Feature Enhancement (DFFE) module to strengthen edge detail representation and introduce a Frequency-Domain Interactive Difference (FDID) module to enhance the modeling of fine-grained changes. In the low-frequency branch, we exploit Transformers to capture global semantic relationships and employ a Progressive Contextual Difference Module (PCDM) to progressively refine change regions, enabling precise structural semantic characterization. Finally, the high- and low-frequency features are synergistically fused to unify local sensitivity with global discriminability. Extensive experiments on multiple remote sensing datasets demonstrate that WGDF significantly alleviates edge ambiguity and achieves superior detection accuracy and robustness compared to state-of-the-art methods. The code will be available at https://github.com/boshizhang123/WGDF.

[164] VS-LLM: Visual-Semantic Depression Assessment based on LLM for Drawing Projection Test

Meiqi Wu, Yaxuan Kang, Xuchen Li, Shiyu Hu, Xiaotang Chen, Yunfeng Kang, Weiqiang Wang, Kaiqi Huang

Main category: cs.CV

TL;DR: The paper proposes an automated method (VS-LLM) to analyze PPAT sketches for depression assessment, improving accuracy by 17.6% over traditional psychologist evaluations.

Details

Motivation: The manual interpretation of PPAT sketches in art therapy is labor-intensive and subjective, necessitating an automated solution.

Method: The VS-LLM method combines visual-semantic analysis with LLM for automated depression assessment from PPAT sketches, addressing challenges like low drawing accuracy.

Result: The method outperforms psychologist assessments by 17.6%, demonstrating effectiveness in large-scale DPT.

Conclusion: The work advances mental state assessment via automated PPAT sketch analysis, with datasets and code publicly available.

Abstract: The Drawing Projection Test (DPT) is an essential tool in art therapy, allowing psychologists to assess participants’ mental states through their sketches. Specifically, through sketches with the theme of “a person picking an apple from a tree (PPAT)”, it can be revealed whether the participants are in mental states such as depression. Compared with scales, the DPT can enrich psychologists’ understanding of an individual’s mental state. However, the interpretation of the PPAT is laborious and depends on the experience of the psychologists. To address this issue, we propose an effective identification method to support psychologists in conducting a large-scale automatic DPT. Unlike traditional sketch recognition, DPT more focus on the overall evaluation of the sketches, such as color usage and space utilization. Moreover, PPAT imposes a time limit and prohibits verbal reminders, resulting in low drawing accuracy and a lack of detailed depiction. To address these challenges, we propose the following efforts: (1) Providing an experimental environment for automated analysis of PPAT sketches for depression assessment; (2) Offering a Visual-Semantic depression assessment based on LLM (VS-LLM) method; (3) Experimental results demonstrate that our method improves by 17.6% compared to the psychologist assessment method. We anticipate that this work will contribute to the research in mental state assessment based on PPAT sketches’ elements recognition. Our datasets and codes are available at https://github.com/wmeiqi/VS-LLM.

[165] CoCAViT: Compact Vision Transformer with Robust Global Coordination

Xuyang Wang, Lingjuan Miao, Zhiqiang Zhou

Main category: cs.CV

TL;DR: CoCAViT addresses the generalization gap in smaller models by introducing Coordinator-patch Cross Attention (CoCA) for robust real-time visual representation, achieving strong performance on OOD benchmarks.

Details

Motivation: Smaller models show a larger performance drop on out-of-distribution (OOD) data, highlighting a generalization deficiency in existing efficient architectures.

Method: Identifies architectural bottlenecks and introduces CoCA, a dynamic, domain-aware global token mechanism, to enhance local-global feature modeling.

Result: CoCAViT-28M achieves 84.0% top-1 accuracy on ImageNet-1K, with significant gains on OOD benchmarks, and performs well on COCO and ADE20K tasks.

Conclusion: CoCAViT effectively bridges the generalization gap for smaller models, offering robust performance with low computational overhead.

Abstract: In recent years, large-scale visual backbones have demonstrated remarkable capabilities in learning general-purpose features from images via extensive pre-training. Concurrently, many efficient architectures have emerged that have performance comparable to that of larger models on in-domain benchmarks. However, we observe that for smaller models, the performance drop on out-of-distribution (OOD) data is disproportionately larger, indicating a deficiency in the generalization performance of existing efficient models. To address this, we identify key architectural bottlenecks and inappropriate design choices that contribute to this issue, retaining robustness for smaller models. To restore the global field of pure window attention, we further introduce a Coordinator-patch Cross Attention (CoCA) mechanism, featuring dynamic, domain-aware global tokens that enhance local-global feature modeling and adaptively capture robust patterns across domains with minimal computational overhead. Integrating these advancements, we present CoCAViT, a novel visual backbone designed for robust real-time visual representation. Extensive experiments empirically validate our design. At a resolution of 224*224, CoCAViT-28M achieves 84.0% top-1 accuracy on ImageNet-1K, with significant gains on multiple OOD benchmarks, compared to competing models. It also attains 52.2 mAP on COCO object detection and 51.3 mIOU on ADE20K semantic segmentation, while maintaining low latency.

[166] mKG-RAG: Multimodal Knowledge Graph-Enhanced RAG for Visual Question Answering

Xu Yuan, Liangbo Ning, Wenqi Fan, Qing Li

Main category: cs.CV

TL;DR: The paper introduces mKG-RAG, a multimodal knowledge-augmented generation framework for VQA, using structured multimodal KGs to improve accuracy and reliability over vanilla RAG methods.

Details

Motivation: Vanilla RAG-based VQA methods often introduce irrelevant content due to unstructured knowledge, reducing accuracy. Structured multimodal KGs can enhance generation.

Method: Proposes mKG-RAG, leveraging MLLMs for keyword extraction and vision-text matching to build multimodal KGs. Uses a dual-stage retrieval strategy for efficient and precise knowledge retrieval.

Result: Outperforms existing methods, achieving state-of-the-art performance in knowledge-based VQA.

Conclusion: mKG-RAG effectively integrates structured multimodal knowledge, improving VQA accuracy and reliability.

Abstract: Recently, Retrieval-Augmented Generation (RAG) has been proposed to expand internal knowledge of Multimodal Large Language Models (MLLMs) by incorporating external knowledge databases into the generation process, which is widely used for knowledge-based Visual Question Answering (VQA) tasks. Despite impressive advancements, vanilla RAG-based VQA methods that rely on unstructured documents and overlook the structural relationships among knowledge elements frequently introduce irrelevant or misleading content, reducing answer accuracy and reliability. To overcome these challenges, a promising solution is to integrate multimodal knowledge graphs (KGs) into RAG-based VQA frameworks to enhance the generation by introducing structured multimodal knowledge. Therefore, in this paper, we propose a novel multimodal knowledge-augmented generation framework (mKG-RAG) based on multimodal KGs for knowledge-intensive VQA tasks. Specifically, our approach leverages MLLM-powered keyword extraction and vision-text matching to distill semantically consistent and modality-aligned entities/relationships from multimodal documents, constructing high-quality multimodal KGs as structured knowledge representations. In addition, a dual-stage retrieval strategy equipped with a question-aware multimodal retriever is introduced to improve retrieval efficiency while refining precision. Comprehensive experiments demonstrate that our approach significantly outperforms existing methods, setting a new state-of-the-art for knowledge-based VQA.

[167] Textual Inversion for Efficient Adaptation of Open-Vocabulary Object Detectors Without Forgetting

Frank Ruis, Gertjan Burghouts, Hugo Kuijf

Main category: cs.CV

TL;DR: The paper proposes a Textual Inversion (TI) method for open-vocabulary object detection in VLMs, enabling vocabulary extension with minimal data while retaining zero-shot capabilities.

Details

Motivation: To address the loss of natural language querying and zero-shot capabilities in VLMs during fine-tuning, inspired by TI's success in text-to-image models.

Method: Extends VLM vocabulary by learning new tokens from few examples, keeping original weights frozen to retain performance and capabilities.

Result: The method maintains original benchmark performance, enables zero-shot transfer, and requires less compute than full fine-tuning.

Conclusion: TI for VLMs outperforms baselines prone to forgetting, offering efficient vocabulary extension without sacrificing existing capabilities.

Abstract: Recent progress in large pre-trained vision language models (VLMs) has reached state-of-the-art performance on several object detection benchmarks and boasts strong zero-shot capabilities, but for optimal performance on specific targets some form of finetuning is still necessary. While the initial VLM weights allow for great few-shot transfer learning, this usually involves the loss of the original natural language querying and zero-shot capabilities. Inspired by the success of Textual Inversion (TI) in personalizing text-to-image diffusion models, we propose a similar formulation for open-vocabulary object detection. TI allows extending the VLM vocabulary by learning new or improving existing tokens to accurately detect novel or fine-grained objects from as little as three examples. The learned tokens are completely compatible with the original VLM weights while keeping them frozen, retaining the original model’s benchmark performance, and leveraging its existing capabilities such as zero-shot domain transfer (e.g., detecting a sketch of an object after training only on real photos). The storage and gradient calculations are limited to the token embedding dimension, requiring significantly less compute than full-model fine-tuning. We evaluated whether the method matches or outperforms the baseline methods that suffer from forgetting in a wide variety of quantitative and qualitative experiments.

[168] 3DGabSplat: 3D Gabor Splatting for Frequency-adaptive Radiance Field Rendering

Junyu Zhou, Yuyang Huang, Wenrui Dai, Junni Zou, Ziyang Zheng, Nuowen Kan, Chenglin Li, Hongkai Xiong

Main category: cs.CV

TL;DR: 3DGabSplat improves 3DGS by using 3D Gabor-based primitives for better high-frequency detail capture and efficiency.

Details

Motivation: 3DGS struggles with high-frequency details, redundant primitives, and inefficiency.

Method: Proposes 3D Gabor-based primitives with multi-directional frequency responses and a CUDA-based rasterizer.

Result: Outperforms 3DGS with 1.35 dB PSNR gain, fewer primitives, and lower memory use.

Conclusion: 3DGabSplat enhances rendering quality and efficiency, scalable for integration into existing 3DGS frameworks.

Abstract: Recent prominence in 3D Gaussian Splatting (3DGS) has enabled real-time rendering while maintaining high-fidelity novel view synthesis. However, 3DGS resorts to the Gaussian function that is low-pass by nature and is restricted in representing high-frequency details in 3D scenes. Moreover, it causes redundant primitives with degraded training and rendering efficiency and excessive memory overhead. To overcome these limitations, we propose 3D Gabor Splatting (3DGabSplat) that leverages a novel 3D Gabor-based primitive with multiple directional 3D frequency responses for radiance field representation supervised by multi-view images. The proposed 3D Gabor-based primitive forms a filter bank incorporating multiple 3D Gabor kernels at different frequencies to enhance flexibility and efficiency in capturing fine 3D details. Furthermore, to achieve novel view rendering, an efficient CUDA-based rasterizer is developed to project the multiple directional 3D frequency components characterized by 3D Gabor-based primitives onto the 2D image plane, and a frequency-adaptive mechanism is presented for adaptive joint optimization of primitives. 3DGabSplat is scalable to be a plug-and-play kernel for seamless integration into existing 3DGS paradigms to enhance both efficiency and quality of novel view synthesis. Extensive experiments demonstrate that 3DGabSplat outperforms 3DGS and its variants using alternative primitives, and achieves state-of-the-art rendering quality across both real-world and synthetic scenes. Remarkably, we achieve up to 1.35 dB PSNR gain over 3DGS with simultaneously reduced number of primitives and memory consumption.

[169] PriorRG: Prior-Guided Contrastive Pre-training and Coarse-to-Fine Decoding for Chest X-ray Report Generation

Kang Liu, Zhuoqi Ma, Zikang Fang, Yunan Li, Kun Xie, Qiguang Miao

Main category: cs.CV

TL;DR: PriorRG improves chest X-ray report generation by integrating patient-specific prior knowledge, outperforming existing methods with significant BLEU and F1 score gains.

Details

Motivation: Existing methods neglect patient-specific prior knowledge, leading to inaccurate reports. PriorRG aims to emulate clinical workflows for better diagnostic reasoning.

Method: A two-stage framework: 1) Prior-guided contrastive pre-training for spatiotemporal feature extraction, and 2) Prior-aware coarse-to-fine decoding to integrate prior knowledge with vision encoder outputs.

Result: PriorRG achieves 3.6% BLEU-4 and 3.8% F1 score improvement on MIMIC-CXR, and 5.9% BLEU-1 gain on MIMIC-ABN.

Conclusion: PriorRG enhances clinical accuracy and fluency in report generation by leveraging prior knowledge, setting a new state-of-the-art.

Abstract: Chest X-ray report generation aims to reduce radiologists’ workload by automatically producing high-quality preliminary reports. A critical yet underexplored aspect of this task is the effective use of patient-specific prior knowledge – including clinical context (e.g., symptoms, medical history) and the most recent prior image – which radiologists routinely rely on for diagnostic reasoning. Most existing methods generate reports from single images, neglecting this essential prior information and thus failing to capture diagnostic intent or disease progression. To bridge this gap, we propose PriorRG, a novel chest X-ray report generation framework that emulates real-world clinical workflows via a two-stage training pipeline. In Stage 1, we introduce a prior-guided contrastive pre-training scheme that leverages clinical context to guide spatiotemporal feature extraction, allowing the model to align more closely with the intrinsic spatiotemporal semantics in radiology reports. In Stage 2, we present a prior-aware coarse-to-fine decoding for report generation that progressively integrates patient-specific prior knowledge with the vision encoder’s hidden states. This decoding allows the model to align with diagnostic focus and track disease progression, thereby enhancing the clinical accuracy and fluency of the generated reports. Extensive experiments on MIMIC-CXR and MIMIC-ABN datasets demonstrate that PriorRG outperforms state-of-the-art methods, achieving a 3.6% BLEU-4 and 3.8% F1 score improvement on MIMIC-CXR, and a 5.9% BLEU-1 gain on MIMIC-ABN. Code and checkpoints will be released upon acceptance.

[170] Cross-View Localization via Redundant Sliced Observations and A-Contrario Validation

Yongjun Zhang, Mingtao Xiong, Yi Wan, Gui-Song Xia

Main category: cs.CV

TL;DR: Slice-Loc is a two-stage method for cross-view localization (CVL) that improves reliability and accuracy by dividing ground-level images into sub-images, validating poses, and quantifying meaningfulness.

Details

Motivation: Current CVL methods lack redundant observations for reliability assessment, making it hard to validate localization accuracy.

Method: Slice-Loc divides query images into sub-images, estimates 3-DoF poses for each, filters errors using geometric rigidity, and merges inliers for the final pose. It also quantifies meaningfulness via false alarm estimation.

Result: Slice-Loc reduces errors exceeding 10m to under 3%, cuts mean localization error from 4.47m to 1.86m, and orientation error from 3.42° to 1.24° in cross-city tests.

Conclusion: Slice-Loc enhances CVL accuracy and reliability, outperforming state-of-the-art methods.

Abstract: Cross-view localization (CVL) matches ground-level images with aerial references to determine the geo-position of a camera, enabling smart vehicles to self-localize offline in GNSS-denied environments. However, most CVL methods output only a single observation, the camera pose, and lack the redundant observations required by surveying principles, making it challenging to assess localization reliability through the mutual validation of observational data. To tackle this, we introduce Slice-Loc, a two-stage method featuring an a-contrario reliability validation for CVL. Instead of using the query image as a single input, Slice-Loc divides it into sub-images and estimates the 3-DoF pose for each slice, creating redundant and independent observations. Then, a geometric rigidity formula is proposed to filter out the erroneous 3-DoF poses, and the inliers are merged to generate the final camera pose. Furthermore, we propose a model that quantifies the meaningfulness of localization by estimating the number of false alarms (NFA), according to the distribution of the locations of the sliced images. By eliminating gross errors, Slice-Loc boosts localization accuracy and effectively detects failures. After filtering out mislocalizations, Slice-Loc reduces the proportion of errors exceeding 10 m to under 3%. In cross-city tests on the DReSS dataset, Slice-Loc cuts the mean localization error from 4.47 m to 1.86 m and the mean orientation error from $\mathbf{3.42^{\circ}}$ to $\mathbf{1.24^{\circ}}$, outperforming state-of-the-art methods. Code and dataset will be available at: https://github.com/bnothing/Slice-Loc.

[171] CT-GRAPH: Hierarchical Graph Attention Network for Anatomy-Guided CT Report Generation

Hamza Kalisch, Fabian Hörst, Jens Kleesiek, Ken Herrmann, Constantin Seibold

Main category: cs.CV

TL;DR: CT-GRAPH is a hierarchical graph attention network that improves radiology report generation by modeling fine-grained organ relationships and integrating them with global features.

Details

Motivation: Automating radiology report generation to assist radiologists by capturing fine-grained organ relationships often missed by current methods.

Method: Uses a hierarchical graph attention network with pretrained 3D medical feature encoders and anatomical masks to refine organ-level features, integrated into a large language model for report generation.

Result: Achieves a 7.9% absolute improvement in F1 score over state-of-the-art methods on the CT-RATE dataset.

Conclusion: CT-GRAPH effectively models radiological knowledge and significantly enhances report generation accuracy.

Abstract: As medical imaging is central to diagnostic processes, automating the generation of radiology reports has become increasingly relevant to assist radiologists with their heavy workloads. Most current methods rely solely on global image features, failing to capture fine-grained organ relationships crucial for accurate reporting. To this end, we propose CT-GRAPH, a hierarchical graph attention network that explicitly models radiological knowledge by structuring anatomical regions into a graph, linking fine-grained organ features to coarser anatomical systems and a global patient context. Our method leverages pretrained 3D medical feature encoders to obtain global and organ-level features by utilizing anatomical masks. These features are further refined within the graph and then integrated into a large language model to generate detailed medical reports. We evaluate our approach for the task of report generation on the large-scale chest CT dataset CT-RATE. We provide an in-depth analysis of pretrained feature encoders for CT report generation and show that our method achieves a substantial improvement of absolute 7.9% in F1 score over current state-of-the-art methods. The code is publicly available at https://github.com/hakal104/CT-GRAPH.

[172] Deformable Attention Graph Representation Learning for Histopathology Whole Slide Image Analysis

Mingxi Fu, Xitong Ling, Yuxuan Chen, Jiawen Li, fanglei fu, Huaitian Yuan, Tian Guan, Yonghong He, Lianghui Zhu

Main category: cs.CV

TL;DR: A novel GNN framework with deformable attention improves pathology image classification by dynamically capturing spatial dependencies among tissue patches.

Details

Motivation: Existing methods like MIL and static GNNs fail to adequately model spatial relationships and lack specificity in attention mechanisms for pathology images.

Method: Proposes a dynamic weighted directed GNN with deformable attention, incorporating learnable spatial offsets based on patch coordinates to adaptively focus on relevant regions.

Result: Achieves state-of-the-art performance on four benchmark datasets, demonstrating superior ability to capture complex spatial structures.

Conclusion: The deformable attention mechanism enhances contextual understanding while preserving spatial specificity, proving effective for pathology image analysis.

Abstract: Accurate classification of Whole Slide Images (WSIs) and Regions of Interest (ROIs) is a fundamental challenge in computational pathology. While mainstream approaches often adopt Multiple Instance Learning (MIL), they struggle to capture the spatial dependencies among tissue structures. Graph Neural Networks (GNNs) have emerged as a solution to model inter-instance relationships, yet most rely on static graph topologies and overlook the physical spatial positions of tissue patches. Moreover, conventional attention mechanisms lack specificity, limiting their ability to focus on structurally relevant regions. In this work, we propose a novel GNN framework with deformable attention for pathology image analysis. We construct a dynamic weighted directed graph based on patch features, where each node aggregates contextual information from its neighbors via attention-weighted edges. Specifically, we incorporate learnable spatial offsets informed by the real coordinates of each patch, enabling the model to adaptively attend to morphologically relevant regions across the slide. This design significantly enhances the contextual field while preserving spatial specificity. Our framework achieves state-of-the-art performance on four benchmark datasets (TCGA-COAD, BRACS, gastric intestinal metaplasia grading, and intestinal ROI classification), demonstrating the power of deformable attention in capturing complex spatial structures in WSIs and ROIs.

[173] UNCAGE: Contrastive Attention Guidance for Masked Generative Transformers in Text-to-Image Generation

Wonjun Kang, Byeongkeun Ahn, Minjae Lee, Kevin Galim, Seunghyuk Oh, Hyung Il Koo, Nam Ik Cho

Main category: cs.CV

TL;DR: UNCAGE is a training-free method using contrastive attention guidance to improve compositional fidelity in Masked Generative Transformers for text-to-image generation.

Details

Motivation: Addressing the challenge of accurate attribute binding and text-image alignment in compositional T2I generation, which even state-of-the-art Diffusion Models struggle with.

Method: Proposes UNCAGE, leveraging attention maps to prioritize unmasking tokens representing individual objects, enhancing compositional fidelity without additional training.

Result: UNCAGE improves performance in quantitative and qualitative evaluations across benchmarks with minimal inference overhead.

Conclusion: UNCAGE effectively enhances compositional T2I generation for Masked Generative Transformers, offering a practical solution with negligible overhead.

Abstract: Text-to-image (T2I) generation has been actively studied using Diffusion Models and Autoregressive Models. Recently, Masked Generative Transformers have gained attention as an alternative to Autoregressive Models to overcome the inherent limitations of causal attention and autoregressive decoding through bidirectional attention and parallel decoding, enabling efficient and high-quality image generation. However, compositional T2I generation remains challenging, as even state-of-the-art Diffusion Models often fail to accurately bind attributes and achieve proper text-image alignment. While Diffusion Models have been extensively studied for this issue, Masked Generative Transformers exhibit similar limitations but have not been explored in this context. To address this, we propose Unmasking with Contrastive Attention Guidance (UNCAGE), a novel training-free method that improves compositional fidelity by leveraging attention maps to prioritize the unmasking of tokens that clearly represent individual objects. UNCAGE consistently improves performance in both quantitative and qualitative evaluations across multiple benchmarks and metrics, with negligible inference overhead. Our code is available at https://github.com/furiosa-ai/uncage.

[174] Physical Adversarial Camouflage through Gradient Calibration and Regularization

Jiawei Liang, Siyuan Liang, Jianjie Huang, Chenxi Si, Ming Zhang, Xiaochun Cao

Main category: cs.CV

TL;DR: A novel adversarial camouflage framework improves attack success rates on deep object detectors by addressing gradient optimization challenges in variable physical environments.

Details

Motivation: Physical adversarial camouflage poses security risks in safety-critical fields like autonomous driving, but existing methods struggle with inconsistent sampling and conflicting gradients.

Method: Proposes gradient calibration for consistent updates and gradient decorrelation to prioritize and orthogonalize gradients, enhancing stability and effectiveness.

Result: Achieves average ASR increases of 13.46% across distances and 11.03% across angles, outperforming state-of-the-art methods.

Conclusion: The framework demonstrates superior performance, highlighting the need for more robust system designs in real-world applications.

Abstract: The advancement of deep object detectors has greatly affected safety-critical fields like autonomous driving. However, physical adversarial camouflage poses a significant security risk by altering object textures to deceive detectors. Existing techniques struggle with variable physical environments, facing two main challenges: 1) inconsistent sampling point densities across distances hinder the gradient optimization from ensuring local continuity, and 2) updating texture gradients from multiple angles causes conflicts, reducing optimization stability and attack effectiveness. To address these issues, we propose a novel adversarial camouflage framework based on gradient optimization. First, we introduce a gradient calibration strategy, which ensures consistent gradient updates across distances by propagating gradients from sparsely to unsampled texture points. Additionally, we develop a gradient decorrelation method, which prioritizes and orthogonalizes gradients based on loss values, enhancing stability and effectiveness in multi-angle optimization by eliminating redundant or conflicting updates. Extensive experimental results on various detection models, angles and distances show that our method significantly exceeds the state of the art, with an average increase in attack success rate (ASR) of 13.46% across distances and 11.03% across angles. Furthermore, empirical evaluation in real-world scenarios highlights the need for more robust system design.

[175] Smoothing Slot Attention Iterations and Recurrences

Rongzhen Zhao, Wenyan Yang, Juho Kannala, Joni Pajarinen

Main category: cs.CV

TL;DR: SmoothSA improves Slot Attention (SA) for Object-Centric Learning (OCL) by preheating cold-start queries and differentiating transforms for first and non-first video frames, enhancing precision and performance.

Details

Motivation: Cold-start queries in SA lack sample-specific cues, hindering precise aggregation on the first frame of images or videos. Non-first frames' queries require different transforms than the first frame.

Method: SmoothSA introduces (1) a preheating module to enrich cold-start queries with input features and (2) differentiated transforms for first (full iterations) and non-first (single iteration) video frames.

Result: Experiments show SmoothSA’s effectiveness in object discovery, recognition, and downstream tasks, with intuitive insights into its smoothing mechanism.

Conclusion: SmoothSA addresses SA’s limitations, improving aggregation precision and performance in OCL for both images and videos.

Abstract: Slot Attention (SA) and its variants lie at the heart of mainstream Object-Centric Learning (OCL). Objects in an image can be aggregated into respective slot vectors, by \textit{iteratively} refining cold-start query vectors, typically three times, via SA on image features. For video, such aggregation is \textit{recurrently} shared across frames, with queries cold-started on the first frame while transitioned from the previous frame’s slots on non-first frames. However, the cold-start queries lack sample-specific cues thus hinder precise aggregation on the image or video’s first frame; Also, non-first frames’ queries are already sample-specific thus require transforms different from the first frame’s aggregation. We address these issues for the first time with our \textit{SmoothSA}: (1) To smooth SA iterations on the image or video’s first frame, we \textit{preheat} the cold-start queries with rich information of input features, via a tiny module self-distilled inside OCL; (2) To smooth SA recurrences across all video frames, we \textit{differentiate} the homogeneous transforms on the first and non-first frames, by using full and single iterations respectively. Comprehensive experiments on object discovery, recognition and downstream benchmarks validate our method’s effectiveness. Further analyses intuitively illuminate how our method smooths SA iterations and recurrences. Our code is available in the supplement.

[176] Explaining Similarity in Vision-Language Encoders with Weighted Banzhaf Interactions

Hubert Baniecki, Maximilian Muschalik, Fabian Fumagalli, Barbara Hammer, Eyke Hüllermeier, Przemyslaw Biecek

Main category: cs.CV

TL;DR: FIxLIP introduces a game-theory-based method for explaining vision-language models, outperforming first-order methods.

Details

Motivation: Current saliency maps lack the ability to capture complex cross-modal interactions in vision-language models.

Method: FIxLIP uses the weighted Banzhaf interaction index for efficient second-order interaction analysis.

Result: FIxLIP outperforms first-order methods on benchmarks like MS COCO and ImageNet-1k.

Conclusion: FIxLIP provides high-quality explanations and is useful for model comparison.

Abstract: Language-image pre-training (LIP) enables the development of vision-language models capable of zero-shot classification, localization, multimodal retrieval, and semantic understanding. Various explanation methods have been proposed to visualize the importance of input image-text pairs on the model’s similarity outputs. However, popular saliency maps are limited by capturing only first-order attributions, overlooking the complex cross-modal interactions intrinsic to such encoders. We introduce faithful interaction explanations of LIP models (FIxLIP) as a unified approach to decomposing the similarity in vision-language encoders. FIxLIP is rooted in game theory, where we analyze how using the weighted Banzhaf interaction index offers greater flexibility and improves computational efficiency over the Shapley interaction quantification framework. From a practical perspective, we propose how to naturally extend explanation evaluation metrics, like the pointing game and area between the insertion/deletion curves, to second-order interaction explanations. Experiments on MS COCO and ImageNet-1k benchmarks validate that second-order methods like FIxLIP outperform first-order attribution methods. Beyond delivering high-quality explanations, we demonstrate the utility of FIxLIP in comparing different models like CLIP vs. SigLIP-2 and ViT-B/32 vs. ViT-L/16.

[177] How and Why: Taming Flow Matching for Unsupervised Anomaly Detection and Localization

Liangwei Li, Lin Liu, Juanxiu Liu, Jing Zhang, Ruqian Hao, Xiaohui Du

Main category: cs.CV

TL;DR: A new unsupervised anomaly detection method using Flow Matching (FM) is proposed, addressing limitations of conventional flow-based methods. It introduces time-reversed FM (rFM) and Worst Transport (WT) interpolation, achieving state-of-the-art results on MVTec.

Details

Motivation: To overcome expressivity limitations in conventional flow-based methods for anomaly detection and localization.

Method: Uses Flow Matching (FM) with time-reversed FM (rFM) and Worst Transport (WT) displacement interpolation to transform data distributions and control sample trajectories.

Result: Achieves state-of-the-art performance on the MVTec dataset for unsupervised anomaly detection.

Conclusion: The proposed WT-Flow method provides a theoretically grounded and scalable solution for anomaly detection, with reproducible code to be released.

Abstract: We propose a new paradigm for unsupervised anomaly detection and localization using Flow Matching (FM), which fundamentally addresses the model expressivity limitations of conventional flow-based methods. To this end, we formalize the concept of time-reversed Flow Matching (rFM) as a vector field regression along a predefined probability path to transform unknown data distributions into standard Gaussian. We bring two core observations that reshape our understanding of FM. First, we rigorously prove that FM with linear interpolation probability paths is inherently non-invertible. Second, our analysis reveals that employing reversed Gaussian probability paths in high-dimensional spaces can lead to trivial vector fields. This issue arises due to the manifold-related constraints. Building on the second observation, we propose Worst Transport (WT) displacement interpolation to reconstruct a non-probabilistic evolution path. The proposed WT-Flow enhances dynamical control over sample trajectories, constructing ‘‘degenerate potential wells’’ for anomaly-free samples while allowing anomalous samples to escape. This novel unsupervised paradigm offers a theoretically grounded separation mechanism for anomalous samples. Notably, FM provides a computationally tractable framework that scales to complex data. We present the first successful application of FM for the unsupervised anomaly detection task, achieving state-of-the-art performance at a single scale on the MVTec dataset. The reproducible code for training will be released upon camera-ready submission.

[178] SMOL-MapSeg: Show Me One Label

Yunshuang Yuan, Frank Thiemann, Thorsten Dahms, Monika Sester

Main category: cs.CV

TL;DR: The paper proposes OND knowledge-based prompting to improve semantic segmentation of historical maps using foundation models, outperforming UNet.

Details

Motivation: Historical maps lack consistency in patterns and styles, making it difficult for pre-trained foundation models to perform well.

Method: Introduces OND prompting to guide models on pattern-concept mapping, replacing SAM’s prompt encoder and fine-tuning on historical maps.

Result: SMOL-MapSeg accurately segments classes defined by OND knowledge and adapts to unseen classes with few-shot fine-tuning, outperforming UNet.

Conclusion: OND prompting effectively addresses the challenges of historical map segmentation, offering a flexible and accurate solution.

Abstract: Historical maps are valuable for studying changes to the Earth’s surface. With the rise of deep learning, models like UNet have been used to extract information from these maps through semantic segmentation. Recently, pre-trained foundation models have shown strong performance across domains such as autonomous driving, medical imaging, and industrial inspection. However, they struggle with historical maps. These models are trained on modern or domain-specific images, where patterns can be tied to predefined concepts through common sense or expert knowledge. Historical maps lack such consistency – similar concepts can appear in vastly different shapes and styles. To address this, we propose On-Need Declarative (OND) knowledge-based prompting, which introduces explicit prompts to guide the model on what patterns correspond to which concepts. This allows users to specify the target concept and pattern during inference (on-need inference). We implement this by replacing the prompt encoder of the foundation model SAM with our OND prompting mechanism and fine-tune it on historical maps. The resulting model is called SMOL-MapSeg (Show Me One Label). Experiments show that SMOL-MapSeg can accurately segment classes defined by OND knowledge. It can also adapt to unseen classes through few-shot fine-tuning. Additionally, it outperforms a UNet-based baseline in average segmentation performance.

[179] AutoIAD: Manager-Driven Multi-Agent Collaboration for Automated Industrial Anomaly Detection

Dongwei Ji, Bingzhang Hu, Yi Zhou

Main category: cs.CV

TL;DR: AutoIAD is a multi-agent framework for automated industrial anomaly detection, outperforming existing methods in task completion and model performance.

Details

Motivation: Manual effort in industrial anomaly detection is inefficient; AutoIAD aims to automate the process for better quality control.

Method: Uses a Manager-Driven central agent to orchestrate specialized sub-agents and a domain-specific knowledge base for end-to-end automation.

Result: Outperforms general-purpose agentic collaboration and AutoML frameworks in AUROC and task completion, with reduced hallucination issues.

Conclusion: AutoIAD is effective for industrial anomaly detection, with the Manager agent and knowledge base being key to its success.

Abstract: Industrial anomaly detection (IAD) is critical for manufacturing quality control, but conventionally requires significant manual effort for various application scenarios. This paper introduces AutoIAD, a multi-agent collaboration framework, specifically designed for end-to-end automated development of industrial visual anomaly detection. AutoIAD leverages a Manager-Driven central agent to orchestrate specialized sub-agents (including Data Preparation, Data Loader, Model Designer, Trainer) and integrates a domain-specific knowledge base, which intelligently handles the entire pipeline using raw industrial image data to develop a trained anomaly detection model. We construct a comprehensive benchmark using MVTec AD datasets to evaluate AutoIAD across various LLM backends. Extensive experiments demonstrate that AutoIAD significantly outperforms existing general-purpose agentic collaboration frameworks and traditional AutoML frameworks in task completion rate and model performance (AUROC), while effectively mitigating issues like hallucination through iterative refinement. Ablation studies further confirm the crucial roles of the Manager central agent and the domain knowledge base module in producing robust and high-quality IAD solutions.

[180] Symmetry Understanding of 3D Shapes via Chirality Disentanglement

Weikang Wang, Tobias Weißberg, Nafie El Amrani, Florian Bernard

Main category: cs.CV

TL;DR: The paper introduces an unsupervised chirality feature extraction pipeline for shape analysis, addressing the gap in chirality-aware descriptors for point clouds and meshes.

Details

Motivation: Chirality is crucial in shape analysis but underdeveloped compared to image domains. Existing shape descriptors lack chirality disambiguation.

Method: Proposes a pipeline using the Diff3F framework to extract chirality features from 2D foundation models, applied to shape vertices.

Result: Evaluated on tasks like left-right disentanglement, shape matching, and part segmentation, showing effectiveness.

Conclusion: The proposed chirality features are practical and enhance shape analysis tasks.

Abstract: Chirality information (i.e. information that allows distinguishing left from right) is ubiquitous for various data modes in computer vision, including images, videos, point clouds, and meshes. While chirality has been extensively studied in the image domain, its exploration in shape analysis (such as point clouds and meshes) remains underdeveloped. Although many shape vertex descriptors have shown appealing properties (e.g. robustness to rigid-body transformations), they are often not able to disambiguate between left and right symmetric parts. Considering the ubiquity of chirality information in different shape analysis problems and the lack of chirality-aware features within current shape descriptors, developing a chirality feature extractor becomes necessary and urgent. Based on the recent Diff3F framework, we propose an unsupervised chirality feature extraction pipeline to decorate shape vertices with chirality-aware information, extracted from 2D foundation models. We evaluated the extracted chirality features through quantitative and qualitative experiments across diverse datasets. Results from downstream tasks including left-right disentanglement, shape matching, and part segmentation demonstrate their effectiveness and practical utility. Project page: https://wei-kang-wang.github.io/chirality/

[181] MagicHOI: Leveraging 3D Priors for Accurate Hand-object Reconstruction from Short Monocular Video Clips

Shibo Wang, Haonan He, Maria Parelli, Christoph Gebhardt, Zicong Fan, Jie Song

Main category: cs.CV

TL;DR: MagicHOI improves hand-object reconstruction from monocular videos by using novel view synthesis diffusion models to regularize unseen object regions, outperforming existing methods.

Details

Motivation: Existing methods rely on object templates or assume full visibility, leading to implausible reconstructions in real-world scenarios with limited viewpoints.

Method: MagicHOI integrates a novel view synthesis model and visible contact constraints to align hands and objects, leveraging diffusion models for supervision.

Result: MagicHOI outperforms state-of-the-art methods and effectively regularizes unseen object regions.

Conclusion: Novel view synthesis diffusion priors enhance 3D hand-object reconstruction, addressing limitations of current approaches.

Abstract: Most RGB-based hand-object reconstruction methods rely on object templates, while template-free methods typically assume full object visibility. This assumption often breaks in real-world settings, where fixed camera viewpoints and static grips leave parts of the object unobserved, resulting in implausible reconstructions. To overcome this, we present MagicHOI, a method for reconstructing hands and objects from short monocular interaction videos, even under limited viewpoint variation. Our key insight is that, despite the scarcity of paired 3D hand-object data, large-scale novel view synthesis diffusion models offer rich object supervision. This supervision serves as a prior to regularize unseen object regions during hand interactions. Leveraging this insight, we integrate a novel view synthesis model into our hand-object reconstruction framework. We further align hand to object by incorporating visible contact constraints. Our results demonstrate that MagicHOI significantly outperforms existing state-of-the-art hand-object reconstruction methods. We also show that novel view synthesis diffusion priors effectively regularize unseen object regions, enhancing 3D hand-object reconstruction.

[182] Revealing Latent Information: A Physics-inspired Self-supervised Pre-training Framework for Noisy and Sparse Events

Lin Zhu, Ruonan Liu, Xiao Wang, Lizhi Wang, Hua Huang

Main category: cs.CV

TL;DR: A self-supervised pre-training framework for event cameras enhances feature extraction from sparse, noisy data, outperforming state-of-the-art methods in tasks like object recognition and optical flow estimation.

Details

Motivation: Event cameras provide high temporal resolution and dynamic range but produce sparse, noisy data, complicating feature extraction.

Method: The framework includes Difference-guided Masked Modeling, Backbone-fixed Feature Transition, and Focus-aimed Contrastive Learning to extract and refine features.

Result: The framework outperforms existing methods in downstream tasks like object recognition and semantic segmentation.

Conclusion: The proposed self-supervised approach effectively leverages event data, demonstrating robustness and superior performance across tasks.

Abstract: Event camera, a novel neuromorphic vision sensor, records data with high temporal resolution and wide dynamic range, offering new possibilities for accurate visual representation in challenging scenarios. However, event data is inherently sparse and noisy, mainly reflecting brightness changes, which complicates effective feature extraction. To address this, we propose a self-supervised pre-training framework to fully reveal latent information in event data, including edge information and texture cues. Our framework consists of three stages: Difference-guided Masked Modeling, inspired by the event physical sampling process, reconstructs temporal intensity difference maps to extract enhanced information from raw event data. Backbone-fixed Feature Transition contrasts event and image features without updating the backbone to preserve representations learned from masked modeling and stabilizing their effect on contrastive learning. Focus-aimed Contrastive Learning updates the entire model to improve semantic discrimination by focusing on high-value regions. Extensive experiments show our framework is robust and consistently outperforms state-of-the-art methods on various downstream tasks, including object recognition, semantic segmentation, and optical flow estimation. The code and dataset are available at https://github.com/BIT-Vision/EventPretrain.

[183] Head Anchor Enhanced Detection and Association for Crowded Pedestrian Tracking

Zewei Wu, César Teixeira, Wei Ke, Zhang Xiong

Main category: cs.CV

TL;DR: An enhanced pedestrian tracking framework addresses occlusion challenges by integrating richer feature representations, head keypoint detection, and an iterative Kalman filtering approach for robust multi-object tracking in crowded scenes.

Details

Motivation: Real-world pedestrian tracking faces occlusion issues, especially in crowded environments, where traditional methods relying on full-body features and linear motion assumptions fail.

Method: The proposed method uses detection features from regression and classification branches, head keypoint detection (less prone to occlusion), and iterative Kalman filtering with 3D priors for motion modeling.

Result: The framework improves tracking robustness in occlusion-heavy scenarios by combining advanced appearance and motion modeling.

Conclusion: The method provides a more reliable solution for multi-object tracking in crowded, occlusion-prone environments.

Abstract: Visual pedestrian tracking represents a promising research field, with extensive applications in intelligent surveillance, behavior analysis, and human-computer interaction. However, real-world applications face significant occlusion challenges. When multiple pedestrians interact or overlap, the loss of target features severely compromises the tracker’s ability to maintain stable trajectories. Traditional tracking methods, which typically rely on full-body bounding box features extracted from {Re-ID} models and linear constant-velocity motion assumptions, often struggle in severe occlusion scenarios. To address these limitations, this work proposes an enhanced tracking framework that leverages richer feature representations and a more robust motion model. Specifically, the proposed method incorporates detection features from both the regression and classification branches of an object detector, embedding spatial and positional information directly into the feature representations. To further mitigate occlusion challenges, a head keypoint detection model is introduced, as the head is less prone to occlusion compared to the full body. In terms of motion modeling, we propose an iterative Kalman filtering approach designed to align with modern detector assumptions, integrating 3D priors to better complete motion trajectories in complex scenes. By combining these advancements in appearance and motion modeling, the proposed method offers a more robust solution for multi-object tracking in crowded environments where occlusions are prevalent.

[184] FS-IQA: Certified Feature Smoothing for Robust Image Quality Assessment

Ekaterina Shumitskaya, Dmitriy Vatolin, Anastasia Antsiferova

Main category: cs.CV

TL;DR: A novel certified defense method for IQA models using randomized smoothing in feature space, preserving image quality while ensuring robustness.

Details

Motivation: Prior methods degrade image quality by injecting noise in input space; this work aims to maintain fidelity while providing robustness.

Method: Randomized smoothing in feature space, analyzing Jacobian’s maximum singular value to link feature and input-space noise. Supports FR and NR IQA models without architecture changes.

Result: Reduces inference time by 99.5% (no certification) and 20.6% (with certification). Improves correlation with subjective scores by up to 30.9%.

Conclusion: The method is efficient, preserves image quality, and outperforms existing certified defenses in robustness and performance.

Abstract: We propose a novel certified defense method for Image Quality Assessment (IQA) models based on randomized smoothing with noise applied in the feature space rather than the input space. Unlike prior approaches that inject Gaussian noise directly into input images, often degrading visual quality, our method preserves image fidelity while providing robustness guarantees. To formally connect noise levels in the feature space with corresponding input-space perturbations, we analyze the maximum singular value of the backbone network’s Jacobian. Our approach supports both full-reference (FR) and no-reference (NR) IQA models without requiring any architectural modifications, suitable for various scenarios. It is also computationally efficient, requiring a single backbone forward pass per image. Compared to previous methods, it reduces inference time by 99.5% without certification and by 20.6% when certification is applied. We validate our method with extensive experiments on two benchmark datasets, involving six widely-used FR and NR IQA models and comparisons against five state-of-the-art certified defenses. Our results demonstrate consistent improvements in correlation with subjective quality scores by up to 30.9%.

[185] Leveraging AI to Accelerate Clinical Data Cleaning: A Comparative Study of AI-Assisted vs. Traditional Methods

Matthew Purri, Amit Patel, Erik Deurrell

Main category: cs.CV

TL;DR: Octozi, an AI-assisted platform, boosts clinical data cleaning efficiency by 6.03-fold and reduces errors by 6.44-fold, demonstrating transformative potential in drug development.

Details

Motivation: Manual clinical trial data cleaning is inefficient due to rising data volumes and complexity, creating a bottleneck in drug development.

Method: Octozi combines large language models with domain-specific heuristics to assist clinical reviewers in data cleaning.

Result: AI assistance increased throughput by 6.03-fold, reduced errors from 54.67% to 8.48%, and cut false positives by 15.48-fold.

Conclusion: AI-assisted approaches can revolutionize clinical trial workflows, improving efficiency, reducing costs, and maintaining compliance.

Abstract: Clinical trial data cleaning represents a critical bottleneck in drug development, with manual review processes struggling to manage exponentially increasing data volumes and complexity. This paper presents Octozi, an artificial intelligence-assisted platform that combines large language models with domain-specific heuristics to transform clinical data review. In a controlled experimental study with experienced clinical reviewers (n=10), we demonstrate that AI assistance increased data cleaning throughput by 6.03-fold while simultaneously decreasing cleaning errors from 54.67% to 8.48% (a 6.44-fold improvement). Crucially, the system reduced false positive queries by 15.48-fold, minimizing unnecessary site burden. These improvements were consistent across reviewers regardless of experience level, suggesting broad applicability. Our findings indicate that AI-assisted approaches can address fundamental inefficiencies in clinical trial operations, potentially accelerating drug development timelines and reducing costs while maintaining regulatory compliance. This work establishes a framework for integrating AI into safety-critical clinical workflows and demonstrates the transformative potential of human-AI collaboration in pharmaceutical clinical trials.

[186] Optimal Brain Connection: Towards Efficient Structural Pruning

Shaowu Chen, Wei Ma, Binhua Huang, Qingyuan Wang, Guoxin Wang, Weize Sun, Lei Huang, Deepu John

Main category: cs.CV

TL;DR: The paper introduces Optimal Brain Connection, a structural pruning framework that evaluates parameter saliency using the Jacobian Criterion and mitigates performance loss with Equivalent Pruning.

Details

Motivation: Existing structural pruning methods ignore parameter interconnections, limiting their effectiveness.

Method: Proposes the Jacobian Criterion for parameter saliency and Equivalent Pruning with autoencoders to retain contributions of pruned connections.

Result: Jacobian Criterion outperforms other metrics; Equivalent Pruning reduces performance degradation.

Conclusion: The framework improves pruning by considering parameter interconnections and preserving model performance.

Abstract: Structural pruning has been widely studied for its effectiveness in compressing neural networks. However, existing methods often neglect the interconnections among parameters. To address this limitation, this paper proposes a structural pruning framework termed Optimal Brain Connection. First, we introduce the Jacobian Criterion, a first-order metric for evaluating the saliency of structural parameters. Unlike existing first-order methods that assess parameters in isolation, our criterion explicitly captures both intra-component interactions and inter-layer dependencies. Second, we propose the Equivalent Pruning mechanism, which utilizes autoencoders to retain the contributions of all original connection–including pruned ones–during fine-tuning. Experimental results demonstrate that the Jacobian Criterion outperforms several popular metrics in preserving model performance, while the Equivalent Pruning mechanism effectively mitigates performance degradation after fine-tuning. Code: https://github.com/ShaowuChen/Optimal_Brain_Connection

[187] When Deepfake Detection Meets Graph Neural Network:a Unified and Lightweight Learning Framework

Haoyu Liu, Chaoyu Gong, Mengke He, Jiate Li, Kai Han, Siqiang Luo

Main category: cs.CV

TL;DR: SSTGNN is a lightweight graph neural network for detecting AI-generated videos, outperforming state-of-the-art models with fewer parameters.

Details

Motivation: Detecting AI-generated videos is challenging due to diverse manipulation types and reliance on isolated information in existing methods.

Method: SSTGNN uses a graph-based framework with spatial-spectral-temporal reasoning, spectral filters, and temporal differential modeling.

Result: SSTGNN achieves superior performance in in-domain and cross-domain settings, with 42.4x fewer parameters than top models.

Conclusion: SSTGNN is effective, robust, and scalable for real-world deployment in detecting manipulated videos.

Abstract: The proliferation of generative video models has made detecting AI-generated and manipulated videos an urgent challenge. Existing detection approaches often fail to generalize across diverse manipulation types due to their reliance on isolated spatial, temporal, or spectral information, and typically require large models to perform well. This paper introduces SSTGNN, a lightweight Spatial-Spectral-Temporal Graph Neural Network framework that represents videos as structured graphs, enabling joint reasoning over spatial inconsistencies, temporal artifacts, and spectral distortions. SSTGNN incorporates learnable spectral filters and temporal differential modeling into a graph-based architecture, capturing subtle manipulation traces more effectively. Extensive experiments on diverse benchmark datasets demonstrate that SSTGNN not only achieves superior performance in both in-domain and cross-domain settings, but also offers strong robustness against unseen manipulations. Remarkably, SSTGNN accomplishes these results with up to 42.4$\times$ fewer parameters than state-of-the-art models, making it highly lightweight and scalable for real-world deployment.

[188] AI vs. Human Moderators: A Comparative Evaluation of Multimodal LLMs in Content Moderation for Brand Safety

Adi Levi, Or Levi, Sardhendu Mishra, Jonathan Morra

Main category: cs.CV

TL;DR: The paper explores using Multimodal Large Language Models (MLLMs) for brand safety classification in video content moderation, introducing a new dataset and evaluating MLLMs’ performance against human reviewers.

Details

Motivation: The exponential growth of online video content has made manual moderation impractical, necessitating automated solutions that understand multimodal cues.

Method: The authors introduce a novel multimodal and multilingual dataset, labeled by professionals, and benchmark MLLMs (Gemini, GPT, Llama) for brand safety classification.

Result: MLLMs show effectiveness in brand safety classification but have limitations, as highlighted by failure cases. The dataset is released for future research.

Conclusion: MLLMs offer a promising but imperfect solution for multimodal content moderation, with room for improvement in accuracy and cost efficiency.

Abstract: As the volume of video content online grows exponentially, the demand for moderation of unsafe videos has surpassed human capabilities, posing both operational and mental health challenges. While recent studies demonstrated the merits of Multimodal Large Language Models (MLLMs) in various video understanding tasks, their application to multimodal content moderation, a domain that requires nuanced understanding of both visual and textual cues, remains relatively underexplored. In this work, we benchmark the capabilities of MLLMs in brand safety classification, a critical subset of content moderation for safe-guarding advertising integrity. To this end, we introduce a novel, multimodal and multilingual dataset, meticulously labeled by professional reviewers in a multitude of risk categories. Through a detailed comparative analysis, we demonstrate the effectiveness of MLLMs such as Gemini, GPT, and Llama in multimodal brand safety, and evaluate their accuracy and cost efficiency compared to professional human reviewers. Furthermore, we present an in-depth discussion shedding light on limitations of MLLMs and failure cases. We are releasing our dataset alongside this paper to facilitate future research on effective and responsible brand safety and content moderation.

[189] Looking into the Unknown: Exploring Action Discovery for Segmentation of Known and Unknown Actions

Federico Spurio, Emad Bahrami, Olga Zatsarynna, Yazan Abu Farha, Gianpiero Francesca, Juergen Gall

Main category: cs.CV

TL;DR: Action Discovery addresses ambiguous actions and incomplete annotations in partially labeled datasets by identifying known and unknown actions using granularity-guided segmentation and semantic assignment.

Details

Motivation: The challenge lies in defining and annotating ambiguous or overlooked actions in partially labeled datasets, common in neuroscience and other domains.

Method: A two-step approach: Granularity-Guided Segmentation Module (GGSM) for temporal intervals and Unknown Action Segment Assignment (UASA) for semantic classification.

Result: The method outperforms baselines on Breakfast, 50Salads, and Desktop Assembly datasets.

Conclusion: Action Discovery effectively handles partial annotations and improves action segmentation in ambiguous scenarios.

Abstract: We introduce Action Discovery, a novel setup within Temporal Action Segmentation that addresses the challenge of defining and annotating ambiguous actions and incomplete annotations in partially labeled datasets. In this setup, only a subset of actions - referred to as known actions - is annotated in the training data, while other unknown actions remain unlabeled. This scenario is particularly relevant in domains like neuroscience, where well-defined behaviors (e.g., walking, eating) coexist with subtle or infrequent actions that are often overlooked, as well as in applications where datasets are inherently partially annotated due to ambiguous or missing labels. To address this problem, we propose a two-step approach that leverages the known annotations to guide both the temporal and semantic granularity of unknown action segments. First, we introduce the Granularity-Guided Segmentation Module (GGSM), which identifies temporal intervals for both known and unknown actions by mimicking the granularity of annotated actions. Second, we propose the Unknown Action Segment Assignment (UASA), which identifies semantically meaningful classes within the unknown actions, based on learned embedding similarities. We systematically explore the proposed setting of Action Discovery on three challenging datasets - Breakfast, 50Salads, and Desktop Assembly - demonstrating that our method considerably improves upon existing baselines.

[190] Follow-Your-Instruction: A Comprehensive MLLM Agent for World Data Synthesis

Kunyu Feng, Yue Ma, Xinhua Zhang, Boshi Liu, Yikuang Yuluo, Yinhan Zhang, Runtao Liu, Hongyu Liu, Zhiyuan Qin, Shanhui Mo, Qifeng Chen, Zeyu Wang

Main category: cs.CV

TL;DR: Proposes Follow-Your-Instruction, an MLLM-driven framework for synthesizing high-quality 2D, 3D, and 4D data to address the scarcity of real-world data for AI-generated content.

Details

Motivation: High-quality, diverse, and scalable data is crucial for AIGC, but real-world data collection is costly and time-consuming. Existing methods lack scalability and accuracy.

Method: Uses MLLM-Collector for asset collection, MLLM-Generator and MLLM-Optimizer for semantic refinement, and MLLM-Planner for temporal coherence in 4D tasks.

Result: Synthetic data significantly improves baseline model performance in 2D, 3D, and 4D generative tasks.

Conclusion: Follow-Your-Instruction is a scalable and effective data engine for generative intelligence.

Abstract: With the growing demands of AI-generated content (AIGC), the need for high-quality, diverse, and scalable data has become increasingly crucial. However, collecting large-scale real-world data remains costly and time-consuming, hindering the development of downstream applications. While some works attempt to collect task-specific data via a rendering process, most approaches still rely on manual scene construction, limiting their scalability and accuracy. To address these challenges, we propose Follow-Your-Instruction, a Multimodal Large Language Model (MLLM)-driven framework for automatically synthesizing high-quality 2D, 3D, and 4D data. Our \textbf{Follow-Your-Instruction} first collects assets and their associated descriptions through multimodal inputs using the MLLM-Collector. Then it constructs 3D layouts, and leverages Vision-Language Models (VLMs) for semantic refinement through multi-view scenes with the MLLM-Generator and MLLM-Optimizer, respectively. Finally, it uses MLLM-Planner to generate temporally coherent future frames. We evaluate the quality of the generated data through comprehensive experiments on the 2D, 3D, and 4D generative tasks. The results show that our synthetic data significantly boosts the performance of existing baseline models, demonstrating Follow-Your-Instruction’s potential as a scalable and effective data engine for generative intelligence.

Haijing Liu, Tao Pu, Hefeng Wu, Keze Wang, Liang Lin

Main category: cs.CV

TL;DR: DART framework improves open-vocabulary multi-label recognition by combining adaptive intra-class refinement and inter-class transfer using LLM-derived knowledge.

Details

Motivation: Existing VLP models lack fine-grained localization and fail to leverage structured relational knowledge, limiting performance for unseen classes.

Method: DART uses an Adaptive Refinement Module (ARM) for intra-class refinement and an Adaptive Transfer Module (ATM) with a Class Relationship Graph (CRG) for inter-class transfer.

Result: DART achieves state-of-the-art performance on benchmarks.

Conclusion: DART effectively integrates LLM-derived knowledge and adaptive refinement for superior OV-MLR performance.

Abstract: Open-Vocabulary Multi-Label Recognition (OV-MLR) aims to identify multiple seen and unseen object categories within an image, requiring both precise intra-class localization to pinpoint objects and effective inter-class reasoning to model complex category dependencies. While Vision-Language Pre-training (VLP) models offer a strong open-vocabulary foundation, they often struggle with fine-grained localization under weak supervision and typically fail to explicitly leverage structured relational knowledge beyond basic semantics, limiting performance especially for unseen classes. To overcome these limitations, we propose the Dual Adaptive Refinement Transfer (DART) framework. DART enhances a frozen VLP backbone via two synergistic adaptive modules. For intra-class refinement, an Adaptive Refinement Module (ARM) refines patch features adaptively, coupled with a novel Weakly Supervised Patch Selecting (WPS) loss that enables discriminative localization using only image-level labels. Concurrently, for inter-class transfer, an Adaptive Transfer Module (ATM) leverages a Class Relationship Graph (CRG), constructed using structured knowledge mined from a Large Language Model (LLM), and employs graph attention network to adaptively transfer relational information between class representations. DART is the first framework, to our knowledge, to explicitly integrate external LLM-derived relational knowledge for adaptive inter-class transfer while simultaneously performing adaptive intra-class refinement under weak supervision for OV-MLR. Extensive experiments on challenging benchmarks demonstrate that our DART achieves new state-of-the-art performance, validating its effectiveness.

[192] WeTok: Powerful Discrete Tokenization for High-Fidelity Visual Reconstruction

Shaobin Zhuang, Yiwei Guo, Canmiao Fu, Zhipeng Huang, Zeyue Tian, Ying Zhang, Chen Li, Yali Wang

Main category: cs.CV

TL;DR: WeTok tokenizer improves vision generation with Group-wise Quantization and Generative Decoding, outperforming existing methods in fidelity and compression.

Details

Motivation: Addressing the trade-off between compression ratios and reconstruction fidelity in visual tokenizers.

Method: Introduces Group-wise lookup-free Quantization (GQ) and Generative Decoding (GD) for efficient and high-fidelity reconstruction.

Result: Achieves record-low zero-shot rFID (0.12) and superior compression (768x) with high fidelity.

Conclusion: WeTok sets a new standard for visual tokenizers, offering scalable and high-performance solutions.

Abstract: Visual tokenizer is a critical component for vision generation. However, the existing tokenizers often face unsatisfactory trade-off between compression ratios and reconstruction fidelity. To fill this gap, we introduce a powerful and concise WeTok tokenizer, which surpasses the previous leading tokenizers via two core innovations. (1) Group-wise lookup-free Quantization (GQ). We partition the latent features into groups, and perform lookup-free quantization for each group. As a result, GQ can efficiently overcome memory and computation limitations of prior tokenizers, while achieving a reconstruction breakthrough with more scalable codebooks. (2) Generative Decoding (GD). Different from prior tokenizers, we introduce a generative decoder with a prior of extra noise variable. In this case, GD can probabilistically model the distribution of visual data conditioned on discrete tokens, allowing WeTok to reconstruct visual details, especially at high compression ratios. Extensive experiments on mainstream benchmarks show superior performance of our WeTok. On the ImageNet 50k validation set, WeTok achieves a record-low zero-shot rFID (WeTok: 0.12 vs. FLUX-VAE: 0.18 vs. SD-VAE 3.5: 0.19). Furthermore, our highest compression model achieves a zero-shot rFID of 3.49 with a compression ratio of 768, outperforming Cosmos (384) 4.57 which has only 50% compression rate of ours. Code and models are available: https://github.com/zhuangshaobin/WeTok.

[193] LLaVA-RE: Binary Image-Text Relevancy Evaluation with Multimodal Large Language Model

Tao Sun, Oliver Liu, JinJin Li, Lan Ma

Main category: cs.CV

TL;DR: LLaVA-RE is a framework using Multimodal Large Language Models (MLLMs) for binary image-text relevancy evaluation, addressing challenges in diverse text formats and relevancy definitions.

Details

Motivation: Evaluating image-text relevancy is crucial for multimodal generative AI, but it's challenging due to diverse text formats and varying relevancy definitions. MLLMs are ideal for this task due to their flexibility.

Method: LLaVA-RE, based on the LLaVA architecture, uses detailed task instructions and multimodal in-context samples. A novel binary relevancy dataset is also introduced.

Result: Experiments confirm the effectiveness of LLaVA-RE in evaluating image-text relevancy.

Conclusion: LLaVA-RE demonstrates the potential of MLLMs for binary relevancy evaluation, offering a flexible and effective solution.

Abstract: Multimodal generative AI usually involves generating image or text responses given inputs in another modality. The evaluation of image-text relevancy is essential for measuring response quality or ranking candidate responses. In particular, binary relevancy evaluation, i.e., Relevant'' vs. Not Relevant’’, is a fundamental problem. However, this is a challenging task considering that texts have diverse formats and the definition of relevancy varies in different scenarios. We find that Multimodal Large Language Models (MLLMs) are an ideal choice to build such evaluators, as they can flexibly handle complex text formats and take in additional task information. In this paper, we present LLaVA-RE, a first attempt for binary image-text relevancy evaluation with MLLM. It follows the LLaVA architecture and adopts detailed task instructions and multimodal in-context samples. In addition, we propose a novel binary relevancy data set that covers various tasks. Experimental results validate the effectiveness of our framework.

[194] Hi3DEval: Advancing 3D Generation Evaluation with Hierarchical Validity

Yuhan Zhang, Long Zhuo, Ziyang Chu, Tong Wu, Zhibing Li, Liang Pan, Dahua Lin, Ziwei Liu

Main category: cs.CV

TL;DR: Hi3DEval is a hierarchical framework for evaluating 3D generative content, combining object-level and part-level analysis with material realism assessment. It outperforms image-based metrics and aligns better with human preferences.

Details

Motivation: Existing methods for 3D content quality assessment rely on image-based metrics and lack spatial coherence, material authenticity, and local detail evaluation.

Method: Hi3DEval uses a hierarchical approach with object-level and part-level evaluation, material realism assessment, and leverages Hi3DBench (a dataset with annotations) and a 3D-aware automated scoring system.

Result: The framework outperforms image-based metrics in capturing 3D characteristics and aligns better with human preferences.

Conclusion: Hi3DEval provides a scalable and effective alternative to manual evaluations for 3D generative content.

Abstract: Despite rapid advances in 3D content generation, quality assessment for the generated 3D assets remains challenging. Existing methods mainly rely on image-based metrics and operate solely at the object level, limiting their ability to capture spatial coherence, material authenticity, and high-fidelity local details. 1) To address these challenges, we introduce Hi3DEval, a hierarchical evaluation framework tailored for 3D generative content. It combines both object-level and part-level evaluation, enabling holistic assessments across multiple dimensions as well as fine-grained quality analysis. Additionally, we extend texture evaluation beyond aesthetic appearance by explicitly assessing material realism, focusing on attributes such as albedo, saturation, and metallicness. 2) To support this framework, we construct Hi3DBench, a large-scale dataset comprising diverse 3D assets and high-quality annotations, accompanied by a reliable multi-agent annotation pipeline. We further propose a 3D-aware automated scoring system based on hybrid 3D representations. Specifically, we leverage video-based representations for object-level and material-subject evaluations to enhance modeling of spatio-temporal consistency and employ pretrained 3D features for part-level perception. Extensive experiments demonstrate that our approach outperforms existing image-based metrics in modeling 3D characteristics and achieves superior alignment with human preference, providing a scalable alternative to manual evaluations. The project page is available at https://zyh482.github.io/Hi3DEval/.

[195] MOSEv2: A More Challenging Dataset for Video Object Segmentation in Complex Scenes

Henghui Ding, Kaining Ying, Chang Liu, Shuting He, Xudong Jiang, Yu-Gang Jiang, Philip H. S. Torr, Song Bai

Main category: cs.CV

TL;DR: MOSEv2 is a challenging dataset for video object segmentation (VOS) designed to address real-world complexities, showing significant performance drops in current methods.

Details

Motivation: Existing VOS datasets lack real-world complexity, limiting generalization. MOSEv2 aims to bridge this gap.

Method: MOSEv2 introduces 5,024 videos with 701,976 masks, featuring complex scenarios like occlusions, adverse weather, and camouflaged objects.

Result: Benchmarking shows performance drops (e.g., SAM2 from 76.4% to 50.9%), highlighting current methods’ limitations.

Conclusion: MOSEv2 advances VOS research by exposing gaps in handling real-world challenges, encouraging method improvements.

Abstract: Video object segmentation (VOS) aims to segment specified target objects throughout a video. Although state-of-the-art methods have achieved impressive performance (e.g., 90+% J&F) on existing benchmarks such as DAVIS and YouTube-VOS, these datasets primarily contain salient, dominant, and isolated objects, limiting their generalization to real-world scenarios. To advance VOS toward more realistic environments, coMplex video Object SEgmentation (MOSEv1) was introduced to facilitate VOS research in complex scenes. Building on the strengths and limitations of MOSEv1, we present MOSEv2, a significantly more challenging dataset designed to further advance VOS methods under real-world conditions. MOSEv2 consists of 5,024 videos and over 701,976 high-quality masks for 10,074 objects across 200 categories. Compared to its predecessor, MOSEv2 introduces significantly greater scene complexity, including more frequent object disappearance and reappearance, severe occlusions and crowding, smaller objects, as well as a range of new challenges such as adverse weather (e.g., rain, snow, fog), low-light scenes (e.g., nighttime, underwater), multi-shot sequences, camouflaged objects, non-physical targets (e.g., shadows, reflections), scenarios requiring external knowledge, etc. We benchmark 20 representative VOS methods under 5 different settings and observe consistent performance drops. For example, SAM2 drops from 76.4% on MOSEv1 to only 50.9% on MOSEv2. We further evaluate 9 video object tracking methods and find similar declines, demonstrating that MOSEv2 presents challenges across tasks. These results highlight that despite high accuracy on existing datasets, current VOS methods still struggle under real-world complexities. MOSEv2 is publicly available at https://MOSE.video.

[196] GAP: Gaussianize Any Point Clouds with Text Guidance

Weiqi Zhang, Junsheng Zhou, Haotian Geng, Wenyuan Zhang, Yu-Shen Liu

Main category: cs.CV

TL;DR: GAP introduces a method to convert colorless 3D point clouds into high-fidelity 3D Gaussians using text guidance, multi-view optimization, and surface-anchoring for geometric accuracy.

Details

Motivation: Bridging the gap between point clouds and Gaussians, especially for colorless point clouds, is a challenge. GAP aims to solve this by leveraging text guidance and multi-view optimization.

Method: GAP uses a multi-view optimization framework with a depth-aware image diffusion model, surface-anchoring for geometric constraints, and diffuse-based inpainting for hard-to-observe regions.

Result: GAP successfully generates high-fidelity 3D Gaussians from colorless point clouds, tested on synthetic and real-world scans, including large-scale scenes.

Conclusion: GAP provides an effective solution for converting point clouds to Gaussians, addressing key challenges like geometric accuracy and appearance consistency.

Abstract: 3D Gaussian Splatting (3DGS) has demonstrated its advantages in achieving fast and high-quality rendering. As point clouds serve as a widely-used and easily accessible form of 3D representation, bridging the gap between point clouds and Gaussians becomes increasingly important. Recent studies have explored how to convert the colored points into Gaussians, but directly generating Gaussians from colorless 3D point clouds remains an unsolved challenge. In this paper, we propose GAP, a novel approach that gaussianizes raw point clouds into high-fidelity 3D Gaussians with text guidance. Our key idea is to design a multi-view optimization framework that leverages a depth-aware image diffusion model to synthesize consistent appearances across different viewpoints. To ensure geometric accuracy, we introduce a surface-anchoring mechanism that effectively constrains Gaussians to lie on the surfaces of 3D shapes during optimization. Furthermore, GAP incorporates a diffuse-based inpainting strategy that specifically targets at completing hard-to-observe regions. We evaluate GAP on the Point-to-Gaussian generation task across varying complexity levels, from synthetic point clouds to challenging real-world scans, and even large-scale scenes. Project Page: https://weiqi-zhang.github.io/GAP.

[197] FaceAnonyMixer: Cancelable Faces via Identity Consistent Latent Space Mixing

Mohammed Talha Alam, Fahad Shamshad, Fakhri Karray, Karthik Nandakumar

Main category: cs.CV

TL;DR: FaceAnonyMixer is a cancelable face generation framework that synthesizes privacy-preserving face images by mixing latent codes, ensuring revocability, unlinkability, and irreversibility while maintaining recognition utility.

Details

Motivation: Address privacy concerns in face recognition by developing a method that protects identity while meeting biometric template protection requirements.

Method: Leverages a pre-trained generative model’s latent space to mix real face latent codes with synthetic codes from revocable keys, refined via multi-objective loss.

Result: Generates high-quality cancelable faces compatible with existing FR systems, achieving superior recognition accuracy and 11% stronger privacy protection.

Conclusion: FaceAnonyMixer effectively balances privacy and utility, outperforming recent cancelable biometric methods.

Abstract: Advancements in face recognition (FR) technologies have amplified privacy concerns, necessitating methods that protect identity while maintaining recognition utility. Existing face anonymization methods typically focus on obscuring identity but fail to meet the requirements of biometric template protection, including revocability, unlinkability, and irreversibility. We propose FaceAnonyMixer, a cancelable face generation framework that leverages the latent space of a pre-trained generative model to synthesize privacy-preserving face images. The core idea of FaceAnonyMixer is to irreversibly mix the latent code of a real face image with a synthetic code derived from a revocable key. The mixed latent code is further refined through a carefully designed multi-objective loss to satisfy all cancelable biometric requirements. FaceAnonyMixer is capable of generating high-quality cancelable faces that can be directly matched using existing FR systems without requiring any modifications. Extensive experiments on benchmark datasets demonstrate that FaceAnonyMixer delivers superior recognition accuracy while providing significantly stronger privacy protection, achieving over an 11% gain on commercial API compared to recent cancelable biometric methods. Code is available at: https://github.com/talha-alam/faceanonymixer.

[198] M$^{2}$Chat: Empowering VLM for Multimodal LLM Interleaved Text-Image Generation

Xiaowei Chi, Junbo Qi, Rongyu Zhang, Shanghang Zhang, Qifeng Liu, Yike Guo

Main category: cs.CV

TL;DR: M2Chat introduces a unified multimodal LLM framework with M3Adapter for efficient text-image alignment and a two-stage fine-tuning strategy, outperforming state-of-the-art models in diverse tasks.

Details

Motivation: Current LLM chatbots lack efficient alignment methods for high-fidelity performance in multimodal tasks.

Method: Proposes M3Adapter for integrating visual and semantic features and a two-stage M3FT fine-tuning strategy.

Result: M2Chat surpasses state-of-the-art models in benchmarks for interleaved generation, storytelling, and dialogue.

Conclusion: M2Chat demonstrates superior performance in multimodal tasks, with potential for broader applications.

Abstract: While current LLM chatbots like GPT-4V bridge the gap between human instructions and visual representations to enable text-image generations, they still lack efficient alignment methods for high-fidelity performance on multiple downstream tasks. In this paper, we propose \textbf{$M^{2}Chat$}, a novel unified multimodal LLM framework for generating interleaved text-image conversation across various scenarios. Specifically, we propose an $M^{3}Adapter$ that efficiently integrates granular low-level visual information and high-level semantic features from multi-modality prompts. Upon the well-aligned fused feature, $M^{3}Adapter$ tailors a learnable gating strategy to balance the model creativity and consistency across various tasks adaptively. Moreover, to further enhance the effectiveness of $M^{3}Adapter$ while preserving the coherence of semantic context comprehension, we introduce a two-stage $M^{3}FT$ fine-tuning strategy. This strategy optimizes disjoint groups of parameters for image-text alignment and visual-instruction respectively. Extensive experiments demonstrate our $M^{2}Chat$ surpasses state-of-the-art counterparts across diverse benchmarks, showcasing its prowess in interleaving generation, storytelling, and multimodal dialogue systems. The demo and code are available at \red{https://mattie-e.github.io/M2Chat.github.io}.

[199] Magic Fixup: Streamlining Photo Editing by Watching Dynamic Videos

Hadi Alzayer, Zhihao Xia, Xuaner Zhang, Eli Shechtman, Jia-Bin Huang, Michael Gharbi

Main category: cs.CV

TL;DR: A generative model synthesizes photorealistic images from coarsely edited inputs, preserving details and adapting to new layouts using video supervision.

Details

Motivation: To create realistic image edits by leveraging video data for supervision, capturing changes in viewpoint, lighting, and interactions.

Method: Uses paired video frames to train a diffusion model, warping source frames to mimic user edits and supervising the model to generate target frames.

Result: Produces photorealistic edits that follow user layouts while harmonizing lighting and interactions.

Conclusion: Video supervision enables realistic image synthesis from coarse edits, addressing second-order effects effectively.

Abstract: We propose a generative model that, given a coarsely edited image, synthesizes a photorealistic output that follows the prescribed layout. Our method transfers fine details from the original image and preserve the identity of its parts. Yet, it adapts it to the lighting and context defined by the new layout. Our key insight is that videos are a powerful source of supervision for this task: objects and camera motions provide many observations of how the world changes with viewpoint, lighting, and physical interactions. We construct an image dataset in which each sample is a pair of source and target frames extracted from the same video at randomly chosen time intervals. We warp the source frame toward the target using two motion models that mimic the expected test-time user edits. We supervise our model to translate the warped image into the ground truth, starting from a pretrained diffusion model. Our model design explicitly enables fine detail transfer from the source frame to the generated image, while closely following the user-specified layout. We show that by using simple segmentations and coarse 2D manipulations, we can synthesize a photorealistic edit faithful to the user’s input while addressing second-order effects like harmonizing the lighting and physical interactions between edited objects.

[200] Verbalized Representation Learning for Interpretable Few-Shot Generalization

Cheng-Fu Yang, Da Yin, Wenbo Hu, Heng Ji, Nanyun Peng, Bolei Zhou, Kai-Wei Chang

Main category: cs.CV

TL;DR: VRL improves few-shot object recognition by using natural language features from a Vision-Language Model, outperforming prior methods with less data.

Details

Motivation: Humans excel at recognizing objects with few examples due to language understanding; replicating this can enhance model generalization in low-data settings.

Method: VRL employs a Vision-Language Model to extract verbalized features (inter-class differences and intra-class commonalities) and maps them to numeric vectors for downstream tasks.

Result: VRL achieves a 24% improvement over state-of-the-art methods with 95% less data and a 20% gain over human-labeled attributes.

Conclusion: VRL demonstrates the effectiveness of verbalized representations for few-shot learning, offering interpretability and superior performance.

Abstract: Humans recognize objects after observing only a few examples, a remarkable capability enabled by their inherent language understanding of the real-world environment. Developing verbalized and interpretable representation can significantly improve model generalization in low-data settings. In this work, we propose Verbalized Representation Learning (VRL), a novel approach for automatically extracting human-interpretable features for object recognition using few-shot data. Our method uniquely captures inter-class differences and intra-class commonalities in the form of natural language by employing a Vision-Language Model (VLM) to identify key discriminative features between different classes and shared characteristics within the same class. These verbalized features are then mapped to numeric vectors through the VLM. The resulting feature vectors can be further utilized to train and infer with downstream classifiers. Experimental results show that, at the same model scale, VRL achieves a 24% absolute improvement over prior state-of-the-art methods while using 95% less data and a smaller mode. Furthermore, compared to human-labeled attributes, the features learned by VRL exhibit a 20% absolute gain when used for downstream classification tasks. Code is available at: https://github.com/joeyy5588/VRL/tree/main.

[201] PerSense: Training-Free Personalized Instance Segmentation in Dense Images

Muhammad Ibraheem Siddiqui, Muhammad Umer Sheikh, Hassan Abid, Muhammad Haris Khan

Main category: cs.CV

TL;DR: PerSense is a training-free, model-agnostic one-shot framework for personalized instance segmentation in dense images, addressing challenges like occlusions and clutter. It introduces novel modules (IDM, PPSM) and a feedback mechanism, outperforming SOTA methods.

Details

Motivation: Challenges in dense scenarios (occlusions, scale variations, clutter) hinder precise instance segmentation, motivating the need for an advanced, adaptable solution.

Method: PerSense uses an Instance Detection Module (IDM) for point prompts, a Point Prompt Selection Module (PPSM) for refinement, and a feedback mechanism for DM accuracy. It is model-agnostic and training-free.

Result: PerSense outperforms state-of-the-art methods in dense image segmentation, validated by extensive experiments.

Conclusion: PerSense advances dense instance segmentation with its innovative modules and feedback mechanism, supported by the new PerSense-D benchmark.

Abstract: The emergence of foundational models has significantly advanced segmentation approaches. However, challenges still remain in dense scenarios, where occlusions, scale variations, and clutter impede precise instance delineation. To address this, we propose PerSense, an end-to-end, training-free, and model-agnostic one-shot framework for Personalized instance Segmentation in dense images. We start with developing a new baseline capable of automatically generating instance-level point prompts via proposing a novel Instance Detection Module (IDM) that leverages density maps (DMs), encapsulating spatial distribution of objects in an image. To reduce false positives, we design the Point Prompt Selection Module (PPSM), which refines the output of IDM based on adaptive threshold and spatial gating. Both IDM and PPSM seamlessly integrate into our model-agnostic framework. Furthermore, we introduce a feedback mechanism that enables PerSense to improve the accuracy of DMs by automating the exemplar selection process for DM generation. Finally, to advance research in this relatively underexplored area, we introduce PerSense-D, an evaluation benchmark for instance segmentation in dense images. Our extensive experiments establish PerSense’s superiority over SOTA in dense settings.

[202] GTR: Improving Large 3D Reconstruction Models through Geometry and Texture Refinement

Peiye Zhuang, Songfang Han, Chaoyang Wang, Aliaksandr Siarohin, Jiaxu Zou, Michael Vasilkovsky, Vladislav Shakhrai, Sergey Korolev, Sergey Tulyakov, Hsin-Ying Lee

Main category: cs.CV

TL;DR: A novel method for 3D mesh reconstruction from multi-view images improves upon LRM by enhancing architecture, introducing differentiable mesh extraction, and refining textures, achieving state-of-the-art results.

Details

Motivation: To address shortcomings in existing 3D reconstruction models like LRM and improve reconstruction quality, especially for complex textures.

Method: Modifies LRM architecture for better multi-view representation, extracts meshes from NeRF fields, and introduces a lightweight texture refinement procedure.

Result: Achieves PSNR of 28.67 on GSO dataset, improves to 29.79 after refinement, and faithfully reconstructs complex textures like text.

Conclusion: The method advances 3D reconstruction quality and enables applications like text- or image-to-3D generation, though challenges remain for highly complex textures.

Abstract: We propose a novel approach for 3D mesh reconstruction from multi-view images. Our method takes inspiration from large reconstruction models like LRM that use a transformer-based triplane generator and a Neural Radiance Field (NeRF) model trained on multi-view images. However, in our method, we introduce several important modifications that allow us to significantly enhance 3D reconstruction quality. First of all, we examine the original LRM architecture and find several shortcomings. Subsequently, we introduce respective modifications to the LRM architecture, which lead to improved multi-view image representation and more computationally efficient training. Second, in order to improve geometry reconstruction and enable supervision at full image resolution, we extract meshes from the NeRF field in a differentiable manner and fine-tune the NeRF model through mesh rendering. These modifications allow us to achieve state-of-the-art performance on both 2D and 3D evaluation metrics, such as a PSNR of 28.67 on Google Scanned Objects (GSO) dataset. Despite these superior results, our feed-forward model still struggles to reconstruct complex textures, such as text and portraits on assets. To address this, we introduce a lightweight per-instance texture refinement procedure. This procedure fine-tunes the triplane representation and the NeRF color estimation model on the mesh surface using the input multi-view images in just 4 seconds. This refinement improves the PSNR to 29.79 and achieves faithful reconstruction of complex textures, such as text. Additionally, our approach enables various downstream applications, including text- or image-to-3D generation.

[203] Human Cognitive Benchmarks Reveal Foundational Visual Gaps in MLLMs

Jen-Tse Huang, Dasen Dai, Jen-Yuan Huang, Youliang Yuan, Xiaoyuan Liu, Wenxuan Wang, Wenxiang Jiao, Pinjia He, Zhaopeng Tu, Haodong Duan

Main category: cs.CV

TL;DR: VisFactor benchmark reveals MLLMs struggle with basic visual reasoning tasks, scoring only 25.19/100, highlighting a gap in low-level visual cognition compared to humans.

Details

Motivation: To investigate why MLLMs fail at simple visual reasoning tasks humans solve effortlessly, despite progress on multimodal benchmarks.

Method: Introduced VisFactor, a benchmark with 20 vision-centric subtests from cognitive psychology, evaluating 20 MLLMs across four visual cognition domains.

Result: Best model scored 25.19/100, failing tasks like mental rotation and spatial relations, regardless of model size or prompting.

Conclusion: Current MLLMs lack human-like low-level visual cognition, challenging the idea that large-scale pretraining inherently develops perceptual capabilities.

Abstract: Despite significant progress on popular multimodal benchmarks, state-of-the-art Multimodal Large Language Models (MLLMs) continue to struggle with basic visual reasoning tasks that are trivially solved by humans, such as recognizing spatial relationships. To systematically investigate this gap, we introduce VisFactor, a benchmark that digitizes 20 vision-centric subtests from a well-established cognitive psychology assessment. These subtests span four core domains of human visual cognition: (1) Visualization and Spatial Processing, (2) Perceptual and Closure, (3) Memory, and (4) Reasoning. We evaluate 20 frontier MLLMs from GPT, Gemini, Claude, LLaMA, Qwen, and SEED families. The best-performing model achieves a score of only 25.19 out of 100, with consistent failures on tasks such as mental rotation, spatial relation inference, and figure-ground discrimination, regardless of model size or prompting strategy. These findings suggest that current MLLM performance gains on high-level benchmarks do not reflect human-like low-level visual cognition, challenging the assumption that large-scale pretraining naturally induces gestalt-like perceptual capabilities. The dataset and evaluation toolkit are publicly available at: https://github.com/CUHK-ARISE/VisFactor.

[204] Interior Object Geometry via Fitted Frames

Stephen M. Pizer, Zhiyuan Liu, Junjie Zhao, Nicholas Tapp-Hughes, James Damon, Miaomiao Zhang, JS Marron, Mohsen Taheri, Jared Vicory

Main category: cs.CV

TL;DR: The paper introduces an alignment-free method for computing geometric features from fitted frames on object boundaries and interiors, enabling local correspondence across populations. It uses a skeletal representation and diffeomorphic deformation for modeling, showing improved classification performance for hippocampi shape analysis.

Details

Motivation: To develop a method for producing geometric features that ensure strong locational correspondence across object populations, particularly for anatomic objects, enhancing statistical analysis and classification.

Method: The approach involves fitting frames on object boundaries and interiors, using a skeletal representation and diffeomorphic deformation of an ellipsoid’s interior closure. The object is initially provided as a boundary mesh.

Result: The proposed method, called evolutionary s-rep, outperforms two state-of-the-art methods in classifying hippocampi shapes between individuals with a disorder and others.

Conclusion: The evolutionary s-rep provides a powerful representation for anatomic objects, improving geometric correspondence and classification performance, with potential applications in statistical shape analysis.

Abstract: We propose a means of computing fitted frames on the boundary and in the interior of objects and using them to provide the basis for producing geometric features from them that are not only alignment-free but most importantly can be made to correspond locally across a population of objects. We describe a representation targeted for anatomic objects which is designed to enable this strong locational correspondence within object populations and thus to provide powerful object statistics. It accomplishes this by understanding an object as the diffeomorphic deformation of the closure of the interior of an ellipsoid and by using a skeletal representation fitted throughout the deformation to produce a model of the target object, where the object is provided initially in the form of a boundary mesh. Via classification performance on hippocampi shape between individuals with a disorder vs. others, we compare our method to two state-of-theart methods for producing object representations that are intended to capture geometric correspondence across a population of objects and to yield geometric features useful for statistics, and we show notably improved classification performance by this new representation, which we call the evolutionary s-rep. The geometric features that are derived from each of the representations, especially via fitted frames, are discussed.

[205] StitchFusion: Weaving Any Visual Modalities to Enhance Multimodal Semantic Segmentation

Bingyu Li, Da Zhang, Zhiyuan Zhao, Junyu Gao, Xuelong Li

Main category: cs.CV

TL;DR: StitchFusion is a simple yet effective modal fusion framework for multimodal semantic segmentation, leveraging pre-trained models and a MultiAdapter module for cross-modal feature fusion, achieving state-of-the-art results with minimal added parameters.

Details

Motivation: Current methods for multimodal semantic segmentation lack input flexibility and increase training parameters due to specialized fusion modules. StitchFusion aims to address these limitations.

Method: The framework integrates pre-trained models as encoders and uses a MultiAdapter module for cross-modal information transfer, enabling multi-scale feature fusion.

Result: State-of-the-art performance on four datasets with minimal additional parameters. MultiAdapter complements existing Feature Fusion Modules (FFMs).

Conclusion: StitchFusion offers a flexible, efficient solution for multimodal semantic segmentation, enhancing accuracy without excessive parameter growth.

Abstract: Multimodal semantic segmentation shows significant potential for enhancing segmentation accuracy in complex scenes. However, current methods often incorporate specialized feature fusion modules tailored to specific modalities, thereby restricting input flexibility and increasing the number of training parameters. To address these challenges, we propose StitchFusion, a straightforward yet effective modal fusion framework that integrates large-scale pre-trained models directly as encoders and feature fusers. This approach facilitates comprehensive multi-modal and multi-scale feature fusion, accommodating any visual modal inputs. Specifically, Our framework achieves modal integration during encoding by sharing multi-modal visual information. To enhance information exchange across modalities, we introduce a multi-directional adapter module (MultiAdapter) to enable cross-modal information transfer during encoding. By leveraging MultiAdapter to propagate multi-scale information across pre-trained encoders during the encoding process, StitchFusion achieves multi-modal visual information integration during encoding. Extensive comparative experiments demonstrate that our model achieves state-of-the-art performance on four multi-modal segmentation datasets with minimal additional parameters. Furthermore, the experimental integration of MultiAdapter with existing Feature Fusion Modules (FFMs) highlights their complementary nature. Our code is available at StitchFusion_repo.

[206] ReferEverything: Towards Segmenting Everything We Can Speak of in Videos

Anurag Bagchi, Zhipeng Bao, Yu-Xiong Wang, Pavel Tokmakov, Martial Hebert

Main category: cs.CV

TL;DR: REM is a framework for segmenting diverse video concepts using natural language, leveraging video diffusion models fine-tuned on small datasets. It shifts the model’s objective to predict mask latents, enabling accurate segmentation of rare and unseen objects, and generalizing to dynamic concepts like smoke. REM matches state-of-the-art in-domain and excels out-of-domain, benefiting from generative pre-training.

Details

Motivation: To enable segmentation of a wide range of video concepts described via natural language, including rare and unseen objects, and dynamic non-object concepts like smoke or raindrops.

Method: Fine-tunes video diffusion models on small-scale Referring Object Segmentation datasets, shifting their objective from noise prediction to mask latent prediction while preserving the generative architecture.

Result: Accurate segmentation of rare and unseen objects, generalization to dynamic concepts, and outperformance of state-of-the-art by up to 12 IoU points out-of-domain.

Conclusion: REM demonstrates the effectiveness of generative pre-training for video segmentation, showing that advancements in video generation directly enhance segmentation performance.

Abstract: We present REM, a framework for segmenting a wide range of concepts in video that can be described through natural language. Our method leverages the universal visual-language mapping learned by video diffusion models on Internet-scale data by fine-tuning them on small-scale Referring Object Segmentation datasets. Our key insight is to preserve the entirety of the generative model’s architecture by shifting its objective from predicting noise to predicting mask latents. The resulting model can accurately segment rare and unseen objects, despite only being trained on a limited set of categories. Additionally, it can effortlessly generalize to non-object dynamic concepts, such as smoke or raindrops, as demonstrated in our new benchmark for Referring Video Process Segmentation (Ref-VPS). REM performs on par with the state-of-the-art on in-domain datasets, like Ref-DAVIS, while outperforming them by up to 12 IoU points out-of-domain, leveraging the power of generative pre-training. We also show that advancements in video generation directly improve segmentation.

[207] DisCoRD: Discrete Tokens to Continuous Motion via Rectified Flow Decoding

Jungbin Cho, Junwan Kim, Jisoo Kim, Minseo Kim, Mingu Kang, Sungeun Hong, Tae-Hyun Oh, Youngjae Yu

Main category: cs.CV

TL;DR: DisCoRD bridges discrete and continuous motion generation by using rectified flow to decode tokens, achieving smoother, more natural motion without losing conditioning fidelity.

Details

Motivation: Addressing the limitations of discrete (noisy, less expressive) and continuous (struggles with conditioning) motion generation methods.

Method: Leverages rectified flow to decode discrete tokens into continuous motion, framing it as a conditional generation task.

Result: Achieves state-of-the-art performance with FID of 0.032 on HumanML3D and 0.169 on KIT-ML.

Conclusion: DisCoRD effectively combines discrete efficiency with continuous realism, offering a robust solution for motion generation.

Abstract: Human motion is inherently continuous and dynamic, posing significant challenges for generative models. While discrete generation methods are widely used, they suffer from limited expressiveness and frame-wise noise artifacts. In contrast, continuous approaches produce smoother, more natural motion but often struggle to adhere to conditioning signals due to high-dimensional complexity and limited training data. To resolve this ‘discord’ between discrete and continuous representations we introduce DisCoRD: Discrete Tokens to Continuous Motion via Rectified Flow Decoding, a novel method that leverages rectified flow to decode discrete motion tokens in the continuous, raw motion space. Our core idea is to frame token decoding as a conditional generation task, ensuring that DisCoRD captures fine-grained dynamics and achieves smoother, more natural motions. Compatible with any discrete-based framework, our method enhances naturalness without compromising faithfulness to the conditioning signals on diverse settings. Extensive evaluations demonstrate that DisCoRD achieves state-of-the-art performance, with FID of 0.032 on HumanML3D and 0.169 on KIT-ML. These results establish DisCoRD as a robust solution for bridging the divide between discrete efficiency and continuous realism. Project website: https://whwjdqls.github.io/discord-motion/

[208] Viewpoint Consistency in 3D Generation via Attention and CLIP Guidance

Qing Zhang, Zehao Chen, Jinguang Tong, Jing Zhang, Jie Hong, Xuesong Li

Main category: cs.CV

TL;DR: The paper addresses the Janus Problem in text-to-3D generation by proposing ACG, a tuning-free method that improves viewpoint consistency and reduces geometric errors.

Details

Motivation: Current text-to-3D methods suffer from the Janus Problem due to viewpoint bias in diffusion models, leading to inconsistent 3D outputs.

Method: The proposed ACG mechanism adaptively controls cross-attention maps, uses CLIP-based filtering for viewpoints, and employs a coarse-to-fine optimization strategy with staged prompts.

Result: ACG significantly reduces the Janus Problem while maintaining generation speed, proving effective as a plug-and-play solution.

Conclusion: ACG offers an efficient and adaptable solution to viewpoint bias in text-to-3D generation, enhancing geometric consistency without tuning.

Abstract: Despite recent advances in text-to-3D generation techniques, current methods often suffer from geometric inconsistencies, commonly referred to as the Janus Problem. This paper identifies the root cause of the Janus Problem: viewpoint generation bias in diffusion models, which creates a significant gap between the actual generated viewpoint and the expected one required for optimizing the 3D model. To address this issue, we propose a tuning-free approach called the Attention and CLIP Guidance (ACG) mechanism. ACG enhances desired viewpoints by adaptively controlling cross-attention maps, employs CLIP-based view-text similarities to filter out erroneous viewpoints, and uses a coarse-to-fine optimization strategy with staged prompts to progressively refine 3D generation. Extensive experiments demonstrate that our method significantly reduces the Janus Problem without compromising generation speed, establishing ACG as an efficient, plug-and-play component for existing text-to-3D frameworks.

[209] TokenFlow: Unified Image Tokenizer for Multimodal Understanding and Generation

Liao Qu, Huichao Zhang, Yiheng Liu, Xu Wang, Yi Jiang, Yiming Gao, Hu Ye, Daniel K. Du, Zehuan Yuan, Xinglong Wu

Main category: cs.CV

TL;DR: TokenFlow introduces a dual-codebook architecture to unify multimodal understanding and generation, outperforming existing methods in both tasks.

Details

Motivation: Bridging the gap between multimodal understanding and generation, which require different visual granularities, without compromising performance.

Method: Uses a dual-codebook architecture to decouple semantic and pixel-level feature learning, aligned via a shared mapping mechanism.

Result: Achieves 7.2% improvement in understanding over LLaVA-1.5 13B, strong FID score of 0.63 for reconstruction, and GenEval score of 0.55 for generation.

Conclusion: TokenFlow successfully unifies understanding and generation, setting new benchmarks in both domains.

Abstract: We present TokenFlow, a novel unified image tokenizer that bridges the long-standing gap between multimodal understanding and generation. Prior research attempt to employ a single reconstruction-targeted Vector Quantization (VQ) encoder for unifying these two tasks. We observe that understanding and generation require fundamentally different granularities of visual information. This leads to a critical trade-off, particularly compromising performance in multimodal understanding tasks. TokenFlow addresses this challenge through an innovative dual-codebook architecture that decouples semantic and pixel-level feature learning while maintaining their alignment via a shared mapping mechanism. This design enables direct access to both high-level semantic representations crucial for understanding tasks and fine-grained visual features essential for generation through shared indices. Our extensive experiments demonstrate TokenFlow’s superiority across multiple dimensions. Leveraging TokenFlow, we demonstrate for the first time that discrete visual input can surpass LLaVA-1.5 13B in understanding performance, achieving a 7.2% average improvement. For image reconstruction, we achieve a strong FID score of 0.63 at 384384 resolution. Moreover, TokenFlow establishes state-of-the-art performance in autoregressive image generation with a GenEval score of 0.55 at 256256 resolution, achieving comparable results to SDXL.

Shidan He, Lei Liu, Xiujun Shu, Bo Wang, Yuanhao Feng, Shen Zhao

Main category: cs.CV

TL;DR: AnomalyControl is a novel framework for anomaly synthesis that uses cross-modal semantic features to improve realism and generalization in generating abnormal samples.

Details

Motivation: Existing methods lack fine-grained descriptors for realistic anomalies, limiting realism and generalization.

Method: AnomalyControl uses cross-modal semantic modeling (CSM) and anomaly-semantic enhanced attention (ASEA) to guide synthesis with text-image prompts.

Result: Achieves state-of-the-art results in anomaly synthesis and downstream tasks.

Conclusion: AnomalyControl enhances realism and contextual relevance in anomaly synthesis, outperforming existing methods.

Abstract: Anomaly synthesis is a crucial approach to augment abnormal data for advancing anomaly inspection. Based on the knowledge from the large-scale pre-training, existing text-to-image anomaly synthesis methods predominantly focus on textual information or coarse-aligned visual features to guide the entire generation process. However, these methods often lack sufficient descriptors to capture the complicated characteristics of realistic anomalies (e.g., the fine-grained visual pattern of anomalies), limiting the realism and generalization of the generation process. To this end, we propose a novel anomaly synthesis framework called AnomalyControl to learn cross-modal semantic features as guidance signals, which could encode the generalized anomaly cues from text-image reference prompts and improve the realism of synthesized abnormal samples. Specifically, AnomalyControl adopts a flexible and non-matching prompt pair (i.e., a text-image reference prompt and a targeted text prompt), where a Cross-modal Semantic Modeling (CSM) module is designed to extract cross-modal semantic features from the textual and visual descriptors. Then, an Anomaly-Semantic Enhanced Attention (ASEA) mechanism is formulated to allow CSM to focus on the specific visual patterns of the anomaly, thus enhancing the realism and contextual relevance of the generated anomaly features. Treating cross-modal semantic features as the prior, a Semantic Guided Adapter (SGA) is designed to encode effective guidance signals for the adequate and controllable synthesis process. Extensive experiments indicate that AnomalyControl can achieve state-of-the-art results in anomaly synthesis compared with existing methods while exhibiting superior performance for downstream tasks.

[211] A Differentiable Wave Optics Model for End-to-End Computational Imaging System Optimization

Chi-Jui Ho, Yash Belhe, Steve Rotenberg, Ravi Ramamoorthi, Tzu-Mao Li, Nicholas Antipa

Main category: cs.CV

TL;DR: The paper introduces a differentiable optics simulator for end-to-end optimization of computational imaging systems, addressing the challenge of modeling both aberration and diffraction in compound optics.

Details

Motivation: Existing methods compromise physical accuracy by neglecting wave optics effects or off-axis aberrations, raising concerns about design robustness.

Method: The authors propose a differentiable optics simulator that efficiently models aberration and diffraction, enabling joint optimization of optics and algorithms.

Result: Experiments show that lenses and algorithms adapt differently when wave optics is modeled, and systems optimized without wave optics degrade in performance when tested with wave optics effects.

Conclusion: Accurate wave optics modeling is crucial for robust, high-performance computational imaging system design.

Abstract: End-to-end optimization, which simultaneously optimizes optics and algorithms, has emerged as a powerful data-driven method for computational imaging system design. This method achieves joint optimization through backpropagation by incorporating differentiable optics simulators to generate measurements and algorithms to extract information from measurements. However, due to high computational costs, it is challenging to model both aberration and diffraction in light transport for end-to-end optimization of compound optics. Therefore, most existing methods compromise physical accuracy by neglecting wave optics effects or off-axis aberrations, which raises concerns about the robustness of the resulting designs. In this paper, we propose a differentiable optics simulator that efficiently models both aberration and diffraction for compound optics. Using the simulator, we conduct end-to-end optimization on scene reconstruction and classification. Experimental results demonstrate that both lenses and algorithms adopt different configurations depending on whether wave optics is modeled. We also show that systems optimized without wave optics suffer from performance degradation when wave optics effects are introduced during testing. These findings underscore the importance of accurate wave optics modeling in optimizing imaging systems for robust, high-performance applications.

[212] PromptDresser: Improving the Quality and Controllability of Virtual Try-On via Generative Textual Prompt and Prompt-aware Mask

Jeongho Kim, Hoiyeong Jin, Sunghyun Park, Jaegul Choo

Main category: cs.CV

TL;DR: The paper introduces PromptDresser, a text-editable virtual try-on model that uses text prompts to modify clothing styles while preserving the original appearance, leveraging large multimodal models for detailed descriptions and adaptive inpainting masks.

Details

Motivation: To enhance virtual try-on by integrating text prompts for clothing style editing, addressing conflicts in existing methods, and improving image quality through detailed descriptions.

Method: Proposes PromptDresser, which uses large multimodal models (LMMs) for generating detailed text descriptions and adaptively adjusts inpainting masks based on text prompts.

Result: PromptDresser outperforms baselines, offering superior text-driven control and versatile clothing manipulation.

Conclusion: The approach effectively combines text and image data for high-quality virtual try-on, demonstrating the potential of LMMs in this domain.

Abstract: Recent virtual try-on approaches have advanced by finetuning pre-trained text-to-image diffusion models to leverage their powerful generative ability. However, the use of text prompts in virtual try-on remains underexplored. This paper tackles a text-editable virtual try-on task that modifies the clothing based on the provided clothing image while editing the wearing style (e.g., tucking style, fit) according to the text descriptions. In the text-editable virtual try-on, three key aspects exist: (i) designing rich text descriptions for paired person-clothing data to train the model, (ii) addressing the conflicts where textual information of the existing person’s clothing interferes the generation of the new clothing, and (iii) adaptively adjust the inpainting mask aligned with the text descriptions, ensuring proper editing areas while preserving the original person’s appearance irrelevant to the new clothing. To address these aspects, we propose PromptDresser, a text-editable virtual try-on model that leverages large multimodal model (LMM) assistance to enable high-quality and versatile manipulation based on generative text prompts. Our approach utilizes LMMs via in-context learning to generate detailed text descriptions for person and clothing images independently, including pose details and editing attributes using minimal human cost. Moreover, to ensure the editing areas, we adjust the inpainting mask depending on the text prompts adaptively. Our approach enhances text editability while effectively conveying clothing details that are difficult to capture through images alone, leading to improved image quality. Experiments show that PromptDresser significantly outperforms baselines, demonstrating superior text-driven control and versatile clothing manipulation. Our code is available at https://github.com/rlawjdghek/PromptDresser.

[213] MetaOcc: Spatio-Temporal Fusion of Surround-View 4D Radar and Camera for 3D Occupancy Prediction with Dual Training Strategies

Long Yang, Lianqing Zheng, Wenjin Ai, Minghao Liu, Sen Li, Qunshu Lin, Shengyu Yan, Jie Bai, Zhixiong Ma, Tao Huang, Xichan Zhu

Main category: cs.CV

TL;DR: MetaOcc is a multi-modal framework for 3D occupancy prediction using radar and images, featuring innovative modules for feature extraction and fusion, and a semi-supervised approach to reduce annotation costs.

Details

Motivation: Robust 3D occupancy prediction is crucial for autonomous driving, especially in adverse weather, but existing methods struggle with sensor fusion and annotation costs.

Method: MetaOcc uses a Radar Height Self-Attention module for radar data and a Hierarchical Multi-scale Multi-modal Fusion strategy for adaptive sensor fusion. A pseudo-label pipeline enables semi-supervised training.

Result: MetaOcc achieves state-of-the-art performance, improving metrics by +0.47 SC IoU and +4.02 mIoU on OmniHD-Scenes, and +1.16 SC IoU and +1.24 mIoU on SurroundOcc-nuScenes.

Conclusion: MetaOcc is scalable and robust, offering a practical solution for real-world autonomous systems with reduced annotation costs.

Abstract: Robust 3D occupancy prediction is essential for autonomous driving, particularly under adverse weather conditions where traditional vision-only systems struggle. While the fusion of surround-view 4D radar and cameras offers a promising low-cost solution, effectively extracting and integrating features from these heterogeneous sensors remains challenging. This paper introduces MetaOcc, a novel multi-modal framework for omnidirectional 3D occupancy prediction that leverages both multi-view 4D radar and images. To address the limitations of directly applying LiDAR-oriented encoders to sparse radar data, we propose a Radar Height Self-Attention module that enhances vertical spatial reasoning and feature extraction. Additionally, a Hierarchical Multi-scale Multi-modal Fusion strategy is developed to perform adaptive local-global fusion across modalities and time, mitigating spatio-temporal misalignments and enriching fused feature representations. To reduce reliance on expensive point cloud annotations, we further propose a pseudo-label generation pipeline based on an open-set segmentor. This enables a semi-supervised strategy that achieves 90% of the fully supervised performance using only 50% of the ground truth labels, offering an effective trade-off between annotation cost and accuracy. Extensive experiments demonstrate that MetaOcc under full supervision achieves state-of-the-art performance, outperforming previous methods by +0.47 SC IoU and +4.02 mIoU on the OmniHD-Scenes dataset, and by +1.16 SC IoU and +1.24 mIoU on the SurroundOcc-nuScenes dataset. These results demonstrate the scalability and robustness of MetaOcc across sensor domains and training conditions, paving the way for practical deployment in real-world autonomous systems. Code and data are available at https://github.com/LucasYang567/MetaOcc.

[214] FullTransNet: Full Transformer with Local-Global Attention for Video Summarization

Libin Lan, Lu Jiang, Tianshu Yu, Xiaojuan Liu, Zhongshi He

Main category: cs.CV

TL;DR: The paper introduces FullTransNet, a transformer-like architecture for video summarization, addressing limitations in parallelism, long-range dependencies, and generative capabilities. It achieves superior performance on benchmark datasets with lower computational costs.

Details

Motivation: Existing video summarization methods using recurrent or convolutional neural networks, or encoder-only transformers, face issues with parallelism, long-range dependencies, and generative capabilities.

Method: Proposes FullTransNet, a full transformer with encoder-decoder structure, using local-global sparse attention at the encoder side to reduce computational costs while capturing long-range dependencies.

Result: Achieves F-scores of 54.4% (SumMe) and 63.9% (TVSum), outperforming second-best methods by 0.1% and 0.3%, respectively, with lower computational requirements.

Conclusion: FullTransNet is effective and efficient for video summarization, surpassing existing methods in performance and computational efficiency.

Abstract: Video summarization aims to generate a compact, informative, and representative synopsis of raw videos, which is crucial for browsing, analyzing, and understanding video content. Dominant approaches in video summarization primarily rely on recurrent or convolutional neural networks, and more recently on encoder-only transformer architectures. However, these methods typically suffer from several limitations in parallelism, modeling long-range dependencies, and providing explicit generative capabilities. To address these issues, we propose a transformer-like architecture named FullTransNet with two-fold ideas. First, it uses a full transformer with an encoder-decoder structure as an alternative architecture for video summarization. As the full transformer is specifically designed for sequence transduction tasks, its direct application to video summarization is both intuitive and effective. Second, it replaces the standard full attention mechanism with a combination of local and global sparse attention, enabling the model to capture long-range dependencies while significantly reducing computational costs. This local-global sparse attention is applied exclusively at the encoder side, where the majority of computations occur, further enhancing efficiency. Extensive experiments on two widely used benchmark datasets, SumMe and TVSum, demonstrate that our model achieves F-scores of 54.4% and 63.9%, respectively, while maintaining relatively low computational and memory requirements. These results surpass the second-best performing methods by 0.1% and 0.3%, respectively, verifying the effectiveness and efficiency of FullTransNet.

[215] RIFLEx: A Free Lunch for Length Extrapolation in Video Diffusion Transformers

Min Zhao, Guande He, Yixiao Chen, Hongzhou Zhu, Chongxuan Li, Jun Zhu

Main category: cs.CV

TL;DR: RIFLEx improves long video generation by reducing intrinsic frequency in positional embeddings, enabling 2x-3x extrapolation without training or minimal fine-tuning.

Details

Motivation: Generating longer videos with temporal coherence is challenging due to repetition or motion deceleration in existing methods.

Method: Analyzes frequency components in positional embeddings, identifies intrinsic frequency, and proposes RIFLEx to reduce it for better extrapolation.

Result: Achieves high-quality 2x extrapolation training-free and 3x with minimal fine-tuning, enhancing video quality.

Conclusion: RIFLEx effectively addresses long video generation challenges by leveraging frequency insights, offering a simple yet powerful solution.

Abstract: Recent advancements in video generation have enabled models to synthesize high-quality, minute-long videos. However, generating even longer videos with temporal coherence remains a major challenge and existing length extrapolation methods lead to temporal repetition or motion deceleration. In this work, we systematically analyze the role of frequency components in positional embeddings and identify an intrinsic frequency that primarily governs extrapolation behavior. Based on this insight, we propose RIFLEx, a minimal yet effective approach that reduces the intrinsic frequency to suppress repetition while preserving motion consistency, without requiring any additional modifications. RIFLEx offers a true free lunch–achieving high-quality 2x extrapolation on state-of-the-art video diffusion transformers in a completely training-free manner. Moreover, it enhances quality and enables 3x extrapolation by minimal fine-tuning without long videos. Project page and codes: https://riflex-video.github.io/.

[216] Evaluation of Safety Cognition Capability in Vision-Language Models for Autonomous Driving

Enming Zhang, Peizhe Gong, Xingyuan Dai, Min Huang, Yisheng Lv, Qinghai Miao

Main category: cs.CV

TL;DR: SCD-Bench is a new framework for evaluating safety cognition in vision-language models (VLMs) for autonomous driving, featuring a semi-automated labeling system (ADA) and an automated assessment pipeline. The SCD-Training dataset improves model performance on safety and general benchmarks.

Details

Motivation: Existing research lacks focus on safety-critical evaluation for VLMs in autonomous driving, necessitating a dedicated framework.

Method: Introduces SCD-Bench for safety evaluation, ADA for scalable data annotation, and an automated assessment pipeline using large language models. Also creates SCD-Training, a large-scale dataset.

Result: Models trained on SCD-Training show significant improvements on SCD-Bench and other benchmarks, with the assessment pipeline achieving 98% agreement with human experts.

Conclusion: SCD-Bench and SCD-Training provide a robust approach to enhance safety cognition in VLMs for autonomous driving, improving both safety and general performance.

Abstract: Ensuring the safety of vision-language models (VLMs) in autonomous driving systems is of paramount importance, yet existing research has largely focused on conventional benchmarks rather than safety-critical evaluation. In this work, we present SCD-Bench (Safety Cognition Driving Benchmark) a novel framework specifically designed to assess the safety cognition capabilities of VLMs within interactive driving scenarios. To address the scalability challenge of data annotation, we introduce ADA (Autonomous Driving Annotation), a semi-automated labeling system, further refined through expert review by professionals with domain-specific knowledge in autonomous driving. To facilitate scalable and consistent evaluation, we also propose an automated assessment pipeline leveraging large language models, which demonstrates over 98% agreement with human expert judgments. In addressing the broader challenge of aligning VLMs with safety cognition in driving environments, we construct SCD-Training, the first large-scale dataset tailored for this task, comprising 324.35K high-quality samples. Through extensive experiments, we show that models trained on SCD-Training exhibit marked improvements not only on SCD-Bench, but also on general and domain-specific benchmarks, offering a new perspective on enhancing safety-aware interactions in vision-language systems for autonomous driving.

[217] Computation-Efficient and Recognition-Friendly 3D Point Cloud Privacy Protection

Haotian Ma, Lin Gu, Siyi Wu, Yingying Zhu

Main category: cs.CV

TL;DR: The paper addresses privacy leakage in 3D point clouds, proposing PointFlowGMM, a framework for privacy-preserving classification and segmentation without accessing original data.

Details

Motivation: Privacy concerns in 3D point clouds, unlike 2D images, are understudied and require unique solutions due to their texture-less, geometry-focused nature.

Method: Uses a flow-based generative model to project point clouds into a latent Gaussian mixture subspace, employs angular similarity loss for obfuscation, and reduces model size. Random orthogonal rotation further protects geometry while preserving class relationships.

Result: Achieved comparable recognition performance on encrypted point clouds vs. original ones, with model size reduced from 767MB to 120MB.

Conclusion: PointFlowGMM effectively protects 3D point cloud privacy while maintaining utility for downstream tasks.

Abstract: 3D point cloud has been widely used in applications such as self-driving cars, robotics, CAD models, etc. To the best of our knowledge, these applications raised the issue of privacy leakage in 3D point clouds, which has not been studied well. Different from the 2D image privacy, which is related to texture and 2D geometric structure, the 3D point cloud is texture-less and only relevant to 3D geometric structure. In this work, we defined the 3D point cloud privacy problem and proposed an efficient privacy-preserving framework named PointFlowGMM that can support downstream classification and segmentation tasks without seeing the original data. Using a flow-based generative model, the point cloud is projected into a latent Gaussian mixture distributed subspace. We further designed a novel angular similarity loss to obfuscate the original geometric structure and reduce the model size from 767MB to 120MB without a decrease in recognition performance. The projected point cloud in the latent space is orthogonally rotated randomly to further protect the original geometric structure, the class-to-class relationship is preserved after rotation, thus, the protected point cloud can support the recognition task. We evaluated our model on multiple datasets and achieved comparable recognition results on encrypted point clouds compared to the original point clouds.

[218] Stealthy Patch-Wise Backdoor Attack in 3D Point Cloud via Curvature Awareness

Yu Feng, Dingxin Zhang, Runkai Zhao, Yong Xia, Heng Huang, Weidong Cai

Main category: cs.CV

TL;DR: The paper introduces SPBA, a patch-wise backdoor attack for 3D point clouds, improving stealthiness and efficiency over existing methods.

Details

Motivation: Existing 3D point cloud backdoor attacks lack imperceptibility and are computationally expensive. SPBA addresses these issues.

Method: SPBA decomposes point clouds into patches, uses curvature-based scores for trigger injection, and optimizes a unified patch-wise trigger.

Result: SPBA outperforms prior attacks in effectiveness and resistance to defenses, as shown on ModelNet40 and ShapeNetPart.

Conclusion: SPBA is a highly effective and stealthy backdoor attack framework for 3D point clouds.

Abstract: Backdoor attacks pose a severe threat to deep neural networks (DNNs) by implanting hidden backdoors that can be activated with predefined triggers to manipulate model behaviors maliciously. Existing 3D point cloud backdoor attacks primarily rely on sample-wise global modifications, which suffer from low imperceptibility. Although optimization can improve stealthiness, optimizing sample-wise triggers significantly increases computational cost. To address these limitations, we propose the Stealthy Patch-Wise Backdoor Attack (SPBA), the first patch-wise backdoor attack framework for 3D point clouds. Specifically, SPBA decomposes point clouds into local patches and employs a curvature-based imperceptibility score to guide trigger injection into visually less sensitive patches. By optimizing a unified patch-wise trigger that perturbs spectral features of selected patches, SPBA significantly enhances optimization efficiency while maintaining high stealthiness. Extensive experiments on ModelNet40 and ShapeNetPart further demonstrate that SPBA surpasses prior state-of-the-art backdoor attacks in both attack effectiveness and resistance to defense methods.

[219] CM-Diff: A Single Generative Network for Bidirectional Cross-Modality Translation Diffusion Model Between Infrared and Visible Images

Bin Hu, Chenqiang Gao, Shurui Liu, Junjie Guo, Fang Chen, Fangcen Liu, Junwei Han

Main category: cs.CV

TL;DR: A bidirectional cross-modality translation diffusion model (CM-Diff) is proposed for infrared and visible image translation, outperforming existing methods.

Details

Motivation: Existing methods for infrared and visible image translation are either unidirectional or rely on cycle consistency, leading to suboptimal performance.

Method: CM-Diff combines translation direction labels and cross-modality feature control, using Bidirectional Diffusion Training (BDT) and Statistical Constraint Inference (SCI).

Result: CM-Diff outperforms state-of-the-art methods, demonstrating superior performance in generating dual-modality datasets.

Conclusion: The proposed CM-Diff model effectively addresses bidirectional translation challenges, offering potential for enhanced dataset generation.

Abstract: Image translation is one of the crucial approaches for mitigating information deficiencies in the infrared and visible modalities, while also facilitating the enhancement of modality-specific datasets. However, existing methods for infrared and visible image translation either achieve unidirectional modality translation or rely on cycle consistency for bidirectional modality translation, which may result in suboptimal performance. In this work, we present the bidirectional cross-modality translation diffusion model (CM-Diff) for simultaneously modeling data distributions in both the infrared and visible modalities. We address this challenge by combining translation direction labels for guidance during training with cross-modality feature control. Specifically, we view the establishment of the mapping relationship between the two modalities as the process of learning data distributions and understanding modality differences, achieved through a novel Bidirectional Diffusion Training (BDT). Additionally, we propose a Statistical Constraint Inference (SCI) to ensure the generated image closely adheres to the data distribution of the target modality. Experimental results demonstrate the superiority of our CM-Diff over state-of-the-art methods, highlighting its potential for generating dual-modality datasets.

[220] TIME: Temporal-Sensitive Multi-Dimensional Instruction Tuning and Robust Benchmarking for Video-LLMs

Yunxiao Wang, Meng Liu, Wenqi Liu, Xuemeng Song, Bin Wen, Fan Yang, Tingting Gao, Di Zhang, Guorui Zhou, Liqiang Nie

Main category: cs.CV

TL;DR: The paper introduces a method to improve temporal understanding in video large language models (LLMs) through a dedicated instruction dataset and multi-task prompt fine-tuning, avoiding costly annotations. A new benchmark is also developed for accurate evaluation.

Details

Motivation: Current video LLMs lack optimal temporal understanding, limiting their performance in tasks like video question answering.

Method: A dedicated instruction fine-tuning dataset is curated, and a multi-task prompt fine-tuning approach is introduced to integrate temporal-sensitive tasks without additional annotations. A novel benchmark is developed for comprehensive evaluation.

Result: The approach significantly enhances temporal understanding in video LLMs and avoids reliance on shortcuts.

Conclusion: The proposed method effectively improves temporal comprehension in video LLMs while minimizing annotation costs and ensuring robust evaluation.

Abstract: Video large language models have achieved remarkable performance in tasks such as video question answering, however, their temporal understanding remains suboptimal. To address this limitation, we curate a dedicated instruction fine-tuning dataset that focuses on enhancing temporal comprehension across five key dimensions. In order to reduce reliance on costly temporal annotations, we introduce a multi-task prompt fine-tuning approach that seamlessly integrates temporal-sensitive tasks into existing instruction datasets without requiring additional annotations. Furthermore, we develop a novel benchmark for temporal-sensitive video understanding that not only fills the gaps in dimension coverage left by existing benchmarks but also rigorously filters out potential shortcuts, ensuring a more accurate evaluation. Extensive experimental results demonstrate that our approach significantly enhances the temporal understanding of video-LLMs while avoiding reliance on shortcuts.

[221] Learning Disease State from Noisy Ordinal Disease Progression Labels

Gustav Schmidt, Holger Heidrich, Philipp Berens, Sarah Müller

Main category: cs.CV

TL;DR: The paper explores using noisy ordinal labels (better, worse, stable) to learn a disease representation for nAMD, achieving strong few-shot performance for activity classification.

Details

Motivation: To address the challenge of learning from noisy ordinal labels in medical imaging, particularly for modeling disease progression in nAMD.

Method: Proposes a classification task with ordinal ranks, using independent image encoding, antisymmetric logit space equivariance, ordinal scale awareness, and uncertainty-based loss re-weighting.

Result: Learns an interpretable disease representation enabling strong few-shot performance for nAMD activity classification.

Conclusion: The approach effectively leverages noisy ordinal labels to model disease progression and generalize to related tasks.

Abstract: Learning from noisy ordinal labels is a key challenge in medical imaging. In this work, we ask whether ordinal disease progression labels (better, worse, or stable) can be used to learn a representation allowing to classify disease state. For neovascular age-related macular degeneration (nAMD), we cast the problem of modeling disease progression between medical visits as a classification task with ordinal ranks. To enhance generalization, we tailor our model to the problem setting by (1) independent image encoding, (2) antisymmetric logit space equivariance, and (3) ordinal scale awareness. In addition, we address label noise by learning an uncertainty estimate for loss re-weighting. Our approach learns an interpretable disease representation enabling strong few-shot performance for the related task of nAMD activity classification from single images, despite being trained only on image pairs with ordinal disease progression labels.

[222] MotionStreamer: Streaming Motion Generation via Diffusion-based Autoregressive Model in Causal Latent Space

Lixing Xiao, Shunlin Lu, Huaijin Pi, Ke Fan, Liang Pan, Yueer Zhou, Ziyong Feng, Xiaowei Zhou, Sida Peng, Jingbo Wang

Main category: cs.CV

TL;DR: MotionStreamer introduces a framework for text-conditioned streaming motion generation, overcoming limitations of existing methods like diffusion models and GPT-based approaches by using a continuous causal latent space.

Details

Motivation: Existing methods for text-conditioned motion generation face issues like pre-defined motion lengths, delayed responses, and error accumulation. MotionStreamer aims to solve these problems.

Method: The proposed framework, MotionStreamer, integrates a continuous causal latent space into a probabilistic autoregressive model to reduce information loss and error accumulation.

Result: Experiments demonstrate MotionStreamer outperforms existing methods, enabling applications like multi-round generation, long-term generation, and dynamic motion composition.

Conclusion: MotionStreamer effectively addresses the challenges of streaming motion generation, offering superior performance and broader applicability.

Abstract: This paper addresses the challenge of text-conditioned streaming motion generation, which requires us to predict the next-step human pose based on variable-length historical motions and incoming texts. Existing methods struggle to achieve streaming motion generation, e.g., diffusion models are constrained by pre-defined motion lengths, while GPT-based methods suffer from delayed response and error accumulation problem due to discretized non-causal tokenization. To solve these problems, we propose MotionStreamer, a novel framework that incorporates a continuous causal latent space into a probabilistic autoregressive model. The continuous latents mitigate information loss caused by discretization and effectively reduce error accumulation during long-term autoregressive generation. In addition, by establishing temporal causal dependencies between current and historical motion latents, our model fully utilizes the available information to achieve accurate online motion decoding. Experiments show that our method outperforms existing approaches while offering more applications, including multi-round generation, long-term generation, and dynamic motion composition. Project Page: https://zju3dv.github.io/MotionStreamer/

[223] Repurposing 2D Diffusion Models with Gaussian Atlas for 3D Generation

Tiange Xiang, Kai Li, Chengjiang Long, Christian Häne, Peihong Guo, Scott Delp, Ehsan Adeli, Li Fei-Fei

Main category: cs.CV

TL;DR: The paper proposes using pre-trained 2D diffusion models for 3D object generation by introducing Gaussian Atlas and a dataset, GaussianVerse, to bridge the gap between 2D and 3D modeling.

Details

Motivation: The scarcity of high-quality 3D data hinders 3D diffusion models' performance compared to 2D models.

Method: Repurposes pre-trained 2D diffusion models with Gaussian Atlas, a dense 2D grid representation, and uses the GaussianVerse dataset for training.

Result: Successful transfer learning from 2D to 3D, enabling effective 3D content generation.

Conclusion: Text-to-image diffusion models can be adapted for 3D generation, closing the performance gap between 2D and 3D models.

Abstract: Recent advances in text-to-image diffusion models have been driven by the increasing availability of paired 2D data. However, the development of 3D diffusion models has been hindered by the scarcity of high-quality 3D data, resulting in less competitive performance compared to their 2D counterparts. To address this challenge, we propose repurposing pre-trained 2D diffusion models for 3D object generation. We introduce Gaussian Atlas, a novel representation that utilizes dense 2D grids, enabling the fine-tuning of 2D diffusion models to generate 3D Gaussians. Our approach demonstrates successful transfer learning from a pre-trained 2D diffusion model to a 2D manifold flattened from 3D structures. To support model training, we compile GaussianVerse, a large-scale dataset comprising 205K high-quality 3D Gaussian fittings of various 3D objects. Our experimental results show that text-to-image diffusion models can be effectively adapted for 3D content generation, bridging the gap between 2D and 3D modeling.

[224] Follow-Your-Color: Multi-Instance Sketch Colorization

Yinhan Zhang, Yue Ma, Bingyuan Wang, Qifeng Chen, Zeyu Wang

Main category: cs.CV

TL;DR: Follow-Your-Color is a diffusion-based framework for multi-instance sketch colorization, automating the process with high precision and eliminating manual adjustments.

Details

Motivation: Current methods are inefficient and inaccurate for multi-instance colorization, and generative methods struggle due to data collection challenges.

Method: Uses self-play training, an instance guider, and fine-grained color matching with edge loss to achieve precise colorization in a single forward pass.

Result: Outperforms existing methods in chromatic precision and automates colorization with zero manual adjustments.

Conclusion: The framework enables novice users to produce consistent, high-quality artwork efficiently.

Abstract: We present Follow-Your-Color, a diffusion-based framework for multi-instance sketch colorization. The production of multi-instance 2D line art colorization adheres to an industry-standard workflow, which consists of three crucial stages: the design of line art characters, the coloring of individual objects, and the refinement process. The artists are required to repeat the process of coloring each instance one by one, which is inaccurate and inefficient. Meanwhile, current generative methods fail to solve this task due to the challenge of multi-instance pair data collection. To tackle these challenges, we incorporate three technical designs to ensure precise character detail transcription and achieve multi-instance sketch colorization in a single forward pass. Specifically, we first propose the self-play training strategy to address the lack of training data. Then we introduce an instance guider to feed the color of the instance. To achieve accurate color matching, we present fine-grained color matching with edge loss to enhance visual quality. Equipped with the proposed modules, Follow-Your-Color enables automatically transforming sketches into vividly-colored images with accurate consistency and multi-instance control. Experiments on our collected datasets show that our model outperforms existing methods regarding chromatic precision. Specifically, our model critically automates the colorization process with zero manual adjustments, so novice users can produce stylistically consistent artwork by providing reference instances and the original line art. Our code and additional details are available at https://yinhan-zhang.github.io/color.

[225] GaSLight: Gaussian Splats for Spatially-Varying Lighting in HDR

Christophe Bolduc, Yannick Hold-Geoffroy, Zhixin Shu, Jean-François Lalonde

Main category: cs.CV

TL;DR: GaSLight generates spatially-varying lighting from regular images using HDR Gaussian Splats, achieving state-of-the-art results in HDR estimation and virtual illumination.

Details

Motivation: To enable regular images to serve as light sources in 3D rendering, addressing the lack of methods for spatially-varying lighting from such images.

Method: A two-stage process: 1) Enhancing dynamic range of images using diffusion models, 2) Modeling 3D lighting with Gaussian Splats for spatial variation.

Result: State-of-the-art performance in HDR estimation and virtual illumination, validated on a new dataset and existing literature.

Conclusion: GaSLight successfully bridges the gap between regular images and 3D lighting, introducing a novel dataset for benchmarking.

Abstract: We present GaSLight, a method that generates spatially-varying lighting from regular images. Our method proposes using HDR Gaussian Splats as light source representation, marking the first time regular images can serve as light sources in a 3D renderer. Our two-stage process first enhances the dynamic range of images plausibly and accurately by leveraging the priors embedded in diffusion models. Next, we employ Gaussian Splats to model 3D lighting, achieving spatially variant lighting. Our approach yields state-of-the-art results on HDR estimations and their applications in illuminating virtual objects and scenes. To facilitate the benchmarking of images as light sources, we introduce a novel dataset of calibrated and unsaturated HDR to evaluate images as light sources. We assess our method using a combination of this novel dataset and an existing dataset from the literature. Project page: https://lvsn.github.io/gaslight/

[226] Cross-Image Contrastive Decoding: Precise, Lossless Suppression of Language Priors in Large Vision-Language Models

Jianfei Zhao, Feng Zhang, Xin Sun, Chong Feng

Main category: cs.CV

TL;DR: The paper introduces Cross-Image Contrastive Decoding (CICD), a training-free method to reduce hallucinations in Large Vision-Language Models by using unrelated images as contrastive inputs and dynamically suppressing language priors.

Details

Motivation: Over-reliance on language priors in LVLMs causes visually inconsistent outputs. Existing contrastive decoding methods distort distributions and suppress priors excessively.

Method: Proposes CICD, using unrelated images for contrastive inputs and a dynamic selection mechanism to balance language prior suppression.

Result: CICD effectively reduces hallucinations without degrading model performance, validated across benchmarks and LVLMs.

Conclusion: CICD is a simple, effective solution for mitigating hallucinations in LVLMs, particularly in tasks like image captioning.

Abstract: Over-reliance on language priors is a major cause of hallucinations in Large Vision-Language Models (LVLMs), often leading to outputs that are linguistically plausible but visually inconsistent. Recent studies have explored contrastive decoding as a training-free solution. However, these methods typically construct contrastive visual inputs by perturbing the original image, resulting in distorted contrastive distributions, incomplete contrastive signals, and excessive suppression of language priors. Motivated by the observation that language priors tend to remain consistent across different images, we propose Cross-Image Contrastive Decoding (CICD), a simple yet effective training-free method that uses unrelated images as contrastive visual inputs. To address the issue of over-suppressing language priors, which can negatively affect the quality of generated responses, we further introduce a dynamic selection mechanism based on the cross-image differences in model behavior. By selectively suppressing language priors, our method reduces hallucinations without compromising the model’s performance. Extensive experiments across multiple benchmarks and LVLMs confirm the effectiveness and generalizability of CICD, particularly in image captioning, where language priors are especially dominant.

[227] EarthSynth: Generating Informative Earth Observation with Diffusion Models

Jiancheng Pan, Shiye Lei, Yuqian Fu, Jiahao Li, Yanxing Liu, Yuze Sun, Xiao He, Long Peng, Xiaomeng Huang, Bo Zhao

Main category: cs.CV

TL;DR: EarthSynth is a diffusion-based generative model for synthesizing labeled remote sensing images to address data scarcity in RSI interpretation tasks.

Details

Motivation: The scarcity of labeled data in remote sensing image interpretation limits performance, prompting the need for a generative solution.

Method: EarthSynth uses a diffusion-based approach, trained on EarthSynth-180K, with Counterfactual Composition and R-Filter for data diversity and quality.

Result: Significant improvements in open-vocabulary tasks like scene classification, object detection, and semantic segmentation.

Conclusion: EarthSynth offers a practical solution for advancing RSI interpretation by addressing data scarcity and improving generalization.

Abstract: Remote sensing image (RSI) interpretation typically faces challenges due to the scarcity of labeled data, which limits the performance of RSI interpretation tasks. To tackle this challenge, we propose EarthSynth, a diffusion-based generative foundation model that enables synthesizing multi-category, cross-satellite labeled Earth observation for downstream RSI interpretation tasks. To the best of our knowledge, EarthSynth is the first to explore multi-task generation for remote sensing, tackling the challenge of limited generalization in task-oriented synthesis for RSI interpretation. EarthSynth, trained on the EarthSynth-180K dataset, employs the Counterfactual Composition training strategy with a three-dimensional batch-sample selection mechanism to improve training data diversity and enhance category control. Furthermore, a rule-based method of R-Filter is proposed to filter more informative synthetic data for downstream tasks. We evaluate our EarthSynth on scene classification, object detection, and semantic segmentation in open-world scenarios. There are significant improvements in open-vocabulary understanding tasks, offering a practical solution for advancing RSI interpretation.

[228] WeatherEdit: Controllable Weather Editing with 4D Gaussian Field

Chenghao Qian, Wenjing Li, Yuhu Guo, Gustav Markkula

Main category: cs.CV

TL;DR: WeatherEdit is a pipeline for generating realistic, controllable weather effects in 3D scenes, combining background editing and particle construction.

Details

Motivation: To enable flexible and realistic weather simulation for applications like autonomous driving.

Method: Uses a pretrained diffusion model with an all-in-one adapter for 2D background editing and a 4D Gaussian field for 3D particle construction.

Result: Generates diverse, controllable weather effects with realistic dynamics.

Conclusion: WeatherEdit is effective for autonomous driving simulation in adverse weather.

Abstract: In this work, we present WeatherEdit, a novel weather editing pipeline for generating realistic weather effects with controllable types and severity in 3D scenes. Our approach is structured into two key components: weather background editing and weather particle construction. For weather background editing, we introduce an all-in-one adapter that integrates multiple weather styles into a single pretrained diffusion model, enabling the generation of diverse weather effects in 2D image backgrounds. During inference, we design a Temporal-View (TV-) attention mechanism that follows a specific order to aggregate temporal and spatial information, ensuring consistent editing across multi-frame and multi-view images. To construct the weather particles, we first reconstruct a 3D scene using the edited images and then introduce a dynamic 4D Gaussian field to generate snowflakes, raindrops and fog in the scene. The attributes and dynamics of these particles are precisely controlled through physical-based modelling and simulation, ensuring realistic weather representation and flexible severity adjustments. Finally, we integrate the 4D Gaussian field with the 3D scene to render consistent and highly realistic weather effects. Experiments on multiple driving datasets demonstrate that WeatherEdit can generate diverse weather effects with controllable condition severity, highlighting its potential for autonomous driving simulation in adverse weather. See project page: https://jumponthemoon.github.io/w-edit

[229] DSOcc: Leveraging Depth Awareness and Semantic Aid to Boost Camera-Based 3D Semantic Occupancy Prediction

Naiyu Fang, Zheyuan Zhou, Kang Wang, Ruibo Li, Lemiao Qiu, Shuyou Zhang, Zhe Wang, Guosheng Lin

Main category: cs.CV

TL;DR: DSOcc improves camera-based 3D semantic occupancy prediction by integrating depth awareness and semantic aid, outperforming existing methods on SemanticKITTI.

Details

Motivation: Existing methods suffer from incorrect feature assignments and limited learning due to insufficient samples, prompting the need for a more robust solution.

Method: DSOcc jointly infers occupancy state and class using soft occupancy confidence (non-learning) and fuses semantic segmentation from multiple frames to aid inference.

Result: DSOcc achieves state-of-the-art performance on the SemanticKITTI dataset.

Conclusion: The proposed method effectively addresses challenges in 3D semantic occupancy prediction, demonstrating superior performance.

Abstract: Camera-based 3D semantic occupancy prediction offers an efficient and cost-effective solution for perceiving surrounding scenes in autonomous driving. However, existing works rely on explicit occupancy state inference, leading to numerous incorrect feature assignments, and insufficient samples restrict the learning of occupancy class inference. To address these challenges, we propose leveraging Depth awareness and Semantic aid to boost camera-based 3D semantic Occupancy prediction (DSOcc). We jointly perform occupancy state and occupancy class inference, where soft occupancy confidence is calculated by non-learning method and multiplied with image features to make voxels aware of depth, enabling adaptive implicit occupancy state inference. Instead of enhancing feature learning, we directly utilize well-trained image semantic segmentation and fuse multiple frames with their occupancy probabilities to aid occupancy class inference, thereby enhancing robustness. Experimental results demonstrate that DSOcc achieves state-of-the-art performance on the SemanticKITTI dataset among camera-based methods.

[230] MOGO: Residual Quantized Hierarchical Causal Transformer for High-Quality and Real-Time 3D Human Motion Generation

Dongjie Fu, Tengjiao Sun, Pengcheng Fang, Xiaohao Cai, Hansung Kim

Main category: cs.CV

TL;DR: MOGO introduces an efficient autoregressive framework for real-time 3D motion generation, combining MoSA-VQ for compact motion representation and RQHC-Transformer for low-latency token generation, outperforming state-of-the-art methods in quality and speed.

Details

Motivation: Addressing the challenge of achieving high fidelity, real-time responsiveness, and scalability in transformer-based text-to-motion generation.

Method: Proposes MOGO with MoSA-VQ for hierarchical motion discretization and RQHC-Transformer for efficient token generation, plus a text condition alignment mechanism.

Result: Outperforms state-of-the-art methods in generation quality, real-time performance, and zero-shot generalization on benchmark datasets.

Conclusion: MOGO successfully balances quality and efficiency, advancing real-time motion generation capabilities.

Abstract: Recent advances in transformer-based text-to-motion generation have led to impressive progress in synthesizing high-quality human motion. Nevertheless, jointly achieving high fidelity, streaming capability, real-time responsiveness, and scalability remains a fundamental challenge. In this paper, we propose MOGO (Motion Generation with One-pass), a novel autoregressive framework tailored for efficient and real-time 3D motion generation. MOGO comprises two key components: (1) MoSA-VQ, a motion scale-adaptive residual vector quantization module that hierarchically discretizes motion sequences with learnable scaling to produce compact yet expressive representations; and (2) RQHC-Transformer, a residual quantized hierarchical causal transformer that generates multi-layer motion tokens in a single forward pass, significantly reducing inference latency. To enhance semantic fidelity, we further introduce a text condition alignment mechanism that improves motion decoding under textual control. Extensive experiments on benchmark datasets including HumanML3D, KIT-ML, and CMP demonstrate that MOGO achieves competitive or superior generation quality compared to state-of-the-art transformer-based methods, while offering substantial improvements in real-time performance, streaming generation, and generalization under zero-shot settings.

[231] SteerPose: Simultaneous Extrinsic Camera Calibration and Matching from Articulation

Sang-Eun Lee, Ko Nishino, Shohei Nobuhara

Main category: cs.CV

TL;DR: SteerPose is a neural network that performs 2D pose rotation for multi-camera calibration and correspondence search, validated on diverse datasets.

Details

Motivation: Humans can mentally align 2D poses across views, inspiring a method to automate this for multi-camera systems.

Method: SteerPose integrates differentiable matching and a geometric consistency loss for camera calibration and correspondence.

Result: Effective and robust performance on in-the-wild datasets, enabling 3D pose reconstruction of novel animals.

Conclusion: SteerPose offers a unified framework for calibration and correspondence, with potential for broader applications.

Abstract: Can freely moving humans or animals themselves serve as calibration targets for multi-camera systems while simultaneously estimating their correspondences across views? We humans can solve this problem by mentally rotating the observed 2D poses and aligning them with those in the target views. Inspired by this cognitive ability, we propose SteerPose, a neural network that performs this rotation of 2D poses into another view. By integrating differentiable matching, SteerPose simultaneously performs extrinsic camera calibration and correspondence search within a single unified framework. We also introduce a novel geometric consistency loss that explicitly ensures that the estimated rotation and correspondences result in a valid translation estimation. Experimental results on diverse in-the-wild datasets of humans and animals validate the effectiveness and robustness of the proposed method. Furthermore, we demonstrate that our method can reconstruct the 3D poses of novel animals in multi-camera setups by leveraging off-the-shelf 2D pose estimators and our class-agnostic model.

[232] CLIP Meets Diffusion: A Synergistic Approach to Anomaly Detection

Byeongchan Lee, John Won, Seunghyun Lee, Jinwoo Shin

Main category: cs.CV

TL;DR: CLIPFUSION combines discriminative and generative models for anomaly detection, outperforming baselines on benchmark datasets.

Details

Motivation: Anomaly detection is challenging due to ambiguous definitions, diverse anomaly types, and limited data. A comprehensive model is needed to capture both low-level and high-level features.

Method: CLIPFUSION leverages CLIP (discriminative) and diffusion (generative) models, using cross-attention and feature maps for anomaly detection.

Result: Outperforms baseline methods on MVTec-AD and VisA datasets in anomaly segmentation and classification.

Conclusion: CLIPFUSION demonstrates the effectiveness of multi-modal and multi-model fusion for scalable anomaly detection in real-world applications.

Abstract: Anomaly detection is a complex problem due to the ambiguity in defining anomalies, the diversity of anomaly types (e.g., local and global defect), and the scarcity of training data. As such, it necessitates a comprehensive model capable of capturing both low-level and high-level features, even with limited data. To address this, we propose CLIPFUSION, a method that leverages both discriminative and generative foundation models. Specifically, the CLIP-based discriminative model excels at capturing global features, while the diffusion-based generative model effectively captures local details, creating a synergistic and complementary approach. Notably, we introduce a methodology for utilizing cross-attention maps and feature maps extracted from diffusion models specifically for anomaly detection. Experimental results on benchmark datasets (MVTec-AD, VisA) demonstrate that CLIPFUSION consistently outperforms baseline methods, achieving outstanding performance in both anomaly segmentation and classification. We believe that our method underscores the effectiveness of multi-modal and multi-model fusion in tackling the multifaceted challenges of anomaly detection, providing a scalable solution for real-world applications.

[233] HydroChronos: Forecasting Decades of Surface Water Change

Daniele Rege Cambrin, Eleonora Poeta, Eliana Pastor, Isaac Corley, Tania Cerquitelli, Elena Baralis, Paolo Garza

Main category: cs.CV

TL;DR: HydroChronos is a large-scale dataset for surface water dynamics forecasting, paired with a novel model (AquaClimaTempo UNet) that outperforms baselines and includes Explainable AI insights.

Details

Motivation: Addressing the lack of comprehensive datasets and standardized benchmarks in forecasting surface water dynamics for water resource management and climate change adaptation.

Method: Introduces HydroChronos dataset with multi-modal data (Landsat 5, Sentinel-2, climate data, DEMs) and proposes AquaClimaTempo UNet, a spatiotemporal model with a climate data branch.

Result: Model outperforms Persistence baseline by +14% and +11% F1 in tasks, and +0.1 MAE in regression. Includes Explainable AI analysis of key climate variables.

Conclusion: HydroChronos and AquaClimaTempo UNet provide a robust benchmark and insights for future modeling in surface water dynamics forecasting.

Abstract: Forecasting surface water dynamics is crucial for water resource management and climate change adaptation. However, the field lacks comprehensive datasets and standardized benchmarks. In this paper, we introduce HydroChronos, a large-scale, multi-modal spatiotemporal dataset for surface water dynamics forecasting designed to address this gap. We couple the dataset with three forecasting tasks. The dataset includes over three decades of aligned Landsat 5 and Sentinel-2 imagery, climate data, and Digital Elevation Models for diverse lakes and rivers across Europe, North America, and South America. We also propose AquaClimaTempo UNet, a novel spatiotemporal architecture with a dedicated climate data branch, as a strong benchmark baseline. Our model significantly outperforms a Persistence baseline for forecasting future water dynamics by +14% and +11% F1 across change detection and direction of change classification tasks, and by +0.1 MAE on the magnitude of change regression. Finally, we conduct an Explainable AI analysis to identify the key climate variables and input channels that influence surface water change, providing insights to inform and guide future modeling efforts.

[234] Sign Spotting Disambiguation using Large Language Models

JianHe Low, Ozge Mercanoglu Sincan, Richard Bowden

Main category: cs.CV

TL;DR: A training-free framework using LLMs improves sign spotting by combining spatio-temporal and hand shape features with dictionary-based matching and context-aware disambiguation.

Details

Motivation: To address data scarcity and vocabulary inflexibility in sign language translation by enhancing sign spotting accuracy and flexibility.

Method: Extracts global spatio-temporal and hand shape features, matches them to a sign dictionary using dynamic time warping and cosine similarity, and employs an LLM for context-aware gloss disambiguation.

Result: Superior accuracy and sentence fluency compared to traditional methods, demonstrated on synthetic and real-world datasets.

Conclusion: LLMs can significantly advance sign spotting by improving flexibility and accuracy without requiring retraining.

Abstract: Sign spotting, the task of identifying and localizing individual signs within continuous sign language video, plays a pivotal role in scaling dataset annotations and addressing the severe data scarcity issue in sign language translation. While automatic sign spotting holds great promise for enabling frame-level supervision at scale, it grapples with challenges such as vocabulary inflexibility and ambiguity inherent in continuous sign streams. Hence, we introduce a novel, training-free framework that integrates Large Language Models (LLMs) to significantly enhance sign spotting quality. Our approach extracts global spatio-temporal and hand shape features, which are then matched against a large-scale sign dictionary using dynamic time warping and cosine similarity. This dictionary-based matching inherently offers superior vocabulary flexibility without requiring model retraining. To mitigate noise and ambiguity from the matching process, an LLM performs context-aware gloss disambiguation via beam search, notably without fine-tuning. Extensive experiments on both synthetic and real-world sign language datasets demonstrate our method’s superior accuracy and sentence fluency compared to traditional approaches, highlighting the potential of LLMs in advancing sign spotting.

[235] Part Segmentation and Motion Estimation for Articulated Objects with Dynamic 3D Gaussians

Jun-Jee Chao, Qingyuan Jiang, Volkan Isler

Main category: cs.CV

TL;DR: A method for joint part segmentation and motion estimation from point clouds of articulated objects, robust to occlusions and missing data, using 3D Gaussians for representation.

Details

Motivation: Addressing challenges in articulated object motion analysis where point clouds vary due to occlusions or asynchronous sensor data, making point correspondence tracking ineffective.

Method: Represents the object as 3D Gaussians with time-dependent transformations (rotations, translations, scales). Part segmentation and motion estimation are achieved by linking observed points to Gaussians.

Result: Outperforms point-correspondence-based methods, especially under occlusions, with a 13% improvement in part segmentation accuracy.

Conclusion: The proposed Gaussian-based representation is effective for joint part segmentation and motion estimation, offering robustness to occlusions and missing data.

Abstract: Part segmentation and motion estimation are two fundamental problems for articulated object motion analysis. In this paper, we present a method to solve these two problems jointly from a sequence of observed point clouds of a single articulated object. The main challenge in our problem setting is that the point clouds are not assumed to be generated by a fixed set of moving points. Instead, each point cloud in the sequence could be an arbitrary sampling of the object surface at that particular time step. Such scenarios occur when the object undergoes major occlusions, or if the dataset is collected using measurements from multiple sensors asynchronously. In these scenarios, methods that rely on tracking point correspondences are not appropriate. We present an alternative approach based on a compact but effective representation where we represent the object as a collection of simple building blocks modeled as 3D Gaussians. We parameterize the Gaussians with time-dependent rotations, translations, and scales that are shared across all time steps. With our representation, part segmentation can be achieved by building correspondences between the observed points and the Gaussians. Moreover, the transformation of each point across time can be obtained by following the poses of the assigned Gaussian (even when the point is not observed). Experiments show that our method outperforms existing methods that solely rely on finding point correspondences. Additionally, we extend existing datasets to emulate real-world scenarios by considering viewpoint occlusions. We further demonstrate that our method is more robust to missing points as compared to existing approaches on these challenging datasets, even when some parts are completely occluded in some time-steps. Notably, our part segmentation performance outperforms the state-of-the-art method by 13% on point clouds with occlusions.

[236] DepthSync: Diffusion Guidance-Based Depth Synchronization for Scale- and Geometry-Consistent Video Depth Estimation

Yue-Jiang Dong, Wang Zhao, Jiale Xu, Ying Shan, Song-Hai Zhang

Main category: cs.CV

TL;DR: DepthSync is a training-free framework using diffusion guidance to achieve consistent depth predictions for long videos by addressing scale discrepancies and geometric inconsistencies.

Details

Motivation: Existing methods for video depth estimation struggle with scale discrepancies and geometric inconsistencies in long videos due to reliance on 2D diffusion priors and sliding window approaches.

Method: DepthSync introduces scale guidance to synchronize depth scales across windows and geometry guidance to enforce 3D geometric alignment within windows, leveraging diffusion guidance.

Result: Experiments show DepthSync improves scale and geometry consistency in depth predictions, especially for long videos.

Conclusion: DepthSync effectively addresses challenges in long-video depth estimation by combining scale and geometry guidance, enhancing consistency without training.

Abstract: Diffusion-based video depth estimation methods have achieved remarkable success with strong generalization ability. However, predicting depth for long videos remains challenging. Existing methods typically split videos into overlapping sliding windows, leading to accumulated scale discrepancies across different windows, particularly as the number of windows increases. Additionally, these methods rely solely on 2D diffusion priors, overlooking the inherent 3D geometric structure of video depths, which results in geometrically inconsistent predictions. In this paper, we propose DepthSync, a novel, training-free framework using diffusion guidance to achieve scale- and geometry-consistent depth predictions for long videos. Specifically, we introduce scale guidance to synchronize the depth scale across windows and geometry guidance to enforce geometric alignment within windows based on the inherent 3D constraints in video depths. These two terms work synergistically, steering the denoising process toward consistent depth predictions. Experiments on various datasets validate the effectiveness of our method in producing depth estimates with improved scale and geometry consistency, particularly for long videos.

[237] CLOT: Closed Loop Optimal Transport for Unsupervised Action Segmentation

Elena Bueno-Benito, Mariella Dimiccoli

Main category: cs.CV

TL;DR: CLOT introduces a multi-level cyclic feature learning mechanism for unsupervised action segmentation, improving feedback between frames and action representations by solving three OT problems.

Details

Motivation: ASOT lacks segment-level supervision, limiting feedback effectiveness. CLOT addresses this by integrating cyclical learning.

Method: CLOT uses an encoder-decoder architecture to solve two OT problems for pseudo-labels and embeddings, then refines them via cross-attention with a third OT problem.

Result: Experiments on four datasets show CLOT’s cyclical learning enhances unsupervised action segmentation.

Conclusion: CLOT’s multi-level cyclic learning improves unsupervised action segmentation by better integrating frame and segment embeddings.

Abstract: Unsupervised action segmentation has recently pushed its limits with ASOT, an optimal transport (OT)-based method that simultaneously learns action representations and performs clustering using pseudo-labels. Unlike other OT-based approaches, ASOT makes no assumptions about action ordering and can decode a temporally consistent segmentation from a noisy cost matrix between video frames and action labels. However, the resulting segmentation lacks segment-level supervision, limiting the effectiveness of feedback between frames and action representations. To address this limitation, we propose Closed Loop Optimal Transport (CLOT), a novel OT-based framework with a multi-level cyclic feature learning mechanism. Leveraging its encoder-decoder architecture, CLOT learns pseudo-labels alongside frame and segment embeddings by solving two separate OT problems. It then refines both frame embeddings and pseudo-labels through cross-attention between the learned frame and segment embeddings, by integrating a third OT problem. Experimental results on four benchmark datasets demonstrate the benefits of cyclical learning for unsupervised action segmentation.

Rang Meng, Yan Wang, Weipeng Wu, Ruobing Zheng, Yuming Li, Chenguang Ma

Main category: cs.CV

TL;DR: EchoMimicV3 is an efficient framework for multi-task and multi-modal human animation, addressing slow inference and high computational costs of traditional methods.

Details

Motivation: Overcome limitations of slow inference, high computational demands, and inefficiency in multi-task scenarios in human animation.

Method: Uses Soup-of-Tasks and Soup-of-Modals paradigms, along with novel training strategies like Negative Direct Preference Optimization and Phase-aware Negative CFG.

Result: Achieves competitive performance with a minimal model size of 1.3B parameters.

Conclusion: EchoMimicV3 efficiently unifies multi-task and multi-modal human animation, offering practical advantages over traditional methods.

Abstract: Recent work on human animation usually incorporates large-scale video models, thereby achieving more vivid performance. However, the practical use of such methods is hindered by the slow inference speed and high computational demands. Moreover, traditional work typically employs separate models for each animation task, increasing costs in multi-task scenarios and worsening the dilemma. To address these limitations, we introduce EchoMimicV3, an efficient framework that unifies multi-task and multi-modal human animation. At the core of EchoMimicV3 lies a threefold design: a Soup-of-Tasks paradigm, a Soup-of-Modals paradigm, and a novel training and inference strategy. The Soup-of-Tasks leverages multi-task mask inputs and a counter-intuitive task allocation strategy to achieve multi-task gains without multi-model pains. Meanwhile, the Soup-of-Modals introduces a Coupled-Decoupled Multi-Modal Cross Attention module to inject multi-modal conditions, complemented by a Multi-Modal Timestep Phase-aware Dynamical Allocation mechanism to modulate multi-modal mixtures. Besides, we propose Negative Direct Preference Optimization, Phase-aware Negative Classifier-Free Guidance (CFG), and Long Video CFG, which ensure stable training and inference. Extensive experiments and analyses demonstrate that EchoMimicV3, with a minimal model size of 1.3 billion parameters, achieves competitive performance in both quantitative and qualitative evaluations.

[239] NeuraLeaf: Neural Parametric Leaf Models with Shape and Deformation Disentanglement

Yang Yang, Dongni Mao, Hiroaki Santo, Yasuyuki Matsushita, Fumio Okura

Main category: cs.CV

TL;DR: NeuraLeaf is a neural parametric model for 3D leaves, addressing challenges in plant modeling by disentangling 2D base shapes and 3D deformations, leveraging 2D datasets and a novel skeleton-free skinning model.

Details

Motivation: Plant leaves' diverse shapes and flexible deformation pose unique challenges, unlike humans or animals, necessitating a specialized model for accurate 3D reconstruction.

Method: NeuraLeaf separates leaf geometry into 2D base shapes and 3D deformations, uses a skeleton-free skinning model, and employs the DeformLeaf dataset for training.

Result: NeuraLeaf generates diverse leaf shapes with deformation, achieving accurate fitting to 3D observations like depth maps and point clouds.

Conclusion: NeuraLeaf effectively models 3D leaves, leveraging 2D data and novel deformation techniques, with potential applications in agriculture and computer graphics.

Abstract: We develop a neural parametric model for 3D leaves for plant modeling and reconstruction that are essential for agriculture and computer graphics. While neural parametric models are actively studied for humans and animals, plant leaves present unique challenges due to their diverse shapes and flexible deformation. To this problem, we introduce a neural parametric model for leaves, NeuraLeaf. Capitalizing on the fact that flattened leaf shapes can be approximated as a 2D plane, NeuraLeaf disentangles the leaves’ geometry into their 2D base shapes and 3D deformations. This representation allows learning from rich sources of 2D leaf image datasets for the base shapes, and also has the advantage of simultaneously learning textures aligned with the geometry. To model the 3D deformation, we propose a novel skeleton-free skinning model and create a newly captured 3D leaf dataset called DeformLeaf. We show that NeuraLeaf successfully generates a wide range of leaf shapes with deformation, resulting in accurate model fitting to 3D observations like depth maps and point clouds. Our implementation and dataset are available at https://neuraleaf-yang.github.io/.

Xiang Li, Zhangchi Hu, Xiao Xu, Bin Kong

Main category: cs.CV

TL;DR: The paper proposes methods to align LiDAR and camera features in BEV representation for autonomous vehicles, addressing misalignment issues using 2D object priors and achieving SOTA results.

Details

Motivation: To resolve spatial misalignment between LiDAR and camera features caused by projection errors, leveraging predictable error locations at object-background boundaries.

Method: Introduces PGDC for local misalignment, DAGF for global misalignment, and SGDM for feature fusion, using 2D priors and gated attention.

Result: Achieves 71.5% mAP and 73.6% NDS on nuScenes validation dataset.

Conclusion: The proposed methods effectively align cross-modal features, enhancing 3D perception for autonomous vehicles.

Abstract: Integrating LiDAR and camera inputs into a unified Bird’s-Eye-View (BEV) representation is crucial for enhancing 3D perception capabilities of autonomous vehicles. However, existing methods suffer from spatial misalignment between LiDAR and camera features, which causes inaccurate depth supervision in camera branch and erroneous fusion during cross-modal feature aggregation. The root cause of this misalignment lies in projection errors, stemming from calibration inaccuracies and rolling shutter effect. The key insight of this work is that locations of these projection errors are not random but highly predictable, as they are concentrated at object-background boundaries which 2D detectors can reliably identify. Based on this, our main motivation is to utilize 2D object priors to pre-align cross-modal features before fusion. To address local misalignment, we propose Prior Guided Depth Calibration (PGDC), which leverages 2D priors to alleviate misalignment and preserve correct cross-modal feature pairs. To resolve global misalignment, we introduce Discontinuity Aware Geometric Fusion (DAGF) to suppress residual noise from PGDC and explicitly enhance sharp depth transitions at object-background boundaries, yielding a structurally aware representation. To effectively utilize these aligned representations, we incorporate Structural Guidance Depth Modulator (SGDM), using a gated attention mechanism to efficiently fuse aligned depth and image features. Our method achieves SOTA performance on nuScenes validation dataset, with its mAP and NDS reaching 71.5% and 73.6% respectively

[241] GPSMamba: A Global Phase and Spectral Prompt-guided Mamba for Infrared Image Super-Resolution

Yongsong Huang, Tomo Miyazaki, Shinichiro Omachi

Main category: cs.CV

TL;DR: GPSMamba introduces a framework combining adaptive semantic-frequency prompts and non-causal supervision to enhance infrared image super-resolution, outperforming existing methods.

Details

Motivation: Infrared Image Super-Resolution (IRSR) faces challenges like low contrast and sparse textures, requiring robust long-range modeling for global coherence.

Method: Proposes GPSMamba with an Adaptive Semantic-Frequency State Space Module (ASF-SSM) and Thermal-Spectral Attention with Phase Consistency Loss for non-causal supervision.

Result: GPSMamba achieves state-of-the-art performance in infrared image restoration.

Conclusion: The framework effectively mitigates causal modeling limitations, offering a powerful paradigm for IRSR.

Abstract: Infrared Image Super-Resolution (IRSR) is challenged by the low contrast and sparse textures of infrared data, requiring robust long-range modeling to maintain global coherence. While State-Space Models like Mamba offer proficiency in modeling long-range dependencies for this task, their inherent 1D causal scanning mechanism fragments the global context of 2D images, hindering fine-detail restoration. To address this, we propose Global Phase and Spectral Prompt-guided Mamba (GPSMamba), a framework that synergizes architectural guidance with non-causal supervision. First, our Adaptive Semantic-Frequency State Space Module (ASF-SSM) injects a fused semantic-frequency prompt directly into the Mamba block, integrating non-local context to guide reconstruction. Then, a novel Thermal-Spectral Attention and Phase Consistency Loss provides explicit, non-causal supervision to enforce global structural and spectral fidelity. By combining these two innovations, our work presents a systematic strategy to mitigate the limitations of causal modeling. Extensive experiments demonstrate that GPSMamba achieves state-of-the-art performance, validating our approach as a powerful new paradigm for infrared image restoration. Code is available at https://github.com/yongsongH/GPSMamba.

[242] Learning Only with Images: Visual Reinforcement Learning with Reasoning, Rendering, and Visual Feedback

Yang Chen, Yufan Shen, Wenxuan Huang, Sheng Zhou, Qunshu Lin, Xinyu Cai, Zhi Yu, Jiajun Bu, Botian Shi, Yu Qiao

Main category: cs.CV

TL;DR: RRVF framework enables MLLMs to learn complex visual reasoning from raw images, reducing reliance on curated image-text supervision and outperforming existing models.

Details

Motivation: Address the bottleneck of MLLMs' heavy reliance on curated image-text supervision for deep visual reasoning.

Method: Introduces RRVF, a framework using Reinforcement Learning (RL) with a closed-loop process of reasoning, rendering, and visual feedback, optimized via GRPO algorithm.

Result: RRVF-trained model outperforms existing MLLMs and supervised fine-tuning baselines, showing superior generalization.

Conclusion: RRVF effectively reduces dependency on supervised data and enhances MLLMs’ visual reasoning capabilities.

Abstract: Multimodal Large Language Models (MLLMs) exhibit impressive performance across various visual tasks. Subsequent investigations into enhancing their visual reasoning abilities have significantly expanded their performance envelope. However, a critical bottleneck in the advancement of MLLMs toward deep visual reasoning is their heavy reliance on curated image-text supervision. To solve this problem, we introduce a novel framework, Reasoning-Rendering-Visual-Feedback'' (RRVF), that enables MLLMs to learn complex visual reasoning from only raw images. This framework builds on the Asymmetry of Verification’’ principle, i.e., verifying the rendered output against the source image is substantially easier than performing deep visual reasoning to generate a faithful, structured representation such as code. We demonstrate that this relative ease provides an ideal reward signal for optimization via Reinforcement Learning (RL), thereby reducing reliance on image-text supervision. RRVF implements a closed-loop iterative process encompassing reasoning, rendering, and visual feedback components, enabling the model to perform complex reasoning, including self-correction through multi-turn interactions. This process is optimized end-to-end using the GRPO algorithm. Extensive evaluations are conducted on image-to-code generation across two diverse domains: data charts and web interfaces. The RRVF-trained model not only outperforms existing similarly sized open-source MLLMs and supervised fine-tuning baselines but also exhibits superior generalization. Notably, the model outperforms the more advanced MLLM used to generate visual feedback during training. Code is available at https://github.com/L-O-I/RRVF.

[243] Exploring the Feasibility of Deep Learning Techniques for Accurate Gender Classification from Eye Images

Basna Mohammed Salih Hasan, Ramadhan J. Mstafa

Main category: cs.CV

TL;DR: The paper proposes a CNN model for gender classification using the periocular region, achieving high accuracy (99% and 96%) on two datasets.

Details

Motivation: Gender classification is important in security, surveillance, and advertising, but accuracy is affected by cosmetics and disguise. The study focuses on the periocular region for reliable classification.

Method: A sophisticated CNN model is developed, tested on CVBL and (Female and Male) datasets, with performance evaluated using various metrics.

Result: The model achieved 99% accuracy on CVBL and 96% on (Female and Male) datasets, outperforming other state-of-the-art methods.

Conclusion: The model is effective for gender classification using the periocular region, with potential applications in security and surveillance.

Abstract: Gender classification has emerged as a crucial aspect in various fields, including security, human-machine interaction, surveillance, and advertising. Nonetheless, the accuracy of this classification can be influenced by factors such as cosmetics and disguise. Consequently, our study is dedicated to addressing this concern by concentrating on gender classification using color images of the periocular region. The periocular region refers to the area surrounding the eye, including the eyelids, eyebrows, and the region between them. It contains valuable visual cues that can be used to extract key features for gender classification. This paper introduces a sophisticated Convolutional Neural Network (CNN) model that utilizes color image databases to evaluate the effectiveness of the periocular region for gender classification. To validate the model’s performance, we conducted tests on two eye datasets, namely CVBL and (Female and Male). The recommended architecture achieved an outstanding accuracy of 99% on the previously unused CVBL dataset while attaining a commendable accuracy of 96% with a small number of learnable parameters (7,235,089) on the (Female and Male) dataset. To ascertain the effectiveness of our proposed model for gender classification using the periocular region, we evaluated its performance through an extensive range of metrics and compared it with other state-of-the-art approaches. The results unequivocally demonstrate the efficacy of our model, thereby suggesting its potential for practical application in domains such as security and surveillance.

[244] Few-Shot Vision-Language Reasoning for Satellite Imagery via Verifiable Rewards

Aybora Koksal, A. Aydin Alatan

Main category: cs.CV

TL;DR: The paper introduces a few-shot reinforcement learning framework (RLVR) for satellite imagery, eliminating the need for annotated data by using rule-based rewards. It shows strong performance with minimal examples, matching or exceeding models trained on large datasets.

Details

Motivation: Specialized domains like remote sensing lack annotated data, making traditional models impractical. The goal is to enable efficient, data-scarce training for vision-language tasks.

Method: Uses policy-gradient optimization with rule-based binary or IoU rewards, adapting the “1-shot RLVR” paradigm from language models to vision-language models. Requires as few as one curated example.

Result: Achieves substantial improvements with one example; scaling to 128 examples matches or exceeds models trained on thousands. Shows robust generalization but mild overfitting in extreme one-shot cases.

Conclusion: The RLVR framework offers a cost-effective, data-efficient solution for domain-specialist models, emphasizing minimal curated examples and rule-based rewards.

Abstract: Recent advances in large language and vision-language models have enabled strong reasoning capabilities, yet they remain impractical for specialized domains like remote sensing, where annotated data is scarce and expensive. We present the first few-shot reinforcement learning with verifiable reward (RLVR) framework for satellite imagery that eliminates the need for caption supervision–relying solely on lightweight, rule-based binary or IoU-based rewards. Adapting the “1-shot RLVR” paradigm from language models to vision-language models, we employ policy-gradient optimization with as few as one curated example to align model outputs for satellite reasoning tasks. Comprehensive experiments across multiple remote sensing benchmarks–including classification, visual question answering, and grounding–show that even a single example yields substantial improvements over the base model. Scaling to 128 examples matches or exceeds models trained on thousands of annotated samples. While the extreme one-shot setting can induce mild, task-specific overfitting, our approach consistently demonstrates robust generalization and efficiency across diverse tasks. Further, we find that prompt design and loss weighting significantly influence training stability and final accuracy. Our method enables cost-effective and data-efficient development of domain-specialist vision-language reasoning models, offering a pragmatic recipe for data-scarce fields: start from a compact VLM, curate a handful of reward-checkable cases, and train via RLVR.

[245] Personalized Safety Alignment for Text-to-Image Diffusion Models

Yu Lei, Jinbin Bai, Qingyu Shi, Aosong Feng, Kaidong Yu

Main category: cs.CV

TL;DR: The paper introduces Personalized Safety Alignment (PSA), a framework for adapting text-to-image diffusion models to individual user safety preferences, outperforming uniform safety standards.

Details

Motivation: Current text-to-image diffusion models use uniform safety standards, ignoring diverse user preferences shaped by factors like age and mental health.

Method: PSA integrates personalized user profiles into the diffusion process using a cross-attention mechanism and a new dataset, Sage.

Result: PSA outperforms existing methods in harmful content suppression and aligns better with user constraints, achieving higher Win Rate and Pass Rate scores.

Conclusion: PSA effectively personalizes safety behaviors in generative models while maintaining image quality, with publicly available code, data, and models.

Abstract: Text-to-image diffusion models have revolutionized visual content generation, but current safety mechanisms apply uniform standards that often fail to account for individual user preferences. These models overlook the diverse safety boundaries shaped by factors like age, mental health, and personal beliefs. To address this, we propose Personalized Safety Alignment (PSA), a framework that allows user-specific control over safety behaviors in generative models. PSA integrates personalized user profiles into the diffusion process, adjusting the model’s behavior to match individual safety preferences while preserving image quality. We introduce a new dataset, Sage, which captures user-specific safety preferences and incorporates these profiles through a cross-attention mechanism. Experiments show that PSA outperforms existing methods in harmful content suppression and aligns generated content better with user constraints, achieving higher Win Rate and Pass Rate scores. Our code, data, and models are publicly available at https://m-e-agi-lab.github.io/PSAlign/.

[246] Revisiting Adversarial Patch Defenses on Object Detectors: Unified Evaluation, Large-Scale Dataset, and New Insights

Junhao Zheng, Jiahao Sun, Chenhao Lin, Zhengyu Zhao, Chen Ma, Chong Zhang, Cong Wang, Qian Wang, Chao Shen

Main category: cs.CV

TL;DR: The paper introduces a unified benchmark for evaluating defenses against patch attacks on object detectors, revealing key insights and improving existing defenses.

Details

Motivation: Existing defense evaluations lack a unified framework, leading to inconsistent assessments of patch attack defenses.

Method: The study revisits 11 defenses, creates a benchmark with 2 attack goals, 13 patch attacks, 11 detectors, and 4 metrics, and analyzes a large-scale dataset of 94,000 images.

Result: Key findings include the importance of data distribution in defense difficulty, the relevance of attacked object precision over patch detection accuracy, and the robustness of complex/stochastic defenses.

Conclusion: The benchmark and insights aim to guide proper evaluation and design of patch attack defenses, with ongoing updates to the dataset and code.

Abstract: Developing reliable defenses against patch attacks on object detectors has attracted increasing interest. However, we identify that existing defense evaluations lack a unified and comprehensive framework, resulting in inconsistent and incomplete assessments of current methods. To address this issue, we revisit 11 representative defenses and present the first patch defense benchmark, involving 2 attack goals, 13 patch attacks, 11 object detectors, and 4 diverse metrics. This leads to the large-scale adversarial patch dataset with 94 types of patches and 94,000 images. Our comprehensive analyses reveal new insights: (1) The difficulty in defending against naturalistic patches lies in the data distribution, rather than the commonly believed high frequencies. Our new dataset with diverse patch distributions can be used to improve existing defenses by 15.09% AP@0.5. (2) The average precision of the attacked object, rather than the commonly pursued patch detection accuracy, shows high consistency with defense performance. (3) Adaptive attacks can substantially bypass existing defenses, and defenses with complex/stochastic models or universal patch properties are relatively robust. We hope that our analyses will serve as guidance on properly evaluating patch attacks/defenses and advancing their design. Code and dataset are available at https://github.com/Gandolfczjh/APDE, where we will keep integrating new attacks/defenses.

[247] VLM4D: Towards Spatiotemporal Awareness in Vision Language Models

Shijie Zhou, Alexander Vilesov, Xuehai He, Ziyu Wan, Shuwang Zhang, Aditya Nagachandra, Di Chang, Dongdong Chen, Xin Eric Wang, Achuta Kadambi

Main category: cs.CV

TL;DR: VLM4D is a new benchmark for evaluating spatiotemporal reasoning in vision language models (VLMs), revealing their limitations compared to humans and suggesting improvements.

Details

Motivation: Current VLMs lack dynamic spatiotemporal reasoning abilities, which humans excel at, limiting their real-world applicability.

Method: Introduces VLM4D, a benchmark with diverse videos and QA pairs, evaluating VLMs on motion, perspective, and temporal coherence.

Result: VLMs show significant gaps in performance compared to humans, struggling with visual cue integration and temporal coherence.

Conclusion: Targeted improvements like 4D feature fields and fine-tuning show promise, encouraging further research for better dynamic visual intelligence.

Abstract: Vision language models (VLMs) have shown remarkable capabilities in integrating linguistic and visual reasoning but remain fundamentally limited in understanding dynamic spatiotemporal interactions. Humans effortlessly track and reason about object movements, rotations, and perspective shifts-abilities essential for robust dynamic real-world understanding yet notably lacking in current VLMs. In this paper, we introduce VLM4D, the first benchmark specifically designed to evaluate the spatiotemporal reasoning capabilities of VLMs. Our benchmark comprises diverse real-world and synthetic videos accompanied by carefully curated question-answer pairs emphasizing translational and rotational motions, perspective awareness, and motion continuity. Through comprehensive evaluations of state-of-the-art open and closed-source VLMs, we identify significant performance gaps compared to human baselines, highlighting fundamental deficiencies in existing models. Extensive analysis reveals that VLMs struggle particularly with integrating multiple visual cues and maintaining temporal coherence. We further explore promising directions, such as leveraging 4D feature field reconstruction and targeted spatiotemporal supervised fine-tuning, demonstrating their effectiveness in enhancing spatiotemporal comprehension. Our work aims to encourage deeper exploration into improving VLMs’ spatial and temporal grounding, paving the way towards more capable and reliable visual intelligence for dynamic environments.

[248] Patho-AgenticRAG: Towards Multimodal Agentic Retrieval-Augmented Generation for Pathology VLMs via Reinforcement Learning

Wenchuan Zhang, Jingru Guo, Hengzhe Zhang, Penghao Zhang, Jie Chen, Shuwan Zhang, Zhang Zhang, Yuhao Yi, Hong Bu

Main category: cs.CV

TL;DR: Patho-AgenticRAG is a multimodal RAG framework for pathology, addressing hallucinations in VLMs by leveraging joint text-image search from authoritative textbooks, improving diagnostic accuracy.

Details

Motivation: Pathology VLMs face challenges like hallucinations due to ultra-high resolution and complex semantics, limiting trust. Existing RAG methods rely on text-only knowledge, missing visual cues.

Method: Proposes Patho-AgenticRAG, a multimodal RAG framework with page-level embeddings from pathology textbooks, enabling joint text-image search and reasoning.

Result: Outperforms existing models in tasks like multiple-choice diagnosis and visual question answering.

Conclusion: Patho-AgenticRAG enhances diagnostic accuracy by integrating visual and textual information, addressing limitations of current VLMs in pathology.

Abstract: Although Vision Language Models (VLMs) have shown strong generalization in medical imaging, pathology presents unique challenges due to ultra-high resolution, complex tissue structures, and nuanced clinical semantics. These factors make pathology VLMs prone to hallucinations, i.e., generating outputs inconsistent with visual evidence, which undermines clinical trust. Existing RAG approaches in this domain largely depend on text-based knowledge bases, limiting their ability to leverage diagnostic visual cues. To address this, we propose Patho-AgenticRAG, a multimodal RAG framework with a database built on page-level embeddings from authoritative pathology textbooks. Unlike traditional text-only retrieval systems, it supports joint text-image search, enabling direct retrieval of textbook pages that contain both the queried text and relevant visual cues, thus avoiding the loss of critical image-based information. Patho-AgenticRAG also supports reasoning, task decomposition, and multi-turn search interactions, improving accuracy in complex diagnostic scenarios. Experiments show that Patho-AgenticRAG significantly outperforms existing multimodal models in complex pathology tasks like multiple-choice diagnosis and visual question answering. Our project is available at the Patho-AgenticRAG repository: https://github.com/Wenchuan-Zhang/Patho-AgenticRAG.

[249] EgoPrompt: Prompt Learning for Egocentric Action Recognition

Huaihai Lyu, Chaofan Chen, Yuheng Ji, Changsheng Xu

Main category: cs.CV

TL;DR: EgoPrompt, a prompt learning-based framework, enhances egocentric action recognition by unifying verb and noun component representations through a prompt pool and attention-based fusion, achieving state-of-the-art results.

Details

Motivation: Existing approaches treat verb and noun components independently, ignoring their semantic relationships, leading to fragmented representations and poor generalization.

Method: EgoPrompt uses a Unified Prompt Pool and attention-based fusion to integrate verb and noun representations, with Diverse Pool Criteria for training.

Result: EgoPrompt outperforms benchmarks on Ego4D, EPIC-Kitchens, and EGTEA datasets in within-dataset, cross-dataset, and generalization tasks.

Conclusion: EgoPrompt effectively captures cross-component relationships, improving egocentric action recognition performance.

Abstract: Driven by the increasing demand for applications in augmented and virtual reality, egocentric action recognition has emerged as a prominent research area. It is typically divided into two subtasks: recognizing the performed behavior (i.e., verb component) and identifying the objects being acted upon (i.e., noun component) from the first-person perspective. However, most existing approaches treat these two components as independent classification tasks, focusing on extracting component-specific knowledge while overlooking their inherent semantic and contextual relationships, leading to fragmented representations and sub-optimal generalization capability. To address these challenges, we propose a prompt learning-based framework, EgoPrompt, to conduct the egocentric action recognition task. Building on the existing prompting strategy to capture the component-specific knowledge, we construct a Unified Prompt Pool space to establish interaction between the two types of component representations. Specifically, the component representations (from verbs and nouns) are first decomposed into fine-grained patterns with the prompt pair form. Then, these pattern-level representations are fused through an attention-based mechanism to facilitate cross-component interaction. To ensure the prompt pool is informative, we further introduce a novel training objective, Diverse Pool Criteria. This objective realizes our goals from two perspectives: Prompt Selection Frequency Regularization and Prompt Knowledge Orthogonalization. Extensive experiments are conducted on the Ego4D, EPIC-Kitchens, and EGTEA datasets. The results consistently show that EgoPrompt achieves state-of-the-art performance across within-dataset, cross-dataset, and base-to-novel generalization benchmarks.

[250] Modular Transformer Architecture for Precision Agriculture Imaging

Brian Gopalan, Nathalia Nascimento, Vishal Monga

Main category: cs.CV

TL;DR: A modular deep-learning framework for weed segmentation in drone videos, dynamically routing degraded images to specialized transformers for improved accuracy and efficiency.

Details

Motivation: Addresses the need for efficient and accurate weed segmentation in precision agriculture, tackling common image degradation issues like blur and noise.

Method: Uses Mean Absolute Deviation and Laplacian to detect degradation, then routes images to one of three transformer models (baseline, noise-reducing, or blur-correcting) for processing.

Result: Outperforms CNN-based methods in segmentation quality and computational efficiency.

Conclusion: Demonstrates a significant advancement in deep-learning applications for agriculture by dynamically handling image degradation.

Abstract: This paper addresses the critical need for efficient and accurate weed segmentation from drone video in precision agriculture. A quality-aware modular deep-learning framework is proposed that addresses common image degradation by analyzing quality conditions-such as blur and noise-and routing inputs through specialized pre-processing and transformer models optimized for each degradation type. The system first analyzes drone images for noise and blur using Mean Absolute Deviation and the Laplacian. Data is then dynamically routed to one of three vision transformer models: a baseline for clean images, a modified transformer with Fisher Vector encoding for noise reduction, or another with an unrolled Lucy-Richardson decoder to correct blur. This novel routing strategy allows the system to outperform existing CNN-based methods in both segmentation quality and computational efficiency, demonstrating a significant advancement in deep-learning applications for agriculture.

[251] S$^2$Q-VDiT: Accurate Quantized Video Diffusion Transformer with Salient Data and Sparse Token Distillation

Weilun Feng, Haotong Qin, Chuanguang Yang, Xiangqi Li, Han Yang, Yuqi Li, Zhulin An, Libo Huang, Michele Magno, Yongjun Xu

Main category: cs.CV

TL;DR: S$^2$Q-VDiT is a post-training quantization framework for video diffusion models (V-DMs) that reduces computational costs while maintaining performance, using salient data selection and sparse token distillation.

Details

Motivation: The high computational cost of video diffusion models due to large parameter sizes and long token sequences motivates the need for efficient quantization methods.

Method: Proposes Hessian-aware Salient Data Selection for calibration and Attention-guided Sparse Token Distillation to address learning challenges.

Result: Achieves lossless performance under W4A6 quantization, with 3.9× model compression and 1.3× inference acceleration.

Conclusion: S$^2$Q-VDiT effectively balances performance and efficiency for video diffusion models.

Abstract: Diffusion transformers have emerged as the mainstream paradigm for video generation models. However, the use of up to billions of parameters incurs significant computational costs. Quantization offers a promising solution by reducing memory usage and accelerating inference. Nonetheless, we observe that the joint modeling of spatial and temporal information in video diffusion models (V-DMs) leads to extremely long token sequences, which introduces high calibration variance and learning challenges. To address these issues, we propose S$^2$Q-VDiT, a post-training quantization framework for V-DMs that leverages Salient data and Sparse token distillation. During the calibration phase, we identify that quantization performance is highly sensitive to the choice of calibration data. To mitigate this, we introduce \textit{Hessian-aware Salient Data Selection}, which constructs high-quality calibration datasets by considering both diffusion and quantization characteristics unique to V-DMs. To tackle the learning challenges, we further analyze the sparse attention patterns inherent in V-DMs. Based on this observation, we propose \textit{Attention-guided Sparse Token Distillation}, which exploits token-wise attention distributions to emphasize tokens that are more influential to the model’s output. Under W4A6 quantization, S$^2$Q-VDiT achieves lossless performance while delivering $3.9\times$ model compression and $1.3\times$ inference acceleration. Code will be available at https://github.com/wlfeng0509/s2q-vdit.

[252] Unlocking the Potential of MLLMs in Referring Expression Segmentation via a Light-weight Mask Decoder

Jingchao Wang, Zhijian Wu, Dingjiang Huang, Yefeng Zheng, Hong Wang

Main category: cs.CV

TL;DR: MLLMSeg is a lightweight framework for Reference Expression Segmentation (RES) that leverages MLLMs’ visual and semantic features without extra encoders, outperforming SAM-based and SAM-free methods.

Details

Motivation: Address the trade-off between performance and cost in RES by avoiding heavy models like SAM while maintaining accuracy.

Method: Proposes MLLMSeg, utilizing MLLM’s visual encoder and a DSFF module for feature fusion, plus a lightweight mask decoder (34M parameters).

Result: Outperforms SAM-based and SAM-free methods, balancing performance and cost effectively.

Conclusion: MLLMSeg offers a cost-efficient, high-performance solution for RES by integrating MLLM features without additional encoders.

Abstract: Reference Expression Segmentation (RES) aims to segment image regions specified by referring expressions and has become popular with the rise of multimodal large models (MLLMs). While MLLMs excel in semantic understanding, their token-generation paradigm struggles with pixel-level dense prediction. Existing RES methods either couple MLLMs with the parameter-heavy Segment Anything Model (SAM) with 632M network parameters or adopt SAM-free lightweight pipelines that sacrifice accuracy. To address the trade-off between performance and cost, we specifically propose MLLMSeg, a novel framework that fully exploits the inherent visual detail features encoded in the MLLM vision encoder without introducing an extra visual encoder. Besides, we propose a detail-enhanced and semantic-consistent feature fusion module (DSFF) that fully integrates the detail-related visual feature with the semantic-related feature output by the large language model (LLM) of MLLM. Finally, we establish a light-weight mask decoder with only 34M network parameters that optimally leverages detailed spatial features from the visual encoder and semantic features from the LLM to achieve precise mask prediction. Extensive experiments demonstrate that our method generally surpasses both SAM-based and SAM-free competitors, striking a better balance between performance and cost. Code is available at https://github.com/jcwang0602/MLLMSeg.

[253] TSPO: Temporal Sampling Policy Optimization for Long-form Video Language Understanding

Canhui Tang, Zifan Han, Hongbo Sun, Sanping Zhou, Xuchong Zhang, Xin Wei, Ye Yuan, Jinglin Xu, Hao Sun

Main category: cs.CV

TL;DR: TSPO improves MLLMs’ long-video understanding via reinforcement learning for event-aware frame sampling and joint optimization.

Details

Motivation: MLLMs struggle with long videos due to context limits and sparse frame sampling inefficiencies. Existing methods miss critical events or rely on pre-trained models.

Method: Proposes TSPO: a trainable event-aware agent for keyframe selection and reinforcement learning for joint optimization, with rule-based rewards and training data pipeline.

Result: TSPO achieves state-of-the-art performance on long-video benchmarks and shows transferability across Video-MLLMs.

Conclusion: TSPO effectively addresses long-video challenges in MLLMs through optimized temporal sampling and reinforcement learning.

Abstract: Multimodal Large Language Models (MLLMs) have demonstrated significant progress in vision-language tasks, yet they still face challenges when processing long-duration video inputs. The limitation arises from MLLMs’ context limit and training costs, necessitating sparse frame sampling before feeding videos into MLLMs. Existing video MLLMs adopt training-free uniform sampling or keyframe search, which may miss critical events or be constrained by the pre-trained models’ event understanding capabilities. Meanwhile, building a training-based method remains challenging due to the unsupervised and non-differentiable nature of sparse frame sampling. To address these problems, we propose Temporal Sampling Policy Optimization (TSPO), advancing MLLMs’ long-form video-language understanding via reinforcement learning. Specifically, we first propose a trainable event-aware temporal agent, which captures event-query correlation for performing probabilistic keyframe selection. Then, we propose the TSPO reinforcement learning paradigm, which models keyframe selection and language generation as a joint decision-making process, enabling end-to-end group relative optimization with efficient rule-based rewards. Furthermore, for the TSPO’s training, we propose a long video training data construction pipeline with comprehensive temporal data and video Needle-in-a-Haystack data. Finally, we incorporate rule-based answering accuracy and temporal locating reward mechanisms to optimize the temporal sampling policy. Comprehensive experiments show that our TSPO achieves state-of-the-art performance across multiple long video understanding benchmarks, and shows transferable ability across different cutting-edge Video-MLLMs. Our code is available at https://github.com/Hui-design/TSPO

[254] ANPrompt: Anti-noise Prompt Tuning for Vision-Language Models

Yansheng Gao, Yufei Zheng, Jinghan Qu, Zixi Zhu, Yukuan Zhang, Shengsheng Wang

Main category: cs.CV

TL;DR: ANPrompt is a prompt tuning framework for vision-language models that enhances robustness to weak semantic perturbations by integrating noise prompts and noise-resistant visual prototypes, outperforming existing methods.

Details

Motivation: Existing prompt-tuned VLMs are vulnerable to subtle semantic noise, degrading their generalization to unseen classes.

Method: ANPrompt constructs noise prompts from perturbed text embeddings, integrates them with learnable prompts, and computes noise-resistant visual prototypes. It uses alignment, robustness, and anti-noise objectives.

Result: ANPrompt outperforms existing methods on 11 benchmarks, showing superior robustness and generalization.

Conclusion: ANPrompt effectively addresses the vulnerability of prompt-tuned VLMs to semantic noise, improving their robustness and generalization.

Abstract: Prompt tuning has emerged as an efficient and effective technique for adapting vision-language models (VLMs) with low computational overhead. However, existing methods often overlook the vulnerability of prompt-tuned VLMs to weak semantic perturbations-such as subtle image or text noise-that degrade their generalization to unseen classes. To address this limitation, we propose ANPrompt, a novel prompt tuning framework designed to enhance robustness under such perturbations. ANPrompt first constructs weak noise text features by fusing original and noise-perturbed text embeddings, which are then clustered to form noise prompts. These noise prompts are integrated with learnable prompt tokens to generate anti-noise prompts, which are injected into the deeper layers of both image and text encoders. To further capture the noise-aware visual semantics, ANPrompt computes the Noise-Resistant Visual Prompt Prototype (NRVPP) by averaging the output prompt tokens from the vision encoder. Finally, ANPrompt introduces alignment, robustness, and anti-noise objectives by computing a Weak semantic noise Alignment Loss (WALoss) alongside the standard cross-entropy and sim loss. Experiments across 11 benchmarks demonstrate that ANPrompt consistently outperforms existing prompt tuning approaches, achieving superior robustness to semantic noise and improved generalization to novel categories.

cs.AI

[255] Prescriptive Agents based on Rag for Automated Maintenance (PARAM)

Chitranshu Harbola, Anupam Purwar

Main category: cs.AI

TL;DR: An LLM-based system for prescriptive maintenance combines vibration analysis with multi-agentic generation to provide actionable recommendations, improving industrial machinery upkeep.

Details

Motivation: To enhance industrial maintenance by moving beyond anomaly detection to offer intelligent, actionable recommendations, bridging the gap between monitoring and planning.

Method: Integrates vibration frequency analysis (BPFO, BPFI, BSF, FTF) with LLM processing, multi-agentic knowledge retrieval, and structured recommendation generation using the Gemini model.

Result: Demonstrates effective anomaly detection and contextually relevant maintenance guidance, validated on bearing vibration datasets.

Conclusion: Advances LLM applications in industrial maintenance, offering a scalable framework for prescriptive maintenance across sectors.

Abstract: Industrial machinery maintenance requires timely intervention to prevent catastrophic failures and optimize operational efficiency. This paper presents an integrated Large Language Model (LLM)-based intelligent system for prescriptive maintenance that extends beyond traditional anomaly detection to provide actionable maintenance recommendations. Building upon our prior LAMP framework for numerical data analysis, we develop a comprehensive solution that combines bearing vibration frequency analysis with multi agentic generation for intelligent maintenance planning. Our approach serializes bearing vibration data (BPFO, BPFI, BSF, FTF frequencies) into natural language for LLM processing, enabling few-shot anomaly detection with high accuracy. The system classifies fault types (inner race, outer race, ball/roller, cage faults) and assesses severity levels. A multi-agentic component processes maintenance manuals using vector embeddings and semantic search, while also conducting web searches to retrieve comprehensive procedural knowledge and access up-to-date maintenance practices for more accurate and in-depth recommendations. The Gemini model then generates structured maintenance recommendations includes immediate actions, inspection checklists, corrective measures, parts requirements, and timeline specifications. Experimental validation in bearing vibration datasets demonstrates effective anomaly detection and contextually relevant maintenance guidance. The system successfully bridges the gap between condition monitoring and actionable maintenance planning, providing industrial practitioners with intelligent decision support. This work advances the application of LLMs in industrial maintenance, offering a scalable framework for prescriptive maintenance across machinery components and industrial sectors.

[256] GeoFlow: Agentic Workflow Automation for Geospatial Tasks

Amulya Bhattaram, Justin Chung, Stanley Chung, Ranit Gupta, Janani Ramamoorthy, Kartikeya Gullapalli, Diana Marculescu, Dimitrios Stamoulis

Main category: cs.AI

TL;DR: GeoFlow automates agentic workflows for geospatial tasks, improving success rates and reducing token usage.

Details

Motivation: Prior work lacks explicit guidance for API selection in geospatial tasks, limiting agent performance.

Method: GeoFlow provides agents with detailed tool-calling objectives for runtime geospatial API invocation.

Result: Increases agentic success by 6.8% and reduces token usage up to fourfold.

Conclusion: GeoFlow outperforms state-of-the-art methods in efficiency and effectiveness for geospatial tasks.

Abstract: We present GeoFlow, a method that automatically generates agentic workflows for geospatial tasks. Unlike prior work that focuses on reasoning decomposition and leaves API selection implicit, our method provides each agent with detailed tool-calling objectives to guide geospatial API invocation at runtime. GeoFlow increases agentic success by 6.8% and reduces token usage by up to fourfold across major LLM families compared to state-of-the-art approaches.

[257] Who is a Better Player: LLM against LLM

Yingjie Zhou, Jiezhang Cao, Farong Wen, Li Xu, Yanwei Jiang, Jun Jia, Ronghui Li, Xiaohong Liu, Yu Zhou, Xiongkuo Min, Jie Guo, Zicheng Zhang, Guangtao Zhai

Main category: cs.AI

TL;DR: The paper proposes an adversarial benchmarking framework using board games to evaluate LLMs, addressing limitations of Q&A benchmarks. It introduces Qi Town, a platform with 5 games and 20 LLM players, using Elo and PLG for quantitative evaluation and PSS for mental fitness. Results show LLMs’ optimism and adaptability but reveal instability in skill play.

Details

Motivation: To assess LLMs' comprehensive performance beyond Q&A benchmarks, leveraging adversarial board games for strategic reasoning evaluation.

Method: Developed Qi Town, a platform with 5 games and 20 LLM players, using Elo ratings, PLG, and PSS in a round-robin tournament.

Result: LLMs exhibit optimism and adaptability in adversarial environments but show instability in skill play, as revealed by PLG analysis.

Conclusion: The framework effectively evaluates LLMs’ strategic capabilities, highlighting their strengths and areas needing further exploration.

Abstract: Adversarial board games, as a paradigmatic domain of strategic reasoning and intelligence, have long served as both a popular competitive activity and a benchmark for evaluating artificial intelligence (AI) systems. Building on this foundation, we propose an adversarial benchmarking framework to assess the comprehensive performance of Large Language Models (LLMs) through board games competition, compensating the limitation of data dependency of the mainstream Question-and-Answer (Q&A) based benchmark method. We introduce Qi Town, a specialized evaluation platform that supports 5 widely played games and involves 20 LLM-driven players. The platform employs both the Elo rating system and a novel Performance Loop Graph (PLG) to quantitatively evaluate the technical capabilities of LLMs, while also capturing Positive Sentiment Score (PSS) throughout gameplay to assess mental fitness. The evaluation is structured as a round-robin tournament, enabling systematic comparison across players. Experimental results indicate that, despite technical differences, most LLMs remain optimistic about winning and losing, demonstrating greater adaptability to high-stress adversarial environments than humans. On the other hand, the complex relationship between cyclic wins and losses in PLGs exposes the instability of LLMs’ skill play during games, warranting further explanation and exploration.

[258] ConfAgents: A Conformal-Guided Multi-Agent Framework for Cost-Efficient Medical Diagnosis

Huiya Zhao, Yinghao Zhu, Zixiang Wang, Yasha Wang, Junyi Gao, Liantao Ma

Main category: cs.AI

TL;DR: HealthFlow is a self-evolving AI agent for healthcare that improves strategic planning through meta-level evolution, outperforming existing frameworks.

Details

Motivation: Static AI strategies limit effectiveness in healthcare; HealthFlow aims to enable autonomous strategic learning.

Method: HealthFlow uses meta-level evolution to refine problem-solving policies, tested with the EHRFlowBench benchmark.

Result: HealthFlow significantly outperforms state-of-the-art AI frameworks in healthcare tasks.

Conclusion: The work advances AI from tool-users to self-evolving task-managers, enhancing autonomy in scientific discovery.

Abstract: The efficacy of AI agents in healthcare research is hindered by their reliance on static, predefined strategies. This creates a critical limitation: agents can become better tool-users but cannot learn to become better strategic planners, a crucial skill for complex domains like healthcare. We introduce HealthFlow, a self-evolving AI agent that overcomes this limitation through a novel meta-level evolution mechanism. HealthFlow autonomously refines its own high-level problem-solving policies by distilling procedural successes and failures into a durable, strategic knowledge base. To anchor our research and facilitate reproducible evaluation, we introduce EHRFlowBench, a new benchmark featuring complex, realistic health data analysis tasks derived from peer-reviewed clinical research. Our comprehensive experiments demonstrate that HealthFlow’s self-evolving approach significantly outperforms state-of-the-art agent frameworks. This work marks a necessary shift from building better tool-users to designing smarter, self-evolving task-managers, paving the way for more autonomous and effective AI for scientific discovery.

[259] Fine-Tuning Small Language Models (SLMs) for Autonomous Web-based Geographical Information Systems (AWebGIS)

Mahdi Nazari Ashani, Ali Asghar Alesheikh, Saba Kazemi, Kimya Kheirkhah, Yasin Mohammadi, Fatemeh Rezaie, Amir Mahdi Manafi, Hedieh Zarkesh

Main category: cs.AI

TL;DR: The paper compares three methods for autonomous web-based GIS (AWebGIS), favoring a client-side small language model (SLM) for accuracy, privacy, and scalability.

Details

Motivation: Current AWebGIS solutions rely on cloud-based LLMs, posing privacy and scalability issues due to internet dependency and centralized processing.

Method: Three approaches were tested: (1) cloud-based LLMs, (2) offline classical ML classifiers, and (3) client-side fine-tuned SLM (T5-small).

Result: The client-side SLM method achieved the highest accuracy (0.93 exact matching, 0.99 Levenshtein similarity, 0.98 ROUGE scores) and reduced server load.

Conclusion: Client-side SLMs are feasible for AWebGIS, offering accuracy, privacy, and scalability without server-based inference.

Abstract: Autonomous web-based geographical information systems (AWebGIS) aim to perform geospatial operations from natural language input, providing intuitive, intelligent, and hands-free interaction. However, most current solutions rely on cloud-based large language models (LLMs), which require continuous internet access and raise users’ privacy and scalability issues due to centralized server processing. This study compares three approaches to enabling AWebGIS: (1) a fully-automated online method using cloud-based LLMs (e.g., Cohere); (2) a semi-automated offline method using classical machine learning classifiers such as support vector machine and random forest; and (3) a fully autonomous offline (client-side) method based on a fine-tuned small language model (SLM), specifically T5-small model, executed in the client’s web browser. The third approach, which leverages SLMs, achieved the highest accuracy among all methods, with an exact matching accuracy of 0.93, Levenshtein similarity of 0.99, and recall-oriented understudy for gisting evaluation ROUGE-1 and ROUGE-L scores of 0.98. Crucially, this client-side computation strategy reduces the load on backend servers by offloading processing to the user’s device, eliminating the need for server-based inference. These results highlight the feasibility of browser-executable models for AWebGIS solutions.

[260] Cognitive Duality for Adaptive Web Agents

Jiarun Liu, Chunhong Zhang, Zheng Hu

Main category: cs.AI

TL;DR: The paper introduces CogniWeb, a modular web agent architecture inspired by human dual-process theory, combining offline and online learning for efficient and effective web navigation.

Details

Motivation: Web navigation is a challenging AGI task requiring complex decision-making. Current methods lack integration of offline and online learning paradigms.

Method: The paper proposes a dual-process framework (System 1 for fast intuitive actions, System 2 for deliberate planning) implemented in CogniWeb, which adapts based on task complexity.

Result: CogniWeb achieves a 43.96% success rate on WebArena with a 75% reduction in token usage, demonstrating competitive performance and efficiency.

Conclusion: The dual-process approach effectively bridges offline and online learning, offering a scalable and efficient solution for web navigation tasks.

Abstract: Web navigation represents a critical and challenging domain for evaluating artificial general intelligence (AGI), demanding complex decision-making within high-entropy, dynamic environments with combinatorially explosive action spaces. Current approaches to building autonomous web agents either focus on offline imitation learning or online exploration, but rarely integrate both paradigms effectively. Inspired by the dual-process theory of human cognition, we derive a principled decomposition into fast System 1 and slow System 2 cognitive processes. This decomposition provides a unifying perspective on existing web agent methodologies, bridging the gap between offline learning of intuitive reactive behaviors and online acquisition of deliberative planning capabilities. We implement this framework in CogniWeb, a modular agent architecture that adaptively toggles between fast intuitive processing and deliberate reasoning based on task complexity. Our evaluation on WebArena demonstrates that CogniWeb achieves competitive performance (43.96% success rate) while maintaining significantly higher efficiency (75% reduction in token usage).

[261] Large Language Models Reasoning Abilities Under Non-Ideal Conditions After RL-Fine-Tuning

Chang Tian, Matthew B. Blaschko, Mingzhe Xing, Xiuxing Li, Yinliang Yue, Marie-Francine Moens

Main category: cs.AI

TL;DR: RL fine-tuning improves LLM reasoning in ideal settings but fails in non-ideal scenarios like summary inference, noise suppression, and contextual filtering, revealing critical limitations.

Details

Motivation: To address the gap in evaluating LLM reasoning under realistic, non-ideal scenarios, inspired by human reasoning reliability under imperfect inputs.

Method: Fine-tuned three LLMs and an LVLM using RL with a policy-gradient algorithm, tested on eight datasets across three non-ideal scenarios.

Result: RL fine-tuning boosts performance in ideal settings but significantly declines in non-ideal scenarios, with proposed remediation methods largely ineffective.

Conclusion: Current LLM reasoning capabilities are overstated; non-ideal scenario evaluation is crucial for realistic assessment.

Abstract: Reinforcement learning (RL) has become a key technique for enhancing the reasoning abilities of large language models (LLMs), with policy-gradient algorithms dominating the post-training stage because of their efficiency and effectiveness. However, most existing benchmarks evaluate large-language-model reasoning under idealized settings, overlooking performance in realistic, non-ideal scenarios. We identify three representative non-ideal scenarios with practical relevance: summary inference, fine-grained noise suppression, and contextual filtering. We introduce a new research direction guided by brain-science findings that human reasoning remains reliable under imperfect inputs. We formally define and evaluate these challenging scenarios. We fine-tune three LLMs and a state-of-the-art large vision-language model (LVLM) using RL with a representative policy-gradient algorithm and then test their performance on eight public datasets. Our results reveal that while RL fine-tuning improves baseline reasoning under idealized settings, performance declines significantly across all three non-ideal scenarios, exposing critical limitations in advanced reasoning capabilities. Although we propose a scenario-specific remediation method, our results suggest current methods leave these reasoning deficits largely unresolved. This work highlights that the reasoning abilities of large models are often overstated and underscores the importance of evaluating models under non-ideal scenarios. The code and data will be released at XXXX.

[262] Beyond Automation: Socratic AI, Epistemic Agency, and the Implications of the Emergence of Orchestrated Multi-Agent Learning Architectures

Peer-Benedikt Degen, Igor Asanov

Main category: cs.AI

TL;DR: The paper evaluates a Socratic AI Tutor’s impact on student learning, showing it enhances critical thinking compared to uninstructed AI. It proposes a shift to orchestrated multi-agent systems (MAS) for education, discussing implications and scalability.

Details

Motivation: To address concerns about generative AI de-skilling students and explore its potential to enhance metacognitive engagement in education.

Method: A controlled experiment with 65 pre-service teachers in Germany, comparing the Socratic AI Tutor to an uninstructed AI chatbot.

Result: Students using the Socratic Tutor reported greater support for critical, independent, and reflective thinking.

Conclusion: The study advocates for pedagogically aligned multi-agent systems in education, offering empirical support and a conceptual framework for human-AI co-agency.

Abstract: Generative AI is no longer a peripheral tool in higher education. It is rapidly evolving into a general-purpose infrastructure that reshapes how knowledge is generated, mediated, and validated. This paper presents findings from a controlled experiment evaluating a Socratic AI Tutor, a large language model designed to scaffold student research question development through structured dialogue grounded in constructivist theory. Conducted with 65 pre-service teacher students in Germany, the study compares interaction with the Socratic Tutor to engagement with an uninstructed AI chatbot. Students using the Socratic Tutor reported significantly greater support for critical, independent, and reflective thinking, suggesting that dialogic AI can stimulate metacognitive engagement and challenging recent narratives of de-skilling due to generative AI usage. These findings serve as a proof of concept for a broader pedagogical shift: the use of multi-agent systems (MAS) composed of specialised AI agents. To conceptualise this, we introduce the notion of orchestrated MAS, modular, pedagogically aligned agent constellations, curated by educators, that support diverse learning trajectories through differentiated roles and coordinated interaction. To anchor this shift, we propose an adapted offer-and-use model, in which students appropriate instructional offers from these agents. Beyond technical feasibility, we examine system-level implications for higher education institutions and students, including funding necessities, changes to faculty roles, curriculars, competencies and assessment practices. We conclude with a comparative cost-effectiveness analysis highlighting the scalability of such systems. In sum, this study contributes both empirical evidence and a conceptual roadmap for hybrid learning ecosystems that embed human-AI co-agency and pedagogical alignment.

[263] The Docking Game: Loop Self-Play for Fast, Dynamic, and Accurate Prediction of Flexible Protein–Ligand Binding

Youzhi Zhang, Yufei Li, Gaofeng Meng, Hongbin Liu, Jiebo Luo

Main category: cs.AI

TL;DR: A game-theoretic framework (Docking Game) with LoopPlay algorithm improves molecular docking accuracy by 10% over state-of-the-art methods.

Details

Motivation: Current multi-task learning models underperform in ligand docking due to structural complexities.

Method: Proposes a two-player game (ligand vs. protein) solved by LoopPlay, alternating training with mutual adaptation.

Result: 10% improvement in binding mode prediction accuracy on benchmark datasets.

Conclusion: LoopPlay enhances molecular docking accuracy, aiding drug discovery.

Abstract: Molecular docking is a crucial aspect of drug discovery, as it predicts the binding interactions between small-molecule ligands and protein pockets. However, current multi-task learning models for docking often show inferior performance in ligand docking compared to protein pocket docking. This disparity arises largely due to the distinct structural complexities of ligands and proteins. To address this issue, we propose a novel game-theoretic framework that models the protein-ligand interaction as a two-player game called the Docking Game, with the ligand docking module acting as the ligand player and the protein pocket docking module as the protein player. To solve this game, we develop a novel Loop Self-Play (LoopPlay) algorithm, which alternately trains these players through a two-level loop. In the outer loop, the players exchange predicted poses, allowing each to incorporate the other’s structural predictions, which fosters mutual adaptation over multiple iterations. In the inner loop, each player dynamically refines its predictions by incorporating its own predicted ligand or pocket poses back into its model. We theoretically show the convergence of LoopPlay, ensuring stable optimization. Extensive experiments conducted on public benchmark datasets demonstrate that LoopPlay achieves approximately a 10% improvement in predicting accurate binding modes compared to previous state-of-the-art methods. This highlights its potential to enhance the accuracy of molecular docking in drug discovery.

[264] Can Large Language Models Integrate Spatial Data? Empirical Insights into Reasoning Strengths and Computational Weaknesses

Bin Han, Robert Wolfe, Anat Caspi, Bill Howe

Main category: cs.AI

TL;DR: LLMs show promise for integrating urban spatial data but struggle with macro-scale spatial reasoning. A review-and-refine method improves accuracy, positioning LLMs as a flexible alternative to rule-based approaches.

Details

Motivation: Traditional rule-based and machine learning methods for spatial data integration are limited. LLMs offer a potential solution but need evaluation for spatial reasoning and practical application.

Method: Investigated LLMs’ spatial reasoning and adapted a review-and-refine method to correct errors.

Result: LLMs perform well with relevant features but struggle with macro-scale reasoning. The review-and-refine method effectively improves accuracy.

Conclusion: LLMs are a promising alternative for spatial data integration, with future research needed for post-training and multi-modal methods.

Abstract: We explore the application of large language models (LLMs) to empower domain experts in integrating large, heterogeneous, and noisy urban spatial datasets. Traditional rule-based integration methods are unable to cover all edge cases, requiring manual verification and repair. Machine learning approaches require collecting and labeling of large numbers of task-specific samples. In this study, we investigate the potential of LLMs for spatial data integration. Our analysis first considers how LLMs reason about environmental spatial relationships mediated by human experience, such as between roads and sidewalks. We show that while LLMs exhibit spatial reasoning capabilities, they struggle to connect the macro-scale environment with the relevant computational geometry tasks, often producing logically incoherent responses. But when provided relevant features, thereby reducing dependence on spatial reasoning, LLMs are able to generate high-performing results. We then adapt a review-and-refine method, which proves remarkably effective in correcting erroneous initial responses while preserving accurate responses. We discuss practical implications of employing LLMs for spatial data integration in real-world contexts and outline future research directions, including post-training, multi-modal integration methods, and support for diverse data formats. Our findings position LLMs as a promising and flexible alternative to traditional rule-based heuristics, advancing the capabilities of adaptive spatial data integration.

[265] MedMKEB: A Comprehensive Knowledge Editing Benchmark for Medical Multimodal Large Language Models

Dexuan Xu, Jieyi Wang, Zhongyan Chai, Yongzhi Cao, Hanpin Wang, Huamin Zhang, Yu Huang

Main category: cs.AI

TL;DR: MedMKEB is the first benchmark for evaluating multimodal medical knowledge editing in MLLMs, addressing gaps in reliability, generality, and robustness.

Details

Motivation: Existing MLLMs lack systematic benchmarks for updating medical knowledge involving both image and text modalities, necessitating MedMKEB.

Method: MedMKEB is built on a medical visual QA dataset, enriched with tasks like counterfactual correction and adversarial robustness, validated by experts.

Result: Experiments reveal limitations in current knowledge editing approaches, underscoring the need for specialized strategies in medicine.

Conclusion: MedMKEB aims to advance trustworthy and efficient medical knowledge editing algorithms as a standard benchmark.

Abstract: Recent advances in multimodal large language models (MLLMs) have significantly improved medical AI, enabling it to unify the understanding of visual and textual information. However, as medical knowledge continues to evolve, it is critical to allow these models to efficiently update outdated or incorrect information without retraining from scratch. Although textual knowledge editing has been widely studied, there is still a lack of systematic benchmarks for multimodal medical knowledge editing involving image and text modalities. To fill this gap, we present MedMKEB, the first comprehensive benchmark designed to evaluate the reliability, generality, locality, portability, and robustness of knowledge editing in medical multimodal large language models. MedMKEB is built on a high-quality medical visual question-answering dataset and enriched with carefully constructed editing tasks, including counterfactual correction, semantic generalization, knowledge transfer, and adversarial robustness. We incorporate human expert validation to ensure the accuracy and reliability of the benchmark. Extensive single editing and sequential editing experiments on state-of-the-art general and medical MLLMs demonstrate the limitations of existing knowledge-based editing approaches in medicine, highlighting the need to develop specialized editing strategies. MedMKEB will serve as a standard benchmark to promote the development of trustworthy and efficient medical knowledge editing algorithms.

[266] EasySize: Elastic Analog Circuit Sizing via LLM-Guided Heuristic Search

Xinyue Wu, Fan Hu, Shaik Jani Babu, Yi Zhao, Xinfei Guo

Main category: cs.AI

TL;DR: EasySize is a lightweight gate sizing framework using a finetuned Qwen3-8B model, achieving universal applicability and outperforming existing methods with reduced computational resources.

Details

Motivation: Analog circuit design is time-consuming and experience-driven, with existing AI methods lacking portability and efficiency across technology nodes.

Method: Combines a finetuned Qwen3-8B model with dynamic task-specific loss functions and heuristic search (DE and PSO) in a feedback-enhanced flow.

Result: Achieves strong performance across multiple technology nodes without additional training, outperforming AutoCkt with significant resource savings.

Conclusion: EasySize reduces reliance on human expertise and computational resources, simplifying analog circuit design.

Abstract: Analog circuit design is a time-consuming, experience-driven task in chip development. Despite advances in AI, developing universal, fast, and stable gate sizing methods for analog circuits remains a significant challenge. Recent approaches combine Large Language Models (LLMs) with heuristic search techniques to enhance generalizability, but they often depend on large model sizes and lack portability across different technology nodes. To overcome these limitations, we propose EasySize, the first lightweight gate sizing framework based on a finetuned Qwen3-8B model, designed for universal applicability across process nodes, design specifications, and circuit topologies. EasySize exploits the varying Ease of Attainability (EOA) of performance metrics to dynamically construct task-specific loss functions, enabling efficient heuristic search through global Differential Evolution (DE) and local Particle Swarm Optimization (PSO) within a feedback-enhanced flow. Although finetuned solely on 350nm node data, EasySize achieves strong performance on 5 operational amplifier (Op-Amp) netlists across 180nm, 45nm, and 22nm technology nodes without additional targeted training, and outperforms AutoCkt, a widely-used Reinforcement Learning based sizing framework, on 86.67% of tasks with more than 96.67% of simulation resources reduction. We argue that EasySize can significantly reduce the reliance on human expertise and computational resources in gate sizing, thereby accelerating and simplifying the analog circuit design process. EasySize will be open-sourced at a later date.

[267] Graph-based Event Log Repair

Sebastiano Dissegna, Chiara Di Francescomarino, Massimiliano Ronzani

Main category: cs.AI

TL;DR: The paper proposes a Heterogeneous Graph Neural Network model for reconstructing missing event attributes in Process Mining, outperforming state-of-the-art methods.

Details

Motivation: Event logs often contain missing data, and existing methods either rely on process models or machine learning, lacking flexibility. Graph Neural Networks offer a more natural representation for complex traces.

Method: Develops a Heterogeneous Graph Neural Network to reconstruct missing event attributes in traces, evaluated on synthetic and real logs.

Result: The model performs well in reconstructing all event attributes, surpassing autoencoder-based approaches.

Conclusion: The proposed method effectively addresses the challenge of missing data in Process Mining, offering a robust solution.

Abstract: The quality of event logs in Process Mining is crucial when applying any form of analysis to them. In real-world event logs, the acquisition of data can be non-trivial (e.g., due to the execution of manual activities and related manual recording or to issues in collecting, for each event, all its attributes), and often may end up with events recorded with some missing information. Standard approaches to the problem of trace (or log) reconstruction either require the availability of a process model that is used to fill missing values by leveraging different reasoning techniques or employ a Machine Learning/Deep Learning model to restore the missing values by learning from similar cases. In recent years, a new type of Deep Learning model that is capable of handling input data encoded as graphs has emerged, namely Graph Neural Networks. Graph Neural Network models, and even more so Heterogeneous Graph Neural Networks, offer the advantage of working with a more natural representation of complex multi-modal sequences like the execution traces in Process Mining, allowing for more expressive and semantically rich encodings. In this work, we focus on the development of a Heterogeneous Graph Neural Network model that, given a trace containing some incomplete events, will return the full set of attributes missing from those events. We evaluate our work against a state-of-the-art approach leveraging autoencoders on two synthetic logs and four real event logs, on different types of missing values. Different from state-of-the-art model-free approaches, which mainly focus on repairing a subset of event attributes, the proposed approach shows very good performance in reconstructing all different event attributes.

[268] QA-Dragon: Query-Aware Dynamic RAG System for Knowledge-Intensive Visual Question Answering

Zhuohang Jiang, Pangjing Wu, Xu Yuan, Wenqi Fan, Qing Li

Main category: cs.AI

TL;DR: QA-Dragon enhances VQA by dynamically retrieving from text and images, improving accuracy in complex tasks.

Details

Motivation: Existing RAG methods retrieve from text or images separately, limiting performance in multi-hop reasoning and up-to-date knowledge tasks.

Method: QA-Dragon uses a domain router for subject identification and a search router for dynamic retrieval, combining text and image search agents.

Result: Outperforms baselines by 5.06% (single-source), 6.35% (multi-source), and 5.03% (multi-turn) in the Meta CRAG-MM Challenge.

Conclusion: QA-Dragon effectively addresses complex VQA tasks with multimodal, multi-turn, and multi-hop reasoning.

Abstract: Retrieval-Augmented Generation (RAG) has been introduced to mitigate hallucinations in Multimodal Large Language Models (MLLMs) by incorporating external knowledge into the generation process, and it has become a widely adopted approach for knowledge-intensive Visual Question Answering (VQA). However, existing RAG methods typically retrieve from either text or images in isolation, limiting their ability to address complex queries that require multi-hop reasoning or up-to-date factual knowledge. To address this limitation, we propose QA-Dragon, a Query-Aware Dynamic RAG System for Knowledge-Intensive VQA. Specifically, QA-Dragon introduces a domain router to identify the query’s subject domain for domain-specific reasoning, along with a search router that dynamically selects optimal retrieval strategies. By orchestrating both text and image search agents in a hybrid setup, our system supports multimodal, multi-turn, and multi-hop reasoning, enabling it to tackle complex VQA tasks effectively. We evaluate our QA-Dragon on the Meta CRAG-MM Challenge at KDD Cup 2025, where it significantly enhances the reasoning performance of base models under challenging scenarios. Our framework achieves substantial improvements in both answer accuracy and knowledge overlap scores, outperforming baselines by 5.06% on the single-source task, 6.35% on the multi-source task, and 5.03% on the multi-turn task.

[269] An Explainable Natural Language Framework for Identifying and Notifying Target Audiences In Enterprise Communication

Vítor N. Lourenço, Mohnish Dubey, Yunfei Bai, Audrey Depeige, Vivek Jain

Main category: cs.AI

TL;DR: A framework combining RDF graph databases and LLMs improves expert identification and communication in large-scale maintenance organizations by processing natural language queries with transparent reasoning.

Details

Motivation: Traditional communication approaches struggle with information overload and slow response times in complex organizational structures.

Method: Uses RDF graph databases and LLMs for natural language query processing, supported by a planning-orchestration architecture for transparent reasoning.

Result: Enables precise audience targeting with explainable results, improving communication efficiency and trust.

Conclusion: The proposed framework effectively addresses communication challenges in large-scale maintenance organizations.

Abstract: In large-scale maintenance organizations, identifying subject matter experts and managing communications across complex entities relationships poses significant challenges – including information overload and longer response times – that traditional communication approaches fail to address effectively. We propose a novel framework that combines RDF graph databases with LLMs to process natural language queries for precise audience targeting, while providing transparent reasoning through a planning-orchestration architecture. Our solution enables communication owners to formulate intuitive queries combining concepts such as equipment, manufacturers, maintenance engineers, and facilities, delivering explainable results that maintain trust in the system while improving communication efficiency across the organization.

[270] A Novel Architecture for Symbolic Reasoning with Decision Trees and LLM Agents

Andrew Kiruluta

Main category: cs.AI

TL;DR: A hybrid architecture combining decision trees and LLMs improves reasoning performance and interpretability across benchmarks like ProofWriter, GSM8k, and ARC.

Details

Motivation: To bridge the gap between symbolic reasoning and neural modules by embedding decision trees as callable oracles within a unified system for interpretable and robust reasoning.

Method: Integrates decision trees and random forests with LLMs in a multi-agent framework, using a central orchestrator for belief consistency and communication.

Result: Achieves performance gains: +7.2% on ProofWriter, +5.3% on GSM8k, and +6.0% on ARC. Demonstrates effectiveness in clinical and scientific applications.

Conclusion: The hybrid architecture provides a robust, interpretable, and extensible solution for neuro-symbolic reasoning.

Abstract: We propose a hybrid architecture that integrates decision tree-based symbolic reasoning with the generative capabilities of large language models (LLMs) within a coordinated multi-agent framework. Unlike prior approaches that loosely couple symbolic and neural modules, our design embeds decision trees and random forests as callable oracles within a unified reasoning system. Tree-based modules enable interpretable rule inference and causal logic, while LLM agents handle abductive reasoning, generalization, and interactive planning. A central orchestrator maintains belief state consistency and mediates communication across agents and external tools, enabling reasoning over both structured and unstructured inputs. The system achieves strong performance on reasoning benchmarks. On \textit{ProofWriter}, it improves entailment consistency by +7.2% through logic-grounded tree validation. On GSM8k, it achieves +5.3% accuracy gains in multistep mathematical problems via symbolic augmentation. On \textit{ARC}, it boosts abstraction accuracy by +6.0% through integration of symbolic oracles. Applications in clinical decision support and scientific discovery show how the system encodes domain rules symbolically while leveraging LLMs for contextual inference and hypothesis generation. This architecture offers a robust, interpretable, and extensible solution for general-purpose neuro-symbolic reasoning.

[271] DSBC : Data Science task Benchmarking with Context engineering

Ram Mohan Rao Kadiyala, Siddhant Gupta, Jebish Purbey, Giulio Martini, Ali Shafique, Suman Debnath, Hamza Farooq

Main category: cs.AI

TL;DR: The paper introduces a benchmark for evaluating data science agents powered by LLMs, testing three models across multiple task categories and prompting approaches, revealing performance disparities and practical deployment factors.

Details

Motivation: To address the lack of systematic benchmarks for evaluating the efficacy and limitations of data science agents in real-world workflows.

Method: A comprehensive benchmark was developed using real-world user interactions, evaluating three LLMs (Claude-4.0-Sonnet, Gemini-2.5-Flash, OpenAI-o4-Mini) across three approaches (zero-shot, multi-step, SmolAgent) and eight task categories, including sensitivity to prompting issues and temperature parameters.

Result: Distinct performance disparities among models and methodologies were identified, highlighting critical factors for practical deployment.

Conclusion: The benchmark and framework aim to support future research for more robust and effective data science agents.

Abstract: Recent advances in large language models (LLMs) have significantly impacted data science workflows, giving rise to specialized data science agents designed to automate analytical tasks. Despite rapid adoption, systematic benchmarks evaluating the efficacy and limitations of these agents remain scarce. In this paper, we introduce a comprehensive benchmark specifically crafted to reflect real-world user interactions with data science agents by observing usage of our commercial applications. We evaluate three LLMs: Claude-4.0-Sonnet, Gemini-2.5-Flash, and OpenAI-o4-Mini across three approaches: zero-shot with context engineering, multi-step with context engineering, and with SmolAgent. Our benchmark assesses performance across a diverse set of eight data science task categories, additionally exploring the sensitivity of models to common prompting issues, such as data leakage and slightly ambiguous instructions. We further investigate the influence of temperature parameters on overall and task-specific outcomes for each model and approach. Our findings reveal distinct performance disparities among the evaluated models and methodologies, highlighting critical factors that affect practical deployment. The benchmark dataset and evaluation framework introduced herein aim to provide a foundation for future research of more robust and effective data science agents.

[272] The Term ‘Agent’ Has Been Diluted Beyond Utility and Requires Redefinition

Brinnae Bent

Main category: cs.AI

TL;DR: The paper proposes redefining the term ‘agent’ in AI to address ambiguity, offering a framework with clear requirements and multidimensional characterization.

Details

Motivation: Ambiguity in the term 'agent' hinders research communication, evaluation, reproducibility, and policy development.

Method: Historical analysis and contemporary usage patterns inform a framework defining minimum agent requirements and characterizing systems along multiple dimensions.

Result: A clear, standardized definition and framework for ‘agent’ are proposed, improving research clarity and policy effectiveness.

Conclusion: The framework addresses ambiguity, supports reproducibility, and aids policy, with recommendations for adoption and standardization.

Abstract: The term ‘agent’ in artificial intelligence has long carried multiple interpretations across different subfields. Recent developments in AI capabilities, particularly in large language model systems, have amplified this ambiguity, creating significant challenges in research communication, system evaluation and reproducibility, and policy development. This paper argues that the term ‘agent’ requires redefinition. Drawing from historical analysis and contemporary usage patterns, we propose a framework that defines clear minimum requirements for a system to be considered an agent while characterizing systems along a multidimensional spectrum of environmental interaction, learning and adaptation, autonomy, goal complexity, and temporal coherence. This approach provides precise vocabulary for system description while preserving the term’s historically multifaceted nature. After examining potential counterarguments and implementation challenges, we provide specific recommendations for moving forward as a field, including suggestions for terminology standardization and framework adoption. The proposed approach offers practical tools for improving research clarity and reproducibility while supporting more effective policy development.

[273] NomicLaw: Emergent Trust and Strategic Argumentation in LLMs During Collaborative Law-Making

Asutosh Hota, Jussi P. P. Jokinen

Main category: cs.AI

TL;DR: NomicLaw simulates multi-agent LLM interactions in legal settings, revealing social reasoning and persuasive strategies through voting and language analysis.

Details

Motivation: To understand LLM behavior in open-ended, multi-agent legal and ethical dilemmas, which remains empirically limited.

Method: NomicLaw, a structured simulation where LLMs propose, justify, and vote on legal rules, with quantitative (voting patterns) and qualitative (strategic language) analysis.

Result: LLMs form alliances, betray trust, and adapt rhetoric, showcasing social reasoning and persuasive skills across ten models.

Conclusion: The study highlights LLMs’ latent capabilities for autonomous negotiation and legislation, informing future AI system design.

Abstract: Recent advancements in large language models (LLMs) have extended their capabilities from basic text processing to complex reasoning tasks, including legal interpretation, argumentation, and strategic interaction. However, empirical understanding of LLM behavior in open-ended, multi-agent settings especially those involving deliberation over legal and ethical dilemmas remains limited. We introduce NomicLaw, a structured multi-agent simulation where LLMs engage in collaborative law-making, responding to complex legal vignettes by proposing rules, justifying them, and voting on peer proposals. We quantitatively measure trust and reciprocity via voting patterns and qualitatively assess how agents use strategic language to justify proposals and influence outcomes. Experiments involving homogeneous and heterogeneous LLM groups demonstrate how agents spontaneously form alliances, betray trust, and adapt their rhetoric to shape collective decisions. Our results highlight the latent social reasoning and persuasive capabilities of ten open-source LLMs and provide insights into the design of future AI systems capable of autonomous negotiation, coordination and drafting legislation in legal settings.

[274] Minimal Model Reasoning in Description Logics: Don’t Try This at Home!

Federica Di Stefano, Quentin Manière, Magdalena Ortiz, Mantas Šimkus

Main category: cs.AI

TL;DR: Error: OutputParser failed

Details

Motivation: Error: OutputParser failed

Method: Error: OutputParser failed

Result: Error: OutputParser failed

Conclusion: Error: OutputParser failed

Abstract: Reasoning with minimal models has always been at the core of many knowledge representation techniques, but we still have only a limited understanding of this problem in Description Logics (DLs). Minimization of some selected predicates, letting the remaining predicates vary or be fixed, as proposed in circumscription, has been explored and exhibits high complexity. The case of `pure’ minimal models, where the extension of all predicates must be minimal, has remained largely uncharted. We address this problem in popular DLs and obtain surprisingly negative results: concept satisfiability in minimal models is undecidable already for $\mathcal{EL}$. This undecidability also extends to a very restricted fragment of tuple-generating dependencies. To regain decidability, we impose acyclicity conditions on the TBox that bring the worst-case complexity below double exponential time and allow us to establish a connection with the recently studied pointwise circumscription; we also derive results in data complexity. We conclude with a brief excursion to the DL-Lite family, where a positive result was known for DL-Lite${\text{core}}$, but our investigation establishes ExpSpace-hardness already for its extension DL-Lite${\text{horn}}$.

[275] StructVRM: Aligning Multimodal Reasoning with Structured and Verifiable Reward Models

Xiangxiang Zhang, Jingxuan Wei, Donghong Zhong, Qi Chen, Caijun Jia, Cheng Tan, Jinming Gu, Xiaobo Qin, Zhiping Liu, Liang Hu, Tong Sun, Yuchen Wu, Zewei Sun, Chenwei Lou, Hua Zheng, Tianyang Zhan, Changbao Wang, Shuangzhi Wu, Zefa Lin, Chang Guo, Sihang Yuan, Riwei Chen, Shixiong Zhao, Yingping Zhang, Gaowei Wu, Bihui Yu, Jiahui Wu, Zhehui Zhao, Qianqian Liu, Ruofeng Tang, Xingyue Huang, Bing Zhao, Mengyang Zhang, Youqiang Zhou

Main category: cs.AI

TL;DR: StructVRM introduces fine-grained, verifiable rewards for multimodal reasoning tasks, improving performance on complex benchmarks.

Details

Motivation: Traditional binary rewards are inadequate for guiding models in complex, multi-question reasoning tasks.

Method: StructVRM uses a model-based verifier for sub-question-level feedback, assessing semantic and mathematical equivalence.

Result: Seed-StructVRM achieves state-of-the-art performance on six out of twelve benchmarks and a new STEM-Bench.

Conclusion: Structured, verifiable rewards effectively enhance multimodal models’ reasoning capabilities in complex domains.

Abstract: Existing Vision-Language Models often struggle with complex, multi-question reasoning tasks where partial correctness is crucial for effective learning. Traditional reward mechanisms, which provide a single binary score for an entire response, are too coarse to guide models through intricate problems with multiple sub-parts. To address this, we introduce StructVRM, a method that aligns multimodal reasoning with Structured and Verifiable Reward Models. At its core is a model-based verifier trained to provide fine-grained, sub-question-level feedback, assessing semantic and mathematical equivalence rather than relying on rigid string matching. This allows for nuanced, partial credit scoring in previously intractable problem formats. Extensive experiments demonstrate the effectiveness of StructVRM. Our trained model, Seed-StructVRM, achieves state-of-the-art performance on six out of twelve public multimodal benchmarks and our newly curated, high-difficulty STEM-Bench. The success of StructVRM validates that training with structured, verifiable rewards is a highly effective approach for advancing the capabilities of multimodal models in complex, real-world reasoning domains.

[276] An Explainable Machine Learning Framework for Railway Predictive Maintenance using Data Streams from the Metro Operator of Portugal

Silvia García-Méndez, Francisco de Arriba-Pérez, Fátima Leal, Bruno Veloso, Benedita Malheiro, Juan Carlos Burguillo-Rial

Main category: cs.AI

TL;DR: A real-time predictive maintenance solution for Intelligent Transportation Systems using an online processing pipeline with pre-processing, ML classification, and explainability, achieving high accuracy and F-measure.

Details

Motivation: To enhance railway predictive maintenance by enabling real-time fault prediction with explainability, improving service availability and safety.

Method: Proposes an online processing pipeline with sample pre-processing, incremental ML classification, and explainability modules.

Result: Achieves over 98% F-measure and 99% accuracy on the MetroPT dataset, with robust performance under class imbalance and noise.

Conclusion: The pipeline is methodologically sound and practically applicable, enabling proactive maintenance decisions in real-world railway operations.

Abstract: This work contributes to a real-time data-driven predictive maintenance solution for Intelligent Transportation Systems. The proposed method implements a processing pipeline comprised of sample pre-processing, incremental classification with Machine Learning models, and outcome explanation. This novel online processing pipeline has two main highlights: (i) a dedicated sample pre-processing module, which builds statistical and frequency-related features on the fly, and (ii) an explainability module. This work is the first to perform online fault prediction with natural language and visual explainability. The experiments were performed with the MetroPT data set from the metro operator of Porto, Portugal. The results are above 98 % for F-measure and 99 % for accuracy. In the context of railway predictive maintenance, achieving these high values is crucial due to the practical and operational implications of accurate failure prediction. In the specific case of a high F-measure, this ensures that the system maintains an optimal balance between detecting the highest possible number of real faults and minimizing false alarms, which is crucial for maximizing service availability. Furthermore, the accuracy obtained enables reliability, directly impacting cost reduction and increased safety. The analysis demonstrates that the pipeline maintains high performance even in the presence of class imbalance and noise, and its explanations effectively reflect the decision-making process. These findings validate the methodological soundness of the approach and confirm its practical applicability for supporting proactive maintenance decisions in real-world railway operations. Therefore, by identifying the early signs of failure, this pipeline enables decision-makers to understand the underlying problems and act accordingly swiftly.

[277] DeepPHY: Benchmarking Agentic VLMs on Physical Reasoning

Xinrun Xu, Pi Bu, Ye Wang, Börje F. Karlsson, Ziming Wang, Tengtao Song, Qi Zhu, Jun Song, Zhiming Ding, Bo Zheng

Main category: cs.AI

TL;DR: DeepPHY is a benchmark framework to evaluate VLMs’ understanding of physical principles, revealing their struggles with precise predictive control.

Details

Motivation: VLMs lack attention to detail and precise action planning in dynamic environments, necessitating a systematic evaluation of their physical reasoning.

Method: DeepPHY uses simulated environments with varying difficulty levels and fine-grained metrics to assess VLMs.

Result: State-of-the-art VLMs fail to translate descriptive physical knowledge into precise, predictive control.

Conclusion: DeepPHY highlights VLMs’ limitations in physical reasoning, suggesting a need for improved models.

Abstract: Although Vision Language Models (VLMs) exhibit strong perceptual abilities and impressive visual reasoning, they struggle with attention to detail and precise action planning in complex, dynamic environments, leading to subpar performance. Real-world tasks typically require complex interactions, advanced spatial reasoning, long-term planning, and continuous strategy refinement, usually necessitating understanding the physics rules of the target scenario. However, evaluating these capabilities in real-world scenarios is often prohibitively expensive. To bridge this gap, we introduce DeepPHY, a novel benchmark framework designed to systematically evaluate VLMs’ understanding and reasoning about fundamental physical principles through a series of challenging simulated environments. DeepPHY integrates multiple physical reasoning environments of varying difficulty levels and incorporates fine-grained evaluation metrics. Our evaluation finds that even state-of-the-art VLMs struggle to translate descriptive physical knowledge into precise, predictive control.

[278] Large Language Models Transform Organic Synthesis From Reaction Prediction to Automation

Kartar Kumar Lohana Tharwani, Rajesh Kumar, Sumita, Numan Ahmed, Yong Tang

Main category: cs.AI

TL;DR: LLMs are transforming organic synthesis by proposing routes, predicting outcomes, and automating experiments, aided by integration with other AI tools and real-time data. Challenges include biases and safety, but community efforts aim to democratize access and maintain human oversight.

Details

Motivation: To explore how LLMs are revolutionizing organic synthesis by integrating AI and automation, while addressing challenges like bias and safety.

Method: Coupling LLMs with graph neural networks, quantum calculations, and real-time spectroscopy to enhance synthetic planning and execution.

Result: LLMs accelerate discovery cycles, enable greener chemistry, and support automation, though limitations like biased datasets and opaque reasoning persist.

Conclusion: Community initiatives and technological integration can democratize AI-powered molecular innovation while ensuring human control and safety.

Abstract: Large language models (LLMs) are beginning to reshape how chemists plan and run reactions in organic synthesis. Trained on millions of reported transformations, these text-based models can propose synthetic routes, forecast reaction outcomes and even instruct robots that execute experiments without human supervision. Here we survey the milestones that turned LLMs from speculative tools into practical lab partners. We show how coupling LLMs with graph neural networks, quantum calculations and real-time spectroscopy shrinks discovery cycles and supports greener, data-driven chemistry. We discuss limitations, including biased datasets, opaque reasoning and the need for safety gates that prevent unintentional hazards. Finally, we outline community initiatives open benchmarks, federated learning and explainable interfaces that aim to democratize access while keeping humans firmly in control. These advances chart a path towards rapid, reliable and inclusive molecular innovation powered by artificial intelligence and automation.

[279] Whose Truth? Pluralistic Geo-Alignment for (Agentic) AI

Krzysztof Janowicz, Zilong Liu, Gengchen Mai, Zhangyu Wang, Ivan Majic, Alexandra Fortacz, Grant McKenzie, Song Gao

Main category: cs.AI

TL;DR: The paper discusses the challenge of geographic variability in AI alignment, emphasizing the need for context-aware approaches to ensure AI systems align with regional norms and realities.

Details

Motivation: The motivation is to address the underexplored issue of geographic variability in AI alignment, where cultural, political, and legal differences impact what is considered appropriate or truthful.

Method: The paper reviews key geographic research problems, suggests future work topics, and outlines methods for assessing alignment sensitivity.

Result: The paper highlights the divergence of AI alignment outcomes from statistical realities and the need for spatio-temporally aware alignment.

Conclusion: The conclusion underscores the urgency of adopting context-sensitive alignment approaches for AI systems, especially as AI scales globally.

Abstract: AI (super) alignment describes the challenge of ensuring (future) AI systems behave in accordance with societal norms and goals. While a quickly evolving literature is addressing biases and inequalities, the geographic variability of alignment remains underexplored. Simply put, what is considered appropriate, truthful, or legal can differ widely across regions due to cultural norms, political realities, and legislation. Alignment measures applied to AI/ML workflows can sometimes produce outcomes that diverge from statistical realities, such as text-to-image models depicting balanced gender ratios in company leadership despite existing imbalances. Crucially, some model outputs are globally acceptable, while others, e.g., questions about Kashmir, depend on knowing the user’s location and their context. This geographic sensitivity is not new. For instance, Google Maps renders Kashmir’s borders differently based on user location. What is new is the unprecedented scale and automation with which AI now mediates knowledge, expresses opinions, and represents geographic reality to millions of users worldwide, often with little transparency about how context is managed. As we approach Agentic AI, the need for spatio-temporally aware alignment, rather than one-size-fits-all approaches, is increasingly urgent. This paper reviews key geographic research problems, suggests topics for future work, and outlines methods for assessing alignment sensitivity.

[280] Bench-2-CoP: Can We Trust Benchmarking for EU AI Compliance?

Matteo Prandi, Vincenzo Suriani, Federico Pierucci, Marcello Galisai, Daniele Nardi, Piercosma Bisconti

Main category: cs.AI

TL;DR: The paper introduces Bench-2-CoP, a framework to evaluate AI benchmarks against the EU AI Act’s requirements, revealing significant gaps in assessing systemic risks like loss-of-control scenarios.

Details

Motivation: Address the mismatch between current AI benchmarks and regulatory needs under the EU AI Act, focusing on systemic risks.

Method: Uses LLM-as-judge analysis to map 194,955 benchmark questions against the EU AI Act’s taxonomy of capabilities and propensities.

Result: Finds major misalignment, with benchmarks overemphasizing behavioral propensities (e.g., hallucination, bias) and neglecting critical functional capabilities (e.g., loss-of-control risks).

Conclusion: Highlights the need for policymakers and developers to refine evaluation tools to better address systemic risks and regulatory compliance.

Abstract: The rapid advancement of General Purpose AI (GPAI) models necessitates robust evaluation frameworks, especially with emerging regulations like the EU AI Act and its associated Code of Practice (CoP). Current AI evaluation practices depend heavily on established benchmarks, but these tools were not designed to measure the systemic risks that are the focus of the new regulatory landscape. This research addresses the urgent need to quantify this “benchmark-regulation gap.” We introduce Bench-2-CoP, a novel, systematic framework that uses validated LLM-as-judge analysis to map the coverage of 194,955 questions from widely-used benchmarks against the EU AI Act’s taxonomy of model capabilities and propensities. Our findings reveal a profound misalignment: the evaluation ecosystem is overwhelmingly focused on a narrow set of behavioral propensities, such as “Tendency to hallucinate” (53.7% of the corpus) and “Discriminatory bias” (28.9%), while critical functional capabilities are dangerously neglected. Crucially, capabilities central to loss-of-control scenarios, including evading human oversight, self-replication, and autonomous AI development, receive zero coverage in the entire benchmark corpus. This translates to a near-total evaluation gap for systemic risks like “Loss of Control” (0.4% coverage) and “Cyber Offence” (0.8% coverage). This study provides the first comprehensive, quantitative analysis of this gap, offering critical insights for policymakers to refine the CoP and for developers to build the next generation of evaluation tools, ultimately fostering safer and more compliant AI.

[281] Can Large Language Models Generate Effective Datasets for Emotion Recognition in Conversations?

Burak Can Kaplan, Hugo Cesar De Castro Carneiro, Stefan Wermter

Main category: cs.AI

TL;DR: The paper addresses the scarcity and bias in ERC datasets by using a small, efficient LLM to generate diverse synthetic datasets, improving classifier robustness and performance.

Details

Motivation: ERC datasets are scarce and biased, and LLMs are expensive for ERC tasks. The goal is to create cost-effective synthetic datasets to enhance ERC classification.

Method: A small, resource-efficient LLM is used to generate six synthetic ERC datasets, supplementing three widely used benchmarks.

Result: ERC classifiers trained on synthetic datasets show robustness and significant performance improvements on existing benchmarks.

Conclusion: Synthetic datasets generated by efficient LLMs effectively address ERC data scarcity and bias, enhancing classifier performance.

Abstract: Emotion recognition in conversations (ERC) focuses on identifying emotion shifts within interactions, representing a significant step toward advancing machine intelligence. However, ERC data remains scarce, and existing datasets face numerous challenges due to their highly biased sources and the inherent subjectivity of soft labels. Even though Large Language Models (LLMs) have demonstrated their quality in many affective tasks, they are typically expensive to train, and their application to ERC tasks–particularly in data generation–remains limited. To address these challenges, we employ a small, resource-efficient, and general-purpose LLM to synthesize ERC datasets with diverse properties, supplementing the three most widely used ERC benchmarks. We generate six novel datasets, with two tailored to enhance each benchmark. We evaluate the utility of these datasets to (1) supplement existing datasets for ERC classification, and (2) analyze the effects of label imbalance in ERC. Our experimental results indicate that ERC classifier models trained on the generated datasets exhibit strong robustness and consistently achieve statistically significant performance improvements on existing ERC benchmarks.

[282] InfiAlign: A Scalable and Sample-Efficient Framework for Aligning LLMs to Enhance Reasoning Capabilities

Shuo Cai, Su Lu, Qi Zhou, Kejing Yang, Zhijie Sang, Congkai Xie, Hongxia Yang

Main category: cs.AI

TL;DR: InfiAlign is a scalable, sample-efficient post-training framework combining SFT and DPO to enhance LLM reasoning, reducing data needs by 88% while matching or outperforming benchmarks.

Details

Motivation: Current methods for improving LLM reasoning are resource-intensive and lack scalability. InfiAlign aims to address this with efficient data curation and alignment.

Method: InfiAlign integrates SFT and DPO, using a data selection pipeline with multidimensional quality metrics to curate high-quality alignment data.

Result: Applied to Qwen2.5-Math-7B-Base, InfiAlign matches DeepSeek-R1-Distill-Qwen-7B with 12% of the data, achieving a 3.89% improvement on AIME benchmarks.

Conclusion: InfiAlign offers a scalable, data-efficient solution for aligning large reasoning models, combining principled data selection with full-stage post-training.

Abstract: Large language models (LLMs) have exhibited impressive reasoning abilities on a wide range of complex tasks. However, enhancing these capabilities through post-training remains resource intensive, particularly in terms of data and computational cost. Although recent efforts have sought to improve sample efficiency through selective data curation, existing methods often rely on heuristic or task-specific strategies that hinder scalability. In this work, we introduce InfiAlign, a scalable and sample-efficient post-training framework that integrates supervised fine-tuning (SFT) with Direct Preference Optimization (DPO) to align LLMs for enhanced reasoning. At the core of InfiAlign is a robust data selection pipeline that automatically curates high-quality alignment data from open-source reasoning datasets using multidimensional quality metrics. This pipeline enables significant performance gains while drastically reducing data requirements and remains extensible to new data sources. When applied to the Qwen2.5-Math-7B-Base model, our SFT model achieves performance on par with DeepSeek-R1-Distill-Qwen-7B, while using only approximately 12% of the training data, and demonstrates strong generalization across diverse reasoning tasks. Additional improvements are obtained through the application of DPO, with particularly notable gains in mathematical reasoning tasks. The model achieves an average improvement of 3.89% on AIME 24/25 benchmarks. Our results highlight the effectiveness of combining principled data selection with full-stage post-training, offering a practical solution for aligning large reasoning models in a scalable and data-efficient manner. The model checkpoints are available at https://huggingface.co/InfiX-ai/InfiAlign-Qwen-7B-SFT.

[283] GRAIL:Learning to Interact with Large Knowledge Graphs for Retrieval Augmented Reasoning

Ge Chang, Jinbo Su, Jiacheng Liu, Pengfei Yang, Yuhao Shang, Huiwen Zheng, Hongli Ma, Yan Liang, Yuanchun Li, Yunxin Liu

Main category: cs.AI

TL;DR: GRAIL is a framework combining LLMs with graph retrieval for structured knowledge, improving accuracy and F1 scores in knowledge graph QA tasks.

Details

Motivation: Existing RAG methods struggle with structured knowledge like graphs, and current graph retrieval lacks precision and holistic structure capture.

Method: GRAIL integrates LLM-guided exploration and path filtering for data synthesis, followed by a two-stage training process for dynamic action decisions.

Result: GRAIL improves accuracy by 21.01% and F1 by 22.43% on knowledge graph QA datasets.

Conclusion: GRAIL effectively balances precision and conciseness in graph retrieval, enhancing reasoning performance.

Abstract: Large Language Models (LLMs) integrated with Retrieval-Augmented Generation (RAG) techniques have exhibited remarkable performance across a wide range of domains. However, existing RAG approaches primarily operate on unstructured data and demonstrate limited capability in handling structured knowledge such as knowledge graphs. Meanwhile, current graph retrieval methods fundamentally struggle to capture holistic graph structures while simultaneously facing precision control challenges that manifest as either critical information gaps or excessive redundant connections, collectively undermining reasoning performance. To address this challenge, we propose GRAIL: Graph-Retrieval Augmented Interactive Learning, a framework designed to interact with large-scale graphs for retrieval-augmented reasoning. Specifically, GRAIL integrates LLM-guided random exploration with path filtering to establish a data synthesis pipeline, where a fine-grained reasoning trajectory is automatically generated for each task. Based on the synthesized data, we then employ a two-stage training process to learn a policy that dynamically decides the optimal actions at each reasoning step. The overall objective of precision-conciseness balance in graph retrieval is decoupled into fine-grained process-supervised rewards to enhance data efficiency and training stability. In practical deployment, GRAIL adopts an interactive retrieval paradigm, enabling the model to autonomously explore graph paths while dynamically balancing retrieval breadth and precision. Extensive experiments have shown that GRAIL achieves an average accuracy improvement of 21.01% and F1 improvement of 22.43% on three knowledge graph question-answering datasets. Our source code and datasets is available at https://github.com/Changgeww/GRAIL.

[284] Auto-Eval Judge: Towards a General Agentic Framework for Task Completion Evaluation

Roshita Bhonsle, Rishav Dutta, Sneha Vavilapalli, Harsh Seth, Abubakarr Jaye, Yapei Chang, Mukund Rungta, Emmanuel Aboah Boateng, Sadid Hasan, Ehi Nosakhare, Soundar Srinivasan

Main category: cs.AI

TL;DR: A modular framework for evaluating agent task completion is proposed, focusing on step-by-step reasoning and outperforming LLM-as-a-Judge baselines.

Details

Motivation: Current evaluation methods overlook step-by-step reasoning and are domain-specific, lacking generalizability.

Method: A modular framework decomposes tasks into sub-tasks, validates each step, and aggregates results for a final verdict.

Result: The framework achieves 4.76% and 10.52% higher alignment accuracy than GPT-4o baselines on GAIA and BigCodeBench benchmarks.

Conclusion: The proposed framework shows promise for general-purpose agent evaluation, improving accuracy over existing methods.

Abstract: The increasing adoption of foundation models as agents across diverse domains necessitates a robust evaluation framework. Current methods, such as LLM-as-a-Judge, focus only on final outputs, overlooking the step-by-step reasoning that drives agentic decision-making. Meanwhile, existing Agent-as-a-Judge systems, where one agent evaluates another’s task completion, are typically designed for narrow, domain-specific settings. To address this gap, we propose a generalizable, modular framework for evaluating agent task completion independent of the task domain. The framework emulates human-like evaluation by decomposing tasks into sub-tasks and validating each step using available information, such as the agent’s output and reasoning. Each module contributes to a specific aspect of the evaluation process, and their outputs are aggregated to produce a final verdict on task completion. We validate our framework by evaluating the Magentic-One Actor Agent on two benchmarks, GAIA and BigCodeBench. Our Judge Agent predicts task success with closer agreement to human evaluations, achieving 4.76% and 10.52% higher alignment accuracy, respectively, compared to the GPT-4o based LLM-as-a-Judge baseline. This demonstrates the potential of our proposed general-purpose evaluation framework.

[285] Streamlining Admission with LOR Insights: AI-Based Leadership Assessment in Online Master’s Program

Meryem Yilmaz Soylu, Adrian Gallard, Jeonghyun Lee, Gayane Grigoryan, Rushil Desai, Stephen Harmon

Main category: cs.AI

TL;DR: LORI is an AI tool using NLP and large language models (RoBERTa, LLAMA) to analyze leadership skills in LORs for graduate admissions, achieving high accuracy (F1: 91.6%).

Details

Motivation: LORs are time-consuming to review manually, and leadership skills are critical in STEM admissions.

Method: Uses NLP and large language models (RoBERTa, LLAMA) to detect leadership attributes like teamwork and communication.

Result: RoBERTa model achieves 91.6% F1, 92.4% precision, and 91.6% recall.

Conclusion: LORI streamlines admissions and ensures comprehensive leadership skill evaluation.

Abstract: Letters of recommendation (LORs) provide valuable insights into candidates’ capabilities and experiences beyond standardized test scores. However, reviewing these text-heavy materials is time-consuming and labor-intensive. To address this challenge and support the admission committee in providing feedback for students’ professional growth, our study introduces LORI: LOR Insights, a novel AI-based detection tool for assessing leadership skills in LORs submitted by online master’s program applicants. By employing natural language processing and leveraging large language models using RoBERTa and LLAMA, we seek to identify leadership attributes such as teamwork, communication, and innovation. Our latest RoBERTa model achieves a weighted F1 score of 91.6%, a precision of 92.4%, and a recall of 91.6%, showing a strong level of consistency in our test data. With the growing importance of leadership skills in the STEM sector, integrating LORI into the graduate admissions process is crucial for accurately assessing applicants’ leadership capabilities. This approach not only streamlines the admissions process but also automates and ensures a more comprehensive evaluation of candidates’ capabilities.

Rui Lu, Jinhe Bi, Yunpu Ma, Feng Xiao, Yuntao Du, Yijun Tian

Main category: cs.AI

TL;DR: MV-Debate is a multi-view agent debate framework for detecting harmful content in social media by leveraging diverse interpretive perspectives and dynamic reflection gating.

Details

Motivation: Identifying harmful intent in multimodal social media content is challenging due to cross-modal contradictions, cultural shifts, and subtle cues.

Method: MV-Debate uses four debate agents (surface analyst, deep reasoner, modality contrast, social contextualist) for iterative debate and reflection under a reflection-gain criterion.

Result: MV-Debate outperforms single-model and multi-agent baselines on three benchmark datasets.

Conclusion: Multi-agent debate frameworks like MV-Debate show promise for reliable harmful content detection in online safety-critical contexts.

Abstract: Social media has evolved into a complex multimodal environment where text, images, and other signals interact to shape nuanced meanings, often concealing harmful intent. Identifying such intent, whether sarcasm, hate speech, or misinformation, remains challenging due to cross-modal contradictions, rapid cultural shifts, and subtle pragmatic cues. To address these challenges, we propose MV-Debate, a multi-view agent debate framework with dynamic reflection gating for unified multimodal harmful content detection. MV-Debate assembles four complementary debate agents, a surface analyst, a deep reasoner, a modality contrast, and a social contextualist, to analyze content from diverse interpretive perspectives. Through iterative debate and reflection, the agents refine responses under a reflection-gain criterion, ensuring both accuracy and efficiency. Experiments on three benchmark datasets demonstrate that MV-Debate significantly outperforms strong single-model and existing multi-agent debate baselines. This work highlights the promise of multi-agent debate in advancing reliable social intent detection in safety-critical online contexts.

[287] The Missing Reward: Active Inference in the Era of Experience

Bo Wen

Main category: cs.AI

TL;DR: Active Inference (AIF) can bridge the grounded-agency gap in AI by replacing external rewards with intrinsic free-energy minimization, enabling autonomous learning and alignment with human values.

Details

Motivation: Current AI systems rely heavily on human-engineered rewards and data, creating scalability challenges and limiting autonomy.

Method: Proposes integrating AIF with Large Language Models to create agents that learn from self-generated data and minimize free energy.

Result: AIF offers a unified Bayesian framework for autonomous learning, balancing exploration and exploitation.

Conclusion: AIF provides a scalable and principled path toward autonomous AI agents that align with human values.

Abstract: This paper argues that Active Inference (AIF) provides a crucial foundation for developing autonomous AI agents capable of learning from experience without continuous human reward engineering. As AI systems begin to exhaust high-quality training data and rely on increasingly large human workforces for reward design, the current paradigm faces significant scalability challenges that could impede progress toward genuinely autonomous intelligence. The proposal for an ``Era of Experience,’’ where agents learn from self-generated data, is a promising step forward. However, this vision still depends on extensive human engineering of reward functions, effectively shifting the bottleneck from data curation to reward curation. This highlights what we identify as the \textbf{grounded-agency gap}: the inability of contemporary AI systems to autonomously formulate, adapt, and pursue objectives in response to changing circumstances. We propose that AIF can bridge this gap by replacing external reward signals with an intrinsic drive to minimize free energy, allowing agents to naturally balance exploration and exploitation through a unified Bayesian objective. By integrating Large Language Models as generative world models with AIF’s principled decision-making framework, we can create agents that learn efficiently from experience while remaining aligned with human values. This synthesis offers a compelling path toward AI systems that can develop autonomously while adhering to both computational and physical constraints.

[288] Simulating Human-Like Learning Dynamics with LLM-Empowered Agents

Yu Yuan, Lili Zhao, Wei Chen, Guangting Zheng, Kai Zhang, Mengdi Zhang, Qi Liu

Main category: cs.AI

TL;DR: LearnerAgent, a multi-agent framework using LLMs, simulates human learning dynamics with psychologically grounded profiles, revealing insights into cognitive growth and LLM behavior.

Details

Motivation: To address limitations in capturing learning dynamics and explainability in human learning behavior research.

Method: Introduces LearnerAgent, a multi-agent framework with psychologically grounded learner profiles (Deep, Surface, Lazy, General) and tracks progress through knowledge acquisition, tests, and peer interaction.

Result: Findings include sustained cognitive growth in Deep Learners, shallow knowledge in Surface Learners, realistic self-concept evolution, and the base LLM’s default behavior as a brittle Surface Learner.

Conclusion: LearnerAgent effectively simulates real learning scenarios, providing deeper insights into LLM behavior and human-like learning dynamics.

Abstract: Capturing human learning behavior based on deep learning methods has become a major research focus in both psychology and intelligent systems. Recent approaches rely on controlled experiments or rule-based models to explore cognitive processes. However, they struggle to capture learning dynamics, track progress over time, or provide explainability. To address these challenges, we introduce LearnerAgent, a novel multi-agent framework based on Large Language Models (LLMs) to simulate a realistic teaching environment. To explore human-like learning dynamics, we construct learners with psychologically grounded profiles-such as Deep, Surface, and Lazy-as well as a persona-free General Learner to inspect the base LLM’s default behavior. Through weekly knowledge acquisition, monthly strategic choices, periodic tests, and peer interaction, we can track the dynamic learning progress of individual learners over a full-year journey. Our findings are fourfold: 1) Longitudinal analysis reveals that only Deep Learner achieves sustained cognitive growth. Our specially designed “trap questions” effectively diagnose Surface Learner’s shallow knowledge. 2) The behavioral and cognitive patterns of distinct learners align closely with their psychological profiles. 3) Learners’ self-concept scores evolve realistically, with the General Learner developing surprisingly high self-efficacy despite its cognitive limitations. 4) Critically, the default profile of base LLM is a “diligent but brittle Surface Learner”-an agent that mimics the behaviors of a good student but lacks true, generalizable understanding. Extensive simulation experiments demonstrate that LearnerAgent aligns well with real scenarios, yielding more insightful findings about LLMs’ behavior.

[289] Unified Bayesian Frameworks for Multi-criteria Decision-making Problems

Majid Mohammadi

Main category: cs.AI

TL;DR: Bayesian frameworks for multi-criteria decision-making (MCDM) address challenges like group decision-making and criteria correlation, accommodating diverse uncertainties in preferences. A probabilistic mixture model identifies DM subgroups, and a ranking scheme assesses criteria importance. Validated through experiments, the frameworks outperform alternatives.

Details

Motivation: To provide statistically elegant solutions for MCDM challenges, such as group decision-making and criteria correlation, while accommodating diverse uncertainties in decision makers' preferences.

Method: Develops Bayesian frameworks, including a probabilistic mixture model for large-scale group MCDM and a probabilistic ranking scheme for criteria and alternatives.

Result: Validated through numerical examples, the frameworks effectively address MCDM challenges and outperform alternative methods.

Conclusion: The proposed Bayesian frameworks offer flexible, statistically robust solutions for MCDM, demonstrating effectiveness in handling uncertainties and group dynamics.

Abstract: This paper introduces Bayesian frameworks for tackling various aspects of multi-criteria decision-making (MCDM) problems, leveraging a probabilistic interpretation of MCDM methods and challenges. By harnessing the flexibility of Bayesian models, the proposed frameworks offer statistically elegant solutions to key challenges in MCDM, such as group decision-making problems and criteria correlation. Additionally, these models can accommodate diverse forms of uncertainty in decision makers’ (DMs) preferences, including normal and triangular distributions, as well as interval preferences. To address large-scale group MCDM scenarios, a probabilistic mixture model is developed, enabling the identification of homogeneous subgroups of DMs. Furthermore, a probabilistic ranking scheme is devised to assess the relative importance of criteria and alternatives based on DM(s) preferences. Through experimentation on various numerical examples, the proposed frameworks are validated, demonstrating their effectiveness and highlighting their distinguishing features in comparison to alternative methods.

[290] Toward A Causal Framework for Modeling Perception

Jose M. Alvarez, Salvatore Ruggieri

Main category: cs.AI

TL;DR: The paper introduces a causal modeling approach to perception in ML, addressing how human experts interpret ML outputs differently and its implications for fairness.

Details

Motivation: Perception's role in ML decision-making is understudied, yet critical, as human experts often interpret ML outputs differently, leading to potential biases.

Method: The authors model perception causally using structural causal models (SCMs), defining structural and parametrical probabilistic causal perception.

Result: The framework is demonstrated through examples of decision flows, highlighting its applicability and relevance to fair ML.

Conclusion: Addressing perception in ML is essential for fairness, and the proposed causal approach provides a foundational step toward this goal.

Abstract: Perception occurs when individuals interpret the same information differently. It is a known cognitive phenomenon with implications for bias in human decision-making. Perception, however, remains understudied in machine learning (ML). This is problematic as modern decision flows, whether partially or fully automated by ML applications, always involve human experts. For instance, how might we account for cases in which two experts interpret differently the same deferred instance or explanation from a ML model? Addressing this and similar questions requires first a formulation of perception, particularly, in a manner that integrates with ML-enabled decision flows. In this work, we present a first approach to modeling perception causally. We define perception under causal reasoning using structural causal models (SCMs). Our approach formalizes individual experience as additional causal knowledge that comes with and is used by the expert decision-maker in the form of a SCM. We define two kinds of probabilistic causal perception: structural and parametrical. We showcase our framework through a series of examples of modern decision flows. We also emphasize the importance of addressing perception in fair ML, discussing relevant fairness implications and possible applications.

[291] Advancing Multi-Organ Disease Care: A Hierarchical Multi-Agent Reinforcement Learning Framework

Daniel J. Tan, Qianyi Xu, Kay Choong See, Dilruk Perera, Mengling Feng

Main category: cs.AI

TL;DR: Proposes a Hierarchical Multi-Agent Reinforcement Learning (HMARL) framework for multi-organ disease treatment, addressing gaps in current AI systems by enabling inter-organ communication and dual-layer state representation.

Details

Motivation: Current AI clinical decision support systems focus on single-organ systems, ignoring interdependencies, which limits their effectiveness in holistic treatment recommendations.

Method: Introduces HMARL with specialized agents for each organ and inter-agent communication, plus a dual-layer state representation for global and organ-specific patient conditions.

Result: Evaluated on sepsis management, the method improves patient survival by learning clinically aligned treatment policies.

Conclusion: HMARL advances clinical decision support by addressing multi-organ complexity, surpassing single-organ models.

Abstract: In healthcare, multi-organ system diseases pose unique and significant challenges as they impact multiple physiological systems concurrently, demanding complex and coordinated treatment strategies. Despite recent advancements in the AI based clinical decision support systems, these solutions only focus on individual organ systems, failing to account for complex interdependencies between them. This narrow focus greatly hinders their effectiveness in recommending holistic and clinically actionable treatments in the real world setting. To address this critical gap, we propose a novel Hierarchical Multi-Agent Reinforcement Learning (HMARL) framework. Our architecture deploys specialized and dedicated agents for each organ system and facilitates inter-agent communication to enable synergistic decision-making across organ systems. Furthermore, we introduce a dual-layer state representation technique that contextualizes patient conditions at both global and organ-specific levels, improving the accuracy and relevance of treatment decisions. We evaluate our HMARL solution on the task of sepsis management, a common and critical multi-organ disease, using both qualitative and quantitative metrics. Our method learns effective, clinically aligned treatment policies that considerably improve patient survival. We believe this framework represents a significant advancement in clinical decision support systems, introducing the first RL solution explicitly designed for multi-organ treatment recommendations. Our solution moves beyond prevailing simplified, single-organ models that fall short in addressing the complexity of multi-organ diseases.

[292] DOTS: Learning to Reason Dynamically in LLMs via Optimal Reasoning Trajectories Search

Murong Yue, Wenlin Yao, Haitao Mi, Dian Yu, Ziyu Yao, Dong Yu

Main category: cs.AI

TL;DR: DOTS enhances LLM reasoning by dynamically searching for optimal reasoning trajectories tailored to each question and LLM capability, outperforming static methods.

Details

Motivation: Static reasoning actions in LLMs lack adaptability to question specifics and LLM capabilities, limiting performance.

Method: DOTS involves defining atomic reasoning actions, searching optimal trajectories, and training LLMs to plan reasoning for unseen questions via fine-tuning.

Result: Outperforms static reasoning and vanilla instruction tuning across eight tasks, adapting computation to problem complexity.

Conclusion: DOTS improves LLM reasoning by dynamically optimizing reasoning trajectories, demonstrating adaptability and superior performance.

Abstract: Enhancing the capability of large language models (LLMs) in reasoning has gained significant attention in recent years. Previous studies have demonstrated the effectiveness of various prompting strategies in aiding LLMs in reasoning (called “reasoning actions”), such as step-by-step thinking, reflecting before answering, solving with programs, and their combinations. However, these approaches often applied static, predefined reasoning actions uniformly to all questions, without considering the specific characteristics of each question or the capability of the task-solving LLM. In this paper, we propose DOTS, an approach enabling LLMs to reason dynamically via optimal reasoning trajectory search, tailored to the specific characteristics of each question and the inherent capability of the task-solving LLM. Our approach involves three key steps: i) defining atomic reasoning action modules that can be composed into various reasoning action trajectories; ii) searching for the optimal action trajectory for each training question through iterative exploration and evaluation for the specific task-solving LLM; and iii) using the collected optimal trajectories to train an LLM to plan for the reasoning trajectories of unseen questions. In particular, we propose two learning paradigms, i.e., fine-tuning an external LLM as a planner to guide the task-solving LLM, or directly fine-tuning the task-solving LLM with an internalized capability for reasoning actions planning. Our experiments across eight reasoning tasks show that our method consistently outperforms static reasoning techniques and the vanilla instruction tuning approach. Further analysis reveals that our method enables LLMs to adjust their computation based on problem complexity, allocating deeper thinking and reasoning to harder problems.

[293] ST-WebAgentBench: A Benchmark for Evaluating Safety and Trustworthiness in Web Agents

Ido Levy, Ben Wiesel, Sami Marreed, Alon Oved, Avi Yaeli, Segev Shlomov

Main category: cs.AI

TL;DR: ST-WebAgentBench is a benchmark for evaluating safety and trustworthiness (ST) of web agents, introducing metrics like Completion Under Policy (CuP) and Risk Ratio to measure adherence to ST policies.

Details

Motivation: Existing benchmarks for web agents focus only on task completion, ignoring safety and trustworthiness, which are critical for enterprise adoption.

Method: The authors introduce ST-WebAgentBench, a suite of 222 tasks paired with ST policies, scored across six dimensions. Metrics like CuP and Risk Ratio are proposed to evaluate ST compliance.

Result: Evaluation of three state-of-the-art agents shows their average CuP is less than two-thirds of their nominal completion rate, highlighting safety gaps.

Conclusion: ST-WebAgentBench provides tools and metrics to advance the deployment of trustworthy web agents in enterprise workflows.

Abstract: Autonomous web agents solve complex browsing tasks, yet existing benchmarks measure only whether an agent finishes a task, ignoring whether it does so safely or in a way enterprises can trust. To integrate these agents into critical workflows, safety and trustworthiness (ST) are prerequisite conditions for adoption. We introduce \textbf{\textsc{ST-WebAgentBench}}, a configurable and easily extensible suite for evaluating web agent ST across realistic enterprise scenarios. Each of its 222 tasks is paired with ST policies, concise rules that encode constraints, and is scored along six orthogonal dimensions (e.g., user consent, robustness). Beyond raw task success, we propose the \textit{Completion Under Policy} (\textit{CuP}) metric, which credits only completions that respect all applicable policies, and the \textit{Risk Ratio}, which quantifies ST breaches across dimensions. Evaluating three open state-of-the-art agents reveals that their average CuP is less than two-thirds of their nominal completion rate, exposing critical safety gaps. By releasing code, evaluation templates, and a policy-authoring interface, \href{https://sites.google.com/view/st-webagentbench/home}{\textsc{ST-WebAgentBench}} provides an actionable first step toward deploying trustworthy web agents at scale.

[294] Interactive Data Harmonization with LLM Agents: Opportunities and Challenges

Aécio Santos, Eduardo H. M. Pena, Roque Lopez, Juliana Freire

Main category: cs.AI

TL;DR: The paper proposes Harmonia, an agentic data harmonization system using LLM-based reasoning and interactive tools to automate pipeline synthesis, demonstrated in clinical data.

Details

Motivation: Data harmonization is complex due to schema mismatches and varying terminologies, requiring expert involvement and efficiency improvements.

Method: Introduces Harmonia, combining LLM reasoning, a user interface, and harmonization primitives to automate pipeline creation.

Result: Demonstrated in clinical data, Harmonia successfully creates reusable pipelines for standardizing datasets.

Conclusion: Highlights challenges and suggests research directions to advance agentic data harmonization.

Abstract: Data harmonization is an essential task that entails integrating datasets from diverse sources. Despite years of research in this area, it remains a time-consuming and challenging task due to schema mismatches, varying terminologies, and differences in data collection methodologies. This paper presents the case for agentic data harmonization as a means to both empower experts to harmonize their data and to streamline the process. We introduce Harmonia, a system that combines LLM-based reasoning, an interactive user interface, and a library of data harmonization primitives to automate the synthesis of data harmonization pipelines. We demonstrate Harmonia in a clinical data harmonization scenario, where it helps to interactively create reusable pipelines that map datasets to a standard format. Finally, we discuss challenges and open problems, and suggest research directions for advancing our vision.

Zhenghao Liu, Xingsheng Zhu, Tianshuo Zhou, Xinyi Zhang, Xiaoyuan Yi, Yukun Yan, Ge Yu, Maosong Sun

Main category: cs.AI

TL;DR: The paper introduces M²RAG, a benchmark for evaluating Multi-modal Large Language Models (MLLMs) in Retrieval-Augmented Generation (RAG), and proposes MM-RAIT, an instruction tuning method to enhance MLLMs’ multi-modal context utilization.

Details

Motivation: To explore the underexplored potential of MLLMs in leveraging multi-modal contextual information for RAG.

Method: Introduces M²RAG benchmark with four tasks and MM-RAIT, an instruction tuning method for MLLMs.

Result: MM-RAIT significantly improves RAG model performance, outperforming MiniCPM-V 2.6 and Qwen2-VL by 34% and 33% gains.

Conclusion: M²RAG and MM-RAIT effectively enhance MLLMs’ multi-modal RAG capabilities, with promising experimental results.

Abstract: With the rapid advancement of Multi-modal Large Language Models (MLLMs), their capability in understanding both images and text has greatly improved. However, their potential for leveraging multi-modal contextual information in Retrieval-Augmented Generation (RAG) remains largely underexplored. To address this gap, this paper introduces Multi-Modal Retrieval-Augmented Generation (M$^2$RAG), a benchmark designed to evaluate the effectiveness of Multi-modal Large Language Models in leveraging knowledge from multi-modal retrieval documents. The benchmark comprises four tasks: image captioning, multi-modal question answering, multi-modal fact verification, and image reranking. All tasks are set in an open-domain setting, requiring RAG models to retrieve query-relevant information from a multi-modal document collection and use it as contextual input for RAG modeling. To enhance the context utilization capabilities of MLLMs, we also introduce Multi-Modal Retrieval-Augmented Instruction Tuning (MM-RAIT), an instruction tuning method that optimizes MLLMs within multi-modal contexts. Our experiments demonstrate the effectiveness of MM-RAIT by significantly improving the quality of responses generated by different RAG models, outperforming MiniCPM-V 2.6 and Qwen2-VL with 34% and 33% gains, respectively. All data and code are available at https://github.com/NEUIR/M2RAG.

[296] Agent Guide: A Simple Agent Behavioral Watermarking Framework

Kaibo Huang, Zipei Zhang, Zhongliang Yang, Linna Zhou

Main category: cs.AI

TL;DR: Agent Guide is a behavioral watermarking framework for intelligent agents, embedding watermarks in high-level decisions while preserving action naturalness, ensuring traceability in digital ecosystems.

Details

Motivation: Addressing traceability and accountability challenges in intelligent agents, especially in cybersecurity and content protection, where traditional LLM watermarking fails due to behavior tokenization issues.

Method: Decouples agent behavior into behavior and action levels, applying watermark-guided biases to behavior probability distribution. Uses z-statistic for reliable watermark detection.

Result: Effective watermark detection with low false positives in social media scenarios, demonstrating robustness.

Conclusion: Agent Guide offers a practical solution for agent watermarking, useful for identifying malicious agents and protecting proprietary systems.

Abstract: The increasing deployment of intelligent agents in digital ecosystems, such as social media platforms, has raised significant concerns about traceability and accountability, particularly in cybersecurity and digital content protection. Traditional large language model (LLM) watermarking techniques, which rely on token-level manipulations, are ill-suited for agents due to the challenges of behavior tokenization and information loss during behavior-to-action translation. To address these issues, we propose Agent Guide, a novel behavioral watermarking framework that embeds watermarks by guiding the agent’s high-level decisions (behavior) through probability biases, while preserving the naturalness of specific executions (action). Our approach decouples agent behavior into two levels, behavior (e.g., choosing to bookmark) and action (e.g., bookmarking with specific tags), and applies watermark-guided biases to the behavior probability distribution. We employ a z-statistic-based statistical analysis to detect the watermark, ensuring reliable extraction over multiple rounds. Experiments in a social media scenario with diverse agent profiles demonstrate that Agent Guide achieves effective watermark detection with a low false positive rate. Our framework provides a practical and robust solution for agent watermarking, with applications in identifying malicious agents and protecting proprietary agent systems.

[297] Graph-Based Fault Diagnosis for Rotating Machinery: Adaptive Segmentation and Structural Feature Integration

Moirangthem Tiken Singh

Main category: cs.AI

TL;DR: A graph-based framework for multiclass fault diagnosis in rotating machinery achieves high accuracy (up to 100%) and noise resilience, with interpretability and scalability.

Details

Motivation: To develop a robust, interpretable, and scalable method for fault diagnosis in rotating machinery without relying on deep learning.

Method: Integrates entropy-optimized signal segmentation, time-frequency feature extraction, and graph-theoretic modeling to classify faults using graph metrics and local features.

Result: Achieves 99.8% accuracy on CWRU and 100% on SU datasets, with strong noise resilience (95.4% accuracy at high noise) and cross-domain transferability (99.7% F1-score).

Conclusion: The method is reliable, scalable, and suitable for real-time industrial diagnostics, outperforming traditional techniques.

Abstract: This paper proposes a novel graph-based framework for robust and interpretable multiclass fault diagnosis in rotating machinery. The method integrates entropy-optimized signal segmentation, time-frequency feature extraction, and graph-theoretic modeling to transform vibration signals into structured representations suitable for classification. Graph metrics, such as average shortest path length, modularity, and spectral gap, are computed and combined with local features to capture global and segment-level fault characteristics. The proposed method achieves high diagnostic accuracy when evaluated on two benchmark datasets, the CWRU bearing dataset (under 0-3 HP loads) and the SU gearbox and bearing datasets (under different speed-load configurations). Classification scores reach up to 99.8% accuracy on Case Western Reserve University (CWRU) and 100% accuracy on the Southeast University datasets using a logistic regression classifier. Furthermore, the model exhibits strong noise resilience, maintaining over 95.4% accuracy at high noise levels (standard deviation = 0.5), and demonstrates excellent cross-domain transferability with up to 99.7% F1-score in load-transfer scenarios. Compared to traditional techniques, this approach requires no deep learning architecture, enabling lower complexity while ensuring interpretability. The results confirm the method’s scalability, reliability, and potential for real-time deployment in industrial diagnostics.

[298] Optimizing LLM-Based Multi-Agent System with Textual Feedback: A Case Study on Software Development

Ming Shen, Raphael Shu, Anurag Pratik, James Gung, Yubin Ge, Monica Sunkara, Yi Zhang

Main category: cs.AI

TL;DR: The paper presents a method for optimizing role-based multi-agent systems (MAS) using natural language feedback, focusing on software development tasks. It introduces a two-step pipeline for prompt optimization and evaluates various settings.

Details

Motivation: Optimizing LLM-based multi-agent systems is challenging, especially for complex tasks requiring diverse expertise. This work aims to improve system performance through empirical case studies.

Method: A two-step pipeline: (1) identify underperforming agents using textual feedback, (2) optimize their prompts based on failure explanations. Evaluates online vs. offline and individual vs. group optimization, including one-pass and multi-pass strategies.

Result: The method effectively optimizes role-based MAS for software tasks across diverse evaluation dimensions. Different optimization settings impact group behaviors, providing practical insights.

Conclusion: The proposed optimization pipeline is effective for role-based MAS, with findings offering guidance for future development of such systems.

Abstract: We have seen remarkable progress in large language models (LLMs) empowered multi-agent systems solving complex tasks necessitating cooperation among experts with diverse skills. However, optimizing LLM-based multi-agent systems remains challenging. In this work, we perform an empirical case study on group optimization of role-based multi-agent systems utilizing natural language feedback for challenging software development tasks under various evaluation dimensions. We propose a two-step agent prompts optimization pipeline: identifying underperforming agents with their failure explanations utilizing textual feedback and then optimizing system prompts of identified agents utilizing failure explanations. We then study the impact of various optimization settings on system performance with two comparison groups: online against offline optimization and individual against group optimization. For group optimization, we study two prompting strategies: one-pass and multi-pass prompting optimizations. Overall, we demonstrate the effectiveness of our optimization method for role-based multi-agent systems tackling software development tasks evaluated on diverse evaluation dimensions, and we investigate the impact of diverse optimization settings on group behaviors of the multi-agent systems to provide practical insights for future development.

[299] MAGIK: Mapping to Analogous Goals via Imagination-enabled Knowledge Transfer

Ajsal Shereef Palattuparambil, Thommen George Karimpanal, Santu Rana

Main category: cs.AI

TL;DR: MAGIK enables RL agents to transfer knowledge to analogous tasks without retraining, using imagination-based analogy mapping and minimal human-labeled examples.

Details

Motivation: Humans excel at analogical reasoning, but RL agents struggle with transferring knowledge to similar tasks without extensive retraining.

Method: MAGIK uses an imagination mechanism to map entities in the target task to analogues in the source domain, reusing the original policy.

Result: Experiments show MAGIK achieves effective zero-shot transfer on MiniGrid and MuJoCo tasks with few human-labeled examples.

Conclusion: MAGIK offers a novel and effective mechanism for knowledge transfer via imagination-based analogy mapping.

Abstract: Humans excel at analogical reasoning - applying knowledge from one task to a related one with minimal relearning. In contrast, reinforcement learning (RL) agents typically require extensive retraining even when new tasks share structural similarities with previously learned ones. In this work, we propose MAGIK, a novel framework that enables RL agents to transfer knowledge to analogous tasks without interacting with the target environment. Our approach leverages an imagination mechanism to map entities in the target task to their analogues in the source domain, allowing the agent to reuse its original policy. Experiments on custom MiniGrid and MuJoCo tasks show that MAGIK achieves effective zero-shot transfer using only a small number of human-labelled examples. We compare our approach to related baselines and highlight how it offers a novel and effective mechanism for knowledge transfer via imagination-based analogy mapping.

[300] Multi-level Value Alignment in Agentic AI Systems: Survey and Perspectives

Wei Zeng, Hengshu Zhu, Chuan Qin, Han Wu, Yihang Cheng, Sirui Zhang, Xiaowei Jin, Yinuo Shen, Zhenxing Wang, Feimin Zhong, Hui Xiong

Main category: cs.AI

TL;DR: The paper reviews value alignment in LLM-based multi-agent systems, proposing a multi-level framework to address socio-governance demands and suggesting future research directions.

Details

Motivation: The shift to agentic AI and complex multi-agent systems raises risks, necessitating alignment of AI goals with human values and societal norms.

Method: A comprehensive survey structured around three dimensions: value principles hierarchy, application scenarios, and alignment methods/evaluation.

Result: The study maps value alignment to a tiered framework, examines coordination among agents, and identifies benchmarking datasets.

Conclusion: The paper highlights the importance of value alignment in agentic AI and suggests future research directions.

Abstract: The ongoing evolution of AI paradigms has propelled AI research into the agentic AI stage. Consequently, the focus of research has shifted from single agents and simple applications towards multi-agent autonomous decision-making and task collaboration in complex environments. As Large Language Models (LLMs) advance, their applications become more diverse and complex, leading to increasing situational and systemic risks. This has brought significant attention to value alignment for agentic AI systems, which aims to ensure that an agent’s goals, preferences, and behaviors align with human values and societal norms. Addressing socio-governance demands through a Multi-level Value framework, this study comprehensively reviews value alignment in LLM-based multi-agent systems as the representative archetype of agentic AI systems. Our survey systematically examines three interconnected dimensions: First, value principles are structured via a top-down hierarchy across macro, meso, and micro levels. Second, application scenarios are categorized along a general-to-specific continuum explicitly mirroring these value tiers. Third, value alignment methods and evaluation are mapped to this tiered framework through systematic examination of benchmarking datasets and relevant methodologies. Additionally, we delve into value coordination among multiple agents within agentic AI systems. Finally, we propose several potential research directions in this field.

[301] Style-Preserving Policy Optimization for Game Agents

Lingfeng Li, Yunlong Lu, Yongyi Wang, Wenxin Li

Main category: cs.AI

TL;DR: MPPO improves suboptimal game agents’ proficiency while retaining their play styles, outperforming pure online algorithms.

Details

Motivation: Existing methods either focus on proficiency (RL) or diversity (evolution algorithms), but not both. MPPO bridges this gap.

Method: MPPO unifies loss objectives for online/offline samples and uses an implicit constraint to approximate demonstrator policies.

Result: MPPO achieves proficiency comparable or superior to pure online algorithms while preserving play styles.

Conclusion: MPPO effectively generates proficient and diverse game agents, enhancing gameplay.

Abstract: Proficient game agents with diverse play styles enrich the gaming experience and enhance the replay value of games. However, recent advancements in game AI based on reinforcement learning have predominantly focused on improving proficiency, whereas methods based on evolution algorithms generate agents with diverse play styles but exhibit subpar performance compared to RL methods. To address this gap, this paper proposes Mixed Proximal Policy Optimization (MPPO), a method designed to improve the proficiency of existing suboptimal agents while retaining their distinct styles. MPPO unifies loss objectives for both online and offline samples and introduces an implicit constraint to approximate demonstrator policies by adjusting the empirical distribution of samples. Empirical results across environments of varying scales demonstrate that MPPO achieves proficiency levels comparable to, or even superior to, pure online algorithms while preserving demonstrators’ play styles. This work presents an effective approach for generating highly proficient and diverse game agents, ultimately contributing to more engaging gameplay experiences.

[302] Scaling LLM Planning: NL2FLOW for Parametric Problem Generation and Rigorous Evaluation

Jungkoo Kang

Main category: cs.AI

TL;DR: NL2Flow automates the generation of planning problems for LLMs, evaluating their performance in workflow generation and translation tasks.

Details

Motivation: Address the scarcity of scalable, reliable evaluation data for LLM planning and reasoning by identifying a suitable workflow domain.

Method: Introduce NL2Flow, a system for generating planning problems in natural language, structured representation, and PDDL, and evaluate LLMs on 2296 low-difficulty problems.

Result: Best model achieved 86% success in valid plans and 69% in optimal plans. Translation to JSON had lower success than direct plan generation.

Conclusion: LLMs perform better reasoning directly from natural language to action, and understanding bottlenecks is key for scaling to complex problems.

Abstract: Effective agent performance relies on the ability to compose tools and agents into effective workflows. However, progress in Large Language Model (LLM) planning and reasoning is limited by the scarcity of scalable, reliable evaluation data. This study addresses this limitation by identifying a suitable workflow domain for LLM application. I introduce NL2Flow, a fully automated system for parametrically generating planning problems, which are expressed in natural language, a structured intermediate representation, and formal PDDL, and rigorously evaluating the quality of generated plans. NL2Flow generates a dataset of 2296 low-difficulty problems in automated workflow generation and evaluates multiple open-sourced, instruct-tuned LLMs without task-specific optimization or architectural modifications. Results reveal that the highest performing model achieved 86% success in generating valid plans and 69% in generating optimal plans, specifically for problems with feasible plans. Regression analysis shows that the influence of problem characteristics on plan generation is contingent on both model and prompt design. To investigate the potential of LLMs as natural language-to-JSON translators for workflow definition, and to facilitate integration with downstream symbolic computation tools and a symbolic planner, I evaluated the LLM’s translation performance on natural language workflow descriptions. I observed that translating natural language into a JSON representation of a workflow problem yielded a lower success rate than generating a plan directly, suggesting that unnecessary decomposition of the reasoning task may degrade performance and highlighting the benefit of models capable of reasoning directly from natural language to action. As LLM reasoning scales to increasingly complex problems, understanding the shifting bottlenecks and sources of error within these systems will be crucial.

[303] Establishing Best Practices for Building Rigorous Agentic Benchmarks

Yuxuan Zhu, Tengjun Jin, Yada Pruksachatkun, Andy Zhang, Shu Liu, Sasha Cui, Sayash Kapoor, Shayne Longpre, Kevin Meng, Rebecca Weiss, Fazl Barez, Rahul Gupta, Jwala Dhamala, Jacob Merizian, Mario Giulianelli, Harry Coppock, Cozmin Ududec, Jasjeet Sekhon, Jacob Steinhardt, Antony Kellermann, Sarah Schwettmann, Matei Zaharia, Ion Stoica, Percy Liang, Daniel Kang

Main category: cs.AI

TL;DR: The paper highlights flaws in current AI agent benchmarks, introduces the Agentic Benchmark Checklist (ABC) to improve evaluation rigor, and demonstrates its effectiveness by reducing performance overestimation in CVE-Bench by 33%.

Details

Motivation: Existing agentic benchmarks often have issues in task setup or reward design, leading to inaccurate performance evaluations of AI agents.

Method: The authors introduce the Agentic Benchmark Checklist (ABC), synthesized from benchmark-building experience, best practices, and reported issues, to address these flaws.

Result: Applying ABC to CVE-Bench reduces performance overestimation by 33%, demonstrating its effectiveness.

Conclusion: ABC provides a practical solution to improve the rigor and accuracy of agentic benchmarks in AI evaluation.

Abstract: Benchmarks are essential for quantitatively tracking progress in AI. As AI agents become increasingly capable, researchers and practitioners have introduced agentic benchmarks to evaluate agents on complex, real-world tasks. These benchmarks typically measure agent capabilities by evaluating task outcomes via specific reward designs. However, we show that many agentic benchmarks have issues in task setup or reward design. For example, SWE-bench Verified uses insufficient test cases, while TAU-bench counts empty responses as successful. Such issues can lead to under- or overestimation of agents’ performance by up to 100% in relative terms. To make agentic evaluation rigorous, we introduce the Agentic Benchmark Checklist (ABC), a set of guidelines that we synthesized from our benchmark-building experience, a survey of best practices, and previously reported issues. When applied to CVE-Bench, a benchmark with a particularly complex evaluation design, ABC reduces the performance overestimation by 33%.

[304] Chart-R1: Chain-of-Thought Supervision and Reinforcement for Advanced Chart Reasoner

Lei Chen, Xuanle Zhao, Zhixiong Zeng, Jing Huang, Yufeng Zhong, Lin Ma

Main category: cs.AI

TL;DR: Chart-R1 introduces a chart-domain vision-language model with reinforcement learning fine-tuning for complex reasoning, supported by programmatic data synthesis and a two-stage training strategy (Chart-COT and Chart-RFT). It outperforms existing methods and rivals large-scale models like GPT-4o and Claude-3.5.

Details

Motivation: To extend R1-Style methods beyond mathematical reasoning and code intelligence to multimodal data, particularly charts, which present unique reasoning challenges.

Method: 1. Programmatic data synthesis for high-quality chart reasoning data. 2. Two-stage training: Chart-COT (step-by-step supervision) and Chart-RFT (numerically sensitive reinforcement fine-tuning).

Result: Chart-R1 outperforms chart-domain methods and competes with large-scale models like GPT-4o and Claude-3.5.

Conclusion: Chart-R1 successfully addresses chart reasoning challenges and demonstrates superior performance, validating the effectiveness of its data synthesis and training strategies.

Abstract: Recently, inspired by OpenAI-o1/o3 and Deepseek-R1, the R1-Style method based on reinforcement learning fine-tuning has received widespread attention from the community. Previous R1-Style methods mainly focus on mathematical reasoning and code intelligence. It is of great research significance to verify their advantages on more general multimodal data. Chart is an important multimodal data type with rich information, which brings important research challenges in complex reasoning. In this work, we introduce Chart-R1, a chart-domain vision-language model with reinforcement learning fine-tuning to enable complex chart reasoning. To support Chart-R1, we first propose a novel programmatic data synthesis technology to generate high-quality step-by-step chart reasoning data covering single- and multi-subcharts, which makes up for the lack of reasoning data in the chart domain. Then we develop a two-stage training strategy: Chart-COT with step-by-step chain-of-thought supervision, and Chart-RFT with numerically sensitive reinforcement fine-tuning. Chart-COT aims to decompose complex chart reasoning tasks into fine-grained, understandable subtasks through step-by-step supervision, which lays a good foundation for improving the reasoning level of reinforcement learning. Chart-RFT utilize the typical group relative policy optimization strategy, in which a relatively soft reward is adopted for numerical response to emphasize the numerical sensitivity in the chart domain. We conduct extensive experiments on open-source benchmarks and self-built chart reasoning dataset (\emph{i.e., ChartRQA}). Experimental results show that Chart-R1 has significant advantages compared to chart-domain methods, even comparable to open/closed source large-scale models (\emph{e.g., GPT-4o, Claude-3.5}).

[305] Hierarchical Budget Policy Optimization for Adaptive Reasoning

Shangke Lyu, Linjuan Wu, Yuchen Yan, Xingyu Wu, Hao Li, Yongliang Shen, Peisheng Jiang, Weiming Lu, Jun Xiao, Yueting Zhuang

Main category: cs.AI

TL;DR: HBPO is a reinforcement learning framework that optimizes reasoning efficiency by learning problem-specific reasoning depths, reducing token usage by 60.6% while improving accuracy by 3.14%.

Details

Motivation: Current large reasoning models inefficiently apply uniform reasoning depth regardless of problem complexity, leading to unnecessary computational costs.

Method: HBPO uses hierarchical budget-constrained exploration spaces (512-2560 tokens) with differentiated rewards to balance efficiency and capability.

Result: HBPO reduces token usage by 60.6% and improves accuracy by 3.14% across benchmarks, showing adaptive reasoning depth.

Conclusion: Reasoning efficiency and capability can coexist through hierarchical training that preserves exploration diversity.

Abstract: Large reasoning models achieve remarkable performance through extensive chain-of-thought generation, yet they suffer from a critical inefficiency: applying uniformly extensive reasoning regardless of problem complexity. We present Hierarchical Budget Policy Optimization (HBPO), a reinforcement learning framework that enables models to learn problem-specific reasoning depths without sacrificing capability. Unlike existing approaches that impose rigid constraints or rely on discrete mode selection, HBPO partitions the exploration space into budget-constrained hierarchies (512-2560 tokens), each with differentiated reward structures that preserve both efficiency incentives and reasoning capabilities. This design addresses a fundamental challenge in efficient reasoning training: traditional length penalties systematically bias models away from necessary long reasoning paths, causing exploration space collapse. Through hierarchical sampling and budget-aware rewards, HBPO maintains exploration diversity while teaching models to recognize when extended deliberation is warranted. Extensive experiments demonstrate that HBPO reduces average token usage by up to 60.6% while improving accuracy by 3.14% across four reasoning benchmarks. Most notably, HBPO exhibits emergent adaptive behavior where models automatically adjust reasoning depth based on problem complexity. Our results suggest that reasoning efficiency and capability are not inherently conflicting, and can be simultaneously optimized through appropriately structured hierarchical training that preserves exploration diversity.

[306] SafeWork-R1: Coevolving Safety and Intelligence under the AI-45$^{\circ}$ Law

Shanghai AI Lab, :, Yicheng Bao, Guanxu Chen, Mingkang Chen, Yunhao Chen, Chiyu Chen, Lingjie Chen, Sirui Chen, Xinquan Chen, Jie Cheng, Yu Cheng, Dengke Deng, Yizhuo Ding, Dan Ding, Xiaoshan Ding, Yi Ding, Zhichen Dong, Lingxiao Du, Yuyu Fan, Xinshun Feng, Yanwei Fu, Yuxuan Gao, Ruijun Ge, Tianle Gu, Lujun Gui, Jiaxuan Guo, Qianxi He, Yuenan Hou, Xuhao Hu, Hong Huang, Kaichen Huang, Shiyang Huang, Yuxian Jiang, Shanzhe Lei, Jie Li, Lijun Li, Hao Li, Juncheng Li, Xiangtian Li, Yafu Li, Lingyu Li, Xueyan Li, Haotian Liang, Dongrui Liu, Qihua Liu, Zhixuan Liu, Bangwei Liu, Huacan Liu, Yuexiao Liu, Zongkai Liu, Chaochao Lu, Yudong Lu, Xiaoya Lu, Zhenghao Lu, Qitan Lv, Caoyuan Ma, Jiachen Ma, Xiaoya Ma, Zhongtian Ma, Lingyu Meng, Ziqi Miao, Yazhe Niu, Yuezhang Peng, Yuan Pu, Han Qi, Chen Qian, Xingge Qiao, Jingjing Qu, Jiashu Qu, Wanying Qu, Wenwen Qu, Xiaoye Qu, Qihan Ren, Qingnan Ren, Qingyu Ren, Jing Shao, Wenqi Shao, Shuai Shao, Dongxing Shi, Xin Song, Xinhao Song, Yan Teng, Xuan Tong, Yingchun Wang, Xuhong Wang, Shujie Wang, Xin Wang, Yige Wang, Yixu Wang, Yuanfu Wang, Futing Wang, Ruofan Wang, Wenjie Wang, Yajie Wang, Muhao Wei, Xiaoyu Wen, Fenghua Weng, Yuqi Wu, Yingtong Xiong, Xingcheng Xu, Chao Yang, Yue Yang, Yang Yao, Yulei Ye, Zhenyun Yin, Yi Yu, Bo Zhang, Qiaosheng Zhang, Jinxuan Zhang, Yexin Zhang, Yinqiang Zheng, Hefeng Zhou, Zhanhui Zhou, Pengyu Zhu, Qingzi Zhu, Yubo Zhu, Bowen Zhou

Main category: cs.AI

TL;DR: SafeWork-R1 is a multimodal reasoning model co-evolving capabilities and safety via the SafeLadder framework, outperforming base models and proprietary systems like GPT-4.1 in safety benchmarks.

Details

Motivation: To develop AI models where safety and capabilities co-evolve, addressing limitations of existing alignment methods like RLHF.

Method: Uses the SafeLadder framework with progressive safety-oriented reinforcement learning and multi-principled verifiers, plus inference-time interventions and deliberative search.

Result: Achieves 46.54% improvement over base model Qwen2.5-VL-72B in safety benchmarks, matching or surpassing proprietary models like GPT-4.1.

Conclusion: Demonstrates that safety and capability can synergistically co-evolve, validating the generalizability of the SafeLadder framework for robust AI.

Abstract: We introduce SafeWork-R1, a cutting-edge multimodal reasoning model that demonstrates the coevolution of capabilities and safety. It is developed by our proposed SafeLadder framework, which incorporates large-scale, progressive, safety-oriented reinforcement learning post-training, supported by a suite of multi-principled verifiers. Unlike previous alignment methods such as RLHF that simply learn human preferences, SafeLadder enables SafeWork-R1 to develop intrinsic safety reasoning and self-reflection abilities, giving rise to safety `aha’ moments. Notably, SafeWork-R1 achieves an average improvement of $46.54%$ over its base model Qwen2.5-VL-72B on safety-related benchmarks without compromising general capabilities, and delivers state-of-the-art safety performance compared to leading proprietary models such as GPT-4.1 and Claude Opus 4. To further bolster its reliability, we implement two distinct inference-time intervention methods and a deliberative search mechanism, enforcing step-level verification. Finally, we further develop SafeWork-R1-InternVL3-78B, SafeWork-R1-DeepSeek-70B, and SafeWork-R1-Qwen2.5VL-7B. All resulting models demonstrate that safety and capability can co-evolve synergistically, highlighting the generalizability of our framework in building robust, reliable, and trustworthy general-purpose AI.

[307] Multi-Representation Diagrams for Pain Recognition: Integrating Various Electrodermal Activity Signals into a Single Image

Stefanos Gkikas, Ioannis Kyprakis, Manolis Tsiknakis

Main category: cs.AI

TL;DR: The paper proposes a pipeline using electrodermal activity signals for automatic pain assessment, outperforming traditional fusion methods.

Details

Motivation: Reliable pain assessment is crucial for effective management, and automated systems can provide continuous, objective monitoring.

Method: The method uses electrodermal activity signals, creating multiple representations visualized in a multi-representation diagram, and tests various processing techniques.

Result: The approach consistently matches or surpasses traditional fusion methods in performance.

Conclusion: The proposed pipeline is a robust alternative for integrating signal representations in pain-assessment systems.

Abstract: Pain is a multifaceted phenomenon that affects a substantial portion of the population. Reliable and consistent evaluation benefits those experiencing pain and underpins the development of effective and advanced management strategies. Automatic pain-assessment systems deliver continuous monitoring, inform clinical decision-making, and aim to reduce distress while preventing functional decline. By incorporating physiological signals, these systems provide objective, accurate insights into an individual’s condition. This study has been submitted to the \textit{Second Multimodal Sensing Grand Challenge for Next-Gen Pain Assessment (AI4PAIN)}. The proposed method introduces a pipeline that leverages electrodermal activity signals as input modality. Multiple representations of the signal are created and visualized as waveforms, and they are jointly visualized within a single multi-representation diagram. Extensive experiments incorporating various processing and filtering techniques, along with multiple representation combinations, demonstrate the effectiveness of the proposed approach. It consistently yields comparable, and in several cases superior, results to traditional fusion methods, establishing it as a robust alternative for integrating different signal representations or modalities.

[308] Efficient Pain Recognition via Respiration Signals: A Single Cross-Attention Transformer Multi-Window Fusion Pipeline

Stefanos Gkikas, Ioannis Kyprakis, Manolis Tsiknakis

Main category: cs.AI

TL;DR: A pipeline using respiration signals and a cross-attention transformer with multi-windowing for pain assessment, showing strong performance with compact models.

Details

Motivation: Accurate pain assessment is crucial for effective management; automated systems can aid continuous monitoring and clinical decisions.

Method: Respiration-based pipeline with cross-attention transformer and multi-windowing to capture short/long-term and global features.

Result: Respiration is valuable for pain assessment; optimized compact models outperform larger ones.

Conclusion: The method enhances representational capacity and demonstrates efficiency, supporting practical pain assessment applications.

Abstract: Pain is a complex condition affecting a large portion of the population. Accurate and consistent evaluation is essential for individuals experiencing pain, and it supports the development of effective and advanced management strategies. Automatic pain assessment systems provide continuous monitoring and support clinical decision-making, aiming to reduce distress and prevent functional decline. This study has been submitted to the \textit{Second Multimodal Sensing Grand Challenge for Next-Gen Pain Assessment (AI4PAIN)}. The proposed method introduces a pipeline that leverages respiration as the input signal and incorporates a highly efficient cross-attention transformer alongside a multi-windowing strategy. Extensive experiments demonstrate that respiration is a valuable physiological modality for pain assessment. Moreover, experiments revealed that compact and efficient models, when properly optimized, can achieve strong performance, often surpassing larger counterparts. The proposed multi-window approach effectively captures both short-term and long-term features, as well as global characteristics, thereby enhancing the model’s representational capacity.

[309] NatureGAIA: Pushing the Frontiers of GUI Agents with a Challenging Benchmark and High-Quality Trajectory Dataset

Zihan Zheng, Tianle Cui, Chuwen Xie, Jiahui Zhang, Jiahui Pan, Lewei He, Qianglong Chen

Main category: cs.AI

TL;DR: NaturalGAIA is a new benchmark for evaluating LLM-driven GUI agents, addressing accuracy and reproducibility issues. LightManus, a hierarchical agent, was developed to improve task performance, and Reinforcement Fine-Tuning (RFT) was applied to a model, showing limitations in complex scenarios.

Details

Motivation: Existing benchmarks for LLM-driven GUI agents lack accuracy, reproducibility, and scalability, hindering progress in the field.

Method: Introduced NaturalGAIA, a benchmark based on Causal Pathways, and developed LightManus, a hierarchical agent for long-horizon tasks. Used RFT on the Qwen2.5-VL-7B model with a human-verified dataset.

Result: NaturalGAIA proved challenging for top LLMs (e.g., Claude-sonnet-4 achieved 34.6% WPSR). RFT improved smaller models but struggled with complexity.

Conclusion: The research provides a robust evaluation standard and dataset, highlighting the limitations of smaller models in complex tasks and guiding future GUI agent development.

Abstract: The rapid advancement of Large Language Model (LLM)-driven Graphical User Interface (GUI) agents is significantly hampered by the profound limitations of existing evaluation benchmarks in terms of accuracy, reproducibility, and scalability. To address this critical gap, we introduce NaturalGAIA, a novel benchmark engineered on the principle of Causal Pathways. This design paradigm structures complex tasks into a series of programmatically verifiable atomic steps, ensuring a rigorous, fully automated, and reproducible standard for assessment. Concurrently, to mitigate the inherent capability deficits of agents, we developed LightManus, a hierarchical agent architecture specifically optimized for long-horizon tasks. We leveraged this agent to generate a high-quality, human-verified trajectory dataset that uniquely captures diverse and even self-correcting interaction patterns of LLMs. We then utilized this dataset to perform Reinforcement Fine-Tuning (RFT) on the Qwen2.5-VL-7B model. Our experiments reveal that NaturalGAIA presents a formidable challenge to current state-of-the-art LLMs; even the top-performing Claude-sonnet-4 achieved a Weighted Pathway Success Rate (WPSR) of only 34.6%. Moreover, while RFT substantially improved the smaller model’s GUI execution capabilities (WPSR increased from 3.3% to 10.8%), its performance degraded sharply when handling complex scenarios. This outcome highlights the inherent capability ceiling of smaller models when faced with comprehensive tasks that integrate perception, decision-making, and execution. This research contributes a rigorous evaluation standard and a high-quality dataset to the community, aiming to guide the future development of GUI agents.

[310] Getting out of the Big-Muddy: Escalation of Commitment in LLMs

Emilio Barkett, Olivia Long, Paul Kröger

Main category: cs.AI

TL;DR: LLMs exhibit context-dependent cognitive biases like escalation of commitment, with minimal bias in individual decisions but high bias in multi-agent or pressured scenarios.

Details

Motivation: To understand if and when LLMs inherit human cognitive biases, specifically escalation of commitment, given their increasing use in high-stakes decision-making.

Method: A two-stage investment task tested across four conditions: individual decision-making, advisory roles, multi-agent deliberation, and compound pressure scenarios, involving 6,500 trials.

Result: LLMs showed minimal bias in individual decisions but high escalation rates in multi-agent (99.2%) and pressured (68.95%) contexts.

Conclusion: LLM bias depends on social and organizational context, impacting deployment in multi-agent or unsupervised systems.

Abstract: Large Language Models (LLMs) are increasingly deployed in autonomous decision-making roles across high-stakes domains. However, since models are trained on human-generated data, they may inherit cognitive biases that systematically distort human judgment, including escalation of commitment, where decision-makers continue investing in failing courses of action due to prior investment. Understanding when LLMs exhibit such biases presents a unique challenge. While these biases are well-documented in humans, it remains unclear whether they manifest consistently in LLMs or require specific triggering conditions. This paper investigates this question using a two-stage investment task across four experimental conditions: model as investor, model as advisor, multi-agent deliberation, and compound pressure scenario. Across N = 6,500 trials, we find that bias manifestation in LLMs is highly context-dependent. In individual decision-making contexts (Studies 1-2, N = 4,000), LLMs demonstrate strong rational cost-benefit logic with minimal escalation of commitment. However, multi-agent deliberation reveals a striking hierarchy effect (Study 3, N = 500): while asymmetrical hierarchies show moderate escalation rates (46.2%), symmetrical peer-based decision-making produces near-universal escalation (99.2%). Similarly, when subjected to compound organizational and personal pressures (Study 4, N = 2,000), models exhibit high degrees of escalation of commitment (68.95% average allocation to failing divisions). These findings reveal that LLM bias manifestation depends critically on social and organizational context rather than being inherent, with significant implications for the deployment of multi-agent systems and unsupervised operations where such conditions may emerge naturally.

[311] SE-Agent: Self-Evolution Trajectory Optimization in Multi-Step Reasoning with LLM-Based Agents

Jiaye Lin, Yifu Guo, Yuzhen Han, Sen Hu, Ziyi Ni, Licheng Wang, Mingguang Chen, Daxin Jiang, Binxing Jiao, Chen Hu, Huacan Wang

Main category: cs.AI

TL;DR: SE-Agent is a self-evolution framework for LLM-based agents that improves reasoning by revisiting and enhancing past trajectories through revision, recombination, and refinement, achieving up to 55% performance improvement.

Details

Motivation: Current LLM-based agents lack efficient exploitation of interaction trajectories, leading to redundant reasoning and suboptimal outcomes.

Method: Proposes SE-Agent, which uses revision, recombination, and refinement of past trajectories to expand search space and enhance reasoning.

Result: Achieves up to 55% relative improvement on SWE-bench Verified, outperforming other open-source agents.

Conclusion: SE-Agent enables continuous self-evolution, improving reasoning quality and performance in complex tasks.

Abstract: Large Language Model (LLM)-based agents have recently shown impressive capabilities in complex reasoning and tool use via multi-step interactions with their environments. While these agents have the potential to tackle complicated tasks, their problem-solving process, i.e., agents’ interaction trajectory leading to task completion, remains underexploited. These trajectories contain rich feedback that can navigate agents toward the right directions for solving problems correctly. Although prevailing approaches, such as Monte Carlo Tree Search (MCTS), can effectively balance exploration and exploitation, they ignore the interdependence among various trajectories and lack the diversity of search spaces, which leads to redundant reasoning and suboptimal outcomes. To address these challenges, we propose SE-Agent, a Self-Evolution framework that enables Agents to optimize their reasoning processes iteratively. Our approach revisits and enhances former pilot trajectories through three key operations: revision, recombination, and refinement. This evolutionary mechanism enables two critical advantages: (1) it expands the search space beyond local optima by intelligently exploring diverse solution paths guided by previous trajectories, and (2) it leverages cross-trajectory inspiration to efficiently enhance performance while mitigating the impact of suboptimal reasoning paths. Through these mechanisms, SE-Agent achieves continuous self-evolution that incrementally improves reasoning quality. We evaluate SE-Agent on SWE-bench Verified to resolve real-world GitHub issues. Experimental results across five strong LLMs show that integrating SE-Agent delivers up to 55% relative improvement, achieving state-of-the-art performance among all open-source agents on SWE-bench Verified. Our code and demonstration materials are publicly available at https://github.com/JARVIS-Xs/SE-Agent.

[312] CAMA: Enhancing Mathematical Reasoning in Large Language Models with Causal Knowledge

Lei Zan, Keli Zhang, Ruichu Cai, Lujia Pan

Main category: cs.AI

TL;DR: CAMA is a two-stage causal framework enhancing LLMs’ mathematical reasoning by constructing and refining a Mathematical Causal Graph (MCG) and dynamically guiding LLMs with task-relevant subgraphs.

Details

Motivation: LLMs struggle with complex mathematical reasoning due to deep structural dependencies, prompting the need for explicit, reusable mathematical structure.

Method: CAMA constructs and refines an MCG using LLM priors and causal discovery, then dynamically extracts task-relevant subgraphs to guide LLM reasoning.

Result: CAMA significantly improves LLM performance on challenging math problems, with structured guidance and asymmetric causal relationships proving most effective.

Conclusion: CAMA successfully enhances LLMs’ mathematical reasoning by leveraging explicit causal structures, outperforming unstructured alternatives.

Abstract: Large Language Models (LLMs) have demonstrated strong performance across a wide range of tasks, yet they still struggle with complex mathematical reasoning, a challenge fundamentally rooted in deep structural dependencies. To address this challenge, we propose \textbf{CA}usal \textbf{MA}thematician (\textbf{CAMA}), a two-stage causal framework that equips LLMs with explicit, reusable mathematical structure. In the learning stage, CAMA first constructs the \textbf{M}athematical \textbf{C}ausal \textbf{G}raph (\textbf{MCG}), a high-level representation of solution strategies, by combining LLM priors with causal discovery algorithms applied to a corpus of question-solution pairs. The resulting MCG encodes essential knowledge points and their causal dependencies. To better align the graph with downstream reasoning tasks, CAMA further refines the MCG through iterative feedback derived from a selected subset of the question-solution pairs. In the reasoning stage, given a new question, CAMA dynamically extracts a task-relevant subgraph from the MCG, conditioned on both the question content and the LLM’s intermediate reasoning trace. This subgraph, which encodes the most pertinent knowledge points and their causal dependencies, is then injected back into the LLM to guide its reasoning process. Empirical results on real-world datasets show that CAMA significantly improves LLM performance on challenging mathematical problems. Furthermore, our experiments demonstrate that structured guidance consistently outperforms unstructured alternatives, and that incorporating asymmetric causal relationships yields greater improvements than using symmetric associations alone.

[313] Polymath: A Self-Optimizing Agent with Dynamic Hierarchical Workflow

Chia-Tung Ho, Jing Gong, Xufeng Yao, Yunsheng Bai, Abhishek B Akkur, Haoxing Ren

Main category: cs.AI

TL;DR: Polymath is a self-optimizing agent with dynamic hierarchical workflows, improving performance by 8.1% over baselines without labeled data.

Details

Motivation: Manual embedding of foundation models into agentic systems limits scalability and efficiency, and existing methods rely on labeled datasets, which are ineffective for dynamic problems.

Method: Polymath uses task flow graphs and code-represented workflows, optimized via multi-grid-inspired graph optimization and a self-reflection-guided evolutionary algorithm.

Result: Achieves 8.1% average improvement over state-of-the-art baselines on six benchmark datasets.

Conclusion: Polymath addresses the limitations of existing methods by enabling self-optimization without labeled data, proving effective for dynamic problems.

Abstract: Large language models (LLMs) excel at solving complex tasks by executing agentic workflows composed of detailed instructions and structured operations. Yet, building general-purpose agents by manually embedding foundation models into agentic systems such as Chain-of-Thought, Self-Reflection, and ReACT through text interfaces limits scalability and efficiency. Recently, many researchers have sought to automate the generation and optimization of these workflows through code-based representations. However, existing methods often rely on labeled datasets to train and optimize workflows, making them ineffective and inflexible for solving real-world, dynamic problems where labeled data is unavailable. To address this challenge, we introduce Polymath, a self-optimizing agent with dynamic hierarchical workflow that leverages the flexibility of task flow graphs and the expressiveness of code-represented workflows to solve a wide range of real-world, dynamic problems. The proposed optimization methodology integrates multi-grid-inspired graph optimization with a self-reflection-guided evolutionary algorithm to refine workflows without labeled data. Experimental results on six benchmark datasets across coding, math, and multi-turn QA tasks show that Polymath achieves 8.1% average improvement over state-of-the-art baselines.

[314] Nemori: Self-Organizing Agent Memory Inspired by Cognitive Science

Jiayan Nan, Wenquan Ma, Wenlong Wu, Yize Chen

Main category: cs.AI

TL;DR: Nemori is a self-organizing memory architecture for LLMs, addressing memory granularity and adaptive learning through principles inspired by human cognition, outperforming existing systems in long-term contexts.

Details

Motivation: LLMs lack persistent memory for long-term interactions, and existing memory systems are limited by arbitrary granularity and passive knowledge extraction.

Method: Nemori uses the Two-Step Alignment Principle for organizing conversational streams into coherent episodes and the Predict-Calibrate Principle for adaptive learning from prediction gaps.

Result: Nemori outperforms state-of-the-art systems on LoCoMo and LongMemEval benchmarks, especially in longer contexts.

Conclusion: Nemori provides a viable solution for enhancing LLMs’ long-term memory and adaptive learning capabilities.

Abstract: Large Language Models (LLMs) demonstrate remarkable capabilities, yet their inability to maintain persistent memory in long contexts limits their effectiveness as autonomous agents in long-term interactions. While existing memory systems have made progress, their reliance on arbitrary granularity for defining the basic memory unit and passive, rule-based mechanisms for knowledge extraction limits their capacity for genuine learning and evolution. To address these foundational limitations, we present Nemori, a novel self-organizing memory architecture inspired by human cognitive principles. Nemori’s core innovation is twofold: First, its Two-Step Alignment Principle, inspired by Event Segmentation Theory, provides a principled, top-down method for autonomously organizing the raw conversational stream into semantically coherent episodes, solving the critical issue of memory granularity. Second, its Predict-Calibrate Principle, inspired by the Free-energy Principle, enables the agent to proactively learn from prediction gaps, moving beyond pre-defined heuristics to achieve adaptive knowledge evolution. This offers a viable path toward handling the long-term, dynamic workflows of autonomous agents. Extensive experiments on the LoCoMo and LongMemEval benchmarks demonstrate that Nemori significantly outperforms prior state-of-the-art systems, with its advantage being particularly pronounced in longer contexts.

Fuqing Bie, Shiyu Huang, Xijia Tao, Zhiqin Fang, Leyi Pan, Junzhe Chen, Min Ren, Liuyu Xiang, Zhaofeng He

Main category: cs.AI

TL;DR: OmniPlay is a diagnostic benchmark for evaluating multi-modal agentic models, revealing their strengths in memory tasks but weaknesses in reasoning and planning due to brittle fusion mechanisms.

Details

Motivation: Existing evaluations for multi-modal models lack dynamic, interactive testing, ignoring auditory and temporal cues, creating a gap in assessing true cross-modal reasoning.

Method: OmniPlay introduces five game environments to test synergy and conflict scenarios, evaluating six omni-modal models.

Result: Models show superhuman memory performance but fail in reasoning and planning due to brittle fusion, with a paradox where less sensory input improves performance.

Conclusion: Robust AGI requires focusing on synergistic fusion beyond scaling, as highlighted by OmniPlay’s findings.

Abstract: While generalist foundation models like Gemini and GPT-4o demonstrate impressive multi-modal competence, existing evaluations fail to test their intelligence in dynamic, interactive worlds. Static benchmarks lack agency, while interactive benchmarks suffer from a severe modal bottleneck, typically ignoring crucial auditory and temporal cues. To bridge this evaluation chasm, we introduce OmniPlay, a diagnostic benchmark designed not just to evaluate, but to probe the fusion and reasoning capabilities of agentic models across the full sensory spectrum. Built on a core philosophy of modality interdependence, OmniPlay comprises a suite of five game environments that systematically create scenarios of both synergy and conflict, forcing agents to perform genuine cross-modal reasoning. Our comprehensive evaluation of six leading omni-modal models reveals a critical dichotomy: they exhibit superhuman performance on high-fidelity memory tasks but suffer from systemic failures in challenges requiring robust reasoning and strategic planning. We demonstrate that this fragility stems from brittle fusion mechanisms, which lead to catastrophic performance degradation under modality conflict and uncover a counter-intuitive “less is more” paradox, where removing sensory information can paradoxically improve performance. Our findings suggest that the path toward robust AGI requires a research focus beyond scaling to explicitly address synergistic fusion. Our platform is available for anonymous review at https://github.com/fuqingbie/omni-game-benchmark.

cs.SD

[316] Toward Low-Latency End-to-End Voice Agents for Telecommunications Using Streaming ASR, Quantized LLMs, and Real-Time TTS

Vignesh Ethiraj, Ashwath David, Sidhanth Menon, Divya Vijay

Main category: cs.SD

TL;DR: A low-latency telecom AI voice agent pipeline is introduced, combining specialized models (TSLAM, T-VEC, TTE, T-Synth) for real-time, domain-adapted voice AI in telecom applications.

Details

Motivation: To enable advanced, low-latency voice AI for telecom use cases like call center automation and customer support, addressing the need for responsive, domain-specific solutions.

Method: Integration of four specialized models (TSLAM, T-VEC, TTE, T-Synth) for streaming ASR, conversational intelligence, RAG, and real-time TTS, evaluated on a telecom-specific dataset.

Result: The system achieves real-time factors (RTF) below 1.0, demonstrating low latency and high domain relevance for telecom applications.

Conclusion: The pipeline sets a benchmark for telecom voice assistants, enabling next-generation AI-driven customer support and diagnostics.

Abstract: We introduce a low-latency telecom AI voice agent pipeline for real-time, interactive telecommunications use, enabling advanced voice AI for call center automation, intelligent IVR (Interactive Voice Response), and AI-driven customer support. The solution is built for telecom, combining four specialized models by NetoAI: TSLAM, a 4-bit quantized Telecom-Specific Large Language Model (LLM); T-VEC, a Telecom-Specific Embedding Model; TTE, a Telecom-Specific Automatic Speech Recognition (ASR) model; and T-Synth, a Telecom-Specific Text-to-Speech (TTS) model. These models enable highly responsive, domain-adapted voice AI agents supporting knowledge-grounded spoken interactions with low latency. The pipeline integrates streaming ASR (TTE), conversational intelligence (TSLAM), retrieval augmented generation (RAG) over telecom documents, and real-time TTS (T-Synth), setting a new benchmark for telecom voice assistants. To evaluate the system, we built a dataset of 500 human-recorded telecom questions from RFCs, simulating real telecom agent queries. This framework allows analysis of latency, domain relevance, and real-time performance across the stack. Results show that TSLAM, TTE, and T-Synth deliver real-time factors (RTF) below 1.0, supporting enterprise, low-latency telecom deployments. These AI agents – powered by TSLAM, TTE, and T-Synth – provide a foundation for next-generation telecom AI, enabling automated customer support, diagnostics, and more.

[317] Wearable Music2Emotion : Assessing Emotions Induced by AI-Generated Music through Portable EEG-fNIRS Fusion

Sha Zhao, Song Yi, Yangxuan Zhou, Jiadong Pan, Jiquan Wang, Jie Xia, Shijian Li, Shurong Dong, Gang Pan

Main category: cs.SD

TL;DR: MEEtBrain is a portable, multimodal framework for emotion analysis using AI-generated music and synchronized EEG-fNIRS signals, addressing limitations of prior methods.

Details

Motivation: Overcome constraints in music-based affective computing, including stimulus limitations, modality specificity, and portability issues.

Method: Integrates AI-generated music with EEG-fNIRS acquisition via a wireless headband, using dry electrodes for portability.

Result: Collected a 14-hour dataset from 20 participants, validating efficacy, with plans to expand and share the dataset publicly.

Conclusion: MEEtBrain offers a scalable, portable solution for emotion analysis, promoting further research with its open dataset.

Abstract: Emotions critically influence mental health, driving interest in music-based affective computing via neurophysiological signals with Brain-computer Interface techniques. While prior studies leverage music’s accessibility for emotion induction, three key limitations persist: \textbf{(1) Stimulus Constraints}: Music stimuli are confined to small corpora due to copyright and curation costs, with selection biases from heuristic emotion-music mappings that ignore individual affective profiles. \textbf{(2) Modality Specificity}: Overreliance on unimodal neural data (e.g., EEG) ignores complementary insights from cross-modal signal fusion.\textbf{ (3) Portability Limitation}: Cumbersome setups (e.g., 64+ channel gel-based EEG caps) hinder real-world applicability due to procedural complexity and portability barriers. To address these limitations, we propose MEEtBrain, a portable and multimodal framework for emotion analysis (valence/arousal), integrating AI-generated music stimuli with synchronized EEG-fNIRS acquisition via a wireless headband. By MEEtBrain, the music stimuli can be automatically generated by AI on a large scale, eliminating subjective selection biases while ensuring music diversity. We use our developed portable device that is designed in a lightweight headband-style and uses dry electrodes, to simultaneously collect EEG and fNIRS recordings. A 14-hour dataset from 20 participants was collected in the first recruitment to validate the framework’s efficacy, with AI-generated music eliciting target emotions (valence/arousal). We are actively expanding our multimodal dataset (44 participants in the latest dataset) and make it publicly available to promote further research and practical applications. \textbf{The dataset is available at https://zju-bmi-lab.github.io/ZBra.

[318] Towards Hallucination-Free Music: A Reinforcement Learning Preference Optimization Framework for Reliable Song Generation

Huaicheng Zhang, Wei Tan, Guangzheng Li, Yixuan Zhang, Hangting Chen, Shun Lei, Chenyu Yang, Zhiyong Wu, Shuai Wang, Qijun Huang, Dong Yu

Main category: cs.SD

TL;DR: A novel RL framework for lyric-to-song generation reduces content hallucination using preference optimization, achieving significant PER reductions.

Details

Motivation: Addressing content hallucination and misalignment in lyric-to-song generation models, which undermines musical coherence.

Method: Proposes a reinforcement learning framework with three preference optimization strategies (DPO, PPO, GRPO) and a hallucination preference dataset.

Result: DPO reduces PER by 7.4%, PPO by 4.9%, and GRPO by 4.7%, effectively suppressing hallucinations while preserving musical quality.

Conclusion: The RL-based framework systematically controls hallucinations and offers potential for enhancing music style adherence and musicality.

Abstract: Recent advances in audio-based generative language models have accelerated AI-driven lyric-to-song generation. However, these models frequently suffer from content hallucination, producing outputs misaligned with the input lyrics and undermining musical coherence. Current supervised fine-tuning (SFT) approaches, limited by passive label-fitting, exhibit constrained self-improvement and poor hallucination mitigation. To address this core challenge, we propose a novel reinforcement learning (RL) framework leveraging preference optimization for hallucination control. Our key contributions include: (1) Developing a robust hallucination preference dataset constructed via phoneme error rate (PER) computation and rule-based filtering to capture alignment with human expectations; (2) Implementing and evaluating three distinct preference optimization strategies within the RL framework: Direct Preference Optimization (DPO), Proximal Policy Optimization (PPO), and Group Relative Policy Optimization (GRPO). DPO operates off-policy to enhance positive token likelihood, achieving a significant 7.4% PER reduction. PPO and GRPO employ an on-policy approach, training a PER-based reward model to iteratively optimize sequences via reward maximization and KL-regularization, yielding PER reductions of 4.9% and 4.7%, respectively. Comprehensive objective and subjective evaluations confirm that our methods effectively suppress hallucinations while preserving musical quality. Crucially, this work presents a systematic, RL-based solution to hallucination control in lyric-to-song generation. The framework’s transferability also unlocks potential for music style adherence and musicality enhancement, opening new avenues for future generative song research.

[319] SpectroStream: A Versatile Neural Codec for General Audio

Yunpeng Li, Kehang Han, Brian McWilliams, Zalan Borsos, Marco Tagliasacchi

Main category: cs.SD

TL;DR: SpectroStream is a neural audio codec for 48 kHz stereo music at 4-16 kbps, improving on SoundStream with time-frequency domain representation and delayed-fusion for multi-channel audio.

Details

Motivation: Extend SoundStream's capabilities to handle higher sample rates (48 kHz) and stereo audio while maintaining quality at low bit rates.

Method: Uses a neural architecture with time-frequency domain representation and delayed-fusion strategy for multi-channel audio.

Result: Achieves high-quality reconstruction of 48 kHz stereo music at 4-16 kbps.

Conclusion: SpectroStream advances audio codec technology by supporting higher sample rates and multi-channel audio with improved quality.

Abstract: We propose SpectroStream, a full-band multi-channel neural audio codec. Successor to the well-established SoundStream, SpectroStream extends its capability beyond 24 kHz monophonic audio and enables high-quality reconstruction of 48 kHz stereo music at bit rates of 4–16 kbps. This is accomplished with a new neural architecture that leverages audio representation in the time-frequency domain, which leads to better audio quality especially at higher sample rate. The model also uses a delayed-fusion strategy to handle multi-channel audio, which is crucial in balancing per-channel acoustic quality and cross-channel phase consistency.

[320] Estimating Musical Surprisal from Audio in Autoregressive Diffusion Model Noise Spaces

Mathias Rose Bjare, Stefan Lattner, Gerhard Widmer

Main category: cs.SD

TL;DR: The paper explores using autoregressive diffusion models (ADMs) to estimate information content (IC) for modeling musical expectancy and surprisal, outperforming a Generative Infinite-Vocabulary Transformer (GIVT) in tasks like pitch surprisal and segment boundary detection.

Details

Motivation: To investigate the effectiveness of IC modeling in audio using ADMs, comparing their performance to GIVT in capturing musical surprisal.

Method: Uses IC estimates from ADMs based on diffusion ODEs, evaluating them on monophonic pitch surprisal and segment boundary detection tasks.

Result: ADMs outperform GIVT in both tasks, with performance improvements linked to specific noise levels in the diffusion process.

Conclusion: ADMs are more effective than GIVT for IC-based musical surprisal modeling, with noise levels influencing task performance.

Abstract: Recently, the information content (IC) of predictions from a Generative Infinite-Vocabulary Transformer (GIVT) has been used to model musical expectancy and surprisal in audio. We investigate the effectiveness of such modelling using IC calculated with autoregressive diffusion models (ADMs). We empirically show that IC estimates of models based on two different diffusion ordinary differential equations (ODEs) describe diverse data better, in terms of negative log-likelihood, than a GIVT. We evaluate diffusion model IC’s effectiveness in capturing surprisal aspects by examining two tasks: (1) capturing monophonic pitch surprisal, and (2) detecting segment boundaries in multi-track audio. In both tasks, the diffusion models match or exceed the performance of a GIVT. We hypothesize that the surprisal estimated at different diffusion process noise levels corresponds to the surprisal of music and audio features present at different audio granularities. Testing our hypothesis, we find that, for appropriate noise levels, the studied musical surprisal tasks’ results improve. Code is provided on github.com/SonyCSLParis/audioic.

[321] A Scalable Pipeline for Enabling Non-Verbal Speech Generation and Understanding

Runchuan Ye, Yixuan Zhou, Renjie Yu, Zijian Lin, Kehan Li, Xiang Li, Xin Liu, Guoyang Zeng, Zhiyong Wu

Main category: cs.SD

TL;DR: The paper introduces NonVerbalSpeech-38K, a large dataset for non-verbal speech understanding and generation, validated by fine-tuning state-of-the-art models.

Details

Motivation: Existing speech systems lack non-verbal vocalizations (NVs) like laughter or sighs, limiting emotional intelligence and communicative richness.

Method: A dataset of 38,718 samples (131 hours) with 10 NV categories was collected from real-world media and annotated automatically. State-of-the-art models (F5-TTS, Qwen2-Audio) were fine-tuned for validation.

Result: The dataset improves non-verbal speech synthesis and captioning, enhancing human-computer interaction.

Conclusion: The work provides a pipeline for dataset creation, releases a large-scale dataset, and validates its effectiveness, advancing NV research.

Abstract: Human spoken communication involves not only lexical content but also non-verbal vocalizations (NVs) such as laughter, sighs, and coughs, which convey emotions, intentions, and social signals. However, most existing speech systems focus solely on verbal content and lack the ability to understand and generate such non-verbal cues, reducing the emotional intelligence and communicative richness of spoken interfaces. In this work, we introduce $\textbf{NonVerbalSpeech-38K}$, a large and diverse dataset for non-verbal speech generation and understanding, collected from real-world media and annotated using an automatic pipeline. The dataset contains 38,718 samples (about 131 hours) with 10 categories of non-verbal cues, such as laughter, sniff, and throat clearing. We further validate the dataset by fine-tuning state-of-the-art models, including F5-TTS and Qwen2-Audio, demonstrating its effectiveness in non-verbal speech generation and understanding tasks. Our contributions are threefold: (1) We propose a practical pipeline for building natural and diverse non-verbal speech datasets; (2) We release a large-scale dataset to advance research on non-verbal speech generation and understanding; (3) We validate the dataset’s effectiveness by demonstrating improvements in both non-verbal speech synthesis and captioning, thereby facilitating richer human-computer interaction.

[322] SPGISpeech 2.0: Transcribed multi-speaker financial audio for speaker-tagged transcription

Raymond Grossman, Taejin Park, Kunal Dhawan, Andrew Titus, Sophia Zhi, Yulia Shchadilova, Weiqing Wang, Jagadeesh Balam, Boris Ginsburg

Main category: cs.SD

TL;DR: SPGISpeech 2.0 is an enhanced dataset for speaker-tagged transcription in finance, adding 3,780 hours of earnings call audio with transcriptions and speaker info, improving multi-talker ASR performance.

Details

Motivation: To expand the diversity of modeling tasks in speech recognition while retaining the core features of the original SPGISpeech dataset.

Method: The dataset includes professionally transcribed earnings calls with call and speaker metadata, enabling multi-talker ASR.

Result: Fine-tuning on SPGISpeech 2.0 improves speaker-tagged ASR performance of existing models.

Conclusion: SPGISpeech 2.0, freely available for non-commercial use, aims to advance speech recognition research and inspire diverse applications.

Abstract: We introduce SPGISpeech 2.0, a dataset suitable for speaker-tagged transcription in the financial domain. SPGISpeech 2.0 improves the diversity of applicable modeling tasks while maintaining the core characteristic of the original SPGISpeech dataset: audio snippets and their corresponding fully formatted text transcriptions, usable for end-to-end automatic speech recognition (ASR). SPGISpeech 2.0 consists of 3,780 additional hours of professionally transcribed earnings calls. Furthermore, the dataset contains call and speaker information for each audio snippet facilitating multi-talker ASR. We validate the utility of SPGISpeech 2.0 through improvements in speaker-tagged ASR performance of popular speech recognition models after fine-tuning on SPGISpeech 2.0. Released free for non-commercial use, we expect SPGISpeech 2.0 to foster advancements in speech recognition technologies and inspire a wide range of research applications.

[323] Video Soundtrack Generation by Aligning Emotions and Temporal Boundaries

Serkan Sulun, Paula Viana, Matthew E. P. Davies

Main category: cs.SD

TL;DR: EMSYNC is a video-based music generation model that aligns music with video emotions and timing using a two-stage framework, outperforming existing models.

Details

Motivation: To create music that aligns with a video's emotional content and temporal boundaries, addressing limitations in existing models.

Method: Uses a pretrained video emotion classifier for emotional features and a conditional music generator for MIDI sequences, with boundary offsets for timing alignment.

Result: Outperforms state-of-the-art models in subjective listening tests for both music theory-aware and general listeners.

Conclusion: EMSYNC effectively combines emotional and temporal alignment for video-based music generation, achieving superior results.

Abstract: We introduce EMSYNC, a video-based symbolic music generation model that aligns music with a video’s emotional content and temporal boundaries. It follows a two-stage framework, where a pretrained video emotion classifier extracts emotional features, and a conditional music generator produces MIDI sequences guided by both emotional and temporal cues. We introduce boundary offsets, a novel temporal conditioning mechanism that enables the model to anticipate and align musical chords with scene cuts. Unlike existing models, our approach retains event-based encoding, ensuring fine-grained timing control and expressive musical nuances. We also propose a mapping scheme to bridge the video emotion classifier, which produces discrete emotion categories, with the emotion-conditioned MIDI generator, which operates on continuous-valued valence-arousal inputs. In subjective listening tests, EMSYNC outperforms state-of-the-art models across all subjective metrics, for music theory-aware participants as well as the general listeners.

[324] AudioGen-Omni: A Unified Multimodal Diffusion Transformer for Video-Synchronized Audio, Speech, and Song Generation

Le Wang, Jun Wang, Chunyu Qiang, Feng Deng, Chen Zhang, Di Zhang, Kun Gai

Main category: cs.SD

TL;DR: AudioGen-Omni is a unified model using multimodal diffusion transformers for high-fidelity audio, speech, and song generation synchronized with video, achieving state-of-the-art results.

Details

Motivation: To create a versatile model for generating diverse audio types (speech, song) coherently aligned with video, overcoming limitations of text-frozen paradigms.

Method: Uses a joint training paradigm with multimodal inputs, a unified lyrics-transcription encoder, and AdaLN-based joint attention with PAAPI for cross-modal alignment.

Result: Achieves high audio quality, semantic alignment, and lip-sync accuracy, with fast inference (1.91s for 8s audio).

Conclusion: AudioGen-Omni offers efficient, generalizable, and high-quality audio generation across tasks.

Abstract: We present AudioGen-Omni - a unified approach based on multimodal diffusion transformers (MMDit), capable of generating high-fidelity audio, speech, and song coherently synchronized with the input video. AudioGen-Omni introduces a novel joint training paradigm that seamlessly integrates large-scale video-text-audio corpora, enabling a model capable of generating semantically rich, acoustically diverse audio conditioned on multimodal inputs and adaptable to a wide range of audio generation tasks. AudioGen-Omni employs a unified lyrics-transcription encoder that encodes graphemes and phonemes from both song and spoken inputs into dense frame-level representations. Dense frame-level representations are fused using an AdaLN-based joint attention mechanism enhanced with phase-aligned anisotropic positional infusion (PAAPI), wherein RoPE is selectively applied to temporally structured modalities to ensure precise and robust cross-modal alignment. By unfreezing all modalities and masking missing inputs, AudioGen-Omni mitigates the semantic constraints of text-frozen paradigms, enabling effective cross-modal conditioning. This joint training approach enhances audio quality, semantic alignment, and lip-sync accuracy, while also achieving state-of-the-art results on Text-to-Audio/Speech/Song tasks. With an inference time of 1.91 seconds for 8 seconds of audio, it offers substantial improvements in both efficiency and generality.

[325] Towards Reliable Audio Deepfake Attribution and Model Recognition: A Multi-Level Autoencoder-Based Framework

Andrea Di Pierno, Luca Guarnera, Dario Allegra, Sebastiano Battiato

Main category: cs.SD

TL;DR: LAVA is a hierarchical framework for detecting and attributing audio deepfakes, achieving high accuracy in identifying generation technologies and specific models.

Details

Motivation: The increasing threat of audio deepfakes to digital trust necessitates advanced detection and attribution methods.

Method: LAVA uses a convolutional autoencoder for latent representations and two classifiers (ADA and ADMR) for technology and model recognition, with confidence-based rejection for robustness.

Result: Achieves F1-scores over 95% for ADA and 96.31% for ADMR, with confirmed robustness on unseen attacks.

Conclusion: LAVA advances deepfake attribution under open-set conditions, validated on benchmarks with publicly available models and code.

Abstract: The proliferation of audio deepfakes poses a growing threat to trust in digital communications. While detection methods have advanced, attributing audio deepfakes to their source models remains an underexplored yet crucial challenge. In this paper we introduce LAVA (Layered Architecture for Voice Attribution), a hierarchical framework for audio deepfake detection and model recognition that leverages attention-enhanced latent representations extracted by a convolutional autoencoder trained solely on fake audio. Two specialized classifiers operate on these features: Audio Deepfake Attribution (ADA), which identifies the generation technology, and Audio Deepfake Model Recognition (ADMR), which recognize the specific generative model instance. To improve robustness under open-set conditions, we incorporate confidence-based rejection thresholds. Experiments on ASVspoof2021, FakeOrReal, and CodecFake show strong performance: the ADA classifier achieves F1-scores over 95% across all datasets, and the ADMR module reaches 96.31% macro F1 across six classes. Additional tests on unseen attacks from ASVpoof2019 LA and error propagation analysis confirm LAVA’s robustness and reliability. The framework advances the field by introducing a supervised approach to deepfake attribution and model recognition under open-set conditions, validated on public benchmarks and accompanied by publicly released models and code. Models and code are available at https://www.github.com/adipiz99/lava-framework.

cs.LG

[326] NAEx: A Plug-and-Play Framework for Explaining Network Alignment

Shruti Saxena, Arijit Khan, Joydeep Chandra

Main category: cs.LG

TL;DR: NAEx is a model-agnostic framework for explaining network alignment decisions by identifying influential subgraphs and features, ensuring interpretability and trust.

Details

Motivation: Limited interpretability of network alignment models hinders trust, especially in high-stakes domains.

Method: NAEx jointly parameterizes graph structures and feature spaces using learnable masks and optimizes for faithful explanations.

Result: NAEx effectively explains alignment decisions and integrates with four NA models, demonstrating efficiency on benchmarks.

Conclusion: NAEx enhances interpretability and trust in network alignment models through faithful, model-agnostic explanations.

Abstract: Network alignment (NA) identifies corresponding nodes across multiple networks, with applications in domains like social networks, co-authorship, and biology. Despite advances in alignment models, their interpretability remains limited, making it difficult to understand alignment decisions and posing challenges in building trust, particularly in high-stakes domains. To address this, we introduce NAEx, a plug-and-play, model-agnostic framework that explains alignment models by identifying key subgraphs and features influencing predictions. NAEx addresses the key challenge of preserving the joint cross-network dependencies on alignment decisions by: (1) jointly parameterizing graph structures and feature spaces through learnable edge and feature masks, and (2) introducing an optimization objective that ensures explanations are both faithful to the original predictions and enable meaningful comparisons of structural and feature-based similarities between networks. NAEx is an inductive framework that efficiently generates NA explanations for previously unseen data. We introduce evaluation metrics tailored to alignment explainability and demonstrate NAEx’s effectiveness and efficiency on benchmark datasets by integrating it with four representative NA models.

[327] LumiGen: An LVLM-Enhanced Iterative Framework for Fine-Grained Text-to-Image Generation

Xiaoqi Dong, Xiangyu Zhou, Nicholas Evans, Yujia Lin

Main category: cs.LG

TL;DR: LumiGen is a novel LVLM-enhanced iterative framework that improves T2I generation by integrating LVLM-driven feedback for fine-grained control and semantic consistency, outperforming baselines on the LongBench-T2I Benchmark.

Details

Motivation: Existing T2I models struggle with complex instructions, fine-grained control, and semantic consistency, while LVLMs show strong cross-modal understanding. LumiGen aims to bridge this gap.

Method: LumiGen uses an Intelligent Prompt Parsing & Augmentation (IPPA) module for prompt enhancement and an Iterative Visual Feedback & Refinement (IVFR) module for iterative image correction.

Result: LumiGen achieves a superior average score of 3.08 on LongBench-T2I, excelling in text rendering and pose expression.

Conclusion: LVLM integration in LumiGen significantly enhances T2I model performance, validating its effectiveness for controllable, high-quality image generation.

Abstract: Text-to-Image (T2I) generation has made significant advancements with diffusion models, yet challenges persist in handling complex instructions, ensuring fine-grained content control, and maintaining deep semantic consistency. Existing T2I models often struggle with tasks like accurate text rendering, precise pose generation, or intricate compositional coherence. Concurrently, Vision-Language Models (LVLMs) have demonstrated powerful capabilities in cross-modal understanding and instruction following. We propose LumiGen, a novel LVLM-enhanced iterative framework designed to elevate T2I model performance, particularly in areas requiring fine-grained control, through a closed-loop, LVLM-driven feedback mechanism. LumiGen comprises an Intelligent Prompt Parsing & Augmentation (IPPA) module for proactive prompt enhancement and an Iterative Visual Feedback & Refinement (IVFR) module, which acts as a “visual critic” to iteratively correct and optimize generated images. Evaluated on the challenging LongBench-T2I Benchmark, LumiGen achieves a superior average score of 3.08, outperforming state-of-the-art baselines. Notably, our framework demonstrates significant improvements in critical dimensions such as text rendering and pose expression, validating the effectiveness of LVLM integration for more controllable and higher-quality image generation.

[328] MissMecha: An All-in-One Python Package for Studying Missing Data Mechanisms

Youran Zhou, Mohamed Reda Bouadjenek, Sunil Aryal

Main category: cs.LG

TL;DR: MissMecha is a Python toolkit for simulating, visualizing, and evaluating missing data under MCAR, MAR, and MNAR assumptions, supporting both numerical and categorical features.

Details

Motivation: Addressing the fragmented and limited tools for simulating missingness in real-world datasets, which often overlook heterogeneous data types.

Method: Developed MissMecha, an open-source toolkit with visual diagnostics, MCAR testing, and type-aware imputation evaluation metrics.

Result: MissMecha provides a unified platform for mechanism-aware studies on mixed-type tabular data.

Conclusion: MissMecha supports data quality research, benchmarking, and education, offering a comprehensive solution for incomplete data analysis.

Abstract: Incomplete data is a persistent challenge in real-world datasets, often governed by complex and unobservable missing mechanisms. Simulating missingness has become a standard approach for understanding its impact on learning and analysis. However, existing tools are fragmented, mechanism-limited, and typically focus only on numerical variables, overlooking the heterogeneous nature of real-world tabular data. We present MissMecha, an open-source Python toolkit for simulating, visualizing, and evaluating missing data under MCAR, MAR, and MNAR assumptions. MissMecha supports both numerical and categorical features, enabling mechanism-aware studies across mixed-type tabular datasets. It includes visual diagnostics, MCAR testing utilities, and type-aware imputation evaluation metrics. Designed to support data quality research, benchmarking, and education,MissMecha offers a unified platform for researchers and practitioners working with incomplete data.

[329] Edge-Assisted Collaborative Fine-Tuning for Multi-User Personalized Artificial Intelligence Generated Content (AIGC)

Nan Li, Wanting Yang, Marie Siew, Zehui Xiong, Binbin Chen, Shiwen Mao, Kwok-Yan Lam

Main category: cs.LG

TL;DR: A novel cluster-aware hierarchical federated aggregation framework is proposed to address the inefficiency and scalability issues in edge-AIGC, leveraging LoRA for personalized content generation while ensuring privacy and communication efficiency.

Details

Motivation: Existing edge-AIGC applications struggle with efficiency, scalability, and privacy in multi-user scenarios, prompting the need for a better solution.

Method: The framework clusters clients by task similarity, performs intra-cluster aggregation for personalization, and enables inter-cluster knowledge interaction. It uses LoRA for local fine-tuning and FL for collaborative training.

Result: The framework achieves accelerated convergence and practical viability for scalable multi-user personalized AIGC services under edge constraints.

Conclusion: The proposed framework effectively balances personalization, efficiency, and privacy in edge-AIGC, demonstrating its potential for real-world deployment.

Abstract: Diffusion models (DMs) have emerged as powerful tools for high-quality content generation, yet their intensive computational requirements for inference pose challenges for resource-constrained edge devices. Cloud-based solutions aid in computation but often fall short in addressing privacy risks, personalization efficiency, and communication costs in multi-user edge-AIGC scenarios. To bridge this gap, we first analyze existing edge-AIGC applications in personalized content synthesis, revealing their limitations in efficiency and scalability. We then propose a novel cluster-aware hierarchical federated aggregation framework. Based on parameter-efficient local fine-tuning via Low-Rank Adaptation (LoRA), the framework first clusters clients based on the similarity of their uploaded task requirements, followed by an intra-cluster aggregation for enhanced personalization at the server-side. Subsequently, an inter-cluster knowledge interaction paradigm is implemented to enable hybrid-style content generation across diverse clusters.Building upon federated learning (FL) collaboration, our framework simultaneously trains personalized models for individual users at the devices and a shared global model enhanced with multiple LoRA adapters on the server,enabling efficient edge inference; meanwhile, all prompts for clustering and inference are encoded prior to transmission, thereby further mitigating the risk of plaintext leakage. Our evaluations demonstrate that the framework achieves accelerated convergence while maintaining practical viability for scalable multi-user personalized AIGC services under edge constraints.

Pengtao Dang, Tingbo Guo, Sha Cao, Chi Zhang

Main category: cs.LG

TL;DR: The paper introduces a Large Multi-Modal Model (LMMM) and a framework (M3F) for few-shot learning (FSL), outperforming meta-learning methods, supported by a diverse dataset (M3FD).

Details

Motivation: FSL is vital in data-scarce scientific fields, but existing methods lack generalization. The study aims to improve FSL using multi-modal models.

Method: Developed M3F, a modular LMMM framework, and M3FD, a diverse dataset (10K+ samples). Fine-tuned LMMM on M3FD for FSL tasks.

Result: M3F outperforms conventional meta-learning methods, demonstrating improved generalization for FSL in scientific applications.

Conclusion: M3F and M3FD provide a scalable, unified solution for FSL in data-scarce domains, with open-source tools for accessibility.

Abstract: Few-shot learning (FSL) is a machine learning paradigm that aims to generalize models from a small number of labeled examples, typically fewer than 10 per class. FSL is particularly crucial in biomedical, environmental, materials, and mechanical sciences, where samples are limited and data collection is often prohibitively costly, time-consuming, or ethically constrained. In this study, we present an innovative approach to FSL by demonstrating that a Large Multi-Modal Model (LMMM), trained on a set of independent tasks spanning diverse domains, task types, and input modalities, can substantially improve the generalization of FSL models, outperforming models based on conventional meta-learning on tasks of the same type. To support this, we first constructed a Multi-Modal Model Few-shot Dataset (M3FD, over 10K+ few-shot samples), which includes 2D RGB images, 2D/3D medical scans, tabular and time-course datasets, from which we manually curated FSL tasks such as classification. We further introduced M3F (Multi-Modal Model for Few-shot learning framework), a novel Large Multi-Modal Model framework tailored for data-constrained scientific applications. M3F supports a wide range of scientific data types through a modular pipeline. By fine-tuning the model on M3FD, M3F improves model performance, making LMMM feasible for real-world FSL deployment. The source code is located at https://github.com/ptdang1001/M3F. To democratize access to complex FSL data and promote reproducibility for public usage, M3FD is paired with a flexible and user-friendly tool that enables efficient querying, task-specific sampling, and preprocessing. Together, our dataset and framework offer a unified, scalable solution that significantly lowers the barrier to applying LMMMs in data-scarce scientific domains.

[331] AttriLens-Mol: Attribute Guided Reinforcement Learning for Molecular Property Prediction with Large Language Models

Xuan Lin, Long Chen, Yile Wang

Main category: cs.LG

TL;DR: AttriLens-Mol is an attribute-guided reinforcement learning framework for molecular property prediction with LLMs, improving performance and interpretability by steering reasoning with structured rewards.

Details

Motivation: Current LLM-based methods for molecular property prediction rely on human prompts and verbose reasoning, lacking relevance. AttriLens-Mol addresses this by guiding the model's reasoning with structured rewards.

Method: AttriLens-Mol uses three rewards: format (structured output), count (avoid irrelevant attributes), and rationality (verified relatedness). It trains 7B-size models on 4,000 samples.

Result: The method outperforms supervised fine-tuning and advanced models (GPT-3.5, GPT-4o, etc.) on in-distribution and out-of-distribution datasets. Extracted attributes also improve interpretable decision tree performance.

Conclusion: AttriLens-Mol effectively elicits relevant molecular attributes, enhancing both performance and interpretability in molecular property prediction.

Abstract: Large Language Models (LLMs) have shown promise in assisting molecular property prediction tasks but often rely on human-crafted prompts and chain-of-thought templates. While recent advanced large reasoning models like DeepSeek-R1 employ reinforcement learning for an extended ``thinking’’ process, their reasoning can be verbose and lack relevance. We introduce AttriLens-Mol, an attribute-guided reinforcement learning framework for molecular property prediction with LLMs. AttriLens-Mol steers the model’s reasoning by using: (1) a format reward encouraging attribute-based structured output, (2) a count reward to avoid enumerating irrelevant attributes, and (3) a rationality reward using advanced LLMs and RDKit to verify the relatedness of the generated attributes. This approach implicitly elicits the model’s inherent knowledge of relevant molecular attributes during reasoning, enables making predictions for the molecular property more effectively. Experiments on both in-distribution and out-of-distribution datasets show that, training both 7B-size R1-Distilled-Qwen2.5 and R1-Distilled-LLaMA3.1 models on 4,000 samples with our proposed AttriLens-Mol method significantly boosts the performance, getting comparable or better results than supervised fine-tuning models (Mol-Instructions, ChemDFM, etc.) and advanced models (GPT-3.5, GPT-4o, DeepSeek-V3, DeepSeek-R1, etc.). Further, our extracted attributes for the target property, when used as features for an interpretable decision tree model, yield superior performance compared to attributes generated by prompting LLMs. This shows that AttriLens-Mol effectively elicits more relevant and predictive molecular attributes, leading to enhanced interpretability and performance for property prediction. We release the code in https://github.com/szu-tera/AttriLens-Mol.

[332] MoMA: A Mixture-of-Multimodal-Agents Architecture for Enhancing Clinical Prediction Modelling

Jifan Gao, Mahmudur Rahman, John Caskey, Madeline Oguss, Ann O’Rourke, Randy Brown, Anne Stey, Anoop Mayampurath, Matthew M. Churpek, Guanhua Chen, Majid Afshar

Main category: cs.LG

TL;DR: MoMA is a novel architecture using multiple LLM agents to integrate multimodal EHR data for clinical predictions, outperforming state-of-the-art methods.

Details

Motivation: Effectively integrating diverse EHR data modalities for clinical prediction is challenging due to high data requirements.

Method: MoMA uses specialist LLM agents to convert non-textual data into structured summaries, an aggregator LLM to unify them, and a predictor LLM for clinical predictions.

Result: MoMA outperforms current methods in accuracy and flexibility across three real-world prediction tasks.

Conclusion: MoMA demonstrates superior performance in leveraging multimodal EHR data for clinical predictions.

Abstract: Multimodal electronic health record (EHR) data provide richer, complementary insights into patient health compared to single-modality data. However, effectively integrating diverse data modalities for clinical prediction modeling remains challenging due to the substantial data requirements. We introduce a novel architecture, Mixture-of-Multimodal-Agents (MoMA), designed to leverage multiple large language model (LLM) agents for clinical prediction tasks using multimodal EHR data. MoMA employs specialized LLM agents (“specialist agents”) to convert non-textual modalities, such as medical images and laboratory results, into structured textual summaries. These summaries, together with clinical notes, are combined by another LLM (“aggregator agent”) to generate a unified multimodal summary, which is then used by a third LLM (“predictor agent”) to produce clinical predictions. Evaluating MoMA on three prediction tasks using real-world datasets with different modality combinations and prediction settings, MoMA outperforms current state-of-the-art methods, highlighting its enhanced accuracy and flexibility across various tasks.

[333] PA-RNet: Perturbation-Aware Reasoning Network for Multimodal Time Series Forecasting

Chanjuan Liu, Shengzhi Wang, Enqiang Zhu

Main category: cs.LG

TL;DR: PA-RNet is a robust multimodal forecasting framework addressing textual noise in time series data, outperforming baselines with its perturbation-aware and cross-modal attention mechanisms.

Details

Motivation: Existing methods neglect textual perturbations, degrading performance; PA-RNet aims to mitigate this by handling noise and maintaining semantic integrity.

Method: PA-RNet uses a perturbation-aware projection module and cross-modal attention to separate noise from textual embeddings while preserving meaningful representations.

Result: PA-RNet outperforms state-of-the-art baselines across diverse domains, with theoretical guarantees of stability under noise.

Conclusion: PA-RNet effectively addresses textual noise in multimodal time series forecasting, enhancing robustness and generalization.

Abstract: In real-world applications, multimodal time series data often suffer from interference, especially in the textual modality. Existing methods for multimodal time series forecasting often neglect the inherent perturbations within textual data, where irrelevant, noisy, or ambiguous content can significantly degrade model performance, particularly when the noise exhibits varying intensity or stems from structural inconsistencies. To address this challenge, we propose PA-RNet (Perturbation-Aware Reasoning Network for Multimodal Time Series Forecasting), a robust multimodal forecasting framework. PA-RNet features a perturbation-aware projection module and a cross-modal attention mechanism to effectively separate noise from the textual embeddings while maintaining semantically meaningful representations, thereby enhancing the model’s generalization ability. Theoretically, we establish the Lipschitz continuity of PA-RNet with respect to textual inputs and prove that the proposed perturbation module can reduce expected prediction error, offering strong guarantees of stability under noisy conditions. Furthermore, we introduce a textual perturbation pipeline that can be seamlessly incorporated into existing multimodal time series forecasting tasks, allowing for systematic evaluation of the model’s robustness in the presence of varying levels of textual noise. Extensive experiments across diverse domains and temporal settings demonstrate that PA-RNet consistently outperforms state-of-the-art baselines.

[334] InfoQ: Mixed-Precision Quantization via Global Information Flow

Mehmet Emre Akbulut, Hazem Hesham Yousef Shalby, Fabrizio Pittorino, Manuel Roveri

Main category: cs.LG

TL;DR: InfoQ is a training-free MPQ framework that measures layer sensitivity via mutual information changes, optimizing bit-width allocation efficiently for high compression and accuracy.

Details

Motivation: Current MPQ methods rely on expensive searches or local proxies, missing global quantization effects. InfoQ addresses this by focusing on layer impact on network-wide information flow.

Method: InfoQ quantizes layers at different bit-widths, measures mutual information changes in a single forward pass, and formulates bit-width allocation as an integer linear programming problem.

Result: Achieves superior search-time/accuracy trade-off, using less data, and improves accuracy by up to 1% for MobileNetV2 and ResNet18 on ImageNet at high compression rates.

Conclusion: InfoQ offers an efficient, training-free solution for MPQ, outperforming state-of-the-art methods in both speed and accuracy.

Abstract: Mixed-precision quantization (MPQ) is crucial for deploying deep neural networks on resource-constrained devices, but finding the optimal bit-width for each layer represents a complex combinatorial optimization problem. Current state-of-the-art methods rely on computationally expensive search algorithms or local sensitivity heuristic proxies like the Hessian, which fail to capture the cascading global effects of quantization error. In this work, we argue that the quantization sensitivity of a layer should not be measured by its local properties, but by its impact on the information flow throughout the entire network. We introduce InfoQ, a novel framework for MPQ that is training-free in the bit-width search phase. InfoQ assesses layer sensitivity by quantizing each layer at different bit-widths and measuring, through a single forward pass, the resulting change in mutual information in the subsequent layers. This quantifies how much each layer quantization impacts the network information flow. The resulting scores are used to formulate bit-width allocation as an integer linear programming problem, which is solved efficiently to minimize total sensitivity under a given budget (e.g., model size or BitOps). Our retraining-free search phase provides a superior search-time/accuracy trade-off (using two orders of magnitude less data compared to state-of-the-art methods such as LIMPQ), while yielding up to a 1% accuracy improvement for MobileNetV2 and ResNet18 on ImageNet at high compression rates (14X and 10.66X).

[335] REINA: Regularized Entropy Information-Based Loss for Efficient Simultaneous Speech Translation

Nameer Hirschkind, Joseph Liu, Mahesh Kumar Nandwana, Xiao Yu

Main category: cs.LG

TL;DR: REINA, a novel loss function for Simultaneous Speech Translation (SimulST), optimizes the tradeoff between translation quality and latency by waiting for more input only when it provides useful information. It achieves SOTA results and improves efficiency by 21%.

Details

Motivation: Balancing translation quality and latency in SimulST systems is challenging. The goal is to optimize this tradeoff by leveraging information theory.

Method: Introduces REINA, a loss function derived from information theory, to train an adaptive policy using a non-streaming translation model.

Result: Achieves SOTA results on French, Spanish, and German translations, improving the latency/quality tradeoff by up to 21%.

Conclusion: REINA effectively optimizes SimulST systems, demonstrating significant improvements in both quality and latency.

Abstract: Simultaneous Speech Translation (SimulST) systems stream in audio while simultaneously emitting translated text or speech. Such systems face the significant challenge of balancing translation quality and latency. We introduce a strategy to optimize this tradeoff: wait for more input only if you gain information by doing so. Based on this strategy, we present Regularized Entropy INformation Adaptation (REINA), a novel loss to train an adaptive policy using an existing non-streaming translation model. We derive REINA from information theory principles and show that REINA helps push the reported Pareto frontier of the latency/quality tradeoff over prior works. Utilizing REINA, we train a SimulST model on French, Spanish and German, both from and into English. Training on only open source or synthetically generated data, we achieve state-of-the-art (SOTA) streaming results for models of comparable size. We also introduce a metric for streaming efficiency, quantitatively showing REINA improves the latency/quality trade-off by as much as 21% compared to prior approaches, normalized against non-streaming baseline BLEU scores.

[336] Are Large Language Models Dynamic Treatment Planners? An In Silico Study from a Prior Knowledge Injection Angle

Zhiyao Luo, Tingting Zhu

Main category: cs.LG

TL;DR: LLMs show promise in clinical decision-making for insulin dosing but require careful prompt engineering and validation due to limitations like aggressive dosing and reasoning errors.

Details

Motivation: To evaluate the feasibility of using LLMs for dynamic insulin dosing in Type 1 diabetes, comparing their zero-shot performance to trained RL agents.

Method: Open-source LLMs were tested in a diabetes simulator, with zero-shot prompts compared to trained small neural network RL agents.

Result: Smaller LLMs matched or outperformed trained RL agents in stable patients but showed limitations like aggressive dosing and reasoning errors.

Conclusion: LLMs can complement clinical workflows but need hybrid approaches combining linguistic reasoning and structured modeling for safety and effectiveness.

Abstract: Reinforcement learning (RL)-based dynamic treatment regimes (DTRs) hold promise for automating complex clinical decision-making, yet their practical deployment remains hindered by the intensive engineering required to inject clinical knowledge and ensure patient safety. Recent advancements in large language models (LLMs) suggest a complementary approach, where implicit prior knowledge and clinical heuristics are naturally embedded through linguistic prompts without requiring environment-specific training. In this study, we rigorously evaluate open-source LLMs as dynamic insulin dosing agents in an in silico Type 1 diabetes simulator, comparing their zero-shot inference performance against small neural network-based RL agents (SRAs) explicitly trained for the task. Our results indicate that carefully designed zero-shot prompts enable smaller LLMs (e.g., Qwen2.5-7B) to achieve comparable or superior clinical performance relative to extensively trained SRAs, particularly in stable patient cohorts. However, LLMs exhibit notable limitations, such as overly aggressive insulin dosing when prompted with chain-of-thought (CoT) reasoning, highlighting critical failure modes including arithmetic hallucination, temporal misinterpretation, and inconsistent clinical logic. Incorporating explicit reasoning about latent clinical states (e.g., meals) yielded minimal performance gains, underscoring the current model’s limitations in capturing complex, hidden physiological dynamics solely through textual inference. Our findings advocate for cautious yet optimistic integration of LLMs into clinical workflows, emphasising the necessity of targeted prompt engineering, careful validation, and potentially hybrid approaches that combine linguistic reasoning with structured physiological modelling to achieve safe, robust, and clinically effective decision-support systems.

[337] Uncertainty-aware Predict-Then-Optimize Framework for Equitable Post-Disaster Power Restoration

Lin Jiang, Dahai Yu, Rongchao Xu, Tian Tang, Guang Wang

Main category: cs.LG

TL;DR: EPOPR is a predict-then-optimize framework for equitable power restoration, reducing outage duration by 3.60% and inequity by 14.19%.

Details

Motivation: Current power restoration methods are inequitable, as disadvantaged communities submit fewer requests, leading to longer outages.

Method: EPOPR combines Equity-Conformalized Quantile Regression for repair prediction and Spatial-Temporal Attentional RL for equitable decision-making.

Result: EPOPR reduces average outage duration by 3.60% and inequity by 14.19%.

Conclusion: EPOPR successfully balances efficiency and equity in power restoration.

Abstract: The increasing frequency of extreme weather events, such as hurricanes, highlights the urgent need for efficient and equitable power system restoration. Many electricity providers make restoration decisions primarily based on the volume of power restoration requests from each region. However, our data-driven analysis reveals significant disparities in request submission volume, as disadvantaged communities tend to submit fewer restoration requests. This disparity makes the current restoration solution inequitable, leaving these communities vulnerable to extended power outages. To address this, we aim to propose an equity-aware power restoration strategy that balances both restoration efficiency and equity across communities. However, achieving this goal is challenging for two reasons: the difficulty of predicting repair durations under dataset heteroscedasticity, and the tendency of reinforcement learning agents to favor low-uncertainty actions, which potentially undermine equity. To overcome these challenges, we design a predict-then-optimize framework called EPOPR with two key components: (1) Equity-Conformalized Quantile Regression for uncertainty-aware repair duration prediction, and (2) Spatial-Temporal Attentional RL that adapts to varying uncertainty levels across regions for equitable decision-making. Experimental results show that our EPOPR effectively reduces the average power outage duration by 3.60% and decreases inequity between different communities by 14.19% compared to state-of-the-art baselines.

[338] Federated Continual Recommendation

Jaehyung Lim, Wonbin Kweon, Woojoo Kim, Junyoung Kim, Seongjin Choi, Dongha Kim, Hwanjo Yu

Main category: cs.LG

TL;DR: F3CRec integrates federated and continual learning for privacy-preserving recommendations in non-stationary data streams.

Details

Motivation: Address the gap between privacy-preserving FedRec and evolving-preference CLRec methods.

Method: Proposes F3CRec with Adaptive Replay Memory (client) and Item-wise Temporal Mean (server).

Result: F3CRec outperforms existing methods in maintaining recommendation quality over time.

Conclusion: F3CRec effectively balances privacy and adaptation in federated continual recommendation.

Abstract: The increasing emphasis on privacy in recommendation systems has led to the adoption of Federated Learning (FL) as a privacy-preserving solution, enabling collaborative training without sharing user data. While Federated Recommendation (FedRec) effectively protects privacy, existing methods struggle with non-stationary data streams, failing to maintain consistent recommendation quality over time. On the other hand, Continual Learning Recommendation (CLRec) methods address evolving user preferences but typically assume centralized data access, making them incompatible with FL constraints. To bridge this gap, we introduce Federated Continual Recommendation (FCRec), a novel task that integrates FedRec and CLRec, requiring models to learn from streaming data while preserving privacy. As a solution, we propose F3CRec, a framework designed to balance knowledge retention and adaptation under the strict constraints of FCRec. F3CRec introduces two key components: Adaptive Replay Memory on the client side, which selectively retains past preferences based on user-specific shifts, and Item-wise Temporal Mean on the server side, which integrates new knowledge while preserving prior information. Extensive experiments demonstrate that F3CRec outperforms existing approaches in maintaining recommendation quality over time in a federated environment.

[339] HCRide: Harmonizing Passenger Fairness and Driver Preference for Human-Centered Ride-Hailing

Lin Jiang, Yu Yang, Guang Wang

Main category: cs.LG

TL;DR: HCRide, a human-centered ride-hailing system, balances passenger fairness, driver preference, and system efficiency using a novel multi-agent reinforcement learning algorithm (Habic).

Details

Motivation: Existing systems prioritize operator revenue, often neglecting passenger and driver satisfaction. This work aims to address this gap by designing a fairer and more driver-friendly system.

Method: Developed HCRide using Habic, a multi-agent reinforcement learning algorithm with a competition mechanism, dynamic Actor network, and Bi-Critic network.

Result: HCRide improved system efficiency by 2.02%, fairness by 5.39%, and driver preference by 10.21% in real-world datasets.

Conclusion: HCRide successfully balances efficiency, fairness, and driver preference, offering a more human-centered approach to ride-hailing.

Abstract: Order dispatch systems play a vital role in ride-hailing services, which directly influence operator revenue, driver profit, and passenger experience. Most existing work focuses on improving system efficiency in terms of operator revenue, which may cause a bad experience for both passengers and drivers. Hence, in this work, we aim to design a human-centered ride-hailing system by considering both passenger fairness and driver preference without compromising the overall system efficiency. However, it is nontrivial to achieve this target due to the potential conflicts between passenger fairness and driver preference since optimizing one may sacrifice the other. To address this challenge, we design HCRide, a Human-Centered Ride-hailing system based on a novel multi-agent reinforcement learning algorithm called Harmonization-oriented Actor-Bi-Critic (Habic), which includes three major components (i.e., a multi-agent competition mechanism, a dynamic Actor network, and a Bi-Critic network) to optimize system efficiency and passenger fairness with driver preference consideration. We extensively evaluate our HCRide using two real-world ride-hailing datasets from Shenzhen and New York City. Experimental results show our HCRide effectively improves system efficiency by 2.02%, fairness by 5.39%, and driver preference by 10.21% compared to state-of-the-art baselines.

[340] Conservative classifiers do consistently well with improving agents: characterizing statistical and online learning

Dravyansh Sharma, Alec Sun

Main category: cs.LG

TL;DR: The paper explores learnability in strategic classification, focusing on genuine agent improvements rather than deception, and extends prior work by characterizing learnability across new axes, including proper and improper learning under natural assumptions.

Details

Motivation: To understand how machine learning performs when classified agents genuinely improve (not deceive) and to extend prior work by addressing open questions and natural settings like Euclidean ball improvements.

Method: Introduces asymmetric minimally consistent concept classes for proper learning in realizable settings, studies learnability under Euclidean ball improvements, and addresses bounded noise models and online learning.

Result: Characterizes proper and improper learning with improvements, resolves open questions, and achieves lower generalization error in bounded noise models and mistake bounds in online learning.

Conclusion: The work advances understanding of strategic classification with genuine agent improvements, providing new theoretical insights and practical learning guarantees.

Abstract: Machine learning is now ubiquitous in societal decision-making, for example in evaluating job candidates or loan applications, and it is increasingly important to take into account how classified agents will react to the learning algorithms. The majority of recent literature on strategic classification has focused on reducing and countering deceptive behaviors by the classified agents, but recent work of Attias et al. identifies surprising properties of learnability when the agents genuinely improve in order to attain the desirable classification, such as smaller generalization error than standard PAC-learning. In this paper we characterize so-called learnability with improvements across multiple new axes. We introduce an asymmetric variant of minimally consistent concept classes and use it to provide an exact characterization of proper learning with improvements in the realizable setting. While prior work studies learnability only under general, arbitrary agent improvement regions, we give positive results for more natural Euclidean ball improvement sets. In particular, we characterize improper learning under a mild generative assumption on the data distribution. We further show how to learn in more challenging settings, achieving lower generalization error under well-studied bounded noise models and obtaining mistake bounds in realizable and agnostic online learning. We resolve open questions posed by Attias et al. for both proper and improper learning.

[341] Unified Flow Matching for Long Horizon Event Forecasting

Xiao Shou

Main category: cs.LG

TL;DR: Proposes a non-autoregressive flow matching framework for modeling long horizon marked event sequences, improving accuracy and efficiency over autoregressive and diffusion-based methods.

Details

Motivation: Existing autoregressive models for temporal point processes are inefficient and error-prone for long-range forecasting.

Method: Uses continuous and discrete flow matching to jointly model inter-event times and event types non-autoregressively.

Result: Outperforms autoregressive and diffusion-based baselines in accuracy and generation efficiency on six benchmarks.

Conclusion: The framework enables coherent long horizon event trajectory generation without sequential decoding.

Abstract: Modeling long horizon marked event sequences is a fundamental challenge in many real-world applications, including healthcare, finance, and user behavior modeling. Existing neural temporal point process models are typically autoregressive, predicting the next event one step at a time, which limits their efficiency and leads to error accumulation in long-range forecasting. In this work, we propose a unified flow matching framework for marked temporal point processes that enables non-autoregressive, joint modeling of inter-event times and event types, via continuous and discrete flow matching. By learning continuous-time flows for both components, our method generates coherent long horizon event trajectories without sequential decoding. We evaluate our model on six real-world benchmarks and demonstrate significant improvements over autoregressive and diffusion-based baselines in both accuracy and generation efficiency.

[342] Multi-Stage Knowledge-Distilled VGAE and GAT for Robust Controller-Area-Network Intrusion Detection

Robert Frenken, Sidra Ghayour Bhatti, Hanqin Zhang, Qadeer Ahmed

Main category: cs.LG

TL;DR: A multi-stage intrusion detection framework for CAN protocol combines unsupervised anomaly detection and supervised graph learning, achieving high accuracy and efficiency.

Details

Motivation: The CAN protocol lacks built-in security, making it vulnerable to cyber-attacks, necessitating robust intrusion detection.

Method: Uses a Variational Graph Autoencoder (VGAE) for anomaly detection and a Knowledge-Distilled Graph Attention Network (KD-GAT) for attack classification, encoding CAN traffic as graph sequences.

Result: Achieves 96% parameter reduction and 16.2% average F1-score improvement over existing methods, excelling on imbalanced datasets.

Conclusion: The framework is effective for securing CAN networks, offering competitive performance and efficiency.

Abstract: The Controller Area Network (CAN) protocol is a standard for in-vehicle communication but remains susceptible to cyber-attacks due to its lack of built-in security. This paper presents a multi-stage intrusion detection framework leveraging unsupervised anomaly detection and supervised graph learning tailored for automotive CAN traffic. Our architecture combines a Variational Graph Autoencoder (VGAE) for structural anomaly detection with a Knowledge-Distilled Graph Attention Network (KD-GAT) for robust attack classification. CAN bus activity is encoded as graph sequences to model temporal and relational dependencies. The pipeline applies VGAE-based selective undersampling to address class imbalance, followed by GAT classification with optional score-level fusion. The compact student GAT achieves 96% parameter reduction compared to the teacher model while maintaining strong predictive performance. Experiments on six public CAN intrusion datasets–Car-Hacking, Car-Survival, and can-train-and-test–demonstrate competitive accuracy and efficiency, with average improvements of 16.2% in F1-score over existing methods, particularly excelling on highly imbalanced datasets with up to 55% F1-score improvements.

[343] Provable Post-Training Quantization: Theoretical Analysis of OPTQ and Qronos

Haoyu Zhang, Shihao Zhang, Ian Colbert, Rayan Saab

Main category: cs.LG

TL;DR: This paper provides the first quantitative error bounds for OPTQ (GPTQ) and Qronos, two leading post-training quantization methods, offering theoretical insights into their performance and practical design choices.

Details

Motivation: Despite OPTQ's widespread use in reducing neural network costs, it lacks rigorous theoretical guarantees. This work aims to fill that gap by analyzing its error bounds and extending the analysis to Qronos.

Method: The study derives non-asymptotic 2-norm error bounds for deterministic and stochastic OPTQ, analyzing its iterative procedure and regularization. It also extends this analysis to Qronos.

Result: The paper justifies practical design choices like feature ordering by norm and provides guidance for parameter selection. Stronger infinity-norm bounds for stochastic OPTQ enable better control over quantization.

Conclusion: The theoretical bounds for OPTQ and Qronos enhance understanding of their performance, supporting their empirical advantages and guiding future applications.

Abstract: Post-training quantization (PTQ) has become a crucial tool for reducing the memory and compute costs of modern deep neural networks, including large language models (LLMs). Among PTQ algorithms, the OPTQ framework-also known as GPTQ-has emerged as a leading method due to its computational efficiency and strong empirical performance. Despite its widespread adoption, however, OPTQ lacks rigorous quantitative theoretical guarantees. This paper presents the first quantitative error bounds for both deterministic and stochastic variants of OPTQ, as well as for Qronos, a recent related state-of-the-art PTQ algorithm. We analyze how OPTQ’s iterative procedure induces quantization error and derive non-asymptotic 2-norm error bounds that depend explicitly on the calibration data and a regularization parameter that OPTQ uses. Our analysis provides theoretical justification for several practical design choices, including the widely used heuristic of ordering features by decreasing norm, as well as guidance for selecting the regularization parameter. For the stochastic variant, we establish stronger infinity-norm error bounds, which enable control over the required quantization alphabet and are particularly useful for downstream layers and nonlinearities. Finally, we extend our analysis to Qronos, providing new theoretical bounds, for both its deterministic and stochastic variants, that help explain its empirical advantages.

[344] Agnostics: Learning to Code in Any Programming Language via Reinforcement with a Universal Learning Environment

Aleksander Boruch-Gruszecki, Yangtian Zi, Zixuan Wu, Tejas Oberoi, Carolyn Jane Anderson, Joydeep Biswas, Arjun Guha

Main category: cs.LG

TL;DR: Agnostics is a language-agnostic post-training pipeline for LLMs that improves performance on low-resource languages by judging code behavior, eliminating per-language engineering, and using RL with verifiable rewards.

Details

Motivation: LLMs struggle with low-resource languages due to data shortages and per-language engineering bottlenecks. Agnostics aims to simplify post-training for diverse languages.

Method: Agnostics uses an LLM to rewrite unit tests into I/O format, configures a verifier for compilation/execution, and applies RL with verifiable rewards in a robust execution environment.

Result: Agnostics improves Qwen-3 4B to rival larger models, scales to diverse model families, and sets new SOTA pass@1 results on MultiPL-E and LiveCodeBench for low-resource languages.

Conclusion: Agnostics simplifies RL post-training for any language, with released datasets, code, and configurations to make the process accessible via a YAML file.

Abstract: Large language models (LLMs) already excel at writing code in high-resource languages such as Python and JavaScript, yet stumble on low-resource languages that remain essential to science and engineering. Besides the obvious shortage of pre-training data, post-training itself is a bottleneck: every new language seems to require new datasets, test harnesses, and reinforcement-learning (RL) infrastructure. We introduce Agnostics, a language-agnostic post-training pipeline that eliminates this per-language engineering. The key idea is to judge code solely by its externally observable behavior, so a single verifier can test solutions written in any language. Concretely, we (i) use an LLM to rewrite existing unit-test datasets into an I/O format, (ii) supply a short configuration that tells the verifier how to compile and run a target language, and (iii) apply reinforcement learning with verifiable rewards (RLVR) in a robust code execution environment. Applied to five low-resource languages–Lua, Julia, R, OCaml, and Fortran–Agnostics (1) improves Qwen-3 4B to performance that rivals other 16B-70B open-weight models; (2) scales cleanly to larger and diverse model families (Qwen-3 8B, DeepSeek Coder 6.7B Instruct, Phi 4 Mini); and (3) for ${\le} 16$B parameter models, sets new state-of-the-art pass@1 results on MultiPL-E and a new multi-language version LiveCodeBench that we introduce. We will release the language-agnostic training datasets (Ag-MBPP-X, Ag-Codeforces-X, Ag-LiveCodeBench-X), training code, and ready-to-use configurations, making RL post-training in any programming language as simple as editing a short YAML file.

[345] Hilbert Neural Operator: Operator Learning in the Analytic Signal Domain

Saman Pordanesh, Pejman Shahsavari, Hossein Ghadjari

Main category: cs.LG

TL;DR: The paper introduces the Hilbert Neural Operator (HNO), a new neural operator architecture for PDEs, leveraging the Hilbert transform to enhance learning with instantaneous amplitude and phase features.

Details

Motivation: Existing methods like FNO rely on Fourier transforms, which assume periodicity and lack phase-sensitive features. HNO addresses these limitations by incorporating signal processing insights.

Method: HNO maps input signals to their analytic representation via the Hilbert transform, then applies spectral convolution to this representation.

Result: HNO is hypothesized to better model causal, phase-sensitive, and non-stationary systems compared to FNO.

Conclusion: HNO offers a theoretically grounded alternative to FNO, with potential advantages in specific PDE applications.

Abstract: Neural operators have emerged as a powerful, data-driven paradigm for learning solution operators of partial differential equations (PDEs). State-of-the-art architectures, such as the Fourier Neural Operator (FNO), have achieved remarkable success by performing convolutions in the frequency domain, making them highly effective for a wide range of problems. However, this method has some limitations, including the periodicity assumption of the Fourier transform. In addition, there are other methods of analysing a signal, beyond phase and amplitude perspective, and provide us with other useful information to learn an effective network. We introduce the \textbf{Hilbert Neural Operator (HNO)}, a new neural operator architecture to address some advantages by incorporating a strong inductive bias from signal processing. HNO operates by first mapping the input signal to its analytic representation via the Hilbert transform, thereby making instantaneous amplitude and phase information explicit features for the learning process. The core learnable operation – a spectral convolution – is then applied to this Hilbert-transformed representation. We hypothesize that this architecture enables HNO to model operators more effectively for causal, phase-sensitive, and non-stationary systems. We formalize the HNO architecture and provide the theoretical motivation for its design, rooted in analytic signal theory.

[346] Gaussian mixture layers for neural networks

Sinho Chewi, Philippe Rigollet, Yuling Yan

Main category: cs.LG

TL;DR: The paper explores using Gaussian mixture models and Wasserstein gradient flows to implement dynamics over probability measures, introducing Gaussian mixture (GM) layers in neural networks.

Details

Motivation: To investigate whether dynamics can be directly implemented over probability measures, moving beyond the mean-field theory for neural networks.

Method: Uses Gaussian mixture models and Wasserstein gradient flows to derive training dynamics for probability measures, introducing GM layers.

Result: GM layers achieve comparable test performance to fully connected networks and exhibit distinct behavior even in the mean-field regime.

Conclusion: The proposed GM layers offer a novel approach with unique dynamics, validated through experiments.

Abstract: The mean-field theory for two-layer neural networks considers infinitely wide networks that are linearly parameterized by a probability measure over the parameter space. This nonparametric perspective has significantly advanced both the theoretical and conceptual understanding of neural networks, with substantial efforts made to validate its applicability to networks of moderate width. In this work, we explore the opposite direction, investigating whether dynamics can be directly implemented over probability measures. Specifically, we employ Gaussian mixture models as a flexible and expressive parametric family of distributions together with the theory of Wasserstein gradient flows to derive training dynamics for such measures. Our approach introduces a new type of layer – the Gaussian mixture (GM) layer – that can be integrated into neural network architectures. As a proof of concept, we validate our proposal through experiments on simple classification tasks, where a GM layer achieves test performance comparable to that of a two-layer fully connected network. Furthermore, we examine the behavior of these dynamics and demonstrate numerically that GM layers exhibit markedly different behavior compared to classical fully connected layers, even when the latter are large enough to be considered in the mean-field regime.

[347] Uncertainty Quantification for Surface Ozone Emulators using Deep Learning

Kelsey Doerksen, Yuliya Marchetti, Steven Lu, Kevin Bowman, James Montgomery, Kazuyuki Miyazaki, Yarin Gal, Freddie Kalaitzis

Main category: cs.LG

TL;DR: A deep learning-based U-Net architecture with uncertainty quantification is used to model surface ozone residuals, addressing interpretability gaps in traditional physics-based models.

Details

Motivation: Air pollution, especially surface ozone, poses global health risks, but current models lack interpretability for policy and health decisions.

Method: An uncertainty-aware U-Net with Bayesian and quantile regression predicts MOMO-Chem model’s ozone residuals, tested in North America and Europe.

Result: The method provides regional bias estimates and identifies optimal ground stations for bias correction, with insights on land-use impact.

Conclusion: The approach improves interpretability and decision-making support for ozone pollution modeling, though further validation is needed.

Abstract: Air pollution is a global hazard, and as of 2023, 94% of the world’s population is exposed to unsafe pollution levels. Surface Ozone (O3), an important pollutant, and the drivers of its trends are difficult to model, and traditional physics-based models fall short in their practical use for scales relevant to human-health impacts. Deep Learning-based emulators have shown promise in capturing complex climate patterns, but overall lack the interpretability necessary to support critical decision making for policy changes and public health measures. We implement an uncertainty-aware U-Net architecture to predict the Multi-mOdel Multi-cOnstituent Chemical data assimilation (MOMO-Chem) model’s surface ozone residuals (bias) using Bayesian and quantile regression methods. We demonstrate the capability of our techniques in regional estimation of bias in North America and Europe for June 2019. We highlight the uncertainty quantification (UQ) scores between our two UQ methodologies and discern which ground stations are optimal and sub-optimal candidates for MOMO-Chem bias correction, and evaluate the impact of land-use information in surface ozone residual modeling.

[348] Leveraging Deep Learning for Physical Model Bias of Global Air Quality Estimates

Kelsey Doerksen, Yuliya Marchetti, Kevin Bowman, Steven Lu, James Montgomery, Yarin Gal, Freddie Kalaitzis, Kazuyuki Miyazaki

Main category: cs.LG

TL;DR: A 2D CNN-based method improves surface ozone modeling by estimating model bias, outperforming traditional ML and incorporating satellite data for better urban-scale insights.

Details

Motivation: Air pollution, especially surface ozone, is a major health risk, but current models struggle at human-relevant scales, limiting policy use.

Method: A 2D Convolutional Neural Network estimates ozone model bias from MOMO-Chem residuals, tested in North America and Europe, with added satellite land use data.

Result: The CNN outperforms traditional ML in capturing model residuals, and satellite data enhances estimates, improving urban-scale ozone understanding.

Conclusion: This approach advances ozone modeling for health impacts and policy, highlighting urban-scale drivers of ozone bias.

Abstract: Air pollution is the world’s largest environmental risk factor for human disease and premature death, resulting in more than 6 million permature deaths in 2019. Currently, there is still a challenge to model one of the most important air pollutants, surface ozone, particularly at scales relevant for human health impacts, with the drivers of global ozone trends at these scales largely unknown, limiting the practical use of physics-based models. We employ a 2D Convolutional Neural Network based architecture that estimate surface ozone MOMO-Chem model residuals, referred to as model bias. We demonstrate the potential of this technique in North America and Europe, highlighting its ability better to capture physical model residuals compared to a traditional machine learning method. We assess the impact of incorporating land use information from high-resolution satellite imagery to improve model estimates. Importantly, we discuss how our results can improve our scientific understanding of the factors impacting ozone bias at urban scales that can be used to improve environmental policy.

[349] Retrieval-Augmented Water Level Forecasting for Everglades

Rahuul Rangaraj, Jimeng Shi, Rajendra Paudel, Giri Narasimhan, Yanzhao Wu

Main category: cs.LG

TL;DR: The paper introduces Retrieval-Augmented Forecasting (RAF) for water level forecasting in hydrology, improving accuracy by leveraging historical data without retraining.

Details

Motivation: Accurate water level forecasting is vital for ecosystem management, but existing deep learning models struggle with generalization and adaptation in hydrology.

Method: Proposes RAF, a framework that retrieves analogous historical hydrological data to enhance forecasting. Compares similarity-based and mutual information-based RAF methods.

Result: RAF significantly improves forecasting accuracy in real-world Everglades data.

Conclusion: RAF shows promise for hydrology and encourages adaptive AI adoption in ecosystem management.

Abstract: Accurate water level forecasting is crucial for managing ecosystems such as the Everglades, a subtropical wetland vital for flood mitigation, drought management, water resource planning, and biodiversity conservation. While recent advances in deep learning, particularly time series foundation models, have demonstrated success in general-domain forecasting, their application in hydrology remains underexplored. Furthermore, they often struggle to generalize across diverse unseen datasets and domains, due to the lack of effective mechanisms for adaptation. To address this gap, we introduce Retrieval-Augmented Forecasting (RAF) into the hydrology domain, proposing a framework that retrieves historically analogous multivariate hydrological episodes to enrich the model input before forecasting. By maintaining an external archive of past observations, RAF identifies and incorporates relevant patterns from historical data, thereby enhancing contextual awareness and predictive accuracy without requiring the model for task-specific retraining or fine-tuning. Furthermore, we explore and compare both similarity-based and mutual information-based RAF methods. We conduct a comprehensive evaluation on real-world data from the Everglades, demonstrating that the RAF framework yields substantial improvements in water level forecasting accuracy. This study highlights the potential of RAF approaches in environmental hydrology and paves the way for broader adoption of adaptive AI methods by domain experts in ecosystem management. The code and data are available at https://github.com/rahuul2992000/WaterRAF.

[350] Honest and Reliable Evaluation and Expert Equivalence Testing of Automated Neonatal Seizure Detection

Jovana Kljajic, John M. O’Toole, Robert Hogan, Tamara Skoric

Main category: cs.LG

TL;DR: The paper evaluates performance metrics for neonatal seizure detection models, proposing best practices to address biases and inconsistencies.

Details

Motivation: Current metrics for evaluating neonatal seizure detection models are inconsistent and biased, leading to unreliable claims about AI performance.

Method: The study assesses standard metrics, consensus strategies, and human-expert equivalence tests using real and synthetic seizure annotations under varying conditions.

Result: Matthews and Pearson’s correlation coefficients performed better under class imbalance, and the multi-rater Turing test using Fleiss k best captured expert-level AI performance.

Conclusion: The paper recommends specific reporting practices to ensure thorough and honest evaluation of AI methods for clinical validation.

Abstract: Reliable evaluation of machine learning models for neonatal seizure detection is critical for clinical adoption. Current practices often rely on inconsistent and biased metrics, hindering model comparability and interpretability. Expert-level claims about AI performance are frequently made without rigorous validation, raising concerns about their reliability. This study aims to systematically evaluate common performance metrics and propose best practices tailored to the specific challenges of neonatal seizure detection. Using real and synthetic seizure annotations, we assessed standard performance metrics, consensus strategies, and human-expert level equivalence tests under varying class imbalance, inter-rater agreement, and number of raters. Matthews and Pearson’s correlation coefficients outperformed the area under the curve in reflecting performance under class imbalance. Consensus types are sensitive to the number of raters and agreement level among them. Among human-expert level equivalence tests, the multi-rater Turing test using Fleiss k best captured expert-level AI performance. We recommend reporting: (1) at least one balanced metric, (2) Sensitivity, specificity, PPV and NPV, (3) Multi-rater Turing test results using Fleiss k, and (4) All the above on held-out validation set. This proposed framework provides an important prerequisite to clinical validation by enabling a thorough and honest appraisal of AI methods for neonatal seizure detection.

[351] Sensitivity of Stability: Theoretical & Empirical Analysis of Replicability for Adaptive Data Selection in Transfer Learning

Prabhav Singh, Jessica Sorrell

Main category: cs.LG

TL;DR: The paper analyzes replicability in transfer learning, introducing a measure called selection sensitivity ($\Delta_Q$) to quantify the trade-off between adaptation effectiveness and consistency. It shows that replicability failure increases with selection sensitivity and decreases with sample size, validated through experiments on the MultiNLI corpus.

Details

Motivation: To understand the reliability of transfer learning adaptations, especially with dynamic data selection strategies, and to provide guidelines for balancing performance and replicability.

Method: Theoretical framework for selection sensitivity ($\Delta_Q$) and empirical validation using six adaptive selection strategies on the MultiNLI corpus.

Result: Highly adaptive strategies improve performance but increase replicability failure, while less adaptive ones keep failure rates low. Source domain pretraining mitigates failure rates by up to 30%.

Conclusion: The study offers guidelines for practitioners to manage the performance-replicability trade-off and emphasizes the need for replicability-aware design in transfer learning.

Abstract: The widespread adoption of transfer learning has revolutionized machine learning by enabling efficient adaptation of pre-trained models to new domains. However, the reliability of these adaptations remains poorly understood, particularly when using adaptive data selection strategies that dynamically prioritize training examples. We present a comprehensive theoretical and empirical analysis of replicability in transfer learning, introducing a mathematical framework that quantifies the fundamental trade-off between adaptation effectiveness and result consistency. Our key contribution is the formalization of selection sensitivity ($\Delta_Q$), a measure that captures how adaptive selection strategies respond to perturbations in training data. We prove that replicability failure probability: the likelihood that two independent training runs produce models differing in performance by more than a threshold, increases quadratically with selection sensitivity while decreasing exponentially with sample size. Through extensive experiments on the MultiNLI corpus using six adaptive selection strategies - ranging from uniform sampling to gradient-based selection - we demonstrate that this theoretical relationship holds precisely in practice. Our results reveal that highly adaptive strategies like gradient-based and curriculum learning achieve superior task performance but suffer from high replicability failure rates, while less adaptive approaches maintain failure rates below 7%. Crucially, we show that source domain pretraining provides a powerful mitigation mechanism, reducing failure rates by up to 30% while preserving performance gains. These findings establish principled guidelines for practitioners to navigate the performance-replicability trade-off and highlight the need for replicability-aware design in modern transfer learning systems.

[352] Advancing Hate Speech Detection with Transformers: Insights from the MetaHate

Santosh Chapagain, Shah Muhammad Hamdi, Soukaina Filali Boubrahimi

Main category: cs.LG

TL;DR: The paper explores transformer-based models for hate speech detection, achieving high performance with ELECTRA, while identifying challenges like sarcasm and coded language.

Details

Motivation: Hate speech on social media has serious real-world impacts, necessitating robust automated detection methods.

Method: Evaluated transformer models (BERT, RoBERTa, GPT-2, ELECTRA) on the MetaHate dataset (1.2M samples).

Result: Fine-tuned ELECTRA achieved the highest F1 score (0.8980). Challenges include sarcasm and label noise.

Conclusion: Transformer models, especially ELECTRA, are effective for hate speech detection, but challenges remain in nuanced cases.

Abstract: Hate speech is a widespread and harmful form of online discourse, encompassing slurs and defamatory posts that can have serious social, psychological, and sometimes physical impacts on targeted individuals and communities. As social media platforms such as X (formerly Twitter), Facebook, Instagram, Reddit, and others continue to facilitate widespread communication, they also become breeding grounds for hate speech, which has increasingly been linked to real-world hate crimes. Addressing this issue requires the development of robust automated methods to detect hate speech in diverse social media environments. Deep learning approaches, such as vanilla recurrent neural networks (RNNs), long short-term memory (LSTM), and convolutional neural networks (CNNs), have achieved good results, but are often limited by issues such as long-term dependencies and inefficient parallelization. This study represents the comprehensive exploration of transformer-based models for hate speech detection using the MetaHate dataset–a meta-collection of 36 datasets with 1.2 million social media samples. We evaluate multiple state-of-the-art transformer models, including BERT, RoBERTa, GPT-2, and ELECTRA, with fine-tuned ELECTRA achieving the highest performance (F1 score: 0.8980). We also analyze classification errors, revealing challenges with sarcasm, coded language, and label noise.

[353] ALScope: A Unified Toolkit for Deep Active Learning

Chenkai Wu, Yuanyuan Qi, Xiaohao Yang, Jueqing Lu, Gang Liu, Wray Buntine, Lan Du

Main category: cs.LG

TL;DR: ALScope is a new Deep Active Learning (DAL) platform for classification tasks, integrating diverse datasets and algorithms to evaluate performance under varied conditions like distribution shifts and data imbalance.

Details

Motivation: The lack of a unified platform for fair and systematic evaluation of DAL algorithms under diverse conditions (e.g., distribution shifts, data imbalance) motivated the creation of ALScope.

Method: ALScope integrates 10 datasets from CV and NLP, and 21 DAL algorithms, supporting flexible configuration of experimental factors like OOD ratio and class imbalance.

Result: Experiments show DAL performance varies by domain and setting, with room for improvement in non-standard scenarios, and trade-offs between performance and selection time.

Conclusion: ALScope enables comprehensive DAL evaluation, highlighting challenges and trade-offs, and calls for further research in non-standard scenarios.

Abstract: Deep Active Learning (DAL) reduces annotation costs by selecting the most informative unlabeled samples during training. As real-world applications become more complex, challenges stemming from distribution shifts (e.g., open-set recognition) and data imbalance have gained increasing attention, prompting the development of numerous DAL algorithms. However, the lack of a unified platform has hindered fair and systematic evaluation under diverse conditions. Therefore, we present a new DAL platform ALScope for classification tasks, integrating 10 datasets from computer vision (CV) and natural language processing (NLP), and 21 representative DAL algorithms, including both classical baselines and recent approaches designed to handle challenges such as distribution shifts and data imbalance. This platform supports flexible configuration of key experimental factors, ranging from algorithm and dataset choices to task-specific factors like out-of-distribution (OOD) sample ratio, and class imbalance ratio, enabling comprehensive and realistic evaluation. We conduct extensive experiments on this platform under various settings. Our findings show that: (1) DAL algorithms’ performance varies significantly across domains and task settings; (2) in non-standard scenarios such as imbalanced and open-set settings, DAL algorithms show room for improvement and require further investigation; and (3) some algorithms achieve good performance, but require significantly longer selection time.

[354] Quaternion-Hadamard Network: A Novel Defense Against Adversarial Attacks with a New Dataset

Vladimir Frants, Sos Agaian

Main category: cs.LG

TL;DR: The paper proposes QHNet, a model-agnostic defense against adversarial attacks on weather-removal deep-learning models, using novel quaternion-based blocks and a new dataset (AWCVD).

Details

Motivation: Deep-learning models for weather removal are vulnerable to adversarial attacks, and existing defenses are costly or impractical.

Method: Introduces QHNet with Quaternion Hadamard Denoising Convolutional Block (QHDCB) and Quaternion Denoising Residual Block (QDRB), leveraging polynomial thresholding in an encoder-decoder architecture.

Result: QHNet outperforms state-of-the-art techniques in robustness against adversarial attacks, validated by PSNR and SSIM metrics.

Conclusion: QHNet effectively defends against adversarial attacks on weather-removal models, with plans to release the source code and dataset.

Abstract: This paper addresses the vulnerability of deep-learning models designed for rain, snow, and haze removal. Despite enhancing image quality in adverse weather, these models are susceptible to adversarial attacks that compromise their effectiveness. Traditional defenses such as adversarial training and model distillation often require extensive retraining, making them costly and impractical for real-world deployment. While denoising and super-resolution techniques can aid image classification models, they impose high computational demands and introduce visual artifacts that hinder image processing tasks. We propose a model-agnostic defense against first-order white-box adversarial attacks using the Quaternion-Hadamard Network (QHNet) to tackle these challenges. White-box attacks are particularly difficult to defend against since attackers have full access to the model’s architecture, weights, and training procedures. Our defense introduces the Quaternion Hadamard Denoising Convolutional Block (QHDCB) and the Quaternion Denoising Residual Block (QDRB), leveraging polynomial thresholding. QHNet incorporates these blocks within an encoder-decoder architecture, enhanced by feature refinement, to effectively neutralize adversarial noise. Additionally, we introduce the Adversarial Weather Conditions Vision Dataset (AWCVD), created by applying first-order gradient attacks on state-of-the-art weather removal techniques in scenarios involving haze, rain streaks, and snow. Using PSNR and SSIM metrics, we demonstrate that QHNet significantly enhances the robustness of low-level computer vision models against adversarial attacks compared with state-of-the-art denoising and super-resolution techniques. The source code and dataset will be released alongside the final version of this paper.

[355] Self-Error Adjustment: Theory and Practice of Balancing Individual Performance and Diversity in Ensemble Learning

Rui Zou

Main category: cs.LG

TL;DR: The paper introduces Self-Error Adjustment (SEA), a novel ensemble learning framework that decomposes errors into individual performance and diversity terms, enabling precise control over the accuracy-diversity trade-off. SEA outperforms traditional methods like NCL, offering broader adjustment ranges and tighter theoretical bounds.

Details

Motivation: Traditional ensemble methods like Bagging and Boosting lack precise control over the accuracy-diversity trade-off, while NCL has loose theoretical bounds. SEA aims to address these limitations by providing a more flexible and theoretically grounded approach.

Method: SEA decomposes ensemble errors into individual performance and diversity terms, introducing an adjustable parameter in the loss function for precise control. This allows finer regulation of ensemble performance.

Result: SEA outperforms baseline methods (e.g., NCL) on regression and classification tasks, offering more consistent diversity changes and superior performance. Theoretical bounds are tighter and empirically validated.

Conclusion: SEA provides a flexible, theoretically sound framework for ensemble learning, improving performance and control over the accuracy-diversity trade-off compared to existing methods.

Abstract: Ensemble learning boosts performance by aggregating predictions from multiple base learners. A core challenge is balancing individual learner accuracy with diversity. Traditional methods like Bagging and Boosting promote diversity through randomness but lack precise control over the accuracy-diversity trade-off. Negative Correlation Learning (NCL) introduces a penalty to manage this trade-off but suffers from loose theoretical bounds and limited adjustment range. To overcome these limitations, we propose a novel framework called Self-Error Adjustment (SEA), which decomposes ensemble errors into two distinct components: individual performance terms, representing the self-error of each base learner, and diversity terms, reflecting interactions among learners. This decomposition allows us to introduce an adjustable parameter into the loss function, offering precise control over the contribution of each component, thus enabling finer regulation of ensemble performance. Compared to NCL and its variants, SEA provides a broader range of effective adjustments and more consistent changes in diversity. Furthermore, we establish tighter theoretical bounds for adjustable ensemble methods and validate them through empirical experiments. Experimental results on several public regression and classification datasets demonstrate that SEA consistently outperforms baseline methods across all tasks. Ablation studies confirm that SEA offers more flexible adjustment capabilities and superior performance in fine-tuning strategies.

[356] Compressed Decentralized Momentum Stochastic Gradient Methods for Nonconvex Optimization

Wei Liu, Anweshit Panda, Ujwal Pandey, Christopher Brissette, Yikang Shen, George M. Slota, Naigang Wang, Jie Chen, Yangyang Xu

Main category: cs.LG

TL;DR: The paper introduces two compressed decentralized algorithms for nonconvex stochastic optimization, combining momentum and message-compression techniques. The first is a compressed decentralized adaptive method for bounded gradients, and the second is a compressed decentralized heavy-ball method for data heterogeneity. Both achieve optimal convergence rates and outperform state-of-the-art methods in training DNNs and Transformers.

Details

Motivation: The motivation is to address the challenge of combining momentum acceleration and compressed communication in decentralized algorithms while controlling consensus, compression, and momentum gradient errors.

Method: Two algorithms are proposed: (1) a compressed decentralized adaptive method for bounded gradients, and (2) a compressed decentralized heavy-ball method with gradient tracking for data heterogeneity. Both use momentum and compression techniques.

Result: Both algorithms achieve optimal convergence rates, linear speedup, and topology-independent parameters within error tolerance. They outperform existing methods in training DNNs and Transformers.

Conclusion: The proposed algorithms effectively combine momentum and compression, achieving theoretical and empirical superiority in decentralized nonconvex stochastic optimization.

Abstract: In this paper, we design two compressed decentralized algorithms for solving nonconvex stochastic optimization under two different scenarios. Both algorithms adopt a momentum technique to achieve fast convergence and a message-compression technique to save communication costs. Though momentum acceleration and compressed communication have been used in literature, it is highly nontrivial to theoretically prove the effectiveness of their composition in a decentralized algorithm that can maintain the benefits of both sides, because of the need to simultaneously control the consensus error, the compression error, and the bias from the momentum gradient. For the scenario where gradients are bounded, our proposal is a compressed decentralized adaptive method. To the best of our knowledge, this is the first decentralized adaptive stochastic gradient method with compressed communication. For the scenario of data heterogeneity without bounded gradients, our proposal is a compressed decentralized heavy-ball method, which applies a gradient tracking technique to address the challenge of data heterogeneity. Notably, both methods achieve an optimal convergence rate, and they can achieve linear speed up and adopt topology-independent algorithmic parameters within a certain regime of the user-specified error tolerance. Superior empirical performance is observed over state-of-the-art methods on training deep neural networks (DNNs) and Transformers.

[357] MENDR: Manifold Explainable Neural Data Representations

Matthew Chen, Micky Nnamdi, Justin Shao, Andrew Hornback, Hongyun Huang, Ben Tamo, Yishan Zhong, Benoit Marteau, Wenqi Shi, May Dongmei Wang

Main category: cs.LG

TL;DR: MENDR is a novel EEG foundation model using Riemannian Manifold Transformer and wavelet transforms for interpretable, efficient EEG analysis.

Details

Motivation: Current EEG foundation models lack transparency and interpretability, hindering clinical integration.

Method: MENDR uses a Riemannian Manifold Transformer and wavelet packet transforms to create symmetric positive definite matrix embeddings.

Result: MENDR achieves near state-of-the-art performance with fewer parameters and enhances interpretability.

Conclusion: MENDR offers a promising, efficient, and interpretable approach for EEG analysis in clinical settings.

Abstract: Foundation models for electroencephalography (EEG) signals have recently demonstrated success in learning generalized representations of EEGs, outperforming specialized models in various downstream tasks. However, many of these models lack transparency in their pretraining dynamics and offer limited insight into how well EEG information is preserved within their embeddings. For successful clinical integration, EEG foundation models must ensure transparency in pretraining, downstream fine-tuning, and the interpretability of learned representations. Current approaches primarily operate in the temporal domain, overlooking advancements in digital signal processing that enable the extraction of deterministic and traceable features, such as wavelet-based representations. We propose MENDR (Manifold Explainable Neural Data Representations), a filter bank-based EEG foundation model built on a novel Riemannian Manifold Transformer architecture to resolve these issues. MENDR learns symmetric positive definite matrix embeddings of EEG signals and is pretrained on a large corpus comprising over 4,000 hours of EEG data, decomposed via discrete wavelet packet transforms into multi-resolution coefficients. MENDR significantly enhances interpretability by visualizing symmetric positive definite embeddings as geometric ellipsoids and supports accurate reconstruction of EEG signals from learned embeddings. Evaluations across multiple clinical EEG tasks demonstrate that MENDR achieves near state-of-the-art performance with substantially fewer parameters, underscoring its potential for efficient, interpretable, and clinically applicable EEG analysis.

[358] RCUKF: Data-Driven Modeling Meets Bayesian Estimation

Kumar Anurag, Kasra Azizi, Francesco Sorrentino, Wenbin Wan

Main category: cs.LG

TL;DR: A novel framework, RCUKF, combines reservoir computing and unscented Kalman filtering for accurate modeling of complex systems.

Details

Motivation: Accurate modeling of complex systems is challenging, especially where nominal models fail.

Method: Integrates reservoir computing (RC) for data-driven modeling with unscented Kalman filter (UKF) for Bayesian estimation.

Result: Demonstrated effectiveness on benchmark problems and real-time vehicle trajectory estimation.

Conclusion: RCUKF provides a robust solution for modeling complex systems by combining data-driven and Bayesian approaches.

Abstract: Accurate modeling is crucial in many engineering and scientific applications, yet obtaining a reliable process model for complex systems is often challenging. To address this challenge, we propose a novel framework, reservoir computing with unscented Kalman filtering (RCUKF), which integrates data-driven modeling via reservoir computing (RC) with Bayesian estimation through the unscented Kalman filter (UKF). The RC component learns the nonlinear system dynamics directly from data, serving as a surrogate process model in the UKF prediction step to generate state estimates in high-dimensional or chaotic regimes where nominal mathematical models may fail. Meanwhile, the UKF measurement update integrates real-time sensor data to correct potential drift in the data-driven model. We demonstrate RCUKF effectiveness on well-known benchmark problems and a real-time vehicle trajectory estimation task in a high-fidelity simulation environment.

Menghua Jiang, Yuxia Lin, Baoliang Chen, Haifeng Hu, Yuncheng Jiang, Sijie Mai

Main category: cs.LG

TL;DR: The paper proposes MMCI, a model using causal theory to address spurious correlations in multimodal sentiment analysis, improving generalization.

Details

Motivation: Existing MSA methods rely on spurious correlations, undermining generalization. The goal is to mitigate this by leveraging causal theory.

Method: MMCI models multimodal inputs as a multi-relational graph, disentangles causal and shortcut features using attention, and applies backdoor adjustment for stable predictions.

Result: Experiments show MMCI suppresses biases and improves performance on standard and OOD datasets.

Conclusion: MMCI effectively addresses spurious correlations in MSA, enhancing generalization and robustness.

Abstract: Multimodal sentiment analysis (MSA) aims to understand human emotions by integrating information from multiple modalities, such as text, audio, and visual data. However, existing methods often suffer from spurious correlations both within and across modalities, leading models to rely on statistical shortcuts rather than true causal relationships, thereby undermining generalization. To mitigate this issue, we propose a Multi-relational Multimodal Causal Intervention (MMCI) model, which leverages the backdoor adjustment from causal theory to address the confounding effects of such shortcuts. Specifically, we first model the multimodal inputs as a multi-relational graph to explicitly capture intra- and inter-modal dependencies. Then, we apply an attention mechanism to separately estimate and disentangle the causal features and shortcut features corresponding to these intra- and inter-modal relations. Finally, by applying the backdoor adjustment, we stratify the shortcut features and dynamically combine them with the causal features to encourage MMCI to produce stable predictions under distribution shifts. Extensive experiments on several standard MSA datasets and out-of-distribution (OOD) test sets demonstrate that our method effectively suppresses biases and improves performance.

[360] R-Zero: Self-Evolving Reasoning LLM from Zero Data

Chengsong Huang, Wenhao Yu, Xiaoyang Wang, Hongming Zhang, Zongxia Li, Ruosen Li, Jiaxin Huang, Haitao Mi, Dong Yu

Main category: cs.LG

TL;DR: R-Zero is a fully autonomous framework for self-evolving LLMs, eliminating the need for human-curated data by using two co-evolving models, Challenger and Solver, to generate and solve tasks.

Details

Motivation: Existing LLM training relies on human-curated tasks, limiting progress toward super-intelligence. R-Zero aims to overcome this by enabling autonomous self-improvement.

Method: R-Zero uses two models: a Challenger to propose tasks and a Solver to solve them. They co-evolve, creating a self-improving curriculum without external data.

Result: R-Zero improves reasoning in LLMs, e.g., boosting Qwen3-4B-Base by +6.49 on math and +7.54 on general reasoning benchmarks.

Conclusion: R-Zero demonstrates scalable, autonomous LLM evolution, advancing AI beyond human-dependent training methods.

Abstract: Self-evolving Large Language Models (LLMs) offer a scalable path toward super-intelligence by autonomously generating, refining, and learning from their own experiences. However, existing methods for training such models still rely heavily on vast human-curated tasks and labels, typically via fine-tuning or reinforcement learning, which poses a fundamental bottleneck to advancing AI systems toward capabilities beyond human intelligence. To overcome this limitation, we introduce R-Zero, a fully autonomous framework that generates its own training data from scratch. Starting from a single base LLM, R-Zero initializes two independent models with distinct roles, a Challenger and a Solver. These models are optimized separately and co-evolve through interaction: the Challenger is rewarded for proposing tasks near the edge of the Solver capability, and the Solver is rewarded for solving increasingly challenging tasks posed by the Challenger. This process yields a targeted, self-improving curriculum without any pre-existing tasks and labels. Empirically, R-Zero substantially improves reasoning capability across different backbone LLMs, e.g., boosting the Qwen3-4B-Base by +6.49 on math-reasoning benchmarks and +7.54 on general-domain reasoning benchmarks.

[361] SPaRFT: Self-Paced Reinforcement Fine-Tuning for Large Language Models

Dai Do, Manh Nguyen, Svetha Venkatesh, Hung Le

Main category: cs.LG

TL;DR: SPaRFT is a self-paced learning framework for LLMs that optimizes data selection and timing, achieving high accuracy with fewer samples.

Details

Motivation: Current RL-based fine-tuning for LLMs is resource-intensive, and heuristic-driven methods lack scalability. SPaRFT addresses this by enabling efficient, capability-aware learning.

Method: SPaRFT uses cluster-based data reduction to partition data by semantics/difficulty and a multi-armed bandit to adaptively allocate training samples.

Result: SPaRFT matches or outperforms baselines with up to 100x fewer samples, demonstrating the effectiveness of data clustering and adaptive selection.

Conclusion: Performance-driven curricula can unlock LLM reasoning abilities with minimal resources, as shown by SPaRFT.

Abstract: Large language models (LLMs) have shown strong reasoning capabilities when fine-tuned with reinforcement learning (RL). However, such methods require extensive data and compute, making them impractical for smaller models. Current approaches to curriculum learning or data selection are largely heuristic-driven or demand extensive computational resources, limiting their scalability and generalizability. We propose \textbf{SPaRFT}, a self-paced learning framework that enables efficient learning based on the capability of the model being trained through optimizing which data to use and when. First, we apply \emph{cluster-based data reduction} to partition training data by semantics and difficulty, extracting a compact yet diverse subset that reduces redundancy. Then, a \emph{multi-armed bandit} treats data clusters as arms, optimized to allocate training samples based on model current performance. Experiments across multiple reasoning benchmarks show that SPaRFT achieves comparable or better accuracy than state-of-the-art baselines while using up to (100\times) fewer samples. Ablation studies and analyses further highlight the importance of both data clustering and adaptive selection. Our results demonstrate that carefully curated, performance-driven training curricula can unlock strong reasoning abilities in LLMs with minimal resources.

[362] Will You Be Aware? Eye Tracking-Based Modeling of Situational Awareness in Augmented Reality

Zhehan Qu, Tianyi Hu, Christian Fronk, Maria Gorlatova

Main category: cs.LG

TL;DR: The paper explores how AR systems in CPR training can cause cognitive tunneling, reducing situational awareness (SA). It introduces an AR app for CPR feedback and evaluates SA using eye tracking, proposing a graph neural network (FixGraphPool) to predict SA with high accuracy.

Details

Motivation: AR systems improve task performance but may compromise SA in safety-critical scenarios like CPR. The study aims to understand and mitigate this risk.

Method: Developed an AR app for CPR guidance, conducted a user study with simulated incidents, and used eye tracking to analyze SA. Proposed FixGraphPool, a graph neural network, to predict SA from gaze data.

Result: Higher SA correlated with specific eye movement patterns. FixGraphPool achieved 83.0% accuracy in predicting SA, outperforming other models.

Conclusion: Eye tracking and FixGraphPool effectively model SA in AR, aiding the design of safer AR systems for critical tasks like CPR.

Abstract: Augmented Reality (AR) systems, while enhancing task performance through real-time guidance, pose risks of inducing cognitive tunneling-a hyperfocus on virtual content that compromises situational awareness (SA) in safety-critical scenarios. This paper investigates SA in AR-guided cardiopulmonary resuscitation (CPR), where responders must balance effective compressions with vigilance to unpredictable hazards (e.g., patient vomiting). We developed an AR app on a Magic Leap 2 that overlays real-time CPR feedback (compression depth and rate) and conducted a user study with simulated unexpected incidents (e.g., bleeding) to evaluate SA, in which SA metrics were collected via observation and questionnaires administered during freeze-probe events. Eye tracking analysis revealed that higher SA levels were associated with greater saccadic amplitude and velocity, and with reduced proportion and frequency of fixations on virtual content. To predict SA, we propose FixGraphPool, a graph neural network that structures gaze events (fixations, saccades) into spatiotemporal graphs, effectively capturing dynamic attentional patterns. Our model achieved 83.0% accuracy (F1=81.0%), outperforming feature-based machine learning and state-of-the-art time-series models by leveraging domain knowledge and spatial-temporal information encoded in ET data. These findings demonstrate the potential of eye tracking for SA modeling in AR and highlight its utility in designing AR systems that ensure user safety and situational awareness.

[363] Learning from Oblivion: Predicting Knowledge Overflowed Weights via Retrodiction of Forgetting

Jinhyeok Jang, Jaehong Kim, Jung Uk Kim

Main category: cs.LG

TL;DR: KNOW prediction leverages structured forgetting and inversion to enhance pre-trained weights, outperforming naive fine-tuning and simple weight prediction.

Details

Motivation: Improve pre-trained weights by encapsulating more knowledge beyond the given dataset, especially in data-scarce scenarios.

Method: Introduces KNOW prediction, using structured forgetting and meta-learning (KNOWN hyper-model) to predict knowledge-enriched weights.

Result: KNOW prediction consistently outperforms naive fine-tuning and simple weight prediction across diverse datasets and architectures.

Conclusion: Reinterprets forgetting dynamics to enhance knowledge transfer in deep learning.

Abstract: Pre-trained weights have become a cornerstone of modern deep learning, enabling efficient knowledge transfer and improving downstream task performance, especially in data-scarce scenarios. However, a fundamental question remains: how can we obtain better pre-trained weights that encapsulate more knowledge beyond the given dataset? In this work, we introduce \textbf{KNowledge Overflowed Weights (KNOW)} prediction, a novel strategy that leverages structured forgetting and its inversion to synthesize knowledge-enriched weights. Our key insight is that sequential fine-tuning on progressively downsized datasets induces a structured forgetting process, which can be modeled and reversed to recover knowledge as if trained on a larger dataset. We construct a dataset of weight transitions governed by this controlled forgetting and employ meta-learning to model weight prediction effectively. Specifically, our \textbf{KNowledge Overflowed Weights Nowcaster (KNOWN)} acts as a hyper-model that learns the general evolution of weights and predicts enhanced weights with improved generalization. Extensive experiments across diverse datasets and architectures demonstrate that KNOW prediction consistently outperforms Na"ive fine-tuning and simple weight prediction, leading to superior downstream performance. Our work provides a new perspective on reinterpreting forgetting dynamics to push the limits of knowledge transfer in deep learning.

[364] TANGO: Graph Neural Dynamics via Learned Energy and Tangential Flows

Moshe Eliasof, Eldad Haber, Carola-Bibiane Schönlieb

Main category: cs.LG

TL;DR: TANGO is a graph representation learning framework using energy landscapes and descent dynamics, ensuring stability and convergence through a Lyapunov function. It introduces tangential evolution for flexibility and achieves strong performance in graph tasks.

Details

Motivation: To address challenges like oversquashing and ill-conditioned energy regions in graph learning by combining energy-based dynamics with flexible tangential evolution.

Method: Uses a learnable Lyapunov function for energy reduction and introduces a tangential component via message passing to evolve features while maintaining energy.

Result: Achieves strong performance in node and graph classification/regression tasks, demonstrating effective signal propagation and stability.

Conclusion: TANGO’s joint learning of energy functions and tangential flows enhances graph neural networks, offering flexibility and robustness.

Abstract: We introduce TANGO – a dynamical systems inspired framework for graph representation learning that governs node feature evolution through a learned energy landscape and its associated descent dynamics. At the core of our approach is a learnable Lyapunov function over node embeddings, whose gradient defines an energy-reducing direction that guarantees convergence and stability. To enhance flexibility while preserving the benefits of energy-based dynamics, we incorporate a novel tangential component, learned via message passing, that evolves features while maintaining the energy value. This decomposition into orthogonal flows of energy gradient descent and tangential evolution yields a flexible form of graph dynamics, and enables effective signal propagation even in flat or ill-conditioned energy regions, that often appear in graph learning. Our method mitigates oversquashing and is compatible with different graph neural network backbones. Empirically, TANGO achieves strong performance across a diverse set of node and graph classification and regression benchmarks, demonstrating the effectiveness of jointly learned energy functions and tangential flows for graph neural networks.

[365] ULU: A Unified Activation Function

Simin Huo

Main category: cs.LG

TL;DR: ULU and AULU are novel activation functions outperforming ReLU and Mish, with AULU introducing adaptive learnable parameters and a new metric (LIB) for inductive bias.

Details

Motivation: To address limitations of existing activation functions (e.g., ReLU, Mish) by proposing non-monotonic, piecewise functions (ULU, AULU) that treat positive and negative inputs differently.

Method: ULU is defined piecewise with fixed parameters, while AULU uses learnable parameters for adaptability. The LIB metric quantifies inductive bias.

Result: ULU and AULU outperform ReLU and Mish in image classification and object detection tasks.

Conclusion: ULU and AULU offer superior performance and adaptability, with LIB providing a new tool for analyzing model bias.

Abstract: We propose \textbf{ULU}, a novel non-monotonic, piecewise activation function defined as ${f(x;\alpha_1),x<0; f(x;\alpha_2),x>=0 }$, where $f(x;\alpha)=0.5x(tanh(\alpha x)+1),\alpha >0$. ULU treats positive and negative inputs differently. Extensive experiments demonstrate ULU significantly outperforms ReLU and Mish across image classification and object detection tasks. Its variant Adaptive ULU (\textbf{AULU}) is expressed as ${f(x;\beta_1^2),x<0; f(x;\beta_2^2),x>=0 }$, where $\beta_1$ and $\beta_2$ are learnable parameters, enabling it to adapt its response separately for positive and negative inputs. Additionally, we introduce the LIB (Like Inductive Bias) metric from AULU to quantitatively measure the inductive bias of the model.

[366] Analyzing the Impact of Multimodal Perception on Sample Complexity and Optimization Landscapes in Imitation Learning

Luai Abuelsamen, Temitope Lukman Adebanjo

Main category: cs.LG

TL;DR: Multimodal imitation learning benefits from integrated perception (RGB-D, proprioception, language), improving generalization and optimization compared to unimodal methods.

Details

Motivation: To understand how multimodal perception impacts sample complexity and optimization in imitation learning, leveraging statistical learning theory.

Method: Analyzes multimodal policies using Rademacher complexity, PAC learning, and information theory, reviewing frameworks like PerAct and CLIPort.

Result: Integrated multimodal policies achieve tighter generalization bounds and better optimization landscapes than unimodal ones.

Conclusion: Multimodal architectures outperform unimodal ones due to theoretical advantages in learning and optimization.

Abstract: This paper examines the theoretical foundations of multimodal imitation learning through the lens of statistical learning theory. We analyze how multimodal perception (RGB-D, proprioception, language) affects sample complexity and optimization landscapes in imitation policies. Building on recent advances in multimodal learning theory, we show that properly integrated multimodal policies can achieve tighter generalization bounds and more favorable optimization landscapes than their unimodal counterparts. We provide a comprehensive review of theoretical frameworks that explain why multimodal architectures like PerAct and CLIPort achieve superior performance, connecting these empirical results to fundamental concepts in Rademacher complexity, PAC learning, and information theory.

[367] Integrated Influence: Data Attribution with Baseline

Linxiao Yang, Xinyu Gu, Liang Sun

Main category: cs.LG

TL;DR: Proposes Integrated Influence, a data attribution method with a baseline approach, addressing limitations of LOO-based methods and improving reliability.

Details

Motivation: Existing LOO-based data attribution methods lack collective influence analysis and baseline flexibility, limiting transparency and counterfactual explanations.

Method: Introduces Integrated Influence, which uses a baseline dataset, degenerates data to this baseline, and accumulates sample influence during the process.

Result: Outperforms existing methods in data attribution and mislabeled example identification tasks.

Conclusion: Integrated Influence offers a more reliable and flexible data attribution framework, generalizing existing methods like influence functions.

Abstract: As an effective approach to quantify how training samples influence test sample, data attribution is crucial for understanding data and model and further enhance the transparency of machine learning models. We find that prevailing data attribution methods based on leave-one-out (LOO) strategy suffer from the local-based explanation, as these LOO-based methods only perturb a single training sample, and overlook the collective influence in the training set. On the other hand, the lack of baseline in many data attribution methods reduces the flexibility of the explanation, e.g., failing to provide counterfactual explanations. In this paper, we propose Integrated Influence, a novel data attribution method that incorporates a baseline approach. Our method defines a baseline dataset, follows a data degeneration process to transition the current dataset to the baseline, and accumulates the influence of each sample throughout this process. We provide a solid theoretical framework for our method, and further demonstrate that popular methods, such as influence functions, can be viewed as special cases of our approach. Experimental results show that Integrated Influence generates more reliable data attributions compared to existing methods in both data attribution task and mislablled example identification task.

[368] Cold Start Active Preference Learning in Socio-Economic Domains

Mojtaba Fayaz-Bakhsh, Danial Ataee, MohammadAmin Fazli

Main category: cs.LG

TL;DR: The paper proposes a self-supervised pre-training framework using PCA to address the cold-start problem in active preference learning, outperforming standard methods with fewer labeled pairs.

Details

Motivation: The cold-start problem in active preference learning hinders performance when no initial labeled data is available, especially in data-scarce domains like social systems and economics.

Method: The method involves a self-supervised pre-training phase using PCA to generate pseudo-labels, followed by an active learning loop with a simulated noisy oracle.

Result: Experiments show the framework outperforms standard active learning, achieving higher accuracy with fewer labeled pairs.

Conclusion: The proposed framework effectively mitigates the cold-start problem, improving sample efficiency in preference learning for data-constrained environments.

Abstract: Active preference learning is a powerful paradigm for efficiently modeling preferences, yet it suffers from the cold-start problem: a significant drop in performance when no initial labeled data is available. This challenge is particularly acute in computational social systems and economic analysis, where labeled data is often scarce, expensive, and subject to expert noise. To address this gap, we propose a novel framework for cold-start active preference learning. Our method initiates the learning process through a self-supervised pre-training phase, utilizing Principal Component Analysis (PCA) to derive initial pseudo-labels from the data’s inherent structure, thereby creating a cold-start model without any initial oracle interaction. Subsequently, the model is refined through an active learning loop that strategically queries a simulated noisy oracle for labels. We conduct extensive experiments on diverse datasets from different domains, including financial credibility, career success rate, and socio-economic status. The results demonstrate that our cold-start approach outperforms standard active learning strategies that begin from a blank slate, achieving higher accuracy with substantially fewer labeled pairs. Our framework offers a practical and effective solution to mitigate the cold-start problem, enhancing the sample efficiency and applicability of preference learning in data-constrained environments. We release our code at https://github.com/Dan-A2/cold-start-preference-learning

[369] Learning from Similarity-Confidence and Confidence-Difference

Tomoya Tate, Kosuke Sugiyama, Masato Uchida

Main category: cs.LG

TL;DR: A novel Weakly Supervised Learning (WSL) framework leverages multiple weak supervision signals (similarity-confidence and confidence-difference) for improved performance with limited labeled data. It introduces unbiased risk estimators and a risk correction approach, outperforming existing methods.

Details

Motivation: Addressing the challenge of limited labeled data in machine learning by integrating multiple weak supervision signals for more robust training.

Method: Proposes SconfConfDiff Classification, integrating two weak labels (similarity-confidence and confidence-difference) and deriving unbiased risk estimators (convex combination and interaction-based). Includes risk correction to prevent overfitting.

Result: The method achieves optimal convergence rates, robustness against label noise, and outperforms baselines in experiments.

Conclusion: The framework effectively leverages multiple weak supervision signals, offering a practical solution for limited labeled data scenarios.

Abstract: In practical machine learning applications, it is often challenging to assign accurate labels to data, and increasing the number of labeled instances is often limited. In such cases, Weakly Supervised Learning (WSL), which enables training with incomplete or imprecise supervision, provides a practical and effective solution. However, most existing WSL methods focus on leveraging a single type of weak supervision. In this paper, we propose a novel WSL framework that leverages complementary weak supervision signals from multiple relational perspectives, which can be especially valuable when labeled data is limited. Specifically, we introduce SconfConfDiff Classification, a method that integrates two distinct forms of weaklabels: similarity-confidence and confidence-difference, which are assigned to unlabeled data pairs. To implement this method, we derive two types of unbiased risk estimators for classification: one based on a convex combination of existing estimators, and another newly designed by modeling the interaction between two weak labels. We prove that both estimators achieve optimal convergence rates with respect to estimation error bounds. Furthermore, we introduce a risk correction approach to mitigate overfitting caused by negative empirical risk, and provide theoretical analysis on the robustness of the proposed method against inaccurate class prior probability and label noise. Experimental results demonstrate that the proposed method consistently outperforms existing baselines across a variety of settings.

[370] Exploring Superior Function Calls via Reinforcement Learning

Bingguang Hao, Maolin Wang, Zengzhuang Xu, Yicheng Chen, Cunyin Peng, Jinjie GU, Chenyi Zhuang

Main category: cs.LG

TL;DR: A novel reinforcement learning framework improves function calling in LLMs by addressing exploration, reasoning, and parameter verification challenges, achieving 86.02% accuracy.

Details

Motivation: Current training methods for LLMs in function calling lack robust reasoning and struggle with complex action spaces, necessitating a better approach.

Method: A two-stage data preparation pipeline with iterative LLM evaluation and AST validation, combined with strategic entropy-based exploration in reinforcement learning.

Result: Achieves 86.02% accuracy on the Berkeley Function Calling Leaderboard, outperforming standard GRPO by 6% in complex scenarios.

Conclusion: The framework enhances function calling performance, especially for code-pretrained models, and will be released to the community.

Abstract: Function calling capabilities are crucial for deploying Large Language Models in real-world applications, yet current training approaches fail to develop robust reasoning strategies. Supervised fine-tuning produces models that rely on superficial pattern matching, while standard reinforcement learning methods struggle with the complex action space of structured function calls. We present a novel reinforcement learning framework designed to enhance group relative policy optimization through strategic entropy based exploration specifically tailored for function calling tasks. Our approach addresses three critical challenges in function calling: insufficient exploration during policy learning, lack of structured reasoning in chain-of-thought generation, and inadequate verification of parameter extraction. Our two-stage data preparation pipeline ensures high-quality training samples through iterative LLM evaluation and abstract syntax tree validation. Extensive experiments on the Berkeley Function Calling Leaderboard demonstrate that this framework achieves state-of-the-art performance among open-source models with 86.02% overall accuracy, outperforming standard GRPO by up to 6% on complex multi-function scenarios. Notably, our method shows particularly strong improvements on code-pretrained models, suggesting that structured language generation capabilities provide an advantageous starting point for reinforcement learning in function calling tasks. We will release all the code, models and dataset to benefit the community.

[371] HFedATM: Hierarchical Federated Domain Generalization via Optimal Transport and Regularized Mean Aggregation

Thinh Nguyen, Trung Phan, Binh T. Nguyen, Khoa D Doan, Kok-Seng Wong

Main category: cs.LG

TL;DR: The paper introduces HFedDG, a hierarchical federated learning framework addressing domain shift, and proposes HFedATM, a method combining filter alignment and aggregation to improve performance and efficiency.

Details

Motivation: Conventional FL and HFL face scalability and domain shift issues, limiting model performance on unseen data. The paper aims to address these by integrating domain generalization into HFL.

Method: Proposes HFedATM, which aligns convolutional filters using Filter-wise Optimal Transport Alignment and aggregates models with Shrinkage-aware Regularized Mean Aggregation.

Result: HFedATM outperforms FedDG baselines, maintains efficiency, and achieves tighter generalization bounds, enabling faster convergence.

Conclusion: HFedATM effectively tackles domain shift in HFL, enhancing performance and scalability while ensuring computational efficiency.

Abstract: Federated Learning (FL) is a decentralized approach where multiple clients collaboratively train a shared global model without sharing their raw data. Despite its effectiveness, conventional FL faces scalability challenges due to excessive computational and communication demands placed on a single central server as the number of participating devices grows. Hierarchical Federated Learning (HFL) addresses these issues by distributing model aggregation tasks across intermediate nodes (stations), thereby enhancing system scalability and robustness against single points of failure. However, HFL still suffers from a critical yet often overlooked limitation: domain shift, where data distributions vary significantly across different clients and stations, reducing model performance on unseen target domains. While Federated Domain Generalization (FedDG) methods have emerged to improve robustness to domain shifts, their integration into HFL frameworks remains largely unexplored. In this paper, we formally introduce Hierarchical Federated Domain Generalization (HFedDG), a novel scenario designed to investigate domain shift within hierarchical architectures. Specifically, we propose HFedATM, a hierarchical aggregation method that first aligns the convolutional filters of models from different stations through Filter-wise Optimal Transport Alignment and subsequently merges aligned models using a Shrinkage-aware Regularized Mean Aggregation. Our extensive experimental evaluations demonstrate that HFedATM significantly boosts the performance of existing FedDG baselines across multiple datasets and maintains computational and communication efficiency. Moreover, theoretical analyses indicate that HFedATM achieves tighter generalization error bounds compared to standard hierarchical averaging, resulting in faster convergence and stable training behavior.

[372] Deep Neural Networks with General Activations: Super-Convergence in Sobolev Norms

Yahong Yang, Juncai He

Main category: cs.LG

TL;DR: Deep neural networks with general activations outperform classical methods in approximating PDE solutions, achieving super-convergence in Sobolev spaces.

Details

Motivation: To bridge the gap in error-estimation theory for neural-network-based PDE approaches and demonstrate their superior accuracy over traditional methods.

Method: Analysis of deep fully-connected neural networks with general activation functions in Sobolev spaces, comparing errors in $W^{m,p}$-norm for $m < n$.

Result: Deep networks achieve super-convergence, surpassing finite element and spectral methods in approximating PDE weak solutions.

Conclusion: The work provides a unified theoretical foundation for neural networks in scientific computing, closing a gap in PDE error estimation.

Abstract: This paper establishes a comprehensive approximation result for deep fully-connected neural networks with commonly-used and general activation functions in Sobolev spaces $W^{n,\infty}$, with errors measured in the $W^{m,p}$-norm for $m < n$ and $1\le p \le \infty$. The derived rates surpass those of classical numerical approximation techniques, such as finite element and spectral methods, exhibiting a phenomenon we refer to as \emph{super-convergence}. Our analysis shows that deep networks with general activations can approximate weak solutions of partial differential equations (PDEs) with superior accuracy compared to traditional numerical methods at the approximation level. Furthermore, this work closes a significant gap in the error-estimation theory for neural-network-based approaches to PDEs, offering a unified theoretical foundation for their use in scientific computing.

[373] PSEO: Optimizing Post-hoc Stacking Ensemble Through Hyperparameter Tuning

Beicheng Xu, Wei Liu, Keyao Ding, Yupeng Lu, Bin Cui

Main category: cs.LG

TL;DR: PSEO optimizes post-hoc stacking ensembles in AutoML by balancing diversity and performance, outperforming 16 methods with a top average test rank.

Details

Motivation: Existing CASH methods use fixed ensemble strategies, limiting adaptability to task-specific needs. PSEO aims to enhance ensemble optimization.

Method: PSEO selects base models via binary quadratic programming, introduces multi-layer stacking mechanisms, and searches for optimal ensemble strategies.

Result: PSEO achieved the best average test rank (2.96) on 80 datasets, surpassing other AutoML and ensemble methods.

Conclusion: PSEO effectively addresses the limitations of fixed ensemble strategies, demonstrating superior performance in AutoML tasks.

Abstract: The Combined Algorithm Selection and Hyperparameter Optimization (CASH) problem is fundamental in Automated Machine Learning (AutoML). Inspired by the success of ensemble learning, recent AutoML systems construct post-hoc ensembles for final predictions rather than relying on the best single model. However, while most CASH methods conduct extensive searches for the optimal single model, they typically employ fixed strategies during the ensemble phase that fail to adapt to specific task characteristics. To tackle this issue, we propose PSEO, a framework for post-hoc stacking ensemble optimization. First, we conduct base model selection through binary quadratic programming, with a trade-off between diversity and performance. Furthermore, we introduce two mechanisms to fully realize the potential of multi-layer stacking. Finally, PSEO builds a hyperparameter space and searches for the optimal post-hoc ensemble strategy within it. Empirical results on 80 public datasets show that \sys achieves the best average test rank (2.96) among 16 methods, including post-hoc designs in recent AutoML systems and state-of-the-art ensemble learning methods.

[374] Aligning LLMs on a Budget: Inference-Time Alignment with Heuristic Reward Models

Mason Nakamura, Saaduddin Mahmud, Kyle H. Wray, Hamed Zamani, Shlomo Zilberstein

Main category: cs.LG

TL;DR: HIA (Heuristic-Guided Inference-time Alignment) is a tuning-free method that balances alignment quality and computational cost by using heuristic reward models and two-stage filtering, outperforming existing baselines under the same inference budget.

Details

Motivation: Aligning LLMs with user preferences is costly in terms of fine-tuning or inference, creating a trade-off between alignment quality and computational expense. Existing methods often ignore this balance.

Method: HIA employs a lightweight prompt optimizer, heuristic reward models, and two-stage filtering to reduce inference calls while maintaining alignment quality.

Result: HIA outperforms baselines like best-of-N sampling, beam search, and greedy search on real-world datasets (HelpSteer and ComPRed) under the same inference budget, especially effective with low budgets (1-2 queries).

Conclusion: HIA offers a practical, scalable solution for personalized LLM deployment by efficiently balancing alignment quality and computational cost.

Abstract: Aligning LLMs with user preferences is crucial for real-world use but often requires costly fine-tuning or expensive inference, forcing trade-offs between alignment quality and computational cost. Existing inference-time methods typically ignore this balance, focusing solely on the optimized policy’s performance. We propose HIA (Heuristic-Guided Inference-time Alignment), a tuning-free, black-box-compatible approach that uses a lightweight prompt optimizer, heuristic reward models, and two-stage filtering to reduce inference calls while preserving alignment quality. On real-world prompt datasets, HelpSteer and ComPRed, HIA outperforms best-of-N sampling, beam search, and greedy search baselines in multi-objective, goal-conditioned tasks under the same inference budget. We also find that HIA is effective under low-inference budgets with as little as one or two response queries, offering a practical solution for scalable, personalized LLM deployment.

[375] Domain-driven Metrics for Reinforcement Learning: A Case Study on Epidemic Control using Agent-based Simulation

Rishabh Gaur, Gaurav Deshkar, Jayanta Kshirsagar, Harshal Hayatnagarkar, Janani Venugopalan

Main category: cs.LG

TL;DR: The paper introduces domain-driven metrics for evaluating RL-based agent-based models (ABMs) and rational ABMs (RABMs), addressing challenges like system complexity and lack of standardized metrics.

Details

Motivation: The complexity and stochasticity of ABMs/RABMs, along with the absence of standardized metrics for RL algorithms, motivate the need for domain-driven evaluation methods.

Method: The study develops domain-driven RL metrics, building on existing ones, and applies them to a rational ABM case study involving pandemic behaviors like masking and vaccination.

Result: The proposed metrics, combined with traditional ones, are demonstrated in simulation scenarios, such as varying mask availability, showing their effectiveness.

Conclusion: Domain-driven RL metrics enhance the evaluation of ABMs/RABMs, providing more meaningful insights for complex, stochastic systems.

Abstract: For the development and optimization of agent-based models (ABMs) and rational agent-based models (RABMs), optimization algorithms such as reinforcement learning are extensively used. However, assessing the performance of RL-based ABMs and RABMS models is challenging due to the complexity and stochasticity of the modeled systems, and the lack of well-standardized metrics for comparing RL algorithms. In this study, we are developing domain-driven metrics for RL, while building on state-of-the-art metrics. We demonstrate our ``Domain-driven-RL-metrics’’ using policy optimization on a rational ABM disease modeling case study to model masking behavior, vaccination, and lockdown in a pandemic. Our results show the use of domain-driven rewards in conjunction with traditional and state-of-the-art metrics for a few different simulation scenarios such as the differential availability of masks.

[376] pFedDSH: Enabling Knowledge Transfer in Personalized Federated Learning through Data-free Sub-Hypernetwork

Thinh Nguyen, Le Huy Khiem, Van-Tuan Tran, Khoa D Doan, Nitesh V Chawla, Kok-Seng Wong

Main category: cs.LG

TL;DR: The paper introduces pFedDSH, a framework for dynamic client onboarding in Federated Learning, addressing challenges like performance stability and knowledge transfer.

Details

Motivation: Existing Personalized Federated Learning (pFL) methods assume static clients, but real-world scenarios involve dynamic onboarding of new clients, necessitating solutions for stability and transfer.

Method: Proposes pFedDSH, using a central hypernetwork with batch-specific masks and data-free replay for knowledge preservation and backward transfer.

Result: Outperforms state-of-the-art pFL and Federated Continual Learning baselines on CIFAR-10, CIFAR-100, and Tiny-ImageNet, showing robust performance and adaptation.

Conclusion: pFedDSH effectively handles dynamic client onboarding, ensuring stability for existing clients and efficient adaptation for new ones.

Abstract: Federated Learning (FL) enables collaborative model training across distributed clients without sharing raw data, offering a significant privacy benefit. However, most existing Personalized Federated Learning (pFL) methods assume a static client participation, which does not reflect real-world scenarios where new clients may continuously join the federated system (i.e., dynamic client onboarding). In this paper, we explore a practical scenario in which a new batch of clients is introduced incrementally while the learning task remains unchanged. This dynamic environment poses various challenges, including preserving performance for existing clients without retraining and enabling efficient knowledge transfer between client batches. To address these issues, we propose Personalized Federated Data-Free Sub-Hypernetwork (pFedDSH), a novel framework based on a central hypernetwork that generates personalized models for each client via embedding vectors. To maintain knowledge stability for existing clients, pFedDSH incorporates batch-specific masks, which activate subsets of neurons to preserve knowledge. Furthermore, we introduce a data-free replay strategy motivated by DeepInversion to facilitate backward transfer, enhancing existing clients’ performance without compromising privacy. Extensive experiments conducted on CIFAR-10, CIFAR-100, and Tiny-ImageNet demonstrate that pFedDSH outperforms the state-of-the-art pFL and Federated Continual Learning baselines in our investigation scenario. Our approach achieves robust performance stability for existing clients, as well as adaptation for new clients and efficient utilization of neural resources.

[377] FAITH: A Framework for Assessing Intrinsic Tabular Hallucinations in finance

Mengao Zhang, Jiayu Fu, Tanya Warrier, Yuwen Wang, Tianhui Tan, Ke-wei Huang

Main category: cs.LG

TL;DR: A framework for evaluating hallucinations in financial LLMs using context-aware masked span prediction on real-world financial data.

Details

Motivation: Addressing the challenge of hallucinations in LLMs for finance, where minor numerical errors can impact decision-making and compliance.

Method: Develops a scalable framework with automated dataset creation (masking strategy), a new hallucination evaluation dataset from S&P 500 reports, and evaluates state-of-the-art LLMs.

Result: Provides a robust methodology for in-house LLM evaluation, identifying intrinsic hallucination patterns in financial tabular data.

Conclusion: A critical step toward building trustworthy and reliable financial Generative AI systems.

Abstract: Hallucination remains a critical challenge for deploying Large Language Models (LLMs) in finance. Accurate extraction and precise calculation from tabular data are essential for reliable financial analysis, since even minor numerical errors can undermine decision-making and regulatory compliance. Financial applications have unique requirements, often relying on context-dependent, numerical, and proprietary tabular data that existing hallucination benchmarks rarely capture. In this study, we develop a rigorous and scalable framework for evaluating intrinsic hallucinations in financial LLMs, conceptualized as a context-aware masked span prediction task over real-world financial documents. Our main contributions are: (1) a novel, automated dataset creation paradigm using a masking strategy; (2) a new hallucination evaluation dataset derived from S&P 500 annual reports; and (3) a comprehensive evaluation of intrinsic hallucination patterns in state-of-the-art LLMs on financial tabular data. Our work provides a robust methodology for in-house LLM evaluation and serves as a critical step toward building more trustworthy and reliable financial Generative AI systems.

[378] S$^2$M-Former: Spiking Symmetric Mixing Branchformer for Brain Auditory Attention Detection

Jiaqi Wang, Zhengyu Ma, Xiongri Shen, Chenlin Zhou, Leilei Zhao, Han Zhang, Yi Zhong, Siqi Cai, Zhenxi Song, Zhiguo Zhang

Main category: cs.LG

TL;DR: S$^2$M-Former is a novel spiking symmetric mixing framework for EEG-based auditory attention detection, offering energy efficiency and high performance.

Details

Motivation: Current EEG-based AAD lacks synergistic frameworks to leverage complementary EEG features efficiently under energy constraints.

Method: Proposes S$^2$M-Former with a spike-driven symmetric architecture and lightweight 1D token sequences to reduce parameters and power consumption.

Result: Achieves 5.8× energy reduction, 14.7× parameter reduction, and comparable SOTA accuracy on AAD benchmarks.

Conclusion: S$^2$M-Former is a promising low-power, high-performance solution for AAD tasks.

Abstract: Auditory attention detection (AAD) aims to decode listeners’ focus in complex auditory environments from electroencephalography (EEG) recordings, which is crucial for developing neuro-steered hearing devices. Despite recent advancements, EEG-based AAD remains hindered by the absence of synergistic frameworks that can fully leverage complementary EEG features under energy-efficiency constraints. We propose S$^2$M-Former, a novel spiking symmetric mixing framework to address this limitation through two key innovations: i) Presenting a spike-driven symmetric architecture composed of parallel spatial and frequency branches with mirrored modular design, leveraging biologically plausible token-channel mixers to enhance complementary learning across branches; ii) Introducing lightweight 1D token sequences to replace conventional 3D operations, reducing parameters by 14.7$\times$. The brain-inspired spiking architecture further reduces power consumption, achieving a 5.8$\times$ energy reduction compared to recent ANN methods, while also surpassing existing SNN baselines in terms of parameter efficiency and performance. Comprehensive experiments on three AAD benchmarks (KUL, DTU and AV-GC-AAD) across three settings (within-trial, cross-trial and cross-subject) demonstrate that S$^2$M-Former achieves comparable state-of-the-art (SOTA) decoding accuracy, making it a promising low-power, high-performance solution for AAD tasks.

[379] Near Optimal Inference for the Best-Performing Algorithm

Amichai Painsky

Main category: cs.LG

TL;DR: The paper introduces a framework for selecting the best-performing algorithm from a benchmark by formulating it as a subset selection problem for multinomial distributions, offering improved methods and matching lower bounds.

Details

Motivation: To identify the best machine learning algorithm from a benchmark when performance differences are marginal, requiring a robust selection method.

Method: Formulates the problem as subset selection for multinomial distributions, introducing asymptotic and finite-sample schemes.

Result: Proposed schemes significantly outperform existing methods, with matching lower bounds confirming their effectiveness.

Conclusion: The framework provides a reliable solution for selecting top-performing algorithms, validated by theoretical and empirical results.

Abstract: Consider a collection of competing machine learning algorithms. Given their performance on a benchmark of datasets, we would like to identify the best performing algorithm. Specifically, which algorithm is most likely to rank highest on a future, unseen dataset. A natural approach is to select the algorithm that demonstrates the best performance on the benchmark. However, in many cases the performance differences are marginal and additional candidates may also be considered. This problem is formulated as subset selection for multinomial distributions. Formally, given a sample from a countable alphabet, our goal is to identify a minimal subset of symbols that includes the most frequent symbol in the population with high confidence. In this work, we introduce a novel framework for the subset selection problem. We provide both asymptotic and finite-sample schemes that significantly improve upon currently known methods. In addition, we provide matching lower bounds, demonstrating the favorable performance of our proposed schemes.

[380] Human Activity Recognition from Smartphone Sensor Data for Clinical Trials

Stefania Russo, Rafał Klimas, Marta Płonka, Hugo Le Gall, Sven Holm, Dimitar Stanev, Florian Lipsmeier, Mattia Zanon, Lito Kriara

Main category: cs.LG

TL;DR: A ResNet-based HAR model detects gait vs. non-gait activities and everyday activities with high accuracy and robustness across smartphone wear locations, outperforming state-of-the-art models.

Details

Motivation: To develop a lightweight HAR model for detecting gait and everyday activities, applicable to both healthy individuals and people with multiple sclerosis (PwMS), with high accuracy and robustness.

Method: The model uses ResNet architecture, trained and evaluated on smartphone sensor data from healthy controls and PwMS, incorporating datasets from GaitLab, Roche, and public sources.

Result: Achieved 98.4%-99.6% accuracy for gait detection and 96.2% for everyday activities, outperforming state-of-the-art models. Maintained high performance across 9 wear locations.

Conclusion: The proposed HAR model is accurate, robust, and practical for real-world applications, especially for diverse smartphone placements.

Abstract: We developed a ResNet-based human activity recognition (HAR) model with minimal overhead to detect gait versus non-gait activities and everyday activities (walking, running, stairs, standing, sitting, lying, sit-to-stand transitions). The model was trained and evaluated using smartphone sensor data from adult healthy controls (HC) and people with multiple sclerosis (PwMS) with Expanded Disability Status Scale (EDSS) scores between 0.0-6.5. Datasets included the GaitLab study (ISRCTN15993728), an internal Roche dataset, and publicly available data sources (training only). Data from 34 HC and 68 PwMS (mean [SD] EDSS: 4.7 [1.5]) were included in the evaluation. The HAR model showed 98.4% and 99.6% accuracy in detecting gait versus non-gait activities in the GaitLab and Roche datasets, respectively, similar to a comparative state-of-the-art ResNet model (99.3% and 99.4%). For everyday activities, the proposed model not only demonstrated higher accuracy than the state-of-the-art model (96.2% vs 91.9%; internal Roche dataset) but also maintained high performance across 9 smartphone wear locations (handbag, shopping bag, crossbody bag, backpack, hoodie pocket, coat/jacket pocket, hand, neck, belt), outperforming the state-of-the-art model by 2.8% - 9.0%. In conclusion, the proposed HAR model accurately detects everyday activities and shows high robustness to various smartphone wear locations, demonstrating its practical applicability.

[381] Fairy$\pm i$: the First 2-bit Complex LLM with All Parameters in ${\pm1, \pm i}$

Feiyu Wang, Guoan Wang, Yihao Zhang, Shengfan Wang, Weitao Li, Bokai Huang, Shimao Chen, Zihan Jiang, Rui Xu, Tong Yang

Main category: cs.LG

TL;DR: Fairy±i introduces a 2-bit quantization framework for complex-valued LLMs, surpassing the accuracy ceiling of full-precision models by leveraging complex domain advantages.

Details

Motivation: Current QAT research focuses on minimizing quantization error without surpassing full-precision accuracy. This work aims to break that ceiling.

Method: Uses complex domain to boost accuracy, mapping weights to fourth roots of unity for symmetric 2-bit representation, enabling multiplication-free inference.

Result: Fairy±i outperforms existing 2-bit methods in PPL and downstream tasks while maintaining efficiency.

Conclusion: This work pioneers highly accurate, practical LLMs under extreme low-bit constraints.

Abstract: Quantization-Aware Training (QAT) integrates quantization into the training loop, enabling LLMs to learn robust low-bit representations, and is widely recognized as one of the most promising research directions. All current QAT research focuses on minimizing quantization error on full-precision models, where the full-precision accuracy acts as an upper bound (accuracy ceiling). No existing method has even attempted to surpass this ceiling. To break this ceiling, we propose a new paradigm: raising the ceiling (full-precision model), and then still quantizing it efficiently into 2 bits. We propose Fairy$\pm i$, the first 2-bit quantization framework for complex-valued LLMs. Specifically, our method leverages the representational advantages of the complex domain to boost full-precision accuracy. We map weights to the fourth roots of unity ${\pm1, \pm i}$, forming a perfectly symmetric and information-theoretically optimal 2-bit representation. Importantly, each quantized weight has either a zero real or imaginary part, enabling multiplication-free inference using only additions and element swaps. Experimental results show that Fairy$\pm i$ outperforms the ceiling of existing 2-bit quantization approaches in terms of both PPL and downstream tasks, while maintaining strict storage and compute efficiency. This work opens a new direction for building highly accurate and practical LLMs under extremely low-bit constraints.

[382] Physics-Informed Time-Integrated DeepONet: Temporal Tangent Space Operator Learning for High-Accuracy Inference

Luis Mandl, Dibyajyoti Nayak, Tim Ricken, Somdatta Goswami

Main category: cs.LG

TL;DR: PITI-DeepONet improves long-term PDE solutions by learning time-derivative operators, reducing errors significantly compared to traditional methods.

Details

Motivation: Addressing poor generalization and error accumulation in traditional PDE solution methods (FR and AR).

Method: Introduces PITI-DeepONet, a dual-output architecture trained with physics-informed or hybrid objectives, integrating time-derivative operators via classical schemes.

Result: Reduced mean relative errors by up to 98% for benchmark problems like heat, Burgers, and Allen-Cahn equations.

Conclusion: PITI-DeepONet enables more reliable long-term PDE integration, outperforming FR and AR methods.

Abstract: Accurately modeling and inferring solutions to time-dependent partial differential equations (PDEs) over extended horizons remains a core challenge in scientific machine learning. Traditional full rollout (FR) methods, which predict entire trajectories in one pass, often fail to capture the causal dependencies and generalize poorly outside the training time horizon. Autoregressive (AR) approaches, evolving the system step by step, suffer from error accumulation, limiting long-term accuracy. These shortcomings limit the long-term accuracy and reliability of both strategies. To address these issues, we introduce the Physics-Informed Time-Integrated Deep Operator Network (PITI-DeepONet), a dual-output architecture trained via fully physics-informed or hybrid physics- and data-driven objectives to ensure stable, accurate long-term evolution well beyond the training horizon. Instead of forecasting future states, the network learns the time-derivative operator from the current state, integrating it using classical time-stepping schemes to advance the solution in time. Additionally, the framework can leverage residual monitoring during inference to estimate prediction quality and detect when the system transitions outside the training domain. Applied to benchmark problems, PITI-DeepONet shows improved accuracy over extended inference time horizons when compared to traditional methods. Mean relative $\mathcal{L}_2$ errors reduced by 84% (vs. FR) and 79% (vs. AR) for the one-dimensional heat equation; by 87% (vs. FR) and 98% (vs. AR) for the one-dimensional Burgers equation; and by 42% (vs. FR) and 89% (vs. AR) for the two-dimensional Allen-Cahn equation. By moving beyond classic FR and AR schemes, PITI-DeepONet paves the way for more reliable, long-term integration of complex, time-dependent PDEs.

[383] Iterative Learning of Computable Phenotypes for Treatment Resistant Hypertension using Large Language Models

Guilherme Seidyo Imai Aldeia, Daniel S. Herman, William G. La Cava

Main category: cs.LG

TL;DR: LLMs can generate interpretable and accurate computable phenotypes for clinical decision support, with iterative refinement improving performance.

Details

Motivation: To explore LLMs' potential for generating computable phenotypes (CPs) for scalable clinical decision support in hypertension care.

Method: Proposed a synthesize, execute, debug, instruct strategy for iterative refinement of CPs using LLMs and data-driven feedback.

Result: LLMs with iterative learning produced interpretable, reasonably accurate CPs, nearing state-of-the-art ML methods with fewer training examples.

Conclusion: LLMs, combined with iterative refinement, are promising for generating CPs, offering efficiency and scalability in clinical applications.

Abstract: Large language models (LLMs) have demonstrated remarkable capabilities for medical question answering and programming, but their potential for generating interpretable computable phenotypes (CPs) is under-explored. In this work, we investigate whether LLMs can generate accurate and concise CPs for six clinical phenotypes of varying complexity, which could be leveraged to enable scalable clinical decision support to improve care for patients with hypertension. In addition to evaluating zero-short performance, we propose and test a synthesize, execute, debug, instruct strategy that uses LLMs to generate and iteratively refine CPs using data-driven feedback. Our results show that LLMs, coupled with iterative learning, can generate interpretable and reasonably accurate programs that approach the performance of state-of-the-art ML methods while requiring significantly fewer training examples.

[384] Bidding-Aware Retrieval for Multi-Stage Consistency in Online Advertising

Bin Liu, Yunfei Liu, Ziru Xu, Zhaoyu Zhou, Zhi Kou, Yeqiu Yang, Han Zhu, Jian Xu, Bo Zheng

Main category: cs.LG

TL;DR: BAR (Bidding-Aware Retrieval) addresses inconsistency in online ad systems by incorporating bid values into retrieval, improving revenue and ad performance.

Details

Motivation: The inconsistency between retrieval and ranking stages in ad systems, due to lack of real-time bid access, harms revenue and advertiser outcomes.

Method: BAR uses Bidding-Aware Modeling (monotonicity-constrained learning, multi-task distillation) and Asynchronous Near-Line Inference for real-time updates, plus Task-Attentive Refinement for feature disentanglement.

Result: BAR increased platform revenue by 4.32% and impressions by 22.2% in Alibaba’s ad platform.

Conclusion: BAR effectively resolves multi-stage inconsistency, enhancing both platform revenue and advertiser performance.

Abstract: Online advertising systems typically use a cascaded architecture to manage massive requests and candidate volumes, where the ranking stages allocate traffic based on eCPM (predicted CTR $\times$ Bid). With the increasing popularity of auto-bidding strategies, the inconsistency between the computationally sensitive retrieval stage and the ranking stages becomes more pronounced, as the former cannot access precise, real-time bids for the vast ad corpus. This discrepancy leads to sub-optimal platform revenue and advertiser outcomes. To tackle this problem, we propose Bidding-Aware Retrieval (BAR), a model-based retrieval framework that addresses multi-stage inconsistency by incorporating ad bid value into the retrieval scoring function. The core innovation is Bidding-Aware Modeling, incorporating bid signals through monotonicity-constrained learning and multi-task distillation to ensure economically coherent representations, while Asynchronous Near-Line Inference enables real-time updates to the embedding for market responsiveness. Furthermore, the Task-Attentive Refinement module selectively enhances feature interactions to disentangle user interest and commercial value signals. Extensive offline experiments and full-scale deployment across Alibaba’s display advertising platform validated BAR’s efficacy: 4.32% platform revenue increase with 22.2% impression lift for positively-operated advertisements.

[385] Advanced Hybrid Transformer LSTM Technique with Attention and TS Mixer for Drilling Rate of Penetration Prediction

Saddam Hussain Khan

Main category: cs.LG

TL;DR: A hybrid deep learning model combining LSTM, Transformer encoders, TS-Mixer, and attention mechanisms improves ROP prediction accuracy and real-time utility in drilling operations.

Details

Motivation: Traditional models fail to capture complex temporal and contextual relationships in drilling data, leading to suboptimal ROP predictions.

Method: Proposes a hybrid architecture integrating LSTM, Transformer encoders, TS-Mixer, and attention mechanisms to model temporal dependencies and feature interactions.

Result: Achieved an R-squared score of 0.9988 and MAPE of 1.447%, outperforming benchmarks.

Conclusion: The hybrid model enables reliable real-time ROP prediction, advancing intelligent drilling optimization.

Abstract: The Rate of Penetration (ROP) is crucial for optimizing drilling operations; however, accurately predicting it is hindered by the complex, dynamic, and high-dimensional nature of drilling data. Traditional empirical, physics-based, and basic machine learning models often fail to capture intricate temporal and contextual relationships, resulting in suboptimal predictions and limited real-time utility. To address this gap, we propose a novel hybrid deep learning architecture integrating Long Short-Term Memory (LSTM) networks, Transformer encoders, Time-Series Mixer (TS-Mixer) blocks, and attention mechanisms to synergistically model temporal dependencies, static feature interactions, global context, and dynamic feature importance. Evaluated on a real-world drilling dataset, our model outperformed benchmarks (standalone LSTM, TS-Mixer, and simpler hybrids) with an R-squared score of 0.9988 and a Mean Absolute Percentage Error of 1.447%, as measured by standard regression metrics (R-squared, MAE, RMSE, MAPE). Model interpretability was ensured using SHAP and LIME, while actual vs. predicted curves and bias checks confirmed accuracy and fairness across scenarios. This advanced hybrid approach enables reliable real-time ROP prediction, paving the way for intelligent, cost-effective drilling optimization systems with significant operational impact.

[386] DFW: A Novel Weighting Scheme for Covariate Balancing and Treatment Effect Estimation

Ahmad Saeed Khan, Erik Schaffernicht, Johannes Andreas Stork

Main category: cs.LG

TL;DR: Proposes Deconfounding Factor Weighting (DFW) to address instability in propensity score-based methods for causal effect estimation, improving covariate balance and treatment effect accuracy.

Details

Motivation: Selection bias in observational data complicates causal effect estimation, and existing propensity score methods like IPW can be unstable due to high variance in weights.

Method: DFW leverages a deconfounding factor to construct stable weights, prioritizing less confounded samples and bounding weights to improve balance and reduce variance.

Result: DFW outperforms IPW and CBPS in experiments, achieving better covariate balance and more accurate treatment effect estimation.

Conclusion: DFW is a robust alternative to traditional propensity score methods, effective for binary and multi-treatment settings.

Abstract: Estimating causal effects from observational data is challenging due to selection bias, which leads to imbalanced covariate distributions across treatment groups. Propensity score-based weighting methods are widely used to address this issue by reweighting samples to simulate a randomized controlled trial (RCT). However, the effectiveness of these methods heavily depends on the observed data and the accuracy of the propensity score estimator. For example, inverse propensity weighting (IPW) assigns weights based on the inverse of the propensity score, which can lead to instable weights when propensity scores have high variance-either due to data or model misspecification-ultimately degrading the ability of handling selection bias and treatment effect estimation. To overcome these limitations, we propose Deconfounding Factor Weighting (DFW), a novel propensity score-based approach that leverages the deconfounding factor-to construct stable and effective sample weights. DFW prioritizes less confounded samples while mitigating the influence of highly confounded ones, producing a pseudopopulation that better approximates a RCT. Our approach ensures bounded weights, lower variance, and improved covariate balance.While DFW is formulated for binary treatments, it naturally extends to multi-treatment settings, as the deconfounding factor is computed based on the estimated probability of the treatment actually received by each sample. Through extensive experiments on real-world benchmark and synthetic datasets, we demonstrate that DFW outperforms existing methods, including IPW and CBPS, in both covariate balancing and treatment effect estimation.

[387] ML-based Short Physical Performance Battery future score prediction based on questionnaire data

Marcin Kolakowski, Seif Ben Bader

Main category: cs.LG

TL;DR: Predicting SPPB scores in older adults using ML, with XGBoost achieving the best performance (MAE: 0.79). Feature selection improved results (MAE: 0.82).

Details

Motivation: Early intervention is crucial to slow physical deterioration in older adults. Predicting SPPB scores can aid timely action.

Method: Tested ML algorithms (Random Forest, XGBoost, Linear Regression, dense and TabNet neural networks) on questionnaire data. Used Shapley values for feature selection.

Result: XGBoost performed best (MAE: 0.79). Feature subsets (10-20) yielded slightly higher MAE (0.82).

Conclusion: XGBoost is effective for SPPB prediction. Feature selection maintains performance with fewer inputs, aiding practical application.

Abstract: Effective slowing down of older adults' physical capacity deterioration requires intervention as soon as the first symptoms surface. In this paper, we analyze the possibility of predicting the Short Physical Performance Battery (SPPB) score at a four-year horizon based on questionnaire data. The ML algorithms tested included Random Forest, XGBoost, Linear Regression, dense and TabNet neural networks. The best results were achieved for the XGBoost (mean absolute error of 0.79 points). Based on the Shapley values analysis, we selected smaller subsets of features (from 10 to 20) and retrained the XGBoost regressor, achieving a mean absolute error of 0.82.

[388] Don’t Reach for the Stars: Rethinking Topology for Resilient Federated Learning

Mirko Konstantin, Anirban Mukhopadhyay

Main category: cs.LG

TL;DR: Proposes a decentralized P2P FL framework (LIGHTYEAR) using agreement scores for personalized, robust model updates, outperforming centralized and existing P2P methods.

Details

Motivation: Addresses limitations of centralized FL (single point of failure, poor personalization, vulnerability to adversarial clients) and unreliable update selection in heterogeneous data settings.

Method: Introduces a P2P FL framework with agreement scores for semantic alignment of updates, personalized selection, and regularization for stability.

Result: Outperforms centralized and existing P2P FL methods, especially under adversarial and heterogeneous conditions.

Conclusion: Decentralized P2P FL with agreement-based update selection and regularization enhances robustness and personalization in heterogeneous settings.

Abstract: Federated learning (FL) enables collaborative model training across distributed clients while preserving data privacy by keeping data local. Traditional FL approaches rely on a centralized, star-shaped topology, where a central server aggregates model updates from clients. However, this architecture introduces several limitations, including a single point of failure, limited personalization, and poor robustness to distribution shifts or vulnerability to malfunctioning clients. Moreover, update selection in centralized FL often relies on low-level parameter differences, which can be unreliable when client data is not independent and identically distributed, and offer clients little control. In this work, we propose a decentralized, peer-to-peer (P2P) FL framework. It leverages the flexibility of the P2P topology to enable each client to identify and aggregate a personalized set of trustworthy and beneficial updates.This framework is the Local Inference Guided Aggregation for Heterogeneous Training Environments to Yield Enhancement Through Agreement and Regularization (LIGHTYEAR). Central to our method is an agreement score, computed on a local validation set, which quantifies the semantic alignment of incoming updates in the function space with respect to the clients reference model. Each client uses this score to select a tailored subset of updates and performs aggregation with a regularization term that further stabilizes the training. Our empirical evaluation across two datasets shows that the proposed approach consistently outperforms both centralized baselines and existing P2P methods in terms of client-level performance, particularly under adversarial and heterogeneous conditions.

[389] Cross-LoRA: A Data-Free LoRA Transfer Framework across Heterogeneous LLMs

Feifan Xia, Mingyang Liao, Yuyang Fang, Defang Li, Yantong Xie, Weikang Li, Yang Li, Deguo Xia, Jizhou Huang

Main category: cs.LG

TL;DR: Cross-LoRA is a data-free framework for transferring LoRA modules between diverse LLMs, using LoRA-Align and LoRA-Shift, achieving up to 5.26% performance gains.

Details

Motivation: Overcome the limitation of traditional PEFT methods like LoRA, which are architecture-dependent and not easily transferable across heterogeneous LLMs.

Method: Introduces Cross-LoRA with LoRA-Align (subspace alignment via SVD and linear transformation) and LoRA-Shift (projecting weight updates). Both are data-free and training-free.

Result: Achieves up to 5.26% performance gains on benchmarks like ARCs, OBOA, and HellaSwag, with comparable performance to directly trained LoRA adapters.

Conclusion: Cross-LoRA enables efficient, lightweight adaptation of LoRA modules across diverse LLMs without additional data or training.

Abstract: Traditional parameter-efficient fine-tuning (PEFT) methods such as LoRA are tightly coupled with the base model architecture, which constrains their applicability across heterogeneous pretrained large language models (LLMs). To address this limitation, we introduce Cross-LoRA, a data-free framework for transferring LoRA modules between diverse base models without requiring additional training data. Cross-LoRA consists of two key components: (a) LoRA-Align, which performs subspace alignment between source and target base models through rank-truncated singular value decomposition (SVD) and Frobenius-optimal linear transformation, ensuring compatibility under dimension mismatch; and (b) LoRA-Shift, which applies the aligned subspaces to project source LoRA weight updates into the target model parameter space. Both components are data-free, training-free, and enable lightweight adaptation on a commodity GPU in 20 minutes. Experiments on ARCs, OBOA and HellaSwag show that Cross-LoRA achieves relative gains of up to 5.26% over base models. Across other commonsense reasoning benchmarks, Cross-LoRA maintains performance comparable to that of directly trained LoRA adapters.

[390] MoBE: Mixture-of-Basis-Experts for Compressing MoE-based LLMs

Xiaodong Chen, Mingming Ha, Zhenzhong Lan, Jing Zhang, Jianguo Li

Main category: cs.LG

TL;DR: The paper introduces Mixture-of-Basis-Experts (MoBE), a novel method for compressing large MoE-based LLMs with minimal accuracy loss, outperforming existing techniques.

Details

Motivation: Addressing the high memory requirements of large MoE-based LLMs like DeepSeek-V3-0324 and Kimi-K2-Instruct, which current compression methods degrade accuracy significantly.

Method: Decomposes expert matrices via rank decomposition (W = AB), with shared basis matrices {Bi} across experts, learned by minimizing reconstruction error.

Result: MoBE reduces parameter counts by 24%-30% with only 1%-2% accuracy drop, significantly better than prior methods.

Conclusion: MoBE offers an effective solution for compressing MoE models with minimal performance trade-offs.

Abstract: The Mixture-of-Experts (MoE) architecture has become a predominant paradigm for scaling large language models (LLMs). Despite offering strong performance and computational efficiency, large MoE-based LLMs like DeepSeek-V3-0324 and Kimi-K2-Instruct present serious challenges due to substantial memory requirements in deployment. While recent works have explored MoE compression to address this issue, existing methods often suffer from considerable accuracy drops (e.g., 7-14% relatively) even at modest compression rates. This paper introduces a novel Mixture-of-Basis-Experts (MoBE) method that achieves model compression while incurring minimal accuracy drops. Specifically, each up/gate matrix in an expert is decomposed via a rank decomposition as W = AB, where matrix A is unique to each expert. The relatively larger matrix B is further re-parameterized as a linear combination of basis matrices {Bi} shared across all experts within a given MoE layer. The factorization is learned by minimizing the reconstruction error relative to the original weight matrices. Experiments demonstrate that MoBE achieves notably lower accuracy drops compared to prior works. For instance, MoBE can reduce the parameter counts of Qwen3-235B-A22B-2507, DeepSeek-V3-0324 (671B) and Kimi-K2-Instruct (1T) by 24%-30% with only 1%-2% accuracy drop (about 2% drops when measured relatively).

[391] Marine Chlorophyll Prediction and Driver Analysis based on LSTM-RF Hybrid Models

Zhouyao Qian, Yang Chen, Baodian Li, Shuyi Zhang, Zhen Tian, Gongsen Wang, Tianyue Gu, Xinyu Zhou, Huilin Chen, Xinyi Li, Hao Zhu, Shuyao Zhang, Zongheng Li, Siyuan Wang

Main category: cs.LG

TL;DR: A hybrid LSTM-RF model is proposed for predicting marine chlorophyll concentration, outperforming standalone LSTM and RF models with higher accuracy metrics.

Details

Motivation: Accurate prediction of marine chlorophyll concentration is vital for ecosystem health monitoring and red tide warnings.

Method: A hybrid LSTM-RF model trained on multi-source ocean data (temperature, salinity, dissolved oxygen) with standardized treatment and sliding window approach.

Result: The LSTM-RF model achieved R²=0.5386, MSE=0.005806, and MAE=0.057147, outperforming standalone LSTM (R²=0.0208) and RF (R²=0.4934).

Conclusion: The hybrid model provides an innovative solution for high-frequency prediction of marine ecological variables, improving accuracy.

Abstract: Marine chlorophyll concentration is an important indicator of ecosystem health and carbon cycle strength, and its accurate prediction is crucial for red tide warning and ecological response. In this paper, we propose a LSTM-RF hybrid model that combines the advantages of LSTM and RF, which solves the deficiencies of a single model in time-series modelling and nonlinear feature portrayal. Trained with multi-source ocean data(temperature, salinity, dissolved oxygen, etc.), the experimental results show that the LSTM-RF model has an R^2 of 0.5386, an MSE of 0.005806, and an MAE of 0.057147 on the test set, which is significantly better than using LSTM (R^2 = 0.0208) and RF (R^2 =0.4934) alone , respectively. The standardised treatment and sliding window approach improved the prediction accuracy of the model and provided an innovative solution for high-frequency prediction of marine ecological variables.

[392] FlowState: Sampling Rate Invariant Time Series Forecasting

Lars Graf, Thomas Ortner, Stanisław Woźniak, Angeliki Pantazi

Main category: cs.LG

TL;DR: FlowState, a novel time series foundation model (TSFM), addresses generalization and efficiency issues in existing TSFMs using a state space model encoder and functional basis decoder, achieving state-of-the-art performance.

Details

Motivation: Existing TSFMs struggle with generalization across varying temporal scales, adaptability to sampling rates, and computational inefficiency.

Method: FlowState uses a state space model (SSM) encoder and functional basis decoder for continuous-time modeling and dynamic time-scale adjustment.

Result: FlowState outperforms other models on GIFT-ZS and Chronos-ZS benchmarks, with smaller size and reduced data needs.

Conclusion: FlowState’s design enables superior generalization, efficiency, and adaptability, making it a state-of-the-art TSFM.

Abstract: Foundation models (FMs) have transformed natural language processing, but their success has not yet translated to time series forecasting. Existing time series foundation models (TSFMs), often based on transformer variants, struggle with generalization across varying context and target lengths, lack adaptability to different sampling rates, and are computationally inefficient. We introduce FlowState, a novel TSFM architecture that addresses these challenges through two key innovations: a state space model (SSM) based encoder and a functional basis decoder. This design enables continuous-time modeling and dynamic time-scale adjustment, allowing FlowState to inherently generalize across all possible temporal resolutions, and dynamically adjust the forecasting horizons. In contrast to other state-of-the-art TSFMs, which require training data across all possible sampling rates to memorize patterns at each scale, FlowState inherently adapts its internal dynamics to the input scale, enabling smaller models, reduced data requirements, and improved efficiency. We further propose an efficient pretraining strategy that improves robustness and accelerates training. Despite being the smallest model, FlowState outperforms all other models and is state-of-the-art for the GIFT-ZS and the Chronos-ZS benchmarks. Ablation studies confirm the effectiveness of its components, and we demonstrate its unique ability to adapt online to varying input sampling rates.

[393] RLHF Fine-Tuning of LLMs for Alignment with Implicit User Feedback in Conversational Recommenders

Zhongheng Yang, Aijia Sun, Yushang Zhao, Yinuo Yang, Dannier Li, Chengrui Zhou

Main category: cs.LG

TL;DR: The paper proposes using RLHF to align LLM-based CRS with implicit user feedback, improving recommendation accuracy and user satisfaction.

Details

Motivation: Traditional supervised fine-tuning fails to capture implicit user feedback like dwell time or sentiment, necessitating a better alignment method.

Method: Uses RLHF with a reward model trained on weakly-labeled engagement data and optimizes the LLM via PPO to maximize user-centric utility.

Result: RLHF-fine-tuned models outperform others in top-k accuracy, coherence, and user satisfaction on synthetic and real-world datasets.

Conclusion: Implicit signal alignment via RLHF is efficient for scalable and user-adaptive CRS design.

Abstract: Conversational recommender systems (CRS) based on Large Language Models (LLMs) need to constantly be aligned to the user preferences to provide satisfying and context-relevant item recommendations. The traditional supervised fine-tuning cannot capture the implicit feedback signal, e.g., dwell time, sentiment polarity, or engagement patterns. In this paper, we share a fine-tuning solution using human feedback reinforcement learning (RLHF) to maximize implied user feedback (IUF) in a multi-turn recommendation context. We specify a reward model $R_{\phi}$ learnt on weakly-labelled engagement information and maximize user-centric utility by optimizing the foundational LLM M_{\theta} through a proximal policy optimization (PPO) approach. The architecture models conversational state transitions $s_t \to a_t \to s_{t +1}$, where the action $a_t$ is associated with LLM-generated item suggestions only on condition of conversation history in the past. The evaluation across synthetic and real-world datasets (e.g.REDIAL, OpenDialKG) demonstrates that our RLHF-fine-tuned models can perform better in terms of top-$k$ recommendation accuracy, coherence, and user satisfaction compared to (arrow-zero-cmwrquca-teja-falset ensuite 2Round group-deca States penalty give up This paper shows that implicit signal alignment can be efficient in achieving scalable and user-adaptive design of CRS.

[394] Optimal Growth Schedules for Batch Size and Learning Rate in SGD that Reduce SFO Complexity

Hikaru Umeda, Hideaki Iiduka

Main category: cs.LG

TL;DR: The paper explores optimal batch-size and learning-rate scheduling for efficient deep learning training, balancing computational efficiency and convergence.

Details

Motivation: The growth of deep learning models has created computational bottlenecks, and naive hyperparameter scheduling can degrade efficiency and generalization.

Method: The study uses stochastic first-order oracle (SFO) complexity to derive optimal batch-size and learning-rate growth schedules, validated through experiments.

Result: The derived schedules reduce SFO complexity, improving training efficiency without compromising convergence.

Conclusion: The work provides theoretical and practical guidelines for scalable large-batch training in deep learning.

Abstract: The unprecedented growth of deep learning models has enabled remarkable advances but introduced substantial computational bottlenecks. A key factor contributing to training efficiency is batch-size and learning-rate scheduling in stochastic gradient methods. However, naive scheduling of these hyperparameters can degrade optimization efficiency and compromise generalization. Motivated by recent theoretical insights, we investigated how the batch size and learning rate should be increased during training to balance efficiency and convergence. We analyzed this problem on the basis of stochastic first-order oracle (SFO) complexity, defined as the expected number of gradient evaluations needed to reach an $\epsilon$-approximate stationary point of the empirical loss. We theoretically derived optimal growth schedules for the batch size and learning rate that reduce SFO complexity and validated them through extensive experiments. Our results offer both theoretical insights and practical guidelines for scalable and efficient large-batch training in deep learning.

[395] Adaptive Batch Size and Learning Rate Scheduler for Stochastic Gradient Descent Based on Minimization of Stochastic First-order Oracle Complexity

Hikaru Umeda, Hideaki Iiduka

Main category: cs.LG

TL;DR: The paper explores how adjusting batch size and learning rate in SGD, based on critical batch size theory, improves convergence speed.

Details

Motivation: To optimize SGD convergence by leveraging theoretical insights on critical batch size and adaptive scheduling.

Method: Introduces an adaptive joint scheduler that adjusts batch size and learning rate based on observed gradient norm decay.

Result: Experiments show faster convergence compared to existing schedulers.

Conclusion: Adaptive scheduling based on critical batch size theory enhances SGD performance.

Abstract: The convergence behavior of mini-batch stochastic gradient descent (SGD) is highly sensitive to the batch size and learning rate settings. Recent theoretical studies have identified the existence of a critical batch size that minimizes stochastic first-order oracle (SFO) complexity, defined as the expected number of gradient evaluations required to reach a stationary point of the empirical loss function in a deep neural network. An adaptive scheduling strategy is introduced to accelerate SGD that leverages theoretical findings on the critical batch size. The batch size and learning rate are adjusted on the basis of the observed decay in the full gradient norm during training. Experiments using an adaptive joint scheduler based on this strategy demonstrated improved convergence speed compared with that of existing schedulers.

[396] ASkDAgger: Active Skill-level Data Aggregation for Interactive Imitation Learning

Jelle Luijkx, Zlatan Ajanović, Laura Ferranti, Jens Kober

Main category: cs.LG

TL;DR: ASkDAgger reduces human teaching effort in interactive imitation learning by leveraging novice plans and teacher feedback through SAG, FIER, and PIER.

Details

Motivation: Human teaching effort is a bottleneck in interactive imitation learning. Existing methods waste novice plans' valuable information.

Method: ASkDAgger uses S-Aware Gating (SAG), Foresight Interactive Experience Replay (FIER), and Prioritized Interactive Experience Replay (PIER) to optimize teacher feedback and novice plans.

Result: Validated in language-conditioned manipulation tasks, ASkDAgger reduces queries, improves generalization, and speeds adaptation.

Conclusion: ASkDAgger effectively balances query frequency and failure incidence, enhancing interactive imitation learning.

Abstract: Human teaching effort is a significant bottleneck for the broader applicability of interactive imitation learning. To reduce the number of required queries, existing methods employ active learning to query the human teacher only in uncertain, risky, or novel situations. However, during these queries, the novice’s planned actions are not utilized despite containing valuable information, such as the novice’s capabilities, as well as corresponding uncertainty levels. To this end, we allow the novice to say: “I plan to do this, but I am uncertain.” We introduce the Active Skill-level Data Aggregation (ASkDAgger) framework, which leverages teacher feedback on the novice plan in three key ways: (1) S-Aware Gating (SAG): Adjusts the gating threshold to track sensitivity, specificity, or a minimum success rate; (2) Foresight Interactive Experience Replay (FIER), which recasts valid and relabeled novice action plans into demonstrations; and (3) Prioritized Interactive Experience Replay (PIER), which prioritizes replay based on uncertainty, novice success, and demonstration age. Together, these components balance query frequency with failure incidence, reduce the number of required demonstration annotations, improve generalization, and speed up adaptation to changing domains. We validate the effectiveness of ASkDAgger through language-conditioned manipulation tasks in both simulation and real-world environments. Code, data, and videos are available at https://askdagger.github.io.

[397] Divide-and-Conquer for Enhancing Unlabeled Learning, Stability, and Plasticity in Semi-supervised Continual Learning

Yue Duan, Taicai Chen, Lei Qi, Yinghuan Shi

Main category: cs.LG

TL;DR: USP is a framework for semi-supervised continual learning (SSCL) that enhances learning plasticity, unlabeled learning, and memory stability through three strategies: FSR, DCP, and CUD. It outperforms prior methods by up to 5.94%.

Details

Motivation: To address the challenges of SSCL, including balancing unlabeled learning, memory stability, and learning plasticity, which previous methods tackled in isolation.

Method: USP uses three strategies: (1) FSR for learning plasticity, (2) DCP for unlabeled learning, and (3) CUD for memory stability.

Result: USP outperforms prior SSCL methods, achieving up to 5.94% higher accuracy.

Conclusion: USP effectively addresses SSCL challenges by synergistically improving learning plasticity, unlabeled learning, and memory stability.

Abstract: Semi-supervised continual learning (SSCL) seeks to leverage both labeled and unlabeled data in a sequential learning setup, aiming to reduce annotation costs while managing continual data arrival. SSCL introduces complex challenges, including ensuring effective unlabeled learning (UL), while balancing memory stability (MS) and learning plasticity (LP). Previous SSCL efforts have typically focused on isolated aspects of the three, while this work presents USP, a divide-and-conquer framework designed to synergistically enhance these three aspects: (1) Feature Space Reservation (FSR) strategy for LP, which constructs reserved feature locations for future classes by shaping old classes into an equiangular tight frame; (2) Divide-and-Conquer Pseudo-labeling (DCP) approach for UL, which assigns reliable pseudo-labels across both high- and low-confidence unlabeled data; and (3) Class-mean-anchored Unlabeled Distillation (CUD) for MS, which reuses DCP’s outputs to anchor unlabeled data to stable class means for distillation to prevent forgetting. Comprehensive evaluations show USP outperforms prior SSCL methods, with gains up to 5.94% in the last accuracy, validating its effectiveness. The code is available at https://github.com/NJUyued/USP4SSCL.

[398] Optimal Corpus Aware Training for Neural Machine Translation

Yi-Hsiu Liao, Cheng Shen, Brenda, Yang

Main category: cs.LG

TL;DR: OCAT improves CAT by fine-tuning only corpus-related parameters, boosting accuracy and resilience to overfitting, with significant improvements in translation tasks.

Details

Motivation: To address the inefficiency and error-proneness of pre-defining high-quality data in CAT, OCAT fine-tunes a CAT model more effectively.

Method: OCAT fine-tunes a CAT pre-trained model by freezing most parameters and tuning only corpus-related ones.

Result: OCAT achieves +3.6 and +1.8 chrF improvements in WMT23 translation tasks and matches or surpasses other fine-tuning methods.

Conclusion: OCAT is lightweight, effective, and less sensitive to hyperparameters, making it a robust alternative to CAT.

Abstract: Corpus Aware Training (CAT) leverages valuable corpus metadata during training by injecting corpus information into each training example, and has been found effective in the literature, commonly known as the “tagging” approach. Models trained with CAT inherently learn the quality, domain and nuance between corpora directly from data, and can easily switch to different inference behavior. To achieve the best evaluation, CAT models pre-define a group of high quality data before training starts which can be error-prone and inefficient. In this work, we propose Optimal Corpus Aware Training (OCAT), which fine-tunes a CAT pre-trained model by freezing most of the model parameters and only tuning small set of corpus-related parameters. We show that OCAT is lightweight, resilient to overfitting, and effective in boosting model accuracy. We use WMT23 English to Chinese and English to German translation tasks as our test ground and show +3.6 and +1.8 chrF improvement, respectively, over vanilla training. Furthermore, our approach is on-par or slightly better than other state-of-the-art fine-tuning techniques while being less sensitive to hyperparameter settings.

[399] Latent Preference Bandits

Newton Mwai, Emil Carlsson, Fredrik D. Johansson

Main category: cs.LG

TL;DR: The paper proposes relaxing the assumptions of latent bandits by requiring only a model of action preference ordering in each latent state, allowing varied reward distributions. A posterior-sampling algorithm is introduced, showing competitive performance with latent bandits when reward scales differ.

Details

Motivation: Learning from scratch is costly for personalization tasks with limited decision points. Latent bandits reduce exploration time but require accurate joint distribution models, which are hard to find and may not fit all individuals.

Method: The paper relaxes latent bandit assumptions to require only action preference ordering models in each latent state. A posterior-sampling algorithm is proposed for this setup.

Result: The algorithm performs competitively with latent bandits when reward distributions are well-specified and outperforms them when reward scales differ for the same latent state.

Conclusion: Relaxing assumptions to focus on preference ordering allows more flexibility and better performance in scenarios with varying reward scales, making the approach practical for real-world personalization tasks.

Abstract: Bandit algorithms are guaranteed to solve diverse sequential decision-making problems, provided that a sufficient exploration budget is available. However, learning from scratch is often too costly for personalization tasks where a single individual faces only a small number of decision points. Latent bandits offer substantially reduced exploration times for such problems, given that the joint distribution of a latent state and the rewards of actions is known and accurate. In practice, finding such a model is non-trivial, and there may not exist a small number of latent states that explain the responses of all individuals. For example, patients with similar latent conditions may have the same preference in treatments but rate their symptoms on different scales. With this in mind, we propose relaxing the assumptions of latent bandits to require only a model of the \emph{preference ordering} of actions in each latent state. This allows problem instances with the same latent state to vary in their reward distributions, as long as their preference orderings are equal. We give a posterior-sampling algorithm for this problem and demonstrate that its empirical performance is competitive with latent bandits that have full knowledge of the reward distribution when this is well-specified, and outperforms them when reward scales differ between instances with the same latent state.

[400] Echo: Decoupling Inference and Training for Large-Scale RL Alignment on Heterogeneous Swarms

Jie Xiao, Shaoduo Gan, Changyuan Fan, Qingnan Ren, Alfred Long, Yuchen Zhang, Rymon Yu, Eric Yang, Lynn Ai

Main category: cs.LG

TL;DR: Echo decouples RL-based post-training for LLMs into separate inference and training phases, improving efficiency on heterogeneous hardware.

Details

Motivation: Current RL systems violate SPMD assumptions by mixing inference and training workloads on the same GPU cluster.

Method: Echo uses two synchronization protocols: sequential pull for minimal bias and asynchronous push-pull for hardware efficiency.

Result: Echo matches co-located baselines in convergence and reward while utilizing edge hardware.

Conclusion: Decentralized, heterogeneous resources can achieve datacentre-grade performance for large-scale RL in LLMs.

Abstract: Modern RL-based post-training for large language models (LLMs) co-locate trajectory sampling and policy optimisation on the same GPU cluster, forcing the system to switch between inference and training workloads. This serial context switching violates the single-program-multiple-data (SPMD) assumption underlying today’s distributed training systems. We present Echo, the RL system that cleanly decouples these two phases across heterogeneous “inference” and “training” swarms while preserving statistical efficiency. Echo introduces two lightweight synchronization protocols: a sequential pull mode that refreshes sampler weights on every API call for minimal bias, and an asynchronous push-pull mode that streams version-tagged rollouts through a replay buffer to maximise hardware utilisation. Training three representative RL workloads with Qwen3-4B, Qwen2.5-7B and Qwen3-32B on a geographically distributed cluster, Echo matches a fully co-located Verl baseline in convergence speed and final reward while off-loading trajectory generation to commodity edge hardware. These promising results demonstrate that large-scale RL for LLMs could achieve datacentre-grade performance using decentralised, heterogeneous resources.

[401] NT-ML: Backdoor Defense via Non-target Label Training and Mutual Learning

Wenjie Huo, Katinka Wolter

Main category: cs.LG

TL;DR: Proposes NT-ML, a defense against backdoor attacks in DNNs, using non-target label training and mutual learning to purify models.

Details

Motivation: DNNs are vulnerable to backdoor attacks, requiring robust defense mechanisms.

Method: NT-ML involves retraining with standard outputs (teacher-student models) and mutual learning to purify the student model.

Result: Effectively defends against 6 backdoor attacks with minimal clean samples, outperforming 5 state-of-the-art defenses.

Conclusion: NT-ML is a promising defense against advanced backdoor attacks.

Abstract: Recent studies have shown that deep neural networks (DNNs) are vulnerable to backdoor attacks, where a designed trigger is injected into the dataset, causing erroneous predictions when activated. In this paper, we propose a novel defense mechanism, Non-target label Training and Mutual Learning (NT-ML), which can successfully restore the poisoned model under advanced backdoor attacks. NT aims to reduce the harm of poisoned data by retraining the model with the outputs of the standard training. At this stage, a teacher model with high accuracy on clean data and a student model with higher confidence in correct prediction on poisoned data are obtained. Then, the teacher and student can learn the strengths from each other through ML to obtain a purified student model. Extensive experiments show that NT-ML can effectively defend against 6 backdoor attacks with a small number of clean samples, and outperforms 5 state-of-the-art backdoor defenses.

[402] Cumulative Learning Rate Adaptation: Revisiting Path-Based Schedules for SGD and Adam

Asma Atamna, Tom Maus, Fabian Kievelitz, Tobias Glasmachers

Main category: cs.LG

TL;DR: The paper revisits a 2017 adaptive learning rate method, identifies inconsistencies in its application to Adam, proposes a corrected variant, and benchmarks it against other methods to evaluate practical benefits.

Details

Motivation: To improve the utility of adaptive learning rate mechanisms in deep learning by addressing inconsistencies in existing methods and evaluating their practical benefits.

Method: Revisits a cumulative path-based adaptation scheme, proposes a corrected variant for Adam, and benchmarks it against SGD, Adam, and a recent alternative method.

Result: The corrected variant better reflects Adam’s dynamics, and benchmarking clarifies when adaptive strategies are beneficial.

Conclusion: The study provides insights into the practical utility of adaptive learning rate mechanisms and offers a corrected approach for Adam.

Abstract: The learning rate is a crucial hyperparameter in deep learning, with its ideal value depending on the problem and potentially changing during training. In this paper, we investigate the practical utility of adaptive learning rate mechanisms that adjust step sizes dynamically in response to the loss landscape. We revisit a cumulative path-based adaptation scheme proposed in 2017, which adjusts the learning rate based on the discrepancy between the observed path length, computed as a time-discounted sum of normalized gradient steps, and the expected length of a random walk. While the original approach offers a compelling intuition, we show that its adaptation mechanism for Adam is conceptually inconsistent due to the optimizer’s internal preconditioning. We propose a corrected variant that better reflects Adam’s update dynamics. To assess the practical value of online learning rate adaptation, we benchmark SGD and Adam, with and without cumulative adaptation, and compare them to a recent alternative method. Our results aim to clarify when and why such adaptive strategies offer practical benefits.

[403] MolSnap: Snap-Fast Molecular Generation with Latent Variational Mean Flow

Md Atik Ahamed, Qiang Ye, Qiang Cheng

Main category: cs.LG

TL;DR: A causality-aware framework for molecular generation from text, combining a Causality-Aware Transformer and Variational Mean Flow for high-quality, diverse, and fast generation.

Details

Motivation: Existing methods struggle with balancing quality, diversity, and speed in molecular generation from text.

Method: Proposes a Causality-Aware Transformer (CAT) for joint encoding of molecular graphs and text, and a Variational Mean Flow (VMF) framework for efficient latent space modeling.

Result: Outperforms baselines in novelty (74.5%), diversity (70.3%), and validity (100%), with efficient one-step inference.

Conclusion: The framework achieves superior performance and computational efficiency, advancing molecular generation tasks.

Abstract: Molecular generation conditioned on textual descriptions is a fundamental task in computational chemistry and drug discovery. Existing methods often struggle to simultaneously ensure high-quality, diverse generation and fast inference. In this work, we propose a novel causality-aware framework that addresses these challenges through two key innovations. First, we introduce a Causality-Aware Transformer (CAT) that jointly encodes molecular graph tokens and text instructions while enforcing causal dependencies during generation. Second, we develop a Variational Mean Flow (VMF) framework that generalizes existing flow-based methods by modeling the latent space as a mixture of Gaussians, enhancing expressiveness beyond unimodal priors. VMF enables efficient one-step inference while maintaining strong generation quality and diversity. Extensive experiments on four standard molecular benchmarks demonstrate that our model outperforms state-of-the-art baselines, achieving higher novelty (up to 74.5%), diversity (up to 70.3%), and 100% validity across all datasets. Moreover, VMF requires only one number of function evaluation (NFE) during conditional generation and up to five NFEs for unconditional generation, offering substantial computational efficiency over diffusion-based methods.

[404] Echo State Networks for Bitcoin Time Series Prediction

Mansi Sharma, Enrico Sartor, Marc Cavazza, Helmut Prendinger

Main category: cs.LG

TL;DR: The paper explores using Echo State Networks (ESNs) for forecasting cryptocurrency prices, especially during extreme volatility, and shows superior performance over existing methods.

Details

Motivation: Forecasting stock and cryptocurrency prices is difficult due to high volatility and non-stationarity, influenced by economic and market factors.

Method: The study employs Echo State Networks (ESNs) for modeling and includes chaos analysis via the Lyapunov exponent during volatile periods.

Result: ESNs outperform other machine learning methods (Boosting and Naïve) significantly, particularly in chaotic conditions.

Conclusion: ESNs are robust during chaotic periods and excel in high-volatility scenarios, making them effective for cryptocurrency forecasting.

Abstract: Forecasting stock and cryptocurrency prices is challenging due to high volatility and non-stationarity, influenced by factors like economic changes and market sentiment. Previous research shows that Echo State Networks (ESNs) can effectively model short-term stock market movements, capturing nonlinear patterns in dynamic data. To the best of our knowledge, this work is among the first to explore ESNs for cryptocurrency forecasting, especially during extreme volatility. We also conduct chaos analysis through the Lyapunov exponent in chaotic periods and show that our approach outperforms existing machine learning methods by a significant margin. Our findings are consistent with the Lyapunov exponent analysis, showing that ESNs are robust during chaotic periods and excel under high chaos compared to Boosting and Na"ive methods.

[405] Negative Binomial Variational Autoencoders for Overdispersed Latent Modeling

Yixuan Zhang, Wenxin Zhang, Hua Jiang, Quyu Kong, Feng Zhou

Main category: cs.LG

TL;DR: NegBio-VAE extends VAEs by using a negative binomial distribution for spike count modeling, improving accuracy by addressing overdispersion in neural activity.

Details

Motivation: Existing Poisson-VAEs impose rigid constraints (equal mean and variance) that don't reflect neural activity's stochastic nature.

Method: Introduces NegBio-VAE with negative binomial distribution, two ELBO optimization schemes, and differentiable reparameterization strategies.

Result: Significant gains in reconstruction fidelity, demonstrating the importance of modeling overdispersion.

Conclusion: NegBio-VAE provides a more accurate and flexible framework for neural spike train modeling.

Abstract: Biological neurons communicate through spike trains, discrete, irregular bursts of activity that exhibit variability far beyond the modeling capacity of conventional variational autoencoders (VAEs). Recent work, such as the Poisson-VAE, makes a biologically inspired move by modeling spike counts using the Poisson distribution. However, they impose a rigid constraint: equal mean and variance, which fails to reflect the true stochastic nature of neural activity. In this work, we challenge this constraint and introduce NegBio-VAE, a principled extension of the VAE framework that models spike counts using the negative binomial distribution. This shift grants explicit control over dispersion, unlocking a broader and more accurate family of neural representations. We further develop two ELBO optimization schemes and two differentiable reparameterization strategies tailored to the negative binomial setting. By introducing one additional dispersion parameter, NegBio-VAE generalizes the Poisson latent model to a negative binomial formulation. Empirical results demonstrate this minor yet impactful change leads to significant gains in reconstruction fidelity, highlighting the importance of explicitly modeling overdispersion in spike-like activations.

[406] Tail-Risk-Safe Monte Carlo Tree Search under PAC-Level Guarantees

Zuyuan Zhang, Arnob Ghosh, Tian Lan

Main category: cs.LG

TL;DR: The paper introduces CVaR-MCTS and W-MCTS to address tail-risk in MCTS, providing rigorous safety guarantees and outperforming baselines.

Details

Motivation: Existing safety-aware MCTS methods lack rigorous tail-safety guarantees, risking serious consequences in high-stake scenarios.

Method: Proposes CVaR-MCTS for explicit tail-risk control and W-MCTS to address estimation bias using Wasserstein ambiguity sets.

Result: Both methods outperform baselines, achieving robust tail-risk guarantees with improved rewards and stability.

Conclusion: The proposed solutions effectively address tail-risk in MCTS, providing safety guarantees and better performance.

Abstract: Making decisions with respect to just the expected returns in Monte Carlo Tree Search (MCTS) cannot account for the potential range of high-risk, adverse outcomes associated with a decision. To this end, safety-aware MCTS often consider some constrained variants – by introducing some form of mean risk measures or hard cost thresholds. These approaches fail to provide rigorous tail-safety guarantees with respect to extreme or high-risk outcomes (denoted as tail-risk), potentially resulting in serious consequence in high-stake scenarios. This paper addresses the problem by developing two novel solutions. We first propose CVaR-MCTS, which embeds a coherent tail risk measure, Conditional Value-at-Risk (CVaR), into MCTS. Our CVaR-MCTS with parameter $\alpha$ achieves explicit tail-risk control over the expected loss in the “worst $(1-\alpha)%$ scenarios.” Second, we further address the estimation bias of tail-risk due to limited samples. We propose Wasserstein-MCTS (or W-MCTS) by introducing a first-order Wasserstein ambiguity set $\mathcal{P}{\varepsilon{s}}(s,a)$ with radius $\varepsilon_{s}$ to characterize the uncertainty in tail-risk estimates. We prove PAC tail-safety guarantees for both CVaR-MCTS and W-MCTS and establish their regret. Evaluations on diverse simulated environments demonstrate that our proposed methods outperform existing baselines, effectively achieving robust tail-risk guarantees with improved rewards and stability.

[407] Federated Multi-Objective Learning with Controlled Pareto Frontiers

Jiansheng Rao, Jiayi Li, Zhizhi Gong, Soummya Kar, Haoxuan Li

Main category: cs.LG

TL;DR: CR-FMOL introduces a federated MOO framework with client-wise Pareto optimality via preference-cone constraints, improving fairness in FL.

Details

Motivation: Address the under-serving of minority clients in FL by enforcing client-wise Pareto optimality.

Method: Uses cone-constrained Pareto-MTL sub-problem after local FMGDA/FSMGDA steps to ensure Pareto-stationary directions for all clients.

Result: Enhances client fairness; early-stage performance lags FedAvg but expected to match accuracy with more rounds.

Conclusion: CR-FMOL is a promising approach for fairer FL, balancing client needs while maintaining performance.

Abstract: Federated learning (FL) is a widely adopted paradigm for privacy-preserving model training, but FedAvg optimise for the majority while under-serving minority clients. Existing methods such as federated multi-objective learning (FMOL) attempts to import multi-objective optimisation (MOO) into FL. However, it merely delivers task-wise Pareto-stationary points, leaving client fairness to chance. In this paper, we introduce Conically-Regularised FMOL (CR-FMOL), the first federated MOO framework that enforces client-wise Pareto optimality through a novel preference-cone constraint. After local federated multi-gradient descent averaging (FMGDA) / federated stochastic multi-gradient descent averaging (FSMGDA) steps, each client transmits its aggregated task-loss vector as an implicit preference; the server then solves a cone-constrained Pareto-MTL sub-problem centred at the uniform vector, producing a descent direction that is Pareto-stationary for every client within its cone. Experiments on non-IID benchmarks show that CR-FMOL enhances client fairness, and although the early-stage performance is slightly inferior to FedAvg, it is expected to achieve comparable accuracy given sufficient training rounds.

[408] EnergyPatchTST: Multi-scale Time Series Transformers with Uncertainty Estimation for Energy Forecasting

Wei Li, Zixin Wang, Qizheng Sun, Qixiang Gao, Fenglei Yang

Main category: cs.LG

TL;DR: EnergyPatchTST improves energy time series prediction with multi-scale feature extraction, probability prediction, and future variable integration, reducing errors by 7-12%.

Details

Motivation: Existing deep learning methods struggle with multi-scale dynamics and irregular data in energy forecasting, necessitating a more robust solution.

Method: EnergyPatchTST extends Patch Time Series Transformer with multi-scale feature extraction, Monte Carlo uncertainty estimation, future variable integration, and pre-training/fine-tuning.

Result: Outperforms common methods, reducing prediction error by 7-12% and providing reliable uncertainty estimation.

Conclusion: EnergyPatchTST offers a superior approach for energy time series prediction, addressing key limitations of existing methods.

Abstract: Accurate and reliable energy time series prediction is of great significance for power generation planning and allocation. At present, deep learning time series prediction has become the mainstream method. However, the multi-scale time dynamics and the irregularity of real data lead to the limitations of the existing methods. Therefore, we propose EnergyPatchTST, which is an extension of the Patch Time Series Transformer specially designed for energy forecasting. The main innovations of our method are as follows: (1) multi-scale feature extraction mechanism to capture patterns with different time resolutions; (2) probability prediction framework to estimate uncertainty through Monte Carlo elimination; (3) integration path of future known variables (such as temperature and wind conditions); And (4) Pre-training and Fine-tuning examples to enhance the performance of limited energy data sets. A series of experiments on common energy data sets show that EnergyPatchTST is superior to other commonly used methods, the prediction error is reduced by 7-12%, and reliable uncertainty estimation is provided, which provides an important reference for time series prediction in the energy field.

[409] Group Causal Policy Optimization for Post-Training Large Language Models

Ziyin Gu, Jingyao Wang, Ran Zuo, Chuxiong Sun, Zeen Song, Changwen Zheng, Wenwen Qiang

Main category: cs.LG

TL;DR: GCPO improves upon GRPO by incorporating causal dependencies among candidate responses, enhancing prediction quality and outperforming existing methods.

Details

Motivation: Specialized domains require targeted post-training for LLMs, and GRPO's independence assumption overlooks semantic interactions among responses.

Method: Introduces a Structural Causal Model (SCM) to reveal dependencies, leading to causally informed reward adjustment and KL regularization in GCPO.

Result: GCPO outperforms GRPO and other methods across multiple reasoning benchmarks.

Conclusion: Integrating causal structure into policy optimization (GCPO) significantly improves performance in specialized domains.

Abstract: Recent advances in large language models (LLMs) have broadened their applicability across diverse tasks, yet specialized domains still require targeted post training. Among existing methods, Group Relative Policy Optimization (GRPO) stands out for its efficiency, leveraging groupwise relative rewards while avoiding costly value function learning. However, GRPO treats candidate responses as independent, overlooking semantic interactions such as complementarity and contradiction. To address this challenge, we first introduce a Structural Causal Model (SCM) that reveals hidden dependencies among candidate responses induced by conditioning on a final integrated output forming a collider structure. Then, our causal analysis leads to two insights: (1) projecting responses onto a causally informed subspace improves prediction quality, and (2) this projection yields a better baseline than query only conditioning. Building on these insights, we propose Group Causal Policy Optimization (GCPO), which integrates causal structure into optimization through two key components: a causally informed reward adjustment and a novel KL regularization term that aligns the policy with a causally projected reference distribution. Comprehensive experimental evaluations demonstrate that GCPO consistently surpasses existing methods, including GRPO across multiple reasoning benchmarks.

[410] Task complexity shapes internal representations and robustness in neural networks

Robert Jankowski, Filippo Radicchi, M. Ángeles Serrano, Marián Boguñá, Santo Fortunato

Main category: cs.LG

TL;DR: The paper introduces five data-agnostic probes to study how task difficulty affects neural network representations, revealing insights into robustness and topology.

Details

Motivation: To understand how neural network representations are influenced by task complexity and input data, given their opaque nature.

Method: Uses five probes (pruning, binarization, noise injection, sign flipping, bipartite network randomization) on MLPs, analyzed as bipartite graphs, for easy vs. hard tasks on MNIST and Fashion-MNIST.

Result: Hard-task models collapse under binarization, while easy-task models remain robust. Pruning reveals phase transitions, noise can enhance accuracy, and sign structure alone can maintain performance.

Conclusion: Task complexity can be measured by performance gaps in binarized/shuffled networks, highlighting the importance of signed bipartite topology for model compression and interpretability.

Abstract: Neural networks excel across a wide range of tasks, yet remain black boxes. In particular, how their internal representations are shaped by the complexity of the input data and the problems they solve remains obscure. In this work, we introduce a suite of five data-agnostic probes-pruning, binarization, noise injection, sign flipping, and bipartite network randomization-to quantify how task difficulty influences the topology and robustness of representations in multilayer perceptrons (MLPs). MLPs are represented as signed, weighted bipartite graphs from a network science perspective. We contrast easy and hard classification tasks on the MNIST and Fashion-MNIST datasets. We show that binarizing weights in hard-task models collapses accuracy to chance, whereas easy-task models remain robust. We also find that pruning low-magnitude edges in binarized hard-task models reveals a sharp phase-transition in performance. Moreover, moderate noise injection can enhance accuracy, resembling a stochastic-resonance effect linked to optimal sign flips of small-magnitude weights. Finally, preserving only the sign structure-instead of precise weight magnitudes-through bipartite network randomizations suffices to maintain high accuracy. These phenomena define a model- and modality-agnostic measure of task complexity: the performance gap between full-precision and binarized or shuffled neural network performance. Our findings highlight the crucial role of signed bipartite topology in learned representations and suggest practical strategies for model compression and interpretability that align with task complexity.

[411] Discovering Interpretable Programmatic Policies via Multimodal LLM-assisted Evolutionary Search

Qinglong Hu, Xialiang Tong, Mingxuan Yuan, Fei Liu, Zhichao Lu, Qingfu Zhang

Main category: cs.LG

TL;DR: MLES combines multimodal large language models with evolutionary search to create interpretable, high-performance control policies, matching PPO’s efficiency while offering transparency.

Details

Motivation: Deep reinforcement learning lacks interpretability, hindering trust and deployment in safety-critical tasks. MLES aims to address this by integrating interpretability with high performance.

Method: MLES uses multimodal large language models as policy generators, evolutionary mechanisms for optimization, and visual feedback for behavior analysis and targeted improvements.

Result: MLES matches PPO’s performance in control tasks while providing transparent logic and traceable design, overcoming limitations of predefined domain-specific languages.

Conclusion: MLES is a scalable, promising approach for interpretable control policy discovery, facilitating knowledge transfer and reuse.

Abstract: Interpretability and high performance are essential goals in designing control policies, particularly for safety-critical tasks. Deep reinforcement learning has greatly enhanced performance, yet its inherent lack of interpretability often undermines trust and hinders real-world deployment. This work addresses these dual challenges by introducing a novel approach for programmatic policy discovery, called Multimodal Large Language Model-assisted Evolutionary Search (MLES). MLES utilizes multimodal large language models as policy generators, combining them with evolutionary mechanisms for automatic policy optimization. It integrates visual feedback-driven behavior analysis within the policy generation process to identify failure patterns and facilitate targeted improvements, enhancing the efficiency of policy discovery and producing adaptable, human-aligned policies. Experimental results show that MLES achieves policy discovery capabilities and efficiency comparable to Proximal Policy Optimization (PPO) across two control tasks, while offering transparent control logic and traceable design processes. This paradigm overcomes the limitations of predefined domain-specific languages, facilitates knowledge transfer and reuse, and is scalable across various control tasks. MLES shows promise as a leading approach for the next generation of interpretable control policy discovery.

[412] Tractable Sharpness-Aware Learning of Probabilistic Circuits

Hrithik Suresh, Sahil Sidheekh, Vishnu Shreeram M. P, Sriraam Natarajan, Narayanan C. Krishnan

Main category: cs.LG

TL;DR: The paper addresses overfitting in Probabilistic Circuits (PCs) by proposing a Hessian-based regularizer, improving generalization by guiding PCs toward flatter minima.

Details

Motivation: PCs can overfit, especially with limited data, due to convergence to sharp optima. The goal is to improve generalization.

Method: A Hessian-based regularizer is introduced, leveraging the tractable trace of the Hessian of log-likelihood in PCs. This induces a gradient-norm-based regularizer.

Result: Experiments show the method consistently leads to flatter minima and better generalization on synthetic and real-world datasets.

Conclusion: The proposed regularizer effectively mitigates overfitting in PCs, enhancing their generalization performance.

Abstract: Probabilistic Circuits (PCs) are a class of generative models that allow exact and tractable inference for a wide range of queries. While recent developments have enabled the learning of deep and expressive PCs, this increased capacity can often lead to overfitting, especially when data is limited. We analyze PC overfitting from a log-likelihood-landscape perspective and show that it is often caused by convergence to sharp optima that generalize poorly. Inspired by sharpness aware minimization in neural networks, we propose a Hessian-based regularizer for training PCs. As a key contribution, we show that the trace of the Hessian of the log-likelihood-a sharpness proxy that is typically intractable in deep neural networks-can be computed efficiently for PCs. Minimizing this Hessian trace induces a gradient-norm-based regularizer that yields simple closed-form parameter updates for EM, and integrates seamlessly with gradient based learning methods. Experiments on synthetic and real-world datasets demonstrate that our method consistently guides PCs toward flatter minima, improves generalization performance.

[413] Competing Risks: Impact on Risk Estimation and Algorithmic Fairness

Vincent Jeanselme, Brian Tom, Jessica Barrett

Main category: cs.LG

TL;DR: The paper highlights the bias and fairness issues arising from treating competing risks as censoring in survival analysis, demonstrating systematic overestimation of risk and amplified disparities.

Details

Motivation: To address the overlooked consequences of misclassifying competing risks as censoring in survival analysis, which leads to biased estimates and exacerbates disparities.

Method: The authors formalize the problem, quantify the error in survival estimates, and develop a framework to assess predictive performance and fairness implications. Empirical analysis of cardiovascular management supports the findings.

Result: Ignoring competing risks introduces substantial bias, disproportionately affecting high-risk individuals and worsening disparities.

Conclusion: Practitioners must account for competing risks to improve accuracy, reduce disparities, and inform better decisions.

Abstract: Accurate time-to-event prediction is integral to decision-making, informing medical guidelines, hiring decisions, and resource allocation. Survival analysis, the quantitative framework used to model time-to-event data, accounts for patients who do not experience the event of interest during the study period, known as censored patients. However, many patients experience events that prevent the observation of the outcome of interest. These competing risks are often treated as censoring, a practice frequently overlooked due to a limited understanding of its consequences. Our work theoretically demonstrates why treating competing risks as censoring introduces substantial bias in survival estimates, leading to systematic overestimation of risk and, critically, amplifying disparities. First, we formalize the problem of misclassifying competing risks as censoring and quantify the resulting error in survival estimates. Specifically, we develop a framework to estimate this error and demonstrate the associated implications for predictive performance and algorithmic fairness. Furthermore, we examine how differing risk profiles across demographic groups lead to group-specific errors, potentially exacerbating existing disparities. Our findings, supported by an empirical analysis of cardiovascular management, demonstrate that ignoring competing risks disproportionately impacts the individuals most at risk of these events, potentially accentuating inequity. By quantifying the error and highlighting the fairness implications of the common practice of considering competing risks as censoring, our work provides a critical insight into the development of survival models: practitioners must account for competing risks to improve accuracy, reduce disparities in risk assessment, and better inform downstream decisions.

[414] Adapting Vision-Language Models Without Labels: A Comprehensive Survey

Hao Dong, Lijun Sheng, Jian Liang, Ran He, Eleni Chatzi, Olga Fink

Main category: cs.LG

TL;DR: A survey on unsupervised adaptation methods for Vision-Language Models (VLMs), categorizing approaches into four paradigms and analyzing methodologies, benchmarks, and future directions.

Details

Motivation: VLMs show strong generalization but underperform in specific tasks without adaptation. Unsupervised methods are data-efficient but lack a unified survey.

Method: Proposes a taxonomy for unsupervised VLM adaptation: Data-Free Transfer, Unsupervised Domain Transfer, Episodic Test-Time Adaptation, and Online Test-Time Adaptation. Analyzes methodologies and benchmarks.

Result: Provides a structured overview of the field, highlighting key paradigms and adaptation strategies.

Conclusion: Identifies open challenges and future directions, with a maintained repository for ongoing research.

Abstract: Vision-Language Models (VLMs) have demonstrated remarkable generalization capabilities across a wide range of tasks. However, their performance often remains suboptimal when directly applied to specific downstream scenarios without task-specific adaptation. To enhance their utility while preserving data efficiency, recent research has increasingly focused on unsupervised adaptation methods that do not rely on labeled data. Despite the growing interest in this area, there remains a lack of a unified, task-oriented survey dedicated to unsupervised VLM adaptation. To bridge this gap, we present a comprehensive and structured overview of the field. We propose a taxonomy based on the availability and nature of unlabeled visual data, categorizing existing approaches into four key paradigms: Data-Free Transfer (no data), Unsupervised Domain Transfer (abundant data), Episodic Test-Time Adaptation (batch data), and Online Test-Time Adaptation (streaming data). Within this framework, we analyze core methodologies and adaptation strategies associated with each paradigm, aiming to establish a systematic understanding of the field. Additionally, we review representative benchmarks across diverse applications and highlight open challenges and promising directions for future research. An actively maintained repository of relevant literature is available at https://github.com/tim-learn/Awesome-LabelFree-VLMs.

[415] Let’s Measure Information Step-by-Step: LLM-Based Evaluation Beyond Vibes

Zachary Robertson, Sanmi Koyejo

Main category: cs.LG

TL;DR: The paper introduces gaming-resistant mechanisms for evaluating AI systems without ground truth, using f-mutual information to ensure robustness against adversarial manipulation.

Details

Motivation: Current AI evaluation methods lack ground truth and are vulnerable to gaming, necessitating robust mechanisms that resist manipulation while maintaining output quality.

Method: The paper leverages f-mutual information measures, proving their uniqueness under natural conditions, and evaluates them empirically across ten domains.

Result: Information-theoretic mechanisms achieve perfect discrimination between faithful and strategic agents, outperforming LLM judges and showing 10-100x better robustness to adversarial manipulation.

Conclusion: The proposed mechanisms are highly effective, with performance peaking at a 10:1 compression ratio, offering a bias-variance tradeoff for optimal application.

Abstract: We develop mechanisms for evaluating AI systems without ground truth by exploiting a connection between gaming resistance and output quality. The data processing inequality ensures post-hoc attempts to game a metric degrades both information content and task performance. We prove that f-mutual information measures are the unique gaming resistant mechanisms under natural conditions, with the overseer acting as an agent. While Shannon mutual information faces exponential sample complexity, bounded measures like total variation distance remain tractable. Empirically, across ten domains from translation to peer review, all information-theoretic mechanisms achieve perfect discrimination (d

0.5) between faithful and strategic agents. In contrast, LLM judges exhibit systematic evaluation inversion, preferring fabricated content over accurate summaries. Our mechanisms show 10-100x better robustness to adversarial manipulation than current practices. We also find performance follows an inverted-U curve with compression ratio, peaking at 10:1 where agent responses exhibit optimal information diversity (3 effective dimensions), giving a bias-variance perspective on when our approach is expected to be most effective.

[416] Shuffle-R1: Efficient RL framework for Multimodal Large Language Models via Data-centric Dynamic Shuffle

Linghao Zhu, Yiran Guan, Dingkang Liang, Jianzhong Ju, Zhenbo Luo, Bin Qin, Jian Luan, Yuliang Liu, Xiang Bai

Main category: cs.LG

TL;DR: Shuffle-R1 improves RL fine-tuning efficiency by addressing Advantage Collapsing and Rollout Silencing through dynamic trajectory sampling and batch restructuring.

Details

Motivation: Current RL pipelines for MLLMs suffer from inefficiencies due to Advantage Collapsing and Rollout Silencing, hindering long-term learning.

Method: Proposes Pairwise Trajectory Sampling and Advantage-based Trajectory Shuffle to enhance gradient signal and rollout exposure.

Result: Outperforms RL baselines on reasoning benchmarks with minimal overhead.

Conclusion: Data-centric adaptations like Shuffle-R1 are crucial for efficient RL training in MLLMs.

Abstract: Reinforcement learning (RL) has emerged as an effective post-training paradigm for enhancing the reasoning capabilities of multimodal large language model (MLLM). However, current RL pipelines often suffer from training inefficiencies caused by two underexplored issues: Advantage Collapsing, where most advantages in a batch concentrate near zero, and Rollout Silencing, where the proportion of rollouts contributing non-zero gradients diminishes over time. These issues lead to suboptimal gradient updates and hinder long-term learning efficiency. To address these issues, we propose Shuffle-R1, a simple yet principled framework that improves RL fine-tuning efficiency by dynamically restructuring trajectory sampling and batch composition. It introduces (1) Pairwise Trajectory Sampling, which selects high-contrast trajectories with large advantages to improve gradient signal quality, and (2) Advantage-based Trajectory Shuffle, which increases exposure of valuable rollouts through informed batch reshuffling. Experiments across multiple reasoning benchmarks show that our framework consistently outperforms strong RL baselines with minimal overhead. These results highlight the importance of data-centric adaptations for more efficient RL training in MLLM.

[417] Prediction of Survival Outcomes under Clinical Presence Shift: A Joint Neural Network Architecture

Vincent Jeanselme, Glen Martin, Matthew Sperrin, Niels Peek, Brian Tom, Jessica Barrett

Main category: cs.LG

TL;DR: The paper proposes a multi-task recurrent neural network to model clinical presence (inter-observation time and missingness) alongside survival outcomes, improving prediction model performance and transportability.

Details

Motivation: Clinical presence (interaction between patients and healthcare systems) impacts observed outcomes in electronic health records, but is often overlooked in prediction models, limiting their transportability.

Method: A multi-task recurrent neural network jointly models inter-observation time, missingness processes, and survival outcomes.

Result: The proposed method outperforms state-of-the-art models in mortality prediction (MIMIC-III dataset) and improves transportability under clinical presence shifts.

Conclusion: Incorporating clinical presence into prediction models enhances performance and transportability, especially when deployed in new settings.

Abstract: Electronic health records arise from the complex interaction between patients and the healthcare system. This observation process of interactions, referred to as clinical presence, often impacts observed outcomes. When using electronic health records to develop clinical prediction models, it is standard practice to overlook clinical presence, impacting performance and limiting the transportability of models when this interaction evolves. We propose a multi-task recurrent neural network that jointly models the inter-observation time and the missingness processes characterising this interaction in parallel to the survival outcome of interest. Our work formalises the concept of clinical presence shift when the prediction model is deployed in new settings (e.g. different hospitals, regions or countries), and we theoretically justify why the proposed joint modelling can improve transportability under changes in clinical presence. We demonstrate, in a real-world mortality prediction task in the MIMIC-III dataset, how the proposed strategy improves performance and transportability compared to state-of-the-art prediction models that do not incorporate the observation process. These results emphasise the importance of leveraging clinical presence to improve performance and create more transportable clinical prediction models.

[418] TrajEvo: Trajectory Prediction Heuristics Design via LLM-driven Evolution

Zhikai Zhao, Chuanbo Hua, Federico Berto, Kanghoon Lee, Zihan Ma, Jiachen Li, Jinkyoo Park

Main category: cs.LG

TL;DR: TrajEvo uses LLMs and evolutionary algorithms to automate trajectory prediction heuristics, outperforming traditional and deep learning methods, especially in OOD scenarios.

Details

Motivation: Traditional heuristics lack accuracy and generalizability, while deep learning methods are computationally expensive and less explainable. TrajEvo aims to address these gaps.

Method: TrajEvo employs an evolutionary algorithm with Cross-Generation Elite Sampling and a Statistics Feedback Loop to refine heuristics using LLMs.

Result: TrajEvo outperforms existing methods in real-world datasets and excels in OOD generalization.

Conclusion: TrajEvo advances automated, fast, explainable, and generalizable trajectory prediction heuristics, with open-source code for future research.

Abstract: Trajectory prediction is a critical task in modeling human behavior, especially in safety-critical domains such as social robotics and autonomous vehicle navigation. Traditional heuristics based on handcrafted rules often lack accuracy and generalizability. Although deep learning approaches offer improved performance, they typically suffer from high computational cost, limited explainability, and, importantly, poor generalization to out-of-distribution (OOD) scenarios. In this paper, we introduce TrajEvo, a framework that leverages Large Language Models (LLMs) to automatically design trajectory prediction heuristics. TrajEvo employs an evolutionary algorithm to generate and refine prediction heuristics from past trajectory data. We propose two key innovations: Cross-Generation Elite Sampling to encourage population diversity, and a Statistics Feedback Loop that enables the LLM to analyze and improve alternative predictions. Our evaluations demonstrate that TrajEvo outperforms existing heuristic methods across multiple real-world datasets, and notably surpasses both heuristic and deep learning methods in generalizing to an unseen OOD real-world dataset. TrajEvo marks a promising step toward the automated design of fast, explainable, and generalizable trajectory prediction heuristics. We release our source code to facilitate future research at https://github.com/ai4co/trajevo.

[419] Parameter-free entropy-regularized multi-view clustering with hierarchical feature selection

Kristina P. Sinaga, Sara Colantonio, Miin-Shen Yang

Main category: cs.LG

TL;DR: The paper introduces two parameter-free algorithms, AMVFCM-U and AAMVFCM-U, for multi-view clustering, using entropy regularization and SNR-based feature weighting to improve pattern discovery and computational efficiency.

Details

Motivation: Challenges in multi-view clustering include handling high-dimensional data and integrating heterogeneous views without manual tuning. Traditional methods lack principled mechanisms for these tasks.

Method: The proposed algorithms replace fuzzification parameters with entropy regularization and SNR-based feature weighting. AAMVFCM-U adds hierarchical dimensionality reduction via adaptive thresholding.

Result: The methods outperform 15 state-of-the-art techniques, achieving 97% computational efficiency, reducing dimensionality to 0.45%, and identifying optimal view combinations.

Conclusion: The unified framework offers a robust, parameter-free solution for multi-view clustering, enhancing pattern discovery and efficiency.

Abstract: Multi-view clustering faces critical challenges in automatically discovering patterns across heterogeneous data while managing high-dimensional features and eliminating irrelevant information. Traditional approaches suffer from manual parameter tuning and lack principled cross-view integration mechanisms. This work introduces two complementary algorithms: AMVFCM-U and AAMVFCM-U, providing a unified parameter-free framework. Our approach replaces fuzzification parameters with entropy regularization terms that enforce adaptive cross-view consensus. The core innovation employs signal-to-noise ratio based regularization ($\delta_j^h = \frac{\bar{x}_j^h}{(\sigma_j^h)^2}$) for principled feature weighting with convergence guarantees, coupled with dual-level entropy terms that automatically balance view and feature contributions. AAMVFCM-U extends this with hierarchical dimensionality reduction operating at feature and view levels through adaptive thresholding ($\theta^{h^{(t)}} = \frac{d_h^{(t)}}{n}$). Evaluation across five diverse benchmarks demonstrates superiority over 15 state-of-the-art methods. AAMVFCM-U achieves up to 97% computational efficiency gains, reduces dimensionality to 0.45% of original size, and automatically identifies critical view combinations for optimal pattern discovery.

[420] X-VFL: A New Vertical Federated Learning Framework with Cross Completion and Decision Subspace Alignment

Qinghua Yao, Xiangrui Xu, Zhize Li

Main category: cs.LG

TL;DR: X-VFL is a new VFL framework addressing non-aligned data and missing features, enabling local inference with novel modules XCom and DS-Align, and outperforming existing methods.

Details

Motivation: Overcome limitations of VFL: perfectly aligned data and joint inference requirements.

Method: Proposes X-VFL with XCom for feature completion and DS-Align for decision subspace alignment, supporting local inference.

Result: Achieves 15% and 43% accuracy improvements on CIFAR-10 and MIMIC-III datasets, respectively.

Conclusion: X-VFL is effective for scenarios with missing features and local inference, validated by superior performance.

Abstract: Vertical Federated Learning (VFL) enables collaborative learning by integrating disjoint feature subsets from multiple clients/parties. However, VFL typically faces two key challenges: i) the requirement for perfectly aligned data samples across all clients (missing features are not allowed); ii) the requirement for joint collaborative inference/prediction involving all clients (it does not support locally independent inference on a single client). To address these challenges, we propose X-VFL, a new VFL framework designed to deal with the non-aligned data samples with (partially) missing features and to support locally independent inference of new data samples for each client. In particular, we design two novel modules in X-VFL: Cross Completion (XCom) and Decision Subspace Alignment (DS-Align). XCom can complete/reconstruct missing features for non-aligned data samples by leveraging information from other clients. DS-Align aligns local features with completed and global features across all clients within the decision subspace, thus enabling locally independent inference at each client. Moreover, we provide convergence theorems for different algorithms used in training X-VFL, showing an $O(1/\sqrt{T})$ convergence rate for SGD-type algorithms and an $O(1/T)$ rate for PAGE-type algorithms, where $T$ denotes the number of training update steps. Extensive experiments on real-world datasets demonstrate that X-VFL significantly outperforms existing methods, e.g., achieving a 15% improvement in accuracy on the image CIFAR-10 dataset and a 43% improvement on the medical MIMIC-III dataset. These results validate the practical effectiveness and superiority of X-VFL, particularly in scenarios involving partially missing features and locally independent inference.

[421] Enhancing PyKEEN with Multiple Negative Sampling Solutions for Knowledge Graph Embedding Models

Claudia d’Amato, Ivan Diliso, Nicola Fanizzi, Zafar Saeed

Main category: cs.LG

TL;DR: The paper introduces an extension for PyKEEN to integrate advanced negative sampling strategies for knowledge graph embedding, enhancing performance and flexibility.

Details

Motivation: Negative sampling is crucial for training embedding models, but existing libraries lack advanced strategies, limiting performance and customization.

Method: Developed an extension for PyKEEN with a suite of advanced negative samplers (static and dynamic) within a modular architecture.

Result: The extension improves PyKEEN’s capabilities, enabling better performance in link prediction tasks and easier customization of embedding methods.

Conclusion: The study demonstrates the impact of advanced negative sampling on embedding performance and provides insights for designing more effective strategies.

Abstract: Embedding methods have become popular due to their scalability on link prediction and/or triple classification tasks on Knowledge Graphs. Embedding models are trained relying on both positive and negative samples of triples. However, in the absence of negative assertions, these must be usually artificially generated using various negative sampling strategies, ranging from random corruption to more sophisticated techniques which have an impact on the overall performance. Most of the popular libraries for knowledge graph embedding, support only basic such strategies and lack advanced solutions. To address this gap, we deliver an extension for the popular KGE framework PyKEEN that integrates a suite of several advanced negative samplers (including both static and dynamic corruption strategies), within a consistent modular architecture, to generate meaningful negative samples, while remaining compatible with existing PyKEEN -based workflows and pipelines. The developed extension not only enhancesPyKEEN itself but also allows for easier and comprehensive development of embedding methods and/or for their customization. As a proof of concept, we present a comprehensive empirical study of the developed extensions and their impact on the performance (link prediction tasks) of different embedding methods, which also provides useful insights for the design of more effective strategies

[422] Optimizing IoT Threat Detection with Kolmogorov-Arnold Networks (KANs)

Natalia Emelianova, Carlos Kamienski, Ronaldo C. Prati

Main category: cs.LG

TL;DR: Kolmogorov-Arnold Networks (KANs) outperform traditional MLPs and match state-of-the-art models like Random Forest and XGBoost for IoT intrusion detection, with added interpretability.

Details

Motivation: Addressing IoT security concerns due to increasing cyberattacks by exploring KANs as a novel solution.

Method: Evaluating KANs with learnable activation functions against traditional MLPs and state-of-the-art models (Random Forest, XGBoost) for intrusion detection.

Result: KANs achieve competitive accuracy and superior interpretability compared to other models.

Conclusion: KANs are a promising alternative for IoT intrusion detection, balancing performance and interpretability.

Abstract: The exponential growth of the Internet of Things (IoT) has led to the emergence of substantial security concerns, with IoT networks becoming the primary target for cyberattacks. This study examines the potential of Kolmogorov-Arnold Networks (KANs) as an alternative to conventional machine learning models for intrusion detection in IoT networks. The study demonstrates that KANs, which employ learnable activation functions, outperform traditional MLPs and achieve competitive accuracy compared to state-of-the-art models such as Random Forest and XGBoost, while offering superior interpretability for intrusion detection in IoT networks.

[423] Non-omniscient backdoor injection with a single poison sample: Proving the one-poison hypothesis for linear regression and linear classification

Thorsten Peinemann, Paula Arnold, Sebastian Berndt, Thomas Eisenbarth, Esfandiar Mohammadi

Main category: cs.LG

TL;DR: The paper explores the ‘one-poison hypothesis,’ showing that a single poison sample can successfully inject a backdoor into machine learning models with minimal impact on benign task performance, proven for linear models.

Details

Motivation: To address the open question of how much poison data is needed for successful backdoor attacks, challenging prior assumptions that require many samples or extensive data knowledge.

Method: Formulates the one-poison hypothesis, proves it for linear regression and classification, and validates with experiments on benchmark datasets.

Result: Demonstrates that a single poison sample can achieve zero backdooring-error and negligible impact on benign task performance under certain conditions.

Conclusion: The one-poison hypothesis holds for linear models, offering insights into the minimal requirements for successful backdoor attacks.

Abstract: Backdoor injection attacks are a threat to machine learning models that are trained on large data collected from untrusted sources; these attacks enable attackers to inject malicious behavior into the model that can be triggered by specially crafted inputs. Prior work has established bounds on the success of backdoor attacks and their impact on the benign learning task, however, an open question is what amount of poison data is needed for a successful backdoor attack. Typical attacks either use few samples, but need much information about the data points or need to poison many data points. In this paper, we formulate the one-poison hypothesis: An adversary with one poison sample and limited background knowledge can inject a backdoor with zero backdooring-error and without significantly impacting the benign learning task performance. Moreover, we prove the one-poison hypothesis for linear regression and linear classification. For adversaries that utilize a direction that is unused by the benign data distribution for the poison sample, we show that the resulting model is functionally equivalent to a model where the poison was excluded from training. We build on prior work on statistical backdoor learning to show that in all other cases, the impact on the benign learning task is still limited. We also validate our theoretical results experimentally with realistic benchmark data sets.

[424] On the Generalization of SFT: A Reinforcement Learning Perspective with Reward Rectification

Yongliang Wu, Yizhou Zhou, Zhou Ziheng, Yingzhe Peng, Xinyu Ye, Xinting Hu, Wenbo Zhu, Lu Qi, Ming-Hsuan Yang, Xu Yang

Main category: cs.LG

TL;DR: The paper introduces Dynamic Fine-Tuning (DFT), a simple but effective improvement to Supervised Fine-Tuning (SFT) for LLMs, addressing generalization issues by dynamically rescaling gradients. It outperforms SFT and competes with RL methods.

Details

Motivation: Standard SFT has limited generalization compared to RL due to problematic reward structures in its gradients. The goal is to improve SFT's generalization without complex RL methods.

Method: Proposes DFT, which dynamically rescales the objective function for each token based on its probability, stabilizing gradient updates. This is a minimal code change.

Result: DFT significantly outperforms standard SFT across benchmarks and base models, showing improved generalization. It also competes with offline RL methods.

Conclusion: DFT bridges theory and practice, advancing SFT performance with a simple, effective solution. The code will be open-sourced.

Abstract: We present a simple yet theoretically motivated improvement to Supervised Fine-Tuning (SFT) for the Large Language Model (LLM), addressing its limited generalization compared to reinforcement learning (RL). Through mathematical analysis, we reveal that standard SFT gradients implicitly encode a problematic reward structure that may severely restrict the generalization capabilities of model. To rectify this, we propose Dynamic Fine-Tuning (DFT), stabilizing gradient updates for each token by dynamically rescaling the objective function with the probability of this token. Remarkably, this single-line code change significantly outperforms standard SFT across multiple challenging benchmarks and base models, demonstrating greatly improved generalization. Additionally, our approach shows competitive results in offline RL settings, offering an effective yet simpler alternative. This work bridges theoretical insight and practical solutions, substantially advancing SFT performance. The code will be available at https://github.com/yongliang-wu/DFT.

[425] Basis Selection: Low-Rank Decomposition of Pretrained Large Language Models for Target Applications

Yang Li, Daniel Agyei Asante, Changsheng Zhao, Ernie Chang, Yangyang Shi, Vikas Chandra

Main category: cs.LG

TL;DR: The paper introduces a low-rank decomposition method to compress large language models (LLMs) by removing redundant components, reducing size while maintaining accuracy.

Details

Motivation: LLMs are computationally intensive and energy-demanding, making deployment on resource-limited devices or cost-effective cloud use challenging.

Method: The approach identifies and removes redundant parts of pretrained LLMs, representing weight matrices as a linear combination of base components, pruning irrelevant ones, and adding beneficial bases for specific applications.

Result: Tests on Llama 2-7b and -13B models for tasks like mathematical reasoning and code generation show significant size reduction with comparable accuracy to existing compression techniques.

Conclusion: The proposed method effectively compresses LLMs for specific applications, enabling efficient deployment without sacrificing performance.

Abstract: Large language models (LLMs) significantly enhance the performance of various applications, but they are computationally intensive and energy-demanding. This makes it challenging to deploy them on devices with limited resources, such as personal computers and mobile/wearable devices, and results in substantial inference costs in resource-rich environments like cloud servers. To extend the use of LLMs, we introduce a low-rank decomposition approach to effectively compress these models, tailored to the requirements of specific applications. We observe that LLMs pretrained on general datasets contain many redundant components not needed for particular applications. Our method focuses on identifying and removing these redundant parts, retaining only the necessary elements for the target applications. Specifically, we represent the weight matrices of LLMs as a linear combination of base components. We then prune the irrelevant bases and enhance the model with new bases beneficial for specific applications. Deep compression results on the Llama 2-7b and -13B models, conducted on target applications including mathematical reasoning and code generation, show that our method significantly reduces model size while maintaining comparable accuracy to state-of-the-art low-rank compression techniques.

[426] Teaching LLMs How to Learn with Contextual Fine-Tuning

Younwoo Choi, Muhammad Adil Asif, Ziwen Han, John Willes, Rahul G. Krishnan

Main category: cs.LG

TL;DR: The paper explores contextual fine-tuning, a method using instructional prompts to improve LLMs’ learning and reasoning in evolving domains like medicine and finance.

Details

Motivation: To enhance LLMs' ability to learn new concepts by mimicking human cognitive strategies, addressing the challenge of rapid adaptation in dynamic fields.

Method: Introduces contextual fine-tuning, leveraging instructional prompts to guide LLMs’ learning during training, improving domain-specific knowledge and reasoning.

Result: Empirical results show improved fine-tuning efficiency and performance in medical and financial domains.

Conclusion: Contextual fine-tuning is a simple yet effective approach to enhance LLMs’ adaptability and learning in new domains.

Abstract: Prompting Large Language Models (LLMs), or providing context on the expected model of operation, is an effective way to steer the outputs of such models to satisfy human desiderata after they have been trained. But in rapidly evolving domains, there is often need to fine-tune LLMs to improve either the kind of knowledge in their memory or their abilities to perform open ended reasoning in new domains. When human’s learn new concepts, we often do so by linking the new material that we are studying to concepts we have already learned before. To that end, we ask, “can prompting help us teach LLMs how to learn”. In this work, we study a novel generalization of instruction tuning, called contextual fine-tuning, to fine-tune LLMs. Our method leverages instructional prompts designed to mimic human cognitive strategies in learning and problem-solving to guide the learning process during training, aiming to improve the model’s interpretation and understanding of domain-specific knowledge. We empirically demonstrate that this simple yet effective modification improves the ability of LLMs to be fine-tuned rapidly on new datasets both within the medical and financial domains.

[427] Unsupervised Graph Deep Learning Reveals Emergent Flood Risk Profile of Urban Areas

Kai Yin, Junwei Ma, Ali Mostafavi

Main category: cs.LG

TL;DR: The paper introduces FloodRisk-Net, an unsupervised graph deep learning model, to address gaps in urban flood-risk assessment by capturing spatial dependencies and feature interactions. It identifies six city-specific flood-risk levels and highlights hierarchical risk distribution, with core cities bearing the highest risk.

Details

Motivation: Existing flood-risk models focus narrowly on hazard and exposure features, ignoring spatial dependencies and feature interactions, leading to incomplete risk assessment.

Method: The study uses FloodRisk-Net, an unsupervised graph deep learning model, to analyze data from U.S. metropolitan areas, capturing spatial dependencies and feature interactions for flood-risk rating.

Result: The model identifies six distinct flood-risk levels per city, reveals hierarchical spatial risk distribution, and identifies archetypes of high-risk areas. Core cities disproportionately face the highest risk.

Conclusion: The findings emphasize the need for integrated flood-risk strategies addressing spatial inequalities and feature interactions, particularly in core cities with high risk.

Abstract: Urban flood risk emerges from complex and nonlinear interactions among multiple features related to flood hazard, flood exposure, and social and physical vulnerabilities, along with the complex spatial flood dependence relationships. Existing approaches for characterizing urban flood risk, however, are primarily based on flood plain maps, focusing on a limited number of features, primarily hazard and exposure features, without consideration of feature interactions or the dependence relationships among spatial areas. To address this gap, this study presents an integrated urban flood-risk rating model based on a novel unsupervised graph deep learning model (called FloodRisk-Net). FloodRisk-Net is capable of capturing spatial dependence among areas and complex and nonlinear interactions among flood hazards and urban features for specifying emergent flood risk. Using data from multiple metropolitan statistical areas (MSAs) in the United States, the model characterizes their flood risk into six distinct city-specific levels. The model is interpretable and enables feature analysis of areas within each flood-risk level, allowing for the identification of the three archetypes shaping the highest flood risk within each MSA. Flood risk is found to be spatially distributed in a hierarchical structure within each MSA, where the core city disproportionately bears the highest flood risk. Multiple cities are found to have high overall flood-risk levels and low spatial inequality, indicating limited options for balancing urban development and flood-risk reduction. Relevant flood-risk reduction strategies are discussed considering ways that the highest flood risk and uneven spatial distribution of flood risk are formed.

[428] SincVAE: A new semi-supervised approach to improve anomaly detection on EEG data using SincNet and variational autoencoder

Andrea Pollastro, Francesco Isgrò, Roberto Prevete

Main category: cs.LG

TL;DR: Proposes SincVAE, a semi-supervised deep learning method for EEG-based seizure detection, addressing data imbalance and labeling challenges.

Details

Motivation: EEG monitoring is crucial for epilepsy, but supervised ML struggles with labeling and data imbalance. Semi-supervised methods can mitigate these issues.

Method: Uses SincVAE, a Variational Autoencoder with a bandpass filter layer, to detect seizures without preprocessing.

Result: SincVAE improves seizure detection, including preictal and postictal stages.

Conclusion: SincVAE offers a promising semi-supervised solution for EEG seizure detection, overcoming limitations of supervised methods.

Abstract: Over the past few decades, electroencephalography (EEG) monitoring has become a pivotal tool for diagnosing neurological disorders, particularly for detecting seizures. Epilepsy, one of the most prevalent neurological diseases worldwide, affects approximately the 1 % of the population. These patients face significant risks, underscoring the need for reliable, continuous seizure monitoring in daily life. Most of the techniques discussed in the literature rely on supervised Machine Learning (ML) methods. However, the challenge of accurately labeling variations in epileptic EEG waveforms complicates the use of these approaches. Additionally, the rarity of ictal events introduces an high imbalancing within the data, which could lead to poor prediction performance in supervised learning approaches. Instead, a semi-supervised approach allows to train the model only on data not containing seizures, thus avoiding the issues related to the data imbalancing. This work proposes a semi-supervised approach for detecting epileptic seizures from EEG data, utilizing a novel Deep Learning-based method called SincVAE. This proposal incorporates the learning of an ad-hoc array of bandpass filter as a first layer of a Variational Autoencoder (VAE), potentially eliminating the preprocessing stage where informative band frequencies are identified and isolated. Results indicate that SincVAE improves seizure detection in EEG data and is capable of identifying early seizures during the preictal stage as well as monitoring patients throughout the postictal stage.

[429] PL-DCP: A Pairwise Learning framework with Domain and Class Prototypes for EEG emotion recognition under unseen target conditions

Guangli Li, Canbiao Wu, Zhehao Zhou, Tuo Sun, Ping Tan, Li Zhang, Zhen Liang

Main category: cs.LG

TL;DR: The paper proposes PL-DCP, a deep learning framework for EEG emotion recognition, addressing challenges like domain dependence and label noise through feature disentanglement and prototype inference. It outperforms SOTA methods on unseen target domains.

Details

Motivation: Current deep transfer learning methods for EEG emotion recognition suffer from dual domain dependence and label noise, limiting model performance and generalization.

Method: PL-DCP uses feature disentanglement to separate domain and class features, calculates dual prototypes (domain and class), and employs pairwise learning to mitigate label noise.

Result: Achieves accuracies of 82.88%, 65.15%, and 61.29% on SEED, SEED-IV, and SEED-V datasets, outperforming SOTA methods on unseen target domains.

Conclusion: PL-DCP offers a robust solution for EEG emotion recognition, especially in unseen target conditions, with potential for broader affective computing applications.

Abstract: Electroencephalogram (EEG) signals serve as a powerful tool in affective Brain-Computer Interfaces (aBCIs) and play a crucial role in affective computing. In recent years, the introduction of deep learning techniques has significantly advanced the development of aBCIs. However, the current emotion recognition methods based on deep transfer learning face the challenge of the dual dependence of the model on source domain and target domain, As well as being affected by label noise, which seriously affects the performance and generalization ability of the model. To overcome this limitation, we proposes a Pairwise Learning framework with Domain and Category Prototypes for EEG emotion recognition under unseen target conditions (PL-DCP), and integrating concepts of feature disentanglement and prototype inference. Here, the feature disentanglement module extracts and decouples the emotional EEG features to form domain features and class features, and further calculates the dual prototype representation. The Domain-pprototype captures the individual variations across subjects, while the class-prototype captures the cross-individual commonality of emotion categories. In addition, the pairwise learning strategy effectively reduces the noise effect caused by wrong labels. The PL-DCP framework conducts a systematic experimental evaluation on the published datasets SEED, SEED-IV and SEED-V, and the accuracy are 82.88%, 65.15% and 61.29%, respectively. The results show that compared with other State-of-the-Art(SOTA) Methods, the PL-DCP model still achieves slightly better performance than the deep transfer learning method that requires both source and target data, although the target domain is completely unseen during the training. This work provides an effective and robust potential solution for emotion recognition. The source code is available at https://github.com/WuCB-BCI/PL_DCP.

[430] Guided Random Forest and its application to data approximation

Prashant Gupta, Aashi Jindal, Jayadeva, Debarka Sengupta

Main category: cs.LG

TL;DR: GRAF introduces a new ensemble classifier using global partitioning to bridge decision trees and boosting, reducing generalization error and performing well on benchmarks.

Details

Motivation: To bridge the gap between decision trees and boosting algorithms by leveraging global partitioning.

Method: Extends oblique decision trees with localized partitioning to achieve global partitioning, approximating datasets within random forests.

Result: GRAF reduces generalization error and outperforms or matches benchmarks on 115 datasets.

Conclusion: GRAF is an effective ensemble classifier combining decision trees and boosting strengths.

Abstract: We present a new way of constructing an ensemble classifier, named the Guided Random Forest (GRAF) in the sequel. GRAF extends the idea of building oblique decision trees with localized partitioning to obtain a global partitioning. We show that global partitioning bridges the gap between decision trees and boosting algorithms. We empirically demonstrate that global partitioning reduces the generalization error bound. Results on 115 benchmark datasets show that GRAF yields comparable or better results on a majority of datasets. We also present a new way of approximating the datasets in the framework of random forests.

[431] Predicting the Lifespan of Industrial Printheads with Survival Analysis

Dan Parii, Evelyne Janssen, Guangzhi Tang, Charalampos Kouzinopoulos, Marcin Pietrasik

Main category: cs.LG

TL;DR: The paper explores survival analysis techniques for predicting the lifespan of Canon Production Printing printheads, showing superior performance over baseline methods.

Details

Motivation: Accurate lifespan prediction of device components is crucial for maintenance and production optimization.

Method: Five survival analysis techniques (Kaplan-Meier, Cox model, Weibull AFT, random survival forest, gradient boosting) are applied, refined with isotonic regression, and aggregated.

Result: Survival analysis outperforms industry-standard baselines in predicting printhead lifespan, validated with real-world data.

Conclusion: Survival analysis is effective for lifespan prediction of critical components, offering reliability and accuracy.

Abstract: Accurately predicting the lifespan of critical device components is essential for maintenance planning and production optimization, making it a topic of significant interest in both academia and industry. In this work, we investigate the use of survival analysis for predicting the lifespan of production printheads developed by Canon Production Printing. Specifically, we focus on the application of five techniques to estimate survival probabilities and failure rates: the Kaplan-Meier estimator, Cox proportional hazard model, Weibull accelerated failure time model, random survival forest, and gradient boosting. The resulting estimates are further refined using isotonic regression and subsequently aggregated to determine the expected number of failures. The predictions are then validated against real-world ground truth data across multiple time windows to assess model reliability. Our quantitative evaluation using three performance metrics demonstrates that survival analysis outperforms industry-standard baseline methods for printhead lifespan prediction.

[432] Optimal Stochastic Non-smooth Non-convex Optimization through Online-to-Non-convex Conversion

Ashok Cutkosky, Harsh Mehta, Francesco Orabona

Main category: cs.LG

TL;DR: New algorithms improve complexity for non-smooth, non-convex stochastic optimization, reducing queries from $O(\epsilon^{-4}\delta^{-1})$ to $O(\epsilon^{-3}\delta^{-1})$, proven optimal. Techniques include reduction to online learning and advanced optimistic methods for deterministic cases.

Details

Motivation: To address the inefficiency in current methods for optimizing non-smooth, non-convex stochastic objectives by reducing computational complexity.

Method: A reduction from non-smooth non-convex optimization to online learning, leveraging standard regret bounds and advanced optimistic techniques for deterministic cases.

Result: Achieved optimal complexity of $O(\epsilon^{-3}\delta^{-1})$ for stochastic settings and $O(\epsilon^{-1.5}\delta^{-0.5})$ for deterministic, second-order smooth objectives.

Conclusion: The proposed techniques not only improve complexity but also recover optimal or best-known results for various settings, demonstrating broad applicability.

Abstract: We present new algorithms for optimizing non-smooth, non-convex stochastic objectives based on a novel analysis technique. This improves the current best-known complexity for finding a $(\delta,\epsilon)$-stationary point from $O(\epsilon^{-4}\delta^{-1})$ stochastic gradient queries to $O(\epsilon^{-3}\delta^{-1})$, which we also show to be optimal. Our primary technique is a reduction from non-smooth non-convex optimization to online learning, after which our results follow from standard regret bounds in online learning. For deterministic and second-order smooth objectives, applying more advanced optimistic online learning techniques enables a new complexity of $O(\epsilon^{-1.5}\delta^{-0.5})$. Our techniques also recover all optimal or best-known results for finding $\epsilon$ stationary points of smooth or second-order smooth objectives in both stochastic and deterministic settings.

[433] Eliciting Latent Predictions from Transformers with the Tuned Lens

Nora Belrose, Igor Ostrovsky, Lev McKinney, Zach Furman, Logan Smith, Danny Halawi, Stella Biderman, Jacob Steinhardt

Main category: cs.LG

TL;DR: The paper introduces the ’tuned lens’ method, an improved version of the ’logit lens,’ to analyze transformer predictions layer by layer, showing it’s more reliable and unbiased.

Details

Motivation: To understand how transformer model predictions evolve layer by layer and improve upon the brittle 'logit lens' technique.

Method: Train affine probes for each block in a frozen pretrained model to decode hidden states into vocabulary distributions.

Result: The tuned lens outperforms the logit lens in predictability, reliability, and bias reduction, and can detect malicious inputs.

Conclusion: The tuned lens provides deeper insights into transformer behavior and is a robust tool for analyzing model predictions.

Abstract: We analyze transformers from the perspective of iterative inference, seeking to understand how model predictions are refined layer by layer. To do so, we train an affine probe for each block in a frozen pretrained model, making it possible to decode every hidden state into a distribution over the vocabulary. Our method, the tuned lens, is a refinement of the earlier “logit lens” technique, which yielded useful insights but is often brittle. We test our method on various autoregressive language models with up to 20B parameters, showing it to be more predictive, reliable and unbiased than the logit lens. With causal experiments, we show the tuned lens uses similar features to the model itself. We also find the trajectory of latent predictions can be used to detect malicious inputs with high accuracy. All code needed to reproduce our results can be found at https://github.com/AlignmentResearch/tuned-lens.

[434] Deep Learning Methods for Detecting Thermal Runaway Events in Battery Production Lines

Athanasios Athanasopoulos, Matúš Mihalák, Marcin Pietrasik

Main category: cs.LG

TL;DR: The paper explores deep learning for detecting thermal runaway in battery production lines, using data from VDL Nedcar. Three models were evaluated, showing viability for industrial safety.

Details

Motivation: Thermal runaway in battery manufacturing poses safety risks like fires and toxic emissions, necessitating automated detection systems.

Method: Data from baseline and simulated thermal runaway conditions (using heat/smoke) was collected as optical/thermal images, preprocessed, and fused. Three deep-learning models (CNN, ResNet, ViT) were evaluated with explainability methods.

Result: Deep learning proved effective for thermal runaway detection in battery production lines.

Conclusion: The study confirms deep learning’s potential for enhancing safety in battery manufacturing by detecting thermal runaway.

Abstract: One of the key safety considerations of battery manufacturing is thermal runaway, the uncontrolled increase in temperature which can lead to fires, explosions, and emissions of toxic gasses. As such, development of automated systems capable of detecting such events is of considerable importance in both academic and industrial contexts. In this work, we investigate the use of deep learning for detecting thermal runaway in the battery production line of VDL Nedcar, a Dutch automobile manufacturer. Specifically, we collect data from the production line to represent both baseline (non thermal runaway) and thermal runaway conditions. Thermal runaway was simulated through the use of external heat and smoke sources. The data consisted of both optical and thermal images which were then preprocessed and fused before serving as input to our models. In this regard, we evaluated three deep-learning models widely used in computer vision including shallow convolutional neural networks, residual neural networks, and vision transformers on two performance metrics. Furthermore, we evaluated these models using explainability methods to gain insight into their ability to capture the relevant feature information from their inputs. The obtained results indicate that the use of deep learning is a viable approach to thermal runaway detection in battery production lines.

[435] Calibrating Deep Neural Network using Euclidean Distance

Wenhao Liang, Chang Dong, Liangwei Zheng, Wei Zhang, Weitong Chen

Main category: cs.LG

TL;DR: The paper introduces Focal Calibration Loss (FCL) to improve probability calibration in models while retaining Focal Loss benefits for hard samples.

Details

Motivation: Addressing the misalignment between predicted probabilities and actual outcomes in models, which affects reliability, especially in real-world scenarios with uncertainty.

Method: Proposes FCL, a novel loss function that minimizes Euclidean norm via a strictly proper loss, penalizing instance-wise calibration error and constraining bounds.

Result: FCL achieves state-of-the-art performance in both calibration and accuracy metrics across various models and datasets.

Conclusion: FCL effectively improves model calibration without sacrificing accuracy, validated theoretically and empirically, with potential applications in healthcare systems.

Abstract: Uncertainty is a fundamental aspect of real-world scenarios, where perfect information is rarely available. Humans naturally develop complex internal models to navigate incomplete data and effectively respond to unforeseen or partially observed events. In machine learning, Focal Loss is commonly used to reduce misclassification rates by emphasizing hard-to-classify samples. However, it does not guarantee well-calibrated predicted probabilities and may result in models that are overconfident or underconfident. High calibration error indicates a misalignment between predicted probabilities and actual outcomes, affecting model reliability. This research introduces a novel loss function called Focal Calibration Loss (FCL), designed to improve probability calibration while retaining the advantages of Focal Loss in handling difficult samples. By minimizing the Euclidean norm through a strictly proper loss, FCL penalizes the instance-wise calibration error and constrains bounds. We provide theoretical validation for proposed method and apply it to calibrate CheXNet for potential deployment in web-based health-care systems. Extensive evaluations on various models and datasets demonstrate that our method achieves SOTA performance in both calibration and accuracy metrics.

[436] Explainable Clustering Beyond Worst-Case Guarantees

Maximilian Fleissner, Maedeh Zarvandi, Debarghya Ghoshdastidar

Main category: cs.LG

TL;DR: The paper explores explainable clustering, focusing on the price of explainability in decision trees for well-clustered data under a statistical mixture model setting, improving over worst-case bounds.

Details

Motivation: To address whether tighter guarantees exist for well-clustered data and if decision trees can reliably recover underlying cluster structures, moving beyond worst-case analyses.

Method: The study uses a statistical mixture model framework to analyze explainable clustering, proposing an algorithm that constructs a tree in data-independent time and extends the analysis to kernel clustering.

Result: The research demonstrates that better guarantees are feasible for well-clustered data and provides improved bounds for kernel clustering compared to existing worst-case results.

Conclusion: The findings confirm that decision trees can effectively recover cluster structures in well-clustered data, offering tighter guarantees than worst-case scenarios.

Abstract: We study the explainable clustering problem first posed by Moshkovitz, Dasgupta, Rashtchian, and Frost (ICML 2020). The goal of explainable clustering is to fit an axis-aligned decision tree with $K$ leaves and minimal clustering cost (where every leaf is a cluster). The fundamental theoretical question in this line of work is the \textit{price of explainability}, defined as the ratio between the clustering cost of the tree and the optimal cost. Numerous papers have provided worst-case guarantees on this quantity. For $K$-medians, it has recently been shown that the worst-case price of explainability is $\Theta(\log K)$. While this settles the matter from a data-agnostic point of view, two important questions remain unanswered: Are tighter guarantees possible for well-clustered data? And can we trust decision trees to recover underlying cluster structures? In this paper, we place ourselves in a statistical setting of mixture models to answer both questions. We prove that better guarantees are indeed feasible for well-clustered data. Our algorithm takes as input a mixture model and constructs a tree in data-independent time. We then extend our analysis to kernel clustering, deriving new guarantees that significantly improve over existing worst-case bounds.

[437] Probabilistic Stability Guarantees for Feature Attributions

Helen Jin, Anton Xue, Weiqiu You, Surbhi Goel, Eric Wong

Main category: cs.LG

TL;DR: The paper introduces soft stability and a model-agnostic stability certification algorithm (SCA) to address limitations of existing methods, offering non-trivial guarantees for any attribution method.

Details

Motivation: Existing stability certification methods rely on heavily smoothed classifiers, producing conservative guarantees, which limits their practicality.

Method: Proposes soft stability and a sample-efficient, model-agnostic SCA algorithm. Uses Boolean function analysis to characterize stability under mild smoothing.

Result: Demonstrates that mild smoothing achieves a better accuracy-stability trade-off, avoiding aggressive compromises. SCA is effective in evaluating explanation robustness.

Conclusion: Soft stability and SCA provide interpretable, non-trivial guarantees for attribution methods, improving upon prior conservative approaches.

Abstract: Stability guarantees have emerged as a principled way to evaluate feature attributions, but existing certification methods rely on heavily smoothed classifiers and often produce conservative guarantees. To address these limitations, we introduce soft stability and propose a simple, model-agnostic, sample-efficient stability certification algorithm (SCA) that yields non-trivial and interpretable guarantees for any attribution method. Moreover, we show that mild smoothing achieves a more favorable trade-off between accuracy and stability, avoiding the aggressive compromises made in prior certification methods. To explain this behavior, we use Boolean function analysis to derive a novel characterization of stability under smoothing. We evaluate SCA on vision and language tasks and demonstrate the effectiveness of soft stability in measuring the robustness of explanation methods.

[438] Towards Scalable Newborn Screening: Automated General Movement Assessment in Uncontrolled Settings

Daphné Chopard, Sonia Laguna, Kieran Chin-Cheong, Annika Dietz, Anna Badura, Sven Wellmann, Julia E. Vogt

Main category: cs.LG

TL;DR: The paper proposes an automated tool for classifying infant general movements (GMs) from video recordings to address the shortage of trained clinicians for manual assessment.

Details

Motivation: Manual GM assessment is limited by the scarcity of trained clinicians, necessitating an automated solution for scalable newborn screening.

Method: The study introduces a feature extraction tool and evaluates machine learning techniques for classifying GMs from infant videos, addressing challenges like variable recording length and quality.

Result: The work explores automated classification methods, though specific results are not detailed in the abstract.

Conclusion: Automated GM classification could enhance newborn screening scalability, leveraging machine learning to overcome manual assessment limitations.

Abstract: General movements (GMs) are spontaneous, coordinated body movements in infants that offer valuable insights into the developing nervous system. Assessed through the Prechtl GM Assessment (GMA), GMs are reliable predictors for neurodevelopmental disorders. However, GMA requires specifically trained clinicians, who are limited in number. To scale up newborn screening, there is a need for an algorithm that can automatically classify GMs from infant video recordings. This data poses challenges, including variability in recording length, device type, and setting, with each video coarsely annotated for overall movement quality. In this work, we introduce a tool for extracting features from these recordings and explore various machine learning techniques for automated GM classification.

[439] NoWag: A Unified Framework for Shape Preserving Compression of Large Language Models

Lawrence Liu, Inesh Chakrabarti, Yixiao Li, Mengdi Wang, Tuo Zhao, Lin F. Yang

Main category: cs.LG

TL;DR: NoWag is a framework for compressing LLMs efficiently, outperforming state-of-the-art methods in vector quantization and pruning.

Details

Motivation: LLMs have high computational and memory demands, limiting deployment in resource-constrained environments.

Method: Proposes NoWag, a unified framework for one-shot shape-preserving compression, applied to Llama-2 and Llama-3 models using vector quantization (NoWag-VQ) and pruning (NoWag-P).

Result: NoWag-VQ outperforms existing vector quantization methods, and NoWag-P competes with leading pruning techniques.

Conclusion: NoWag highlights commonalities between compression paradigms and suggests future research directions.

Abstract: Large language models (LLMs) exhibit remarkable performance across various natural language processing tasks but suffer from immense computational and memory demands, limiting their deployment in resource-constrained environments. To address this challenge, we propose NoWag (Normalized Weight and Activation Guided Compression), a unified framework for one-shot shape preserving compression algorithms. We apply NoWag to compress Llama-2 (7B, 13B, 70B) and Llama-3 (8B, 70B) models using two popular shape-preserving techniques: vector quantization (NoWag-VQ) and unstructured/semi-structured pruning (NoWag-P). Our results show that NoWag-VQ significantly outperforms state-of-the-art one-shot vector quantization methods, while NoWag-P performs competitively against leading pruning techniques. These findings highlight underlying commonalities between these compression paradigms and suggest promising directions for future research. Our code is available at https://github.com/LawrenceRLiu/NoWag

[440] A solvable generative model with a linear, one-step denoiser

Indranil Halder

Main category: cs.LG

TL;DR: A single-step diffusion model with a linear denoiser is analyzed, showing the Kullback-Leibler divergence’s behavior with finite diffusion time and noise scale. Key findings include the divergence’s monotonic fall phase starting when dataset size matches data dimension, and why more diffusion steps improve quality in large-scale models.

Details

Motivation: To understand the behavior of Kullback-Leibler divergence in diffusion models and explore the impact of dataset size and diffusion steps on model performance.

Method: Developed a tractable single-step diffusion model using a linear denoiser and derived an explicit formula for Kullback-Leibler divergence between generated and sampling distributions.

Result: The divergence’s monotonic fall phase begins when dataset size equals data dimension. More diffusion steps enhance quality in large-scale models.

Conclusion: The study provides theoretical insights into diffusion models, linking dataset size, data dimension, and diffusion steps to performance.

Abstract: We develop an analytically tractable single-step diffusion model based on a linear denoiser and present an explicit formula for the Kullback-Leibler divergence between the generated and sampling distribution, taken to be isotropic Gaussian, showing the effect of finite diffusion time and noise scale. Our study further reveals that the monotonic fall phase of Kullback-Leibler divergence begins when the training dataset size reaches the dimension of the data points. Finally, for large-scale practical diffusion models, we explain why a higher number of diffusion steps enhances production quality based on the theoretical arguments presented before.

[441] Diffusion Models are Secretly Exchangeable: Parallelizing DDPMs via Autospeculation

Hengyuan Hu, Aniket Das, Dorsa Sadigh, Nima Anari

Main category: cs.LG

TL;DR: DDPMs face inference-time bottlenecks due to sequential computation. This work leverages their connection to Stochastic Localization to introduce Autospeculative Decoding (ASD), achieving parallel speedup without auxiliary models.

Details

Motivation: Address the inference-time bottlenecks in DDPMs by optimizing their sequential computation requirements.

Method: Utilize the connection between DDPMs and Stochastic Localization to prove exchangeability of increments, enabling adaptation of autoregressive optimization techniques. Introduce ASD for DDPMs.

Result: ASD achieves a theoretical $ ilde{O}(K^{ rac{1}{3}})$ parallel runtime speedup over sequential DDPM and practical acceleration in various domains.

Conclusion: ASD effectively addresses DDPM inference bottlenecks, offering significant speedup without additional models.

Abstract: Denoising Diffusion Probabilistic Models (DDPMs) have emerged as powerful tools for generative modeling. However, their sequential computation requirements lead to significant inference-time bottlenecks. In this work, we utilize the connection between DDPMs and Stochastic Localization to prove that, under an appropriate reparametrization, the increments of DDPM satisfy an exchangeability property. This general insight enables near-black-box adaptation of various performance optimization techniques from autoregressive models to the diffusion setting. To demonstrate this, we introduce \emph{Autospeculative Decoding} (ASD), an extension of the widely used speculative decoding algorithm to DDPMs that does not require any auxiliary draft models. Our theoretical analysis shows that ASD achieves a $\tilde{O} (K^{\frac{1}{3}})$ parallel runtime speedup over the $K$ step sequential DDPM. We also demonstrate that a practical implementation of autospeculative decoding accelerates DDPM inference significantly in various domains.

[442] Contrastive Representation Modeling for Anomaly Detection

Willian T. Lunardi, Abdulrahman Banabila, Dania Herzalla, Martin Andreoni

Main category: cs.LG

TL;DR: The paper proposes a structured contrastive learning method for anomaly detection, addressing challenges in compact inlier clustering, inlier-anomaly separation, and outlier diversity preservation. It outperforms standard methods in performance and convergence.

Details

Motivation: Conventional contrastive learning struggles to balance compact inlier embeddings and anomaly separation, limiting anomaly detection effectiveness.

Method: A structured contrastive objective redefines positive and negative relationships during training, enhanced by patch-based learning for localized anomalies.

Result: The method achieves faster convergence and superior performance on semantic and industrial benchmarks, even surpassing methods with explicit anomaly labels.

Conclusion: The proposed approach effectively addresses key challenges in anomaly detection, offering a robust solution without requiring anomaly labels.

Abstract: Distance-based anomaly detection methods rely on compact in-distribution (ID) embeddings that are well separated from anomalies. However, conventional contrastive learning strategies often struggle to achieve this balance, either promoting excessive variance among inliers or failing to preserve the diversity of outliers. We begin by analyzing the challenges of representation learning for anomaly detection and identify three essential properties for the pretext task: (1) compact clustering of inliers, (2) strong separation between inliers and anomalies, and (3) preservation of diversity among synthetic outliers. Building on this, we propose a structured contrastive objective that redefines positive and negative relationships during training, promoting these properties without requiring explicit anomaly labels. We extend this framework with a patch-based learning and evaluation strategy specifically designed to improve the detection of localized anomalies in industrial settings. Our approach demonstrates significantly faster convergence and improved performance compared to standard contrastive methods. It matches or surpasses anomaly detection methods on both semantic and industrial benchmarks, including methods that rely on discriminative training or explicit anomaly labels.

[443] RLSR: Reinforcement Learning from Self Reward

Toby Simonds, Kevin Lopez, Akira Yoshiyama, Dominique Garmier

Main category: cs.LG

TL;DR: LLMs can self-improve by judging their own solutions without ground truth, enabling reinforcement learning in domains with impractical rewards.

Details

Motivation: Training LLMs with reinforcement learning often requires expensive verifiable rewards, which are not feasible for all domains.

Method: LLMs self-judge solutions without reference answers, leveraging the asymmetry between generating and verifying solutions. Experiments include Countdown puzzles and integration problems.

Result: Models provide reliable reward signals without ground truth, achieving performance comparable to formal verification. Qwen 2.5 7B DeepSeek Distilled qualifies for MIT Integration Bee.

Conclusion: Self-judging enables autonomous AI improvement, unlocking reinforcement learning in domains with scarce training data or complex evaluation.

Abstract: Large language models can generate solutions to complex problems, but training them with reinforcement learning typically requires verifiable rewards that are expensive to create and not possible for all domains. We demonstrate that LLMs can effectively self-improve through self-judging without reference solutions, leveraging the inherent asymmetry between generating and verifying solutions. Our experiments show that models can provide reliable reward signals without ground truth answers, enabling reinforcement learning in domains where verifiable rewards are impractical. By implementing self-judging across Countdown puzzles and integration problems, we achieve performance comparable to formal verification without ground truth solutions. Most notably, Qwen 2.5 7B DeepSeek Distilled trained with self-rewards qualifies for the prestigious MIT Integration Bee competition, performance through self-supervised improvement. When combined with synthetic question generation, we establish a complete self-improvement loop where models generate practice problems, solve them, and evaluate their own performance without any external validation. Our findings demonstrate that LLM judges can provide effective reward signals for training, unlocking reinforcement learning in countless domains previously limited by reward engineering challenges. This work represents a significant step toward autonomous AI systems that continuously improve through self-directed learning rather than human-guided training, potentially accelerating progress across domains where training data is scarce or evaluation is complex.

[444] Can Transformers Learn Full Bayesian Inference in Context?

Arik Reuter, Tim G. J. Rudner, Vincent Fortuin, David Rügamer

Main category: cs.LG

TL;DR: Transformers can perform full Bayesian inference in-context, matching the quality of traditional methods like MCMC or variational inference.

Details

Motivation: To advance understanding of in-context learning (ICL) in transformers and demonstrate their capability for Bayesian inference without additional training.

Method: A framework combining prior fitted networks and continuous normalizing flows to infer complex posterior distributions for statistical models.

Result: ICL approach produces posterior samples comparable to state-of-the-art non-contextual methods like MCMC or variational inference.

Conclusion: Transformers can effectively perform full Bayesian inference in-context, opening new possibilities for their application in statistical modeling.

Abstract: Transformers have emerged as the dominant architecture in the field of deep learning, with a broad range of applications and remarkable in-context learning (ICL) capabilities. While not yet fully understood, ICL has already proved to be an intriguing phenomenon, allowing transformers to learn in context – without requiring further training. In this paper, we further advance the understanding of ICL by demonstrating that transformers can perform full Bayesian inference for commonly used statistical models in context. More specifically, we introduce a general framework that builds on ideas from prior fitted networks and continuous normalizing flows and enables us to infer complex posterior distributions for models such as generalized linear models and latent factor models. Extensive experiments on real-world datasets demonstrate that our ICL approach yields posterior samples that are similar in quality to state-of-the-art MCMC or variational inference methods that do not operate in context. The source code for this paper is available at https://github.com/ArikReuter/ICL_for_Full_Bayesian_Inference.

[445] LLM-TabLogic: Preserving Inter-Column Logical Relationships in Synthetic Tabular Data via Prompt-Guided Latent Diffusion

Yunbo Long, Liming Xu, Alexandra Brintrup

Main category: cs.LG

TL;DR: LLM-TabLogic uses LLM reasoning to ensure logical consistency in synthetic tabular data, outperforming baselines in accuracy and preserving inter-column relationships.

Details

Motivation: Existing generative models often fail to maintain domain-specific logical consistency in synthetic tabular data, limiting real-world usability.

Method: LLM-TabLogic leverages LLM reasoning to capture inter-column relationships and integrates these constraints into a Score-based Diffusion model for data generation.

Result: Achieves over 90% accuracy in logical inference and outperforms baselines in preserving data fidelity, utility, and privacy.

Conclusion: LLM-TabLogic is the first method to effectively preserve inter-column relationships without domain knowledge, advancing synthetic data generation.

Abstract: Synthetic tabular data are increasingly being used to replace real data, serving as an effective solution that simultaneously protects privacy and addresses data scarcity. However, in addition to preserving global statistical properties, synthetic datasets must also maintain domain-specific logical consistency**-**especially in complex systems like supply chains, where fields such as shipment dates, locations, and product categories must remain logically consistent for real-world usability. Existing generative models often overlook these inter-column relationships, leading to unreliable synthetic tabular data in real-world applications. To address these challenges, we propose LLM-TabLogic, a novel approach that leverages Large Language Model reasoning to capture and compress the complex logical relationships among tabular columns, while these conditional constraints are passed into a Score-based Diffusion model for data generation in latent space. Through extensive experiments on real-world industrial datasets, we evaluate LLM-TabLogic for column reasoning and data generation, comparing it with five baselines including SMOTE and state-of-the-art generative models. Our results show that LLM-TabLogic demonstrates strong generalization in logical inference, achieving over 90% accuracy on unseen tables. Furthermore, our method outperforms all baselines in data generation by fully preserving inter-column relationships while maintaining the best balance between data fidelity, utility, and privacy. This study presents the first method to effectively preserve inter-column relationships in synthetic tabular data generation without requiring domain knowledge, offering new insights for creating logically consistent real-world tabular data.

[446] Diagnosing and Mitigating Modality Interference in Multimodal Large Language Models

Rui Cai, Bangzheng Li, Xiaofei Wen, Muhao Chen, Zhe Zhao

Main category: cs.LG

TL;DR: The paper addresses the Cross-Modality Competency Problem in Multimodal Large Language Models (MLLMs), where models struggle to distinguish task-relevant signals, leading to Modality Interference. A perturbation-based framework is proposed to improve robustness and performance.

Details

Motivation: MLLMs often fail to fairly evaluate all modalities, especially in tasks like VQA, leading to susceptibility to misleading inputs. This limits their reliability in unimodal and multimodal tasks.

Method: A perturbation-based causal diagnostic experiment is designed to quantify Modality Interference. The proposed framework includes perturbation-based data augmentations (heuristic and adversarial via PGD) and consistency regularization for model outputs.

Result: Experiments on benchmark datasets show significant improvements in robustness and cross-modality competency, enhancing both unimodal reasoning and multimodal task performance.

Conclusion: The proposed framework effectively mitigates Modality Interference, improving MLLMs’ ability to handle unimodal and multimodal tasks robustly.

Abstract: Multimodal Large Language Models (MLLMs) have demonstrated impressive capabilities across tasks, yet they often exhibit difficulty in distinguishing task-relevant from irrelevant signals, particularly in tasks like Visual Question Answering (VQA), which can lead to susceptibility to misleading or spurious inputs. We refer to this broader limitation as the Cross-Modality Competency Problem: the model’s inability to fairly evaluate all modalities. This vulnerability becomes more evident in modality-specific tasks such as image classification or pure text question answering, where models are expected to rely solely on one modality. In such tasks, spurious information from irrelevant modalities often leads to significant performance degradation. We refer to this failure as Modality Interference, which serves as a concrete and measurable instance of the cross-modality competency problem. We further design a perturbation-based causal diagnostic experiment to verify and quantify this problem. To mitigate modality interference, we propose a novel framework to fine-tune MLLMs, including perturbation-based data augmentations with both heuristic perturbations and adversarial perturbations via Projected Gradient Descent (PGD), and a consistency regularization strategy applied to model outputs with original and perturbed inputs. Experiments on multiple benchmark datasets (image-heavy, text-heavy, and VQA tasks) and multiple model families with different scales demonstrate significant improvements in robustness and cross-modality competency, indicating our method’s effectiveness in boosting unimodal reasoning ability while enhancing performance on multimodal tasks.

[447] Task Vector Quantization for Memory-Efficient Model Merging

Youngeun Kim, Seunghwan Lee, Aecheon Jung, Bogon Ryu, Sungeun Hong

Main category: cs.LG

TL;DR: The paper proposes quantizing task vectors (differences between pre-trained and fine-tuned checkpoints) to reduce memory usage in model merging, introducing Residual Task Vector Quantization for ultra-low bit precision.

Details

Motivation: Storing multiple task-specific checkpoints consumes significant memory, limiting scalability and restricting model merging to larger models and diverse tasks.

Method: Quantize task vectors (narrow weight range allows low precision, e.g., 4 bit) and introduce Residual Task Vector Quantization for ultra-low bit precision (e.g., 2 bit), allocating bits based on sensitivity.

Result: Maintains or improves model merging performance while using only 8% of the memory required for full-precision checkpoints.

Conclusion: Quantizing task vectors is efficient for memory reduction in model merging, with Residual Task Vector Quantization mitigating errors in ultra-low bit precision.

Abstract: Model merging enables efficient multi-task models by combining task-specific fine-tuned checkpoints. However, storing multiple task-specific checkpoints requires significant memory, limiting scalability and restricting model merging to larger models and diverse tasks. In this paper, we propose quantizing task vectors (i.e., the difference between pre-trained and fine-tuned checkpoints) instead of quantizing fine-tuned checkpoints. We observe that task vectors exhibit a narrow weight range, enabling low precision quantization (e.g., 4 bit) within existing task vector merging frameworks. To further mitigate quantization errors within ultra-low bit precision (e.g., 2 bit), we introduce Residual Task Vector Quantization, which decomposes the task vector into a base vector and offset component. We allocate bits based on quantization sensitivity, ensuring precision while minimizing error within a memory budget. Experiments on image classification and dense prediction show our method maintains or improves model merging performance while using only 8% of the memory required for full-precision checkpoints.

[448] Energy Optimized Piecewise Polynomial Approximation Utilizing Modern Machine Learning Optimizers

Hannes Waclawek, Stefan Huber

Main category: cs.LG

TL;DR: Extends machine learning-optimized piecewise polynomial approximation by adding energy optimization, using TensorFlow for smoother cam profiles.

Details

Motivation: Traditional methods lack flexibility for complex optimization goals like energy efficiency.

Method: Uses gradient descent optimizers in TensorFlow to minimize elastic strain energy in cam profiles.

Result: Achieves smoother motion and Pareto-efficient trade-offs between approximation quality and energy consumption.

Conclusion: The framework effectively balances energy optimization with approximation quality, offering practical benefits.

Abstract: This work explores an extension of machine learning-optimized piecewise polynomial approximation by incorporating energy optimization as an additional objective. Traditional closed-form solutions enable continuity and approximation targets but lack flexibility in accommodating complex optimization goals. By leveraging modern gradient descent optimizers within TensorFlow, we introduce a framework that minimizes elastic strain energy in cam profiles, leading to smoother motion. Experimental results confirm the effectiveness of this approach, demonstrating its potential to Pareto-efficiently trade approximation quality against energy consumption.

[449] Towards Scalable Bayesian Optimization via Gradient-Informed Bayesian Neural Networks

Georgios Makrygiorgos, Joshua Hang Sai Ip, Ali Mesbah

Main category: cs.LG

TL;DR: The paper proposes integrating gradient information into Bayesian neural networks (BNNs) for Bayesian optimization (BO), improving surrogate models and accelerating convergence.

Details

Motivation: While Gaussian processes (GPs) with gradients enhance BO, BNNs' potential for gradient-informed optimization remains unexplored.

Method: A gradient-informed loss function for BNN training is introduced, combining function and gradient observations via automatic differentiation.

Result: The approach improves BNN predictions and speeds up BO convergence, especially with higher-dimensional problems.

Conclusion: Gradient-informed BNNs offer a scalable and flexible alternative to GPs in BO, enhancing performance with gradient data.

Abstract: Bayesian optimization (BO) is a widely used method for data-driven optimization that generally relies on zeroth-order data of objective function to construct probabilistic surrogate models. These surrogates guide the exploration-exploitation process toward finding global optimum. While Gaussian processes (GPs) are commonly employed as surrogates of the unknown objective function, recent studies have highlighted the potential of Bayesian neural networks (BNNs) as scalable and flexible alternatives. Moreover, incorporating gradient observations into GPs, when available, has been shown to improve BO performance. However, the use of gradients within BNN surrogates remains unexplored. By leveraging automatic differentiation, gradient information can be seamlessly integrated into BNN training, resulting in more informative surrogates for BO. We propose a gradient-informed loss function for BNN training, effectively augmenting function observations with local gradient information. The effectiveness of this approach is demonstrated on well-known benchmarks in terms of improved BNN predictions and faster BO convergence as the number of decision variables increases.

[450] A Structure-Preserving Framework for Solving Parabolic Partial Differential Equations with Neural Networks

Gaohang Chen, Lili Ju, Zhonghua Qiao

Main category: cs.LG

TL;DR: The paper introduces “Sidecar,” a framework to enhance physical consistency in neural network (NN) solvers for parabolic PDEs by preserving intrinsic properties like mass and momentum conservation.

Details

Motivation: Existing NN solvers for PDEs often neglect physical properties, leading to nonphysical or unstable solutions, especially in long-term simulations.

Method: The Sidecar framework uses a small copilot network to guide the primary NN solver, ensuring structure-preserving properties are respected.

Result: Experiments show improved accuracy and better preservation of physical properties in benchmark problems.

Conclusion: Sidecar effectively enhances NN solvers for parabolic PDEs by incorporating physical consistency, improving stability and accuracy.

Abstract: Solving partial differential equations (PDEs) with neural networks (NNs) has shown great potential in various scientific and engineering fields. However, most existing NN solvers mainly focus on satisfying the given PDE formulas in the strong or weak sense, without explicitly considering some intrinsic physical properties, such as mass and momentum conservation, or energy dissipation. This limitation may result in nonphysical or unstable numerical solutions, particularly in long-term simulations. To address this issue, we propose ``Sidecar’’, a novel framework that enhances the physical consistency of existing NN solvers for solving parabolic PDEs. Inspired by the time-dependent spectral renormalization approach, our Sidecar framework introduces a small network as a copilot, guiding the primary function-learning NN solver to respect the structure-preserving properties. Our framework is highly flexible, allowing the preservation of various physical quantities for different PDEs to be incorporated into a wide range of NN solvers. Experimental results on some benchmark problems demonstrate significant improvements brought by the proposed framework to both accuracy and structure preservation of existing NN solvers.

[451] How Effective are Large Time Series Models in Hydrology? A Study on Water Level Forecasting in Everglades

Rahuul Rangaraj, Jimeng Shi, Azam Shirali, Rajendra Paudel, Yanzhao Wu, Giri Narasimhan

Main category: cs.LG

TL;DR: The study explores large time series models for water level prediction in the Everglades, finding the foundation model Chronos outperforms others, while task-specific models vary in performance.

Details

Motivation: Traditional methods for water level prediction in the Everglades face computational and adaptability challenges, prompting exploration of advanced time series models.

Method: Twelve task-specific models and five time series foundation models across six categories were evaluated for water level prediction.

Result: Chronos significantly outperformed other models, while other foundation models performed poorly. Task-specific models’ performance varied by architecture.

Conclusion: The study highlights the potential of large time series models in hydrology and encourages further exploration in environmental applications.

Abstract: The Everglades play a crucial role in flood and drought regulation, water resource planning, and ecosystem management in the surrounding regions. However, traditional physics-based and statistical methods for predicting water levels often face significant challenges, including high computational costs and limited adaptability to diverse or unforeseen conditions. Recent advancements in large time series models have demonstrated the potential to address these limitations, with state-of-the-art deep learning and foundation models achieving remarkable success in time series forecasting across various domains. Despite this progress, their application to critical environmental systems, such as the Everglades, remains underexplored. In this study, we fill the gap by investigating twelve task-specific models and five time series foundation models across six categories for a real-world application focused on water level prediction in the Everglades. Our primary results show that the foundation model Chronos significantly outperforms all other models while the remaining foundation models exhibit relatively poor performance. We also noticed that the performance of task-specific models varies with the model architectures, and discussed the possible reasons. We hope our study and findings will inspire the community to explore the applicability of large time series models in hydrological applications. The code and data are available at https://github.com/rahuul2992000/Everglades-Benchmark.

[452] JULI: Jailbreak Large Language Models by Self-Introspection

Jesson Wang, Zhanhao Hu, David Wagner

Main category: cs.LG

TL;DR: JULI is a method to jailbreak LLMs by manipulating token log probabilities using BiasNet, requiring only top-5 token log probabilities, and outperforms existing SOTA approaches.

Details

Motivation: Existing attacks on safety-aligned LLMs often require model weights or generation process access, which proprietary API-calling models restrict. JULI addresses this limitation.

Method: JULI jailbreaks LLMs by manipulating token log probabilities using a plug-in block called BiasNet, relying solely on top-5 token log probabilities.

Result: JULI effectively jailbreaks API-calling LLMs in a black-box setting, outperforming existing methods across multiple metrics.

Conclusion: JULI provides a superior and practical approach to jailbreaking LLMs without needing model weights or generation process access.

Abstract: Large Language Models (LLMs) are trained with safety alignment to prevent generating malicious content. Although some attacks have highlighted vulnerabilities in these safety-aligned LLMs, they typically have limitations, such as necessitating access to the model weights or the generation process. Since proprietary models through API-calling do not grant users such permissions, these attacks find it challenging to compromise them. In this paper, we propose Jailbreaking Using LLM Introspection (JULI), which jailbreaks LLMs by manipulating the token log probabilities, using a tiny plug-in block, BiasNet. JULI relies solely on the knowledge of the target LLM’s predicted token log probabilities. It can effectively jailbreak API-calling LLMs under a black-box setting and knowing only top-$5$ token log probabilities. Our approach demonstrates superior effectiveness, outperforming existing state-of-the-art (SOTA) approaches across multiple metrics.

[453] Label Leakage in Federated Inertial-based Human Activity Recognition

Marius Bock, Maximilian Hopp, Kristof Van Laerhoven, Michael Moeller

Main category: cs.LG

TL;DR: The paper evaluates label reconstruction attacks in Federated Learning for Human Activity Recognition (HAR), revealing high leakage risks and limited protection from Local Differential Privacy techniques.

Details

Motivation: To assess the vulnerability of HAR systems to gradient-based label leakage attacks due to the sensitivity of activity labels.

Method: Evaluates state-of-the-art gradient-based label leakage attacks on HAR benchmark datasets, considering factors like class count, sampling, and imbalance.

Result: Reconstruction accuracies exceed 90% on benchmark datasets; Local Differential Privacy methods provide limited protection.

Conclusion: Recommends privacy-aware deployment strategies for federated HAR systems and highlights open research challenges.

Abstract: While prior work has shown that Federated Learning updates can leak sensitive information, label reconstruction attacks, which aim to recover input labels from shared gradients, have not yet been examined in the context of Human Activity Recognition (HAR). Given the sensitive nature of activity labels, this study evaluates the effectiveness of state-of-the-art gradient-based label leakage attacks on HAR benchmark datasets. Our findings show that the number of activity classes, sampling strategy, and class imbalance are critical factors influencing the extent of label leakage, with reconstruction accuracies reaching well-above 90% on two benchmark datasets, even for trained models. Moreover, we find that Local Differential Privacy techniques such as gradient noise and clipping offer only limited protection, as certain attacks still reliably infer both majority and minority class labels. We conclude by offering practical recommendations for the privacy-aware deployment of federated HAR systems and identify open challenges for future research. Code to reproduce our experiments is publicly available via github.com/mariusbock/leakage_har.

[454] AbbIE: Autoregressive Block-Based Iterative Encoder for Efficient Sequence Modeling

Preslav Aleksandrov, Meghdad Kurmanji, Fernando Garcia Redondo, David O’Shea, William Shen, Alex Iacob, Lorenzo Sani, Xinchi Qiu, Nicola Cancedda, Nicholas D. Lane

Main category: cs.LG

TL;DR: AbbIE is a recursive Transformer variant improving perplexity and dynamic compute scaling, outperforming standard and iterative methods in zero-shot tasks and perplexity.

Details

Motivation: To enhance Transformer performance without relying on parameter/token scaling or specialized datasets.

Method: Recursive, block-based iterative encoder in latent space, trained with 2 iterations but generalizes to longer ones.

Result: 12% better zero-shot learning and 5% perplexity improvement; scales compute dynamically.

Conclusion: AbbIE offers a new way to scale Transformer performance beyond traditional methods.

Abstract: We introduce the Autoregressive Block-Based Iterative Encoder (AbbIE), a novel recursive generalization of the encoder-only Transformer architecture, which achieves better perplexity than a standard Transformer and allows for the dynamic scaling of compute resources at test time. This simple, recursive approach is a complement to scaling large language model (LLM) performance through parameter and token counts. AbbIE performs its iterations in latent space, but unlike latent reasoning models, does not require a specialized dataset or training protocol. We show that AbbIE upward generalizes (ability to generalize to arbitrary iteration lengths) at test time by only using 2 iterations during train time, far outperforming alternative iterative methods. AbbIE’s ability to scale its computational expenditure based on the complexity of the task gives it an up to \textbf{12%} improvement in zero-shot in-context learning tasks versus other iterative and standard methods and up to 5% improvement in language perplexity. The results from this study open a new avenue to Transformer performance scaling. We perform all of our evaluations on model sizes up to 350M parameters.

[455] Explainable Evidential Clustering

Victor F. Lopes de Souza, Karima Bakhti, Sofiane Ramdani, Denis Mottet, Abdelhak Imoussaten

Main category: cs.LG

TL;DR: The paper addresses the challenge of explaining evidential clustering results, proposing the Iterative Evidential Mistake Minimization (IEMM) algorithm for interpretable decision tree explanations.

Details

Motivation: Real-world data imperfections like uncertainty and imprecision are poorly handled by traditional methods, necessitating explainable evidential clustering for high-stakes domains like healthcare.

Method: The paper introduces representativity as a key condition for decision trees to explain evidential clustering, generalizes it with utility functions for partial labeling, and proposes the IEMM algorithm.

Result: The IEMM algorithm provides satisfactory explanations up to 93% of the time, validated on synthetic and real-world data.

Conclusion: The study successfully bridges the gap in explaining evidential clustering, offering practical and interpretable solutions for real-world applications.

Abstract: Unsupervised classification is a fundamental machine learning problem. Real-world data often contain imperfections, characterized by uncertainty and imprecision, which are not well handled by traditional methods. Evidential clustering, based on Dempster-Shafer theory, addresses these challenges. This paper explores the underexplored problem of explaining evidential clustering results, which is crucial for high-stakes domains such as healthcare. Our analysis shows that, in the general case, representativity is a necessary and sufficient condition for decision trees to serve as abductive explainers. Building on the concept of representativity, we generalize this idea to accommodate partial labeling through utility functions. These functions enable the representation of “tolerable” mistakes, leading to the definition of evidential mistakeness as explanation cost and the construction of explainers tailored to evidential classifiers. Finally, we propose the Iterative Evidential Mistake Minimization (IEMM) algorithm, which provides interpretable and cautious decision tree explanations for evidential clustering functions. We validate the proposed algorithm on synthetic and real-world data. Taking into account the decision-maker’s preferences, we were able to provide an explanation that was satisfactory up to 93% of the time.

[456] Learning What Matters: Probabilistic Task Selection via Mutual Information for Model Finetuning

Prateek Chanda, Saral Sureka, Parth Pratim Chatterjee, Krishnateja Killamsetty, Nikhil Shivakumar Nayak, Ganesh Ramakrishnan

Main category: cs.LG

TL;DR: TASKPGM is a framework for optimizing training mixtures for LLMs by minimizing an energy function over an MRF, balancing task diversity and representativeness.

Details

Motivation: Current methods for selecting task mixtures in LLM finetuning are manual and heuristic, lacking principled optimization.

Method: Uses behavioral divergences (e.g., Jensen Shannon Divergence) to model task relationships and provides a closed-form solution under simplex constraints.

Result: Empirical improvements on models like Llama 2 and Mistral across benchmarks (MMLU, BIGBench), with theoretical guarantees.

Conclusion: TASKPGM offers a scalable, interpretable, and robust approach for LLM finetuning.

Abstract: The performance of finetuned large language models (LLMs) hinges critically on the composition of the training mixture. However, selecting an optimal blend of task datasets remains a largely manual, heuristic driven process, with practitioners often relying on uniform or size based sampling strategies. We introduce TASKPGM, a principled and scalable framework for mixture optimization that selects continuous task proportions by minimizing an energy function over a Markov Random Field (MRF). Task relationships are modeled using behavioral divergences such as Jensen Shannon Divergence and Pointwise Mutual Information computed from the predictive distributions of single task finetuned models. Our method yields a closed form solution under simplex constraints and provably balances representativeness and diversity among tasks. We provide theoretical guarantees, including weak submodularity for budgeted variants, and demonstrate consistent empirical improvements on Llama 2 and Mistral across evaluation suites such as MMLU and BIGBench. Beyond performance, TASKPGM offers interpretable insights into task influence and mixture composition, making it a powerful tool for efficient and robust LLM finetuning.

[457] Diffusion Beats Autoregressive in Data-Constrained Settings

Mihir Prabhudesai, Mengning Wu, Amir Zadeh, Katerina Fragkiadaki, Deepak Pathak

Main category: cs.LG

TL;DR: Diffusion models outperform autoregressive (AR) models in data-scarce settings due to better data utilization and implicit augmentation.

Details

Motivation: To explore the advantages of diffusion-based language models over AR models, especially in data-constrained scenarios.

Method: Systematic study of masked diffusion models in data-constrained settings, comparing them with AR models.

Result: Diffusion models achieve lower validation loss and better downstream performance when compute is abundant but data is scarce.

Conclusion: Diffusion models are a compelling alternative to AR models when data is the bottleneck, not compute.

Abstract: Autoregressive (AR) models have long dominated the landscape of large language models, driving progress across a wide range of tasks. Recently, diffusion-based language models have emerged as a promising alternative, though their advantages over AR models remain underexplored. In this paper, we systematically study masked diffusion models in data-constrained settings-where training involves repeated passes over limited data-and find that they significantly outperform AR models when compute is abundant but data is scarce. Diffusion models make better use of repeated data, achieving lower validation loss and superior downstream performance. We interpret this advantage as implicit data augmentation: masked diffusion exposes the model to a diverse distribution of token orderings and prediction tasks, unlike AR’s fixed left-to-right factorization. We find new scaling laws for diffusion models and derive a closed-form expression for the critical compute threshold at which diffusion begins to outperform AR. These results suggest that when data, not compute, is the bottleneck, diffusion models offer a compelling alternative to the standard AR paradigm. Our code is available at: https://diffusion-scaling.github.io.

[458] A Comprehensive Review of Diffusion Models in Smart Agriculture: Progress, Applications, and Challenges

Xing Hu, Haodong Chen, Qianqian Duan, Choon Ki Ahn, Huiliang Shang, Dawei Zhang

Main category: cs.LG

TL;DR: Diffusion models show promise in agriculture for tasks like crop monitoring and pest detection, offering better stability and image quality than GANs, though computational costs and domain generalizability remain challenges.

Details

Motivation: Addressing limited arable land and the need for sustainable agriculture through AI-driven solutions like diffusion models.

Method: Review of diffusion models’ applications in agriculture, focusing on crop disease detection, remote sensing, and data augmentation.

Result: Diffusion models improve image generation, denoising, and data augmentation, aiding precision agriculture despite computational and generalizability issues.

Conclusion: Diffusion models hold potential for sustainable agriculture, with ongoing research needed to overcome current limitations.

Abstract: With the global population increasing and arable land resources becoming increasingly limited, smart and precision agriculture have emerged as essential directions for sustainable agricultural development. Artificial intelligence (AI), particularly deep learning models, has been widely adopted in applications such as crop monitoring, pest detection, and yield prediction. Among recent generative models, diffusion models have demonstrated considerable potential in agricultural image processing, data augmentation, and remote sensing analysis. Compared to traditional generative adversarial networks (GANs), diffusion models exhibit greater training stability and superior image generation quality, effectively addressing challenges such as limited annotated datasets and imbalanced sample distributions in agricultural scenarios. This paper reviews recent advancements in the application of diffusion models within agriculture, focusing on their roles in crop disease and pest detection, remote sensing image enhancement, crop growth prediction, and agricultural resource management. Diffusion models have been found useful in improving tasks like image generation, denoising, and data augmentation in agriculture, especially when environmental noise or variability is present. While their high computational requirements and limited generalizability across domains remain concerns, the approach is gradually proving effective in real-world applications such as precision crop monitoring. As research progresses, these models may help support sustainable agriculture and address emerging challenges in food systems.

[459] Modular Delta Merging with Orthogonal Constraints: A Scalable Framework for Continual and Reversible Model Composition

Haris Khan, Sadia Asif, Shumaila Asif

Main category: cs.LG

TL;DR: MDM-OC is a framework for scalable, interference-free, and reversible model merging in continual learning, outperforming prior methods in accuracy and compliance.

Details

Motivation: Addressing task interference, catastrophic forgetting, and lack of reversibility in existing model merging and continual learning approaches.

Method: Encodes task-specific models as deltas from a shared base, projects them into orthogonal subspaces, and merges them via gradient-based optimization. Supports unmerging and stability techniques.

Result: Outperforms baselines in accuracy, backward transfer, and unmerge fidelity on vision and NLP benchmarks, while being memory-efficient.

Conclusion: MDM-OC provides a principled solution for modular and compliant AI system design.

Abstract: In real-world machine learning deployments, models must be continually updated, composed, and when required, selectively undone. However, existing approaches to model merging and continual learning often suffer from task interference, catastrophic forgetting, or lack of reversibility. We propose Modular Delta Merging with Orthogonal Constraints (MDM-OC), a novel framework that enables scalable, interference-free, and reversible composition of fine-tuned models. Each task-specific model is encoded as a delta from a shared base and projected into an orthogonal subspace to eliminate conflict. These projected deltas are then merged via gradient-based optimization to form a unified model that retains performance across tasks. Our approach supports continual integration of new models, structured unmerging for compliance such as GDPR requirements, and model stability via elastic weight consolidation and synthetic replay. Extensive experiments on vision and natural language processing benchmarks demonstrate that MDM-OC outperforms prior baselines in accuracy, backward transfer, and unmerge fidelity, while remaining memory-efficient and computationally tractable. This framework offers a principled solution for modular and compliant AI system design.

[460] BOASF: A Unified Framework for Speeding up Automatic Machine Learning via Adaptive Successive Filtering

Guanghui Zhu, Xin Fang, Feng Cheng, Lei Wang, Wenzhong Chen, Chunfeng Yuan, Yihua Huang

Main category: cs.LG

TL;DR: BOASF combines Bayesian Optimization and Adaptive Successive Filtering to automate model selection and hyperparameter optimization, outperforming existing methods in speed and performance.

Details

Motivation: Non-experts struggle with model selection and hyperparameter tuning due to lack of expertise. BOASF aims to automate this process efficiently.

Method: BOASF uses Bayesian Optimization for selecting promising configurations and ASF to discard poor-performing ones. A Softmax model allocates resources adaptively.

Result: BOASF speeds up optimization, achieves robust performance, and outperforms state-of-the-art methods under various time budgets.

Conclusion: BOASF is effective for automating machine learning tasks, offering better performance and efficiency than existing methods.

Abstract: Machine learning has been making great success in many application areas. However, for the non-expert practitioners, it is always very challenging to address a machine learning task successfully and efficiently. Finding the optimal machine learning model or the hyperparameter combination set from a large number of possible alternatives usually requires considerable expert knowledge and experience. To tackle this problem, we propose a combined Bayesian Optimization and Adaptive Successive Filtering algorithm (BOASF) under a unified multi-armed bandit framework to automate the model selection or the hyperparameter optimization. Specifically, BOASF consists of multiple evaluation rounds in each of which we select promising configurations for each arm using the Bayesian optimization. Then, ASF can early discard the poor-performed arms adaptively using a Gaussian UCB-based probabilistic model. Furthermore, a Softmax model is employed to adaptively allocate available resources for each promising arm that advances to the next round. The arm with a higher probability of advancing will be allocated more resources. Experimental results show that BOASF is effective for speeding up the model selection and hyperparameter optimization processes while achieving robust and better prediction performance than the existing state-of-the-art automatic machine learning methods. Moreover, BOASF achieves better anytime performance under various time budgets.

[461] An MLI-Guided Framework for Subgroup-Aware Modeling in Electronic Health Records (AdaptHetero)

Ling Liao, Eva Aagaard

Main category: cs.LG

TL;DR: AdaptHetero is an MLI-driven framework that uses interpretability insights to tailor model training and evaluation for subpopulations, improving predictive performance and flagging risks.

Details

Motivation: To bridge the gap between MLI for trust/insights and actionable subgroup-specific modeling strategies in clinical settings.

Method: Integrates SHAP-based interpretation with unsupervised clustering to identify subgroup-specific characteristics.

Result: Uncovers heterogeneous model behaviors, improves predictive performance (up to 174.39%), and flags potential risks in subpopulations.

Conclusion: AdaptHetero enhances robustness, equity, and context-awareness in clinical ML deployment.

Abstract: Machine learning interpretation (MLI) has primarily been leveraged to foster clinician trust and extract insights from electronic health records (EHRs), rather than to guide subgroup-specific, operationalizable modeling strategies. To bridge this gap, we propose AdaptHetero, a novel MLI-driven framework that transforms interpretability insights into actionable guidance for tailoring model training and evaluation across subpopulations. Evaluated on three large-scale EHR datasets – GOSSIS-1-eICU, WiDS, and MIMIC-IV – AdaptHetero consistently uncovers heterogeneous model behaviors in predicting ICU mortality, in-hospital death, and hidden hypoxemia. Integrating SHAP-based interpretation with unsupervised clustering, AdaptHetero identifies clinically meaningful, subgroup-specific characteristics, improving predictive performance across many subpopulations (with gains up to 174.39 percent) while proactively flagging potential risks in others. These results highlight the framework’s promise for more robust, equitable, and context-aware clinical deployment.

[462] Systolic Array-based Accelerator for Structured State-Space Models

Shiva Raja, Cansu Demirkiran, Aakash Sarkar, Milos Popovic, Ajay Joshi

Main category: cs.LG

TL;DR: The paper introduces EpochCore, a hardware accelerator for State-Space Models (SSMs), achieving significant performance and energy efficiency improvements over GPUs and TPUs.

Details

Motivation: Traditional models like RNNs, CNNs, and Transformers struggle with long sequences due to memory limitations, while SSMs offer better efficiency but require intensive computation.

Method: EpochCore uses systolic arrays and a specialized processing element (LIMA-PE) with a novel dataflow (ProDF) to optimize SSM execution.

Result: EpochCore achieves 2000x performance improvement over GPUs and 250x over TPUs, with 45x better energy efficiency.

Conclusion: EpochCore is a highly efficient solution for accelerating SSM-based models, addressing the limitations of existing hardware.

Abstract: Sequence modeling is crucial for AI to understand temporal data and detect complex time-dependent patterns. While recurrent neural networks (RNNs), convolutional neural networks (CNNs), and Transformers have advanced in capturing long-range dependencies, they struggle with achieving high accuracy with very long sequences due to limited memory retention (fixed context window). State-Space Models (SSMs) leverage exponentially decaying memory enabling lengthy context window and so they process very long data sequences more efficiently than recurrent and Transformer-based models. Unlike traditional neural models like CNNs and RNNs, SSM-based models require solving differential equations through continuous integration, making training and inference both compute- and memory-intensive on conventional CPUs and GPUs. In this paper we introduce a specialized hardware accelerator, EpochCore, for accelerating SSMs. EpochCore is based on systolic arrays (SAs) and is designed to enhance the energy efficiency and throughput of inference of SSM-based models for long-range sequence tasks. Within the SA, we propose a versatile processing element (PE) called LIMA-PE to perform traditional and specialized MAC operations to support traditional DNNs and SSMs. To complement the EpochCore microarchitecture, we propose a novel dataflow, ProDF, which enables highly efficient execution of SSM-based models. By leveraging the LIMA-PE microarchitecture and ProDF, EpochCore achieves on average 2000x improvement in performance on LRA datasets compared to a GPU and 250x gains in performance and 45x improvement in energy efficiency, over traditional SA-based accelerators (TPU).

[463] SpectrumWorld: Artificial Intelligence Foundation for Spectroscopy

Zhuo Yang, Jiaqing Xie, Shuaike Shen, Daolang Wang, Yeyun Chen, Ben Gao, Shuzhou Sun, Biqing Qi, Dongzhan Zhou, Lei Bai, Linjiang Chen, Shufei Zhang, Jun Jiang, Tianfan Fu, Yuqiang Li

Main category: cs.LG

TL;DR: SpectrumLab is a unified platform for deep learning in spectroscopy, offering tools, benchmarks, and leaderboards to standardize research.

Details

Motivation: To address the lack of standardized formulations in deep learning for spectroscopy.

Method: Introduces SpectrumLab with three components: a Python library, SpectrumAnnotator for benchmarks, and SpectrumBench for diverse tasks.

Result: Empirical studies on SpectrumBench with 18 LLMs highlight current limitations.

Conclusion: SpectrumLab aims to be a foundational tool for future deep learning advancements in spectroscopy.

Abstract: Deep learning holds immense promise for spectroscopy, yet research and evaluation in this emerging field often lack standardized formulations. To address this issue, we introduce SpectrumLab, a pioneering unified platform designed to systematize and accelerate deep learning research in spectroscopy. SpectrumLab integrates three core components: a comprehensive Python library featuring essential data processing and evaluation tools, along with leaderboards; an innovative SpectrumAnnotator module that generates high-quality benchmarks from limited seed data; and SpectrumBench, a multi-layered benchmark suite covering 14 spectroscopic tasks and over 10 spectrum types, featuring spectra curated from over 1.2 million distinct chemical substances. Thorough empirical studies on SpectrumBench with 18 cutting-edge multimodal LLMs reveal critical limitations of current approaches. We hope SpectrumLab will serve as a crucial foundation for future advancements in deep learning-driven spectroscopy.

[464] Data Leakage and Redundancy in the LIT-PCBA Benchmark

Amber Huang, Ian Scott Knight, Slava Naprienko

Main category: cs.LG

TL;DR: The LIT-PCBA benchmark is compromised due to data leakage, molecular redundancy, and low diversity, allowing models to succeed through memorization rather than generalization, undermining published results.

Details

Motivation: To audit the LIT-PCBA benchmark and expose its flaws, which compromise its validity for evaluating virtual screening models.

Method: Analyzed data leakage, molecular redundancy, and diversity issues across LIT-PCBA’s splits, including 2D-identical ligands and analog overlaps.

Result: Found extensive flaws enabling trivial memorization-based baselines to match or exceed state-of-the-art model performance, invalidating most published results.

Conclusion: LIT-PCBA in its current form fails to measure novel chemotype recovery or methodological progress and should not be trusted as a benchmark.

Abstract: LIT-PCBA is widely used to benchmark virtual screening models, but our audit reveals that it is fundamentally compromised. We find extensive data leakage and molecular redundancy across its splits, including 2D-identical ligands within and across partitions, pervasive analog overlap, and low-diversity query sets. In ALDH1 alone, for instance, 323 active training – validation analog pairs occur at ECFP4 Tanimoto similarity $\geq 0.6$; across all targets, 2,491 2D-identical inactives appear in both training and validation, with very few corresponding actives. These overlaps allow models to succeed through scaffold memorization rather than generalization, inflating enrichment factors and AUROC scores. These flaws are not incidental – they are so severe that a trivial memorization-based baseline with no learnable parameters can exploit them to match or exceed the reported performance of state-of-the-art deep learning and 3D-similarity models. As a result, nearly all published results on LIT-PCBA are undermined. Even models evaluated in “zero-shot” mode are affected by analog leakage into the query set, weakening claims of generalization. In its current form, the benchmark does not measure a model’s ability to recover novel chemotypes and should not be taken as evidence of methodological progress. All code, data, and baseline implementations are available at: https://github.com/sievestack/LIT-PCBA-audit

[465] On the Theory and Practice of GRPO: A Trajectory-Corrected Approach with Fast Convergence

Lei Pang, Ruinan Jin

Main category: cs.LG

TL;DR: GRPO is a critic-free RL algorithm for fine-tuning LLMs, replacing PPO’s value function with group-normalized rewards. A new variant, TIC GRPO, improves it by using trajectory-level importance ratios for unbiased gradient estimates.

Details

Motivation: To simplify and improve reinforcement learning for fine-tuning large language models by removing the critic and addressing bias in policy gradient estimates.

Method: GRPO replaces PPO’s value function with group-normalized rewards and uses token-level importance sampling. TIC GRPO further simplifies this by using trajectory-level ratios for unbiased gradients.

Result: GRPO performs comparably to PPO despite bias. TIC GRPO achieves unbiased gradient estimates while maintaining critic-free structure.

Conclusion: GRPO and TIC GRPO offer efficient, critic-free alternatives for RL in LLMs, with TIC GRPO providing theoretical convergence guarantees.

Abstract: Group Relative Policy Optimization (GRPO), recently proposed by DeepSeek, is a critic-free reinforcement learning algorithm for fine tuning large language models. It replaces the value function in Proximal Policy Optimization (PPO) with group normalized rewards, while retaining PPO style token level importance sampling based on an old policy. We show that GRPO update rule in fact estimates the policy gradient at the old policy rather than the current one. However, since the old policy is refreshed every few steps, the discrepancy between the two remains small limiting the impact of this bias in practice. We validate this through an ablation study in which importance sampling is entirely removed, and updates are instead performed using the gradient estimated at a fixed old policy across multiple optimization steps. Remarkably, this simplification results in performance comparable to standard GRPO. Motivated by these findings, we propose a new algorithm: Trajectory level Importance Corrected GRPO (TIC GRPO). TIC GRPO replaces token level importance ratios with a single trajectory level probability ratio, yielding an unbiased estimate of the current policy gradient while preserving the critic free structure. Furthermore, we present the first theoretical convergence analysis for GRPO style methods, covering both the original GRPO and our proposed variant.

[466] GrandJury: A Collaborative Machine Learning Model Evaluation Protocol for Dynamic Quality Rubrics

Arthur Cho

Main category: cs.LG

TL;DR: GrandJury introduces a dynamic evaluation protocol for generative ML models, addressing the limitations of static benchmarks by incorporating time-decayed aggregation, traceability, and multi-rater human judgment.

Details

Motivation: Standard evaluation methods for generative models rely on static benchmarks, which misalign with the context-dependent nature of real-world applications.

Method: GrandJury combines time-decayed aggregation, traceability, dynamic task rubric attribution, and multi-rater human judgment to enable pluralistic evaluation.

Result: The protocol provides a transparent, accountable framework for evaluating ML outputs without absolute ground truth, illustrated by an open-source implementation and public LLM outputs.

Conclusion: GrandJury offers a new paradigm for evaluating generative models, aligning with dynamic user needs and evolving realities.

Abstract: Generative Machine Learning models have become central to modern systems, powering applications in creative writing, summarization, multi-hop reasoning, and context-aware dialogue. These models underpin large-scale AI assistants, workflow automation, and autonomous decision-making. In such domains, acceptable response is rarely absolute or static, but plural and highly context-dependent. Yet standard evaluation regimes still rely on static, benchmark-style tests, incentivizing optimization toward leaderboard scores rather than alignment with dynamic user needs or evolving realities. GrandJury introduces a formal evaluation protocol combining time-decayed aggregation, complete traceability, with the support of dynamic, transparent task rubric attribution, and multi-rater human judgment. Together, these elements enable pluralistic, accountable evaluation that captures evolving consensus and surfaces disagreement. We provide an open-source implementation (grandjury PyPI package) and a public collection of Large Language Model (LLM) inference outputs to illustrate the need and method. GrandJury provides a new paradigm for AI practitioners when evaluating machine learning outputs without absolute ground truth.

[467] One Small Step with Fingerprints, One Giant Leap for De Novo Molecule Generation from Mass Spectra

Neng Kai Nigel Neo, Lim Jing, Ngoui Yong Zhau Preston, Koh Xue Ting Serene, Bingquan Shen

Main category: cs.LG

TL;DR: A two-stage pipeline (MIST encoder + MolForge decoder) improves molecular structure generation from mass spectra, achieving 28%/36% accuracy (top-1/top-10) with pretraining and thresholding.

Details

Motivation: To enhance de novo molecular generation from mass spectra by improving the encoder-decoder pipeline.

Method: Uses MIST to encode mass spectra into fingerprints and MolForge to decode fingerprints into structures, with pretraining and thresholding for better performance.

Result: Tenfold improvement over prior methods, achieving 28% (top-1) and 36% (top-10) accuracy.

Conclusion: The pipeline sets a strong baseline for future research in molecule elucidation from mass spectra.

Abstract: A common approach to the de novo molecular generation problem from mass spectra involves a two-stage pipeline: (1) encoding mass spectra into molecular fingerprints, followed by (2) decoding these fingerprints into molecular structures. In our work, we adopt MIST as the encoder and MolForge as the decoder, leveraging pretraining to enhance performance. Notably, pretraining MolForge proves especially effective, enabling it to serve as a robust fingerprint-to-structure decoder. Additionally, instead of passing the probability of each bit in the fingerprint, thresholding the probabilities as a step function helps focus the decoder on the presence of substructures, improving recovery of accurate molecular structures even when the fingerprints predicted by MIST only moderately resembles the ground truth in terms of Tanimoto similarity. This combination of encoder and decoder results in a tenfold improvement over previous state-of-the-art methods, generating top-1 28% / top-10 36% of molecular structures correctly from mass spectra. We position this pipeline as a strong baseline for future research in de novo molecule elucidation from mass spectra.

[468] Symmetric Behavior Regularization via Taylor Expansion of Symmetry

Lingwei Zhu, Zheng Chen, Han Wang, Yukie Nagai

Main category: cs.LG

TL;DR: The paper introduces symmetric divergences to BRPO, addressing challenges like lack of analytic policies and numerical issues, proposing S$f$-AC as a competitive offline RL framework.

Details

Motivation: Existing methods rely on asymmetric divergences (e.g., KL), but symmetric divergences offer potential benefits, though they pose challenges like lack of analytic policies and numerical instability.

Method: Uses Taylor series of $f$-divergence to derive analytic policies and decompose symmetric divergences into asymmetry and conditional symmetry terms, mitigating numerical issues.

Result: Proposes S$f$-AC, the first practical BRPO algorithm with symmetric divergences, showing competitive performance in distribution approximation and MuJoCo tasks.

Conclusion: Symmetric divergences can be effectively integrated into BRPO, with S$f$-AC demonstrating practical viability and competitive results.

Abstract: This paper introduces symmetric divergences to behavior regularization policy optimization (BRPO) to establish a novel offline RL framework. Existing methods focus on asymmetric divergences such as KL to obtain analytic regularized policies and a practical minimization objective. We show that symmetric divergences do not permit an analytic policy as regularization and can incur numerical issues as loss. We tackle these challenges by the Taylor series of $f$-divergence. Specifically, we prove that an analytic policy can be obtained with a finite series. For loss, we observe that symmetric divergences can be decomposed into an asymmetry and a conditional symmetry term, Taylor-expanding the latter alleviates numerical issues. Summing together, we propose Symmetric $f$ Actor-Critic (S$f$-AC), the first practical BRPO algorithm with symmetric divergences. Experimental results on distribution approximation and MuJoCo verify that S$f$-AC performs competitively.

[469] Forgetting: A New Mechanism Towards Better Large Language Model Fine-tuning

Ali Taheri Ghahrizjani, Alireza Taban, Qizhou Wang, Shanshan Ye, Abdolreza Mirzaei, Tongliang Liu, Bo Han

Main category: cs.LG

TL;DR: The paper proposes categorizing tokens in supervised fine-tuning (SFT) into positive and negative tokens to improve model performance by focusing on useful information and forgetting misleading or irrelevant data.

Details

Motivation: SFT's effectiveness depends heavily on data quality and volume, which can lead to performance issues if not managed properly. The goal is to reduce reliance on these factors by optimizing token usage.

Method: Tokens are classified as positive (useful) or negative (misleading/irrelevant). Positive tokens are trained normally, while negative tokens are explicitly forgotten to refine learning.

Result: Experiments show the forgetting mechanism enhances model performance and increases response diversity.

Conclusion: Token categorization and selective forgetting improve SFT by focusing learning on valuable information and setting knowledge boundaries.

Abstract: Supervised fine-tuning (SFT) plays a critical role for pretrained large language models (LLMs), notably enhancing their capacity to acquire domain-specific knowledge while preserving or potentially augmenting their general-purpose capabilities. However, the efficacy of SFT hinges on data quality as well as data volume, otherwise it may result in limited performance gains or even degradation relative to the associated baselines. To mitigate such reliance, we suggest categorizing tokens within each corpus into two parts – positive and negative tokens – based on whether they are useful to improve model performance. Positive tokens can be trained in common ways, whereas negative tokens, which may lack essential semantics or be misleading, should be explicitly forgotten. Overall, the token categorization facilitate the model to learn less informative message, and the forgetting process shapes a knowledge boundary to guide the model on what information to learn more precisely. We conduct experiments on well-established benchmarks, finding that this forgetting mechanism not only improves overall model performance and also facilitate more diverse model responses.

[470] Neuromorphic Cybersecurity with Semi-supervised Lifelong Learning

Md Zesun Ahmed Mia, Malyaban Bal, Sen Lu, George M. Nishibuchi, Suhas Chelian, Srini Vasan, Abhronil Sengupta

Main category: cs.LG

TL;DR: A bio-inspired Spiking Neural Network (SNN) architecture for lifelong Network Intrusion Detection Systems (NIDS) combines static and dynamic SNNs, achieving 85.3% accuracy with low-power potential.

Details

Motivation: The brain's hierarchical processing and energy efficiency inspire the design of a lifelong learning system for intrusion detection.

Method: Uses a static SNN for initial intrusion detection and a dynamic SNN with GWR-inspired plasticity and Ad-STDP for attack classification.

Result: Achieves 85.3% accuracy on UNSW-NB15, with robust adaptation and reduced catastrophic forgetting.

Conclusion: The architecture is effective for lifelong NIDS and suitable for low-power neuromorphic hardware.

Abstract: Inspired by the brain’s hierarchical processing and energy efficiency, this paper presents a Spiking Neural Network (SNN) architecture for lifelong Network Intrusion Detection System (NIDS). The proposed system first employs an efficient static SNN to identify potential intrusions, which then activates an adaptive dynamic SNN responsible for classifying the specific attack type. Mimicking biological adaptation, the dynamic classifier utilizes Grow When Required (GWR)-inspired structural plasticity and a novel Adaptive Spike-Timing-Dependent Plasticity (Ad-STDP) learning rule. These bio-plausible mechanisms enable the network to learn new threats incrementally while preserving existing knowledge. Tested on the UNSW-NB15 benchmark in a continual learning setting, the architecture demonstrates robust adaptation, reduced catastrophic forgetting, and achieves $85.3$% overall accuracy. Furthermore, simulations using the Intel Lava framework confirm high operational sparsity, highlighting the potential for low-power deployment on neuromorphic hardware.

cs.MA

[471] BTPG-max: Achieving Local Maximal Bidirectional Pairs for Bidirectional Temporal Plan Graphs

Yifan Su, Rishi Veerapaneni, Jiaoyang Li

Main category: cs.MA

TL;DR: BTPG-max improves upon Bidirectional TPG (BTPG) by finding more bidirectional pairs, ensuring local optimality, and enhancing robustness to delays in Multi-Agent Path Finding (MAPF).

Details

Motivation: Addressing inefficiencies in MAPF caused by real-world delays, the paper aims to enhance the BTPG framework for better collision avoidance and execution robustness.

Method: The BTPG-max algorithm is designed to maximize bidirectional pairs in the Temporal Plan Graph (TPG), ensuring local optimality.

Result: BTPG-max produces BTPGs with more bidirectional edges, superior anytime behavior, and improved delay robustness.

Conclusion: BTPG-max advances MAPF solutions by optimizing bidirectional dependencies, enhancing practical execution under delays.

Abstract: Multi-Agent Path Finding (MAPF) requires computing collision-free paths for multiple agents in shared environment. Most MAPF planners assume that each agent reaches a specific location at a specific timestep, but this is infeasible to directly follow on real systems where delays often occur. To address collisions caused by agents deviating due to delays, the Temporal Plan Graph (TPG) was proposed, which converts a MAPF time dependent solution into a time independent set of inter-agent dependencies. Recently, a Bidirectional TPG (BTPG) was proposed which relaxed some dependencies into ``bidirectional pairs" and improved efficiency of agents executing their MAPF solution with delays. Our work improves upon this prior work by designing an algorithm, BPTG-max, that finds more bidirectional pairs. Our main theoretical contribution is in designing the BTPG-max algorithm is locally optimal, i.e. which constructs a BTPG where no additional bidirectional pairs can be added. We also show how in practice BTPG-max leads to BTPGs with significantly more bidirectional edges, superior anytime behavior, and improves robustness to delays.

Takuro Kato, Keisuke Okumura, Yoko Sasaki, Naoya Yokomachi

Main category: cs.MA

TL;DR: The paper introduces congestion mitigation path planning (CMPP) to address local congestion in multi-agent systems by embedding congestion into path costs, improving navigation efficiency.

Details

Motivation: To enhance navigation efficiency in high-density environments with autonomous agents by mitigating local congestion.

Method: Proposes CMPP, which uses flow-based penalties on sparse graphs, and develops two solvers: an exact solver for small instances and a scalable A-CMTS algorithm for large-scale cases.

Result: CMPP reduces local congestion and boosts system throughput in both discrete- and continuous-space scenarios.

Conclusion: CMPP effectively improves multi-agent system performance, with applications in logistics and autonomous vehicles.

Abstract: In high-density environments where numerous autonomous agents move simultaneously in a distributed manner, streamlining global flows to mitigate local congestion is crucial to maintain overall navigation efficiency. This paper introduces a novel path-planning problem, congestion mitigation path planning (CMPP), which embeds congestion directly into the cost function, defined by the usage of incoming edges along agents’ paths. CMPP assigns a flow-based multiplicative penalty to each vertex of a sparse graph, which grows steeply where frequently-traversed paths intersect, capturing the intuition that congestion intensifies where many agents enter the same area from different directions. Minimizing the total cost yields a set of coarse-level, time-independent routes that autonomous agents can follow while applying their own local collision avoidance. We formulate the problem and develop two solvers: (i) an exact mixed-integer nonlinear programming solver for small instances, and (ii) a scalable two-layer search algorithm, A-CMTS, which quickly finds suboptimal solutions for large-scale instances and iteratively refines them toward the optimum. Empirical studies show that augmenting state-of-the-art collision-avoidance planners with CMPP significantly reduces local congestion and enhances system throughput in both discrete- and continuous-space scenarios. These results indicate that CMPP improves the performance of multi-agent systems in real-world applications such as logistics and autonomous-vehicle operations.

[473] Towards Language-Augmented Multi-Agent Deep Reinforcement Learning

Maxime Toquebiau, Jae-Yun Jun, Faïz Benamar, Nicolas Bredeche

Main category: cs.MA

TL;DR: Language-augmented multi-agent reinforcement learning improves efficiency, interpretability, and generalization over emergent communication methods.

Details

Motivation: Prior works on emergent communication in multi-agent systems often lack efficiency and interpretability. This paper explores grounding agents in human-defined language to enhance learning and coordination.

Method: A framework where agents are trained to act, produce, and interpret natural language descriptions of observations, using language for communication and representation learning.

Result: Language-augmented agents outperform emergent communication baselines, showing better internal representations, generalization, and human-agent interaction.

Conclusion: Integrating structured language into multi-agent learning enhances interpretability and capability, offering promising directions for future systems.

Abstract: Most prior works on communication in multi-agent reinforcement learning have focused on emergent communication, which often results in inefficient and non-interpretable systems. Inspired by the role of language in natural intelligence, we investigate how grounding agents in a human-defined language can improve the learning and coordination of embodied agents. We propose a framework in which agents are trained not only to act but also to produce and interpret natural language descriptions of their observations. This language-augmented learning serves a dual role: enabling efficient and interpretable communication between agents, and guiding representation learning. We demonstrate that language-augmented agents outperform emergent communication baselines across various tasks. Our analysis reveals that language grounding leads to more informative internal representations, better generalization to new partners, and improved capability for human-agent interaction. These findings demonstrate the effectiveness of integrating structured language into multi-agent learning and open avenues for more interpretable and capable multi-agent systems.

[474] Position-Based Flocking for Robust Alignment

Hossein B. Jond

Main category: cs.MA

TL;DR: A position-based flocking model for agents achieves stable collective motion by balancing cohesion-separation and alignment, outperforming position-velocity models in simulations.

Details

Motivation: To develop a robust flocking model using positions (not velocities) for stable alignment and compact formations, applicable in robotics and collective dynamics.

Method: Modifies position-velocity approach by approximating velocity differences with positions, adding a threshold weight for sustained alignment. Tested with 50 agents in 2D simulations.

Result: Stronger alignment, more rigid/compact formations, and better separation metrics compared to position-velocity models.

Conclusion: The position-based model ensures robust flocking behavior, with potential applications in robotics and collective dynamics.

Abstract: This paper presents a position-based flocking model for interacting agents, balancing cohesion-separation and alignment to achieve stable collective motion. The model modifies a position-velocity-based approach by approximating velocity differences using initial and current positions, introducing a threshold weight to ensure sustained alignment. Simulations with 50 agents in 2D demonstrate that the position-based model produces stronger alignment and more rigid and compact formations compared to the position-velocity-based model. The alignment metric and separation distances highlight the efficacy of the proposed model in achieving robust flocking behavior. The model’s use of positions ensures robust alignment, with applications in robotics and collective dynamics.

cs.MM

[475] JPS: Jailbreak Multimodal Large Language Models with Collaborative Visual Perturbation and Textual Steering

Renmiao Chen, Shiyao Cui, Xuancheng Huang, Chengwei Pan, Victor Shea-Jay Huang, QingLin Zhang, Xuan Ouyang, Zhexin Zhang, Hongning Wang, Minlie Huang

Main category: cs.MM

TL;DR: JPS introduces a method for jailbreaking MLLMs by combining visual perturbations and textual steering, achieving high attack success and malicious intent fulfillment.

Details

Motivation: Current jailbreak attacks focus on bypassing safety filters but often fail to produce harmful content, highlighting the need for a method that ensures malicious intent fulfillment.

Method: JPS uses adversarial image perturbations and optimized steering prompts, co-optimized iteratively, to guide MLLM responses.

Result: JPS achieves state-of-the-art performance in attack success rate (ASR) and malicious intent fulfillment rate (MIFR).

Conclusion: JPS effectively combines visual and textual components for high-quality jailbreak attacks, validated by experiments and a new MIFR metric.

Abstract: Jailbreak attacks against multimodal large language Models (MLLMs) are a significant research focus. Current research predominantly focuses on maximizing attack success rate (ASR), often overlooking whether the generated responses actually fulfill the attacker’s malicious intent. This oversight frequently leads to low-quality outputs that bypass safety filters but lack substantial harmful content. To address this gap, we propose JPS, \underline{J}ailbreak MLLMs with collaborative visual \underline{P}erturbation and textual \underline{S}teering, which achieves jailbreaks via corporation of visual image and textually steering prompt. Specifically, JPS utilizes target-guided adversarial image perturbations for effective safety bypass, complemented by “steering prompt” optimized via a multi-agent system to specifically guide LLM responses fulfilling the attackers’ intent. These visual and textual components undergo iterative co-optimization for enhanced performance. To evaluate the quality of attack outcomes, we propose the Malicious Intent Fulfillment Rate (MIFR) metric, assessed using a Reasoning-LLM-based evaluator. Our experiments show JPS sets a new state-of-the-art in both ASR and MIFR across various MLLMs and benchmarks, with analyses confirming its efficacy. Codes are available at \href{https://github.com/thu-coai/JPS}{https://github.com/thu-coai/JPS}. \color{warningcolor}{Warning: This paper contains potentially sensitive contents.}

[476] Embedding Alignment in Code Generation for Audio

Sam Kouteili, Hiren Madhu, George Typaldos, Mark Santolucito

Main category: cs.MM

TL;DR: The paper explores improving LLM-powered code generation for creative coding by analyzing the relationship between code and audio embeddings, proposing a model to predict audio output from code.

Details

Motivation: To enhance creative coding (e.g., live-coding) by enabling users to focus on musical intentions rather than syntax, addressing the lack of diversity in LLM-generated code candidates and their audio output.

Method: Investigates the topology of code-audio embedding spaces, constructs a predictive model to learn an alignment map between code and audio embeddings.

Result: Finds no simple linear relationship between code and audio embeddings but demonstrates that an alignment map can be learned.

Conclusion: Proposes a model to predict audio embeddings from code, aiming to improve musical diversity in LLM-generated outputs.

Abstract: LLM-powered code generation has the potential to revolutionize creative coding endeavors, such as live-coding, by enabling users to focus on structural motifs over syntactic details. In such domains, when prompting an LLM, users may benefit from considering multiple varied code candidates to better realize their musical intentions. Code generation models, however, struggle to present unique and diverse code candidates, with no direct insight into the code’s audio output. To better establish a relationship between code candidates and produced audio, we investigate the topology of the mapping between code and audio embedding spaces. We find that code and audio embeddings do not exhibit a simple linear relationship, but supplement this with a constructed predictive model that shows an embedding alignment map could be learned. Supplementing the aim for musically diverse output, we present a model that given code predicts output audio embedding, constructing a code-audio embedding alignment map.

[477] SimLabel: Similarity-Weighted Iterative Framework for Multi-annotator Learning with Missing Annotations

Liyun Zhang, Zheng Lian, Hong Liu, Takanori Takebe, Yuta Nakashima

Main category: cs.MM

TL;DR: SimLabel is a novel framework for multi-annotator learning that uses similarity-weighted semi-supervised learning and confidence-based refinement to handle missing labels efficiently.

Details

Motivation: Existing methods inefficiently skip updating annotator-specific parameters for missing labels, risking overfitting and poor data utilization.

Method: SimLabel leverages inter-annotator similarities to generate weighted soft labels for missing annotations and uses a confidence-based iterative refinement mechanism.

Result: The approach improves data utilization and model performance, validated on the new AMER2 dataset with high and variable missing rates.

Conclusion: SimLabel effectively addresses the challenge of missing labels in crowdsourced datasets, enhancing model robustness and performance.

Abstract: Multi-annotator learning (MAL) aims to model annotator-specific labeling patterns. However, existing methods face a critical challenge: they simply skip updating annotator-specific model parameters when encountering missing labels, i.e., a common scenario in real-world crowdsourced datasets where each annotator labels only small subsets of samples. This leads to inefficient data utilization and overfitting risks. To this end, we propose a novel similarity-weighted semi-supervised learning framework (SimLabel) that leverages inter-annotator similarities to generate weighted soft labels for missing annotations, enabling the utilization of unannotated samples rather than skipping them entirely. We further introduce a confidence-based iterative refinement mechanism that combines maximum probability with entropy-based uncertainty to prioritize predicted high-quality pseudo-labels to impute missing labels, jointly enhancing similarity estimation and model performance over time. For evaluation, we contribute a new multimodal multi-annotator dataset, AMER2, with high and more variable missing rates, reflecting real-world annotation sparsity and enabling evaluation across different sparsity levels.

[478] QuMAB: Query-based Multi-Annotator Behavior Modeling with Reliability under Sparse Labels

Liyun Zhang, Zheng Lian, Hong Liu, Takanori Takebe, Yuta Nakashima

Main category: cs.MM

TL;DR: QuMAB shifts from sample-wise aggregation to annotator-wise behavior modeling, treating disagreements as valuable information to reduce annotation costs and improve reliability.

Details

Motivation: Traditional aggregation treats annotator disagreements as noise, but subjective tasks lack absolute ground truth, and sparse annotations make aggregation unreliable.

Method: QuMAB uses light-weight queries to model individual annotators and captures inter-annotator correlations as implicit regularization, preventing overfitting.

Result: QuMAB outperforms in modeling annotator behavior, consensus prediction, and works well under sparse annotations, validated on large-scale datasets (STREET and AMER).

Conclusion: QuMAB offers a novel, explainable approach to multi-annotator learning, leveraging annotator behavior for better generalization and cost reduction.

Abstract: Multi-annotator learning traditionally aggregates diverse annotations to approximate a single ground truth, treating disagreements as noise. However, this paradigm faces fundamental challenges: subjective tasks often lack absolute ground truth, and sparse annotation coverage makes aggregation statistically unreliable. We introduce a paradigm shift from sample-wise aggregation to annotator-wise behavior modeling. By treating annotator disagreements as valuable information rather than noise, modeling annotator-specific behavior patterns can reconstruct unlabeled data to reduce annotation cost, enhance aggregation reliability, and explain annotator decision behavior. To this end, we propose QuMAB (Query-based Multi-Annotator Behavior Pattern Learning), which uses light-weight queries to model individual annotators while capturing inter-annotator correlations as implicit regularization, preventing overfitting to sparse individual data while maintaining individualization and improving generalization, with a visualization of annotator focus regions offering an explainable analysis of behavior understanding. We contribute two large-scale datasets with dense per-annotator labels: STREET (4,300 labels/annotator) and AMER (average 3,118 labels/annotator), the first multimodal multi-annotator dataset. Extensive experiments demonstrate the superiority of our QuMAB in modeling individual annotators’ behavior patterns, their utility for consensus prediction, and applicability under sparse annotations.

eess.AS

[479] Keyword Spotting with Hyper-Matched Filters for Small Footprint Devices

Yael Segal-Feldman, Ann R. Bradlow, Matthew Goldrick, Joseph Keshet

Main category: eess.AS

TL;DR: A state-of-the-art open-vocabulary keyword spotting model for small devices, using a speech encoder, keyword encoder, and detection network, achieves high accuracy and generalization.

Details

Motivation: To detect keywords in speech recordings, even if not in training data, for small-footprint devices.

Method: Combines a speech encoder (tiny Whisper/Conformer), a keyword encoder (hyper-network generating matched-filter weights), and a detection network (Perceiver module with cross-attention).

Result: State-of-the-art detection performance, effective generalization to out-of-domain conditions (e.g., L2 speech), and efficiency (4.2M-parameter model matches larger models).

Conclusion: The model is efficient, robust, and generalizes well, making it suitable for practical applications.

Abstract: Open-vocabulary keyword spotting (KWS) refers to the task of detecting words or terms within speech recordings, regardless of whether they were included in the training data. This paper introduces an open-vocabulary keyword spotting model with state-of-the-art detection accuracy for small-footprint devices. The model is composed of a speech encoder, a target keyword encoder, and a detection network. The speech encoder is either a tiny Whisper or a tiny Conformer. The target keyword encoder is implemented as a hyper-network that takes the desired keyword as a character string and generates a unique set of weights for a convolutional layer, which can be considered as a keyword-specific matched filter. The detection network uses the matched-filter weights to perform a keyword-specific convolution, which guides the cross-attention mechanism of a Perceiver module in determining whether the target term appears in the recording. The results indicate that our system achieves state-of-the-art detection performance and generalizes effectively to out-of-domain conditions, including second-language (L2) speech. Notably, our smallest model, with just 4.2 million parameters, matches or outperforms models that are several times larger, demonstrating both efficiency and robustness.

Henri Gode, Simon Doclo

Main category: eess.AS

TL;DR: The paper proposes three extensions to the Blind Oblique Projection (BOP) method for online RTF vector estimation of multiple sound sources in noisy, reverberant environments, addressing computational complexity, accuracy, and low SNR robustness.

Details

Motivation: The challenge lies in estimating RTF vectors of successive sound sources when multiple sources are active simultaneously, with existing BOP methods being computationally intensive, less accurate, and SNR-sensitive.

Method: The paper introduces a closed-form BOP solution, orthogonal additional vectors, and noise-handling techniques, alongside a spatial-coherence-based online source counting method.

Result: Simulations with real-world recordings show improved performance in estimating RTF vectors for successively activating speakers, even without prior source activity knowledge.

Conclusion: The proposed extensions enhance the BOP method’s efficiency, accuracy, and robustness, making it more practical for real-world applications.

Abstract: Relative transfer functions (RTFs) of sound sources play a crucial role in beamforming, enabling effective noise and interference suppression. This paper addresses the challenge of online estimating the RTF vectors of multiple sound sources in noisy and reverberant environments, for the specific scenario where sources activate successively. While the RTF vector of the first source can be estimated straightforwardly, the main challenge arises in estimating the RTF vectors of subsequent sources during segments where multiple sources are simultaneously active. The blind oblique projection (BOP) method has been proposed to estimate the RTF vector of a newly activating source by optimally blocking this source. However, this method faces several limitations: high computational complexity due to its reliance on iterative gradient descent optimization, the introduction of random additional vectors, which can negatively impact performance, and the assumption of high signal-to-noise ratio (SNR). To overcome these limitations, in this paper we propose three extensions to the BOP method. First, we derive a closed-form solution for optimizing the BOP cost function, significantly reducing computational complexity. Second, we introduce orthogonal additional vectors instead of random vectors, enhancing RTF vector estimation accuracy. Third, we incorporate noise handling techniques inspired by covariance subtraction and whitening, increasing robustness in low SNR conditions. To provide a frame-by-frame estimate of the source activity pattern, required by both the conventional BOP method and the proposed method, we propose a spatial-coherence-based online source counting method. Simulations are performed with real-world reverberant noisy recordings featuring 3 successively activating speakers, with and without a-priori knowledge of the source activity pattern.

[481] REF-VC: Robust, Expressive and Fast Zero-Shot Voice Conversion with Diffusion Transformers

Yuepeng Jiang, Ziqian Ning, Shuai Wang, Chengjia Wang, Mengxiao Bi, Pengcheng Zhu, Lei Xie, Zhonghua Fu

Main category: eess.AS

TL;DR: REF-VC is a noise-robust expressive voice conversion system that addresses challenges of environmental noise and expressive output demands. It introduces innovations like random erasing, implicit alignment, and Shortcut Models to outperform baselines in noisy and clean scenarios.

Details

Motivation: The paper addresses the limitations of traditional ASR-based methods (suppressing prosody) and SSL-based models (timbre leakage and noise sensitivity) in voice conversion, aiming to balance noise robustness and expressiveness.

Method: REF-VC employs a random erasing strategy for SSL features, implicit alignment inspired by E2TTS, and Shortcut Models to accelerate flow matching inference, reducing steps to 4.

Result: The model outperforms baselines like Seed-VC in zero-shot noisy scenarios and matches Seed-VC on clean sets. It also supports singing voice conversion within one model.

Conclusion: REF-VC successfully balances noise robustness and expressiveness, offering a versatile solution for voice conversion in both noisy and clean environments, including singing voice applications.

Abstract: In real-world voice conversion applications, environmental noise in source speech and user demands for expressive output pose critical challenges. Traditional ASR-based methods ensure noise robustness but suppress prosody, while SSL-based models improve expressiveness but suffer from timbre leakage and noise sensitivity. This paper proposes REF-VC, a noise-robust expressive voice conversion system. Key innovations include: (1) A random erasing strategy to mitigate the information redundancy inherent in SSL feature, enhancing noise robustness and expressiveness; (2) Implicit alignment inspired by E2TTS to suppress non-essential feature reconstruction; (3) Integration of Shortcut Models to accelerate flow matching inference, significantly reducing to 4 steps. Experimental results demonstrate that our model outperforms baselines such as Seed-VC in zero-shot scenarios on the noisy set, while also performing comparably to Seed-VC on the clean set. In addition, REF-VC can be compatible with singing voice conversion within one model.

[482] MOVER: Combining Multiple Meeting Recognition Systems

Naoyuki Kamo, Tsubasa Ochiai, Marc Delcroix, Tomohiro Nakatani

Main category: eess.AS

TL;DR: MOVER is a novel system for combining meeting recognition outputs, improving accuracy over existing methods.

Details

Motivation: Existing methods like DOVER and ROVER can't combine outputs from systems with differing diarization and ASR. MOVER addresses this gap.

Method: MOVER uses a five-stage process (speaker alignment, segment grouping, word/timing combination) to merge diverse meeting recognition outputs.

Result: MOVER achieved relative tcpWER improvements of 9.55% and 8.51% on CHiME-8 and NOTSOFAR-1 tasks.

Conclusion: MOVER effectively combines diverse meeting recognition systems, outperforming state-of-the-art methods.

Abstract: In this paper, we propose Meeting recognizer Output Voting Error Reduction (MOVER), a novel system combination method for meeting recognition tasks. Although there are methods to combine the output of diarization (e.g., DOVER) or automatic speech recognition (ASR) systems (e.g., ROVER), MOVER is the first approach that can combine the outputs of meeting recognition systems that differ in terms of both diarization and ASR. MOVER combines hypotheses with different time intervals and speaker labels through a five-stage process that includes speaker alignment, segment grouping, word and timing combination, etc. Experimental results on the CHiME-8 DASR task and the multi-channel track of the NOTSOFAR-1 task demonstrate that MOVER can successfully combine multiple meeting recognition systems with diverse diarization and recognition outputs, achieving relative tcpWER improvements of 9.55 % and 8.51 % over the state-of-the-art systems for both tasks.

[483] Fairness in Dysarthric Speech Synthesis: Understanding Intrinsic Bias in Dysarthric Speech Cloning using F5-TTS

Anuprabha M, Krishna Gurugubelli, Anil Kumar Vuppala

Main category: eess.AS

TL;DR: The paper investigates F5-TTS for dysarthric speech synthesis, revealing biases toward intelligibility over speaker similarity and prosody, and suggests fairness-aware approaches for inclusive tech.

Details

Motivation: Addressing the challenge of limited dysarthric speech data and potential biases in synthetic speech generation for assistive technologies.

Method: Uses F5-TTS with the TORGO dataset, evaluating intelligibility, speaker similarity, and prosody, and analyzes biases via fairness metrics.

Result: F5-TTS shows bias toward intelligibility, neglecting speaker and prosody preservation in dysarthric speech synthesis.

Conclusion: Fairness-aware dysarthric speech synthesis can improve inclusivity in speech technologies.

Abstract: Dysarthric speech poses significant challenges in developing assistive technologies, primarily due to the limited availability of data. Recent advances in neural speech synthesis, especially zero-shot voice cloning, facilitate synthetic speech generation for data augmentation; however, they may introduce biases towards dysarthric speech. In this paper, we investigate the effectiveness of state-of-the-art F5-TTS in cloning dysarthric speech using TORGO dataset, focusing on intelligibility, speaker similarity, and prosody preservation. We also analyze potential biases using fairness metrics like Disparate Impact and Parity Difference to assess disparities across dysarthric severity levels. Results show that F5-TTS exhibits a strong bias toward speech intelligibility over speaker and prosody preservation in dysarthric speech synthesis. Insights from this study can help integrate fairness-aware dysarthric speech synthesis, fostering the advancement of more inclusive speech technologies.

[484] Speech LLMs in Low-Resource Scenarios: Data Volume Requirements and the Impact of Pretraining on High-Resource Languages

Seraphina Fong, Marco Matassoni, Alessio Brutti

Main category: eess.AS

TL;DR: The paper explores using Speech LLMs for low-resource ASR with the SLAM-ASR framework, highlighting data challenges and benefits of pretrained projectors.

Details

Motivation: To address the under-explored use of LLMs in low-resource ASR and improve performance with limited data.

Method: Uses the SLAM-ASR framework with a trainable projector between a speech encoder and LLM, testing data volume and pretrained projectors.

Result: Pretrained projectors mitigate data scarcity, and multilingual LLMs improve performance on low-resource benchmarks.

Conclusion: The study provides insights for optimizing Speech LLMs in low-resource and multilingual settings.

Abstract: Large language models (LLMs) have demonstrated potential in handling spoken inputs for high-resource languages, reaching state-of-the-art performance in various tasks. However, their applicability is still less explored in low-resource settings. This work investigates the use of Speech LLMs for low-resource Automatic Speech Recognition using the SLAM-ASR framework, where a trainable lightweight projector connects a speech encoder and a LLM. Firstly, we assess training data volume requirements to match Whisper-only performance, re-emphasizing the challenges of limited data. Secondly, we show that leveraging mono- or multilingual projectors pretrained on high-resource languages reduces the impact of data scarcity, especially with small training sets. Using multilingual LLMs (EuroLLM, Salamandra) with whisper-large-v3-turbo, we evaluate performance on several public benchmarks, providing insights for future research on optimizing Speech LLMs for low-resource languages and multilinguality.

[485] Privacy Disclosure of Similarity in Speech and Language Processing

Tom Bäckström, Mohammad Hassan Vali, My Nguyen, Silas Rech

Main category: eess.AS

TL;DR: The paper proposes a method to quantify privacy disclosure in biometric identification by analyzing similarity rank distributions, using entropy to measure information leakage.

Details

Motivation: Biometric identification systems often use noisy data and inaccurate similarity measures, which can still reveal private information through similarity ranks. The study aims to quantify this privacy risk.

Method: The methodology involves estimating the probability distribution of similarity ranks, using histograms or beta-binomial models for scarce data. Entropy (bits) is used to measure disclosure, allowing additive analysis of independent features.

Result: Experiments show that all tested biometric features (speaker, phone, linguistic embeddings, and fundamental frequency) contain PII, with speaker recognition embeddings leaking the most. Disclosure increases with sample length but is bounded by template length.

Conclusion: The similarity rank disclosure metric enables comparison of PII leakage across biometric features, aiding in holistic privacy threat evaluation for speech and other biometric technologies.

Abstract: Speaker, author, and other biometric identification applications often compare a sample’s similarity to a database of templates to determine the identity. Given that data may be noisy and similarity measures can be inaccurate, such a comparison may not reliably identify the true identity as the most similar. Still, even the similarity rank based on an inaccurate similarity measure can disclose private information about the true identity. We propose a methodology for quantifying the privacy disclosure of such a similarity rank by estimating its probability distribution. It is based on determining the histogram of the similarity rank of the true speaker, or when data is scarce, modeling the histogram with the beta-binomial distribution. We express the disclosure in terms of entropy (bits), such that the disclosure from independent features are additive. Our experiments demonstrate that all tested speaker and author characterizations contain personally identifying information (PII) that can aid in identification, with embeddings from speaker recognition algorithms containing the most information, followed by phone embeddings, linguistic embeddings, and fundamental frequency. Our initial experiments show that the disclosure of PII increases with the length of test samples, but it is bounded by the length of database templates. The provided metric, similarity rank disclosure, provides a way to compare the disclosure of PII between biometric features and merge them to aid identification. It can thus aid in the holistic evaluation of threats to privacy in speech and other biometric technologies.

[486] Investigation of Speech and Noise Latent Representations in Single-channel VAE-based Speech Enhancement

Jiatong Li, Simon Doclo

Main category: eess.AS

TL;DR: The paper explores how modifying pretrained VAE loss terms affects speech enhancement performance, showing that clear separation of speech and noise representations improves results.

Details

Motivation: To investigate the impact of different latent representations (speech and noise) derived from pretrained VAEs on the performance of speech enhancement systems.

Method: Uses Bayesian permutation training with pretrained VAEs to generate latent representations for speech and noise, then evaluates performance with modified loss terms.

Result: Experiments on DNS3, WSJ0-QUT, and VoiceBank-DEMAND datasets show that clearly separated speech and noise representations outperform overlapping ones.

Conclusion: Clear separation of speech and noise latent representations in VAEs significantly enhances speech enhancement performance.

Abstract: Recently, a variational autoencoder (VAE)-based single-channel speech enhancement system using Bayesian permutation training has been proposed, which uses two pretrained VAEs to obtain latent representations for speech and noise. Based on these pretrained VAEs, a noisy VAE learns to generate speech and noise latent representations from noisy speech for speech enhancement. Modifying the pretrained VAE loss terms affects the pretrained speech and noise latent representations. In this paper, we investigate how these different representations affect speech enhancement performance. Experiments on the DNS3, WSJ0-QUT, and VoiceBank-DEMAND datasets show that a latent space where speech and noise representations are clearly separated significantly improves performance over standard VAEs, which produce overlapping speech and noise representations.

[487] PESTO: Pitch Estimation with Self-supervised Transposition-equivariant Objective

Alain Riou, Stefan Lattner, Gaëtan Hadjeres, Geoffroy Peeters

Main category: eess.AS

TL;DR: A lightweight Siamese neural network uses SSL with pitch transposition equivariance for accurate pitch estimation on monophonic audio, trained on small unlabeled datasets.

Details

Motivation: Address the challenge of pitch estimation with limited labeled data by leveraging self-supervised learning and equivariance to pitch transposition.

Method: Uses a Siamese neural network with a novel class-based transposition-equivariant objective and learnable Toeplitz matrices to prevent collapse in an encoder-only setting.

Result: Outperforms self-supervised baselines and narrows the gap with supervised methods, generalizing well across tasks (singing voice and instrument pitch estimation).

Conclusion: The proposed method is lightweight, effective, and suitable for low-resource, real-time applications.

Abstract: In this paper, we address the problem of pitch estimation using Self Supervised Learning (SSL). The SSL paradigm we use is equivariance to pitch transposition, which enables our model to accurately perform pitch estimation on monophonic audio after being trained only on a small unlabeled dataset. We use a lightweight ($<$ 30k parameters) Siamese neural network that takes as inputs two different pitch-shifted versions of the same audio represented by its Constant-Q Transform. To prevent the model from collapsing in an encoder-only setting, we propose a novel class-based transposition-equivariant objective which captures pitch information. Furthermore, we design the architecture of our network to be transposition-preserving by introducing learnable Toeplitz matrices. We evaluate our model for the two tasks of singing voice and musical instrument pitch estimation and show that our model is able to generalize across tasks and datasets while being lightweight, hence remaining compatible with low-resource devices and suitable for real-time applications. In particular, our results surpass self-supervised baselines and narrow the performance gap between self-supervised and supervised methods for pitch estimation.

[488] Overview of Automatic Speech Analysis and Technologies for Neurodegenerative Disorders: Diagnosis and Assistive Applications

Shakeel A. Sheikh, Md. Sahidullah, Ina Kodrasi

Main category: eess.AS

TL;DR: A review of state-of-the-art methods in neurodegenerative speech disorder technologies, covering detection, recognition, enhancement, assessment, and data augmentation, while addressing challenges and future directions.

Details

Motivation: To advance clinical and technological solutions for neurodegenerative speech disorders by reviewing current methods and identifying key challenges.

Method: Comprehensive review of existing techniques in pathological speech detection, recognition, enhancement, assessment, and data augmentation.

Result: Identifies challenges like robustness, privacy, and interpretability, and highlights future directions such as multimodal approaches and large language models.

Conclusion: The paper underscores the potential of emerging technologies to improve speech disorder solutions and calls for further research in multimodal and large language model integration.

Abstract: Advancements in spoken language technologies for neurodegenerative speech disorders are crucial for meeting both clinical and technological needs. This overview paper is vital for advancing the field, as it presents a comprehensive review of state-of-the-art methods in pathological speech detection, automatic speech recognition, pathological speech intelligibility enhancement, intelligibility and severity assessment, and data augmentation approaches for pathological speech. It also highlights key challenges, such as ensuring robustness, privacy, and interpretability. The paper concludes by exploring promising future directions, including the adoption of multimodal approaches and the integration of large language models to further advance speech technologies for neurodegenerative speech disorders.

[489] ZipVoice: Fast and High-Quality Zero-Shot Text-to-Speech with Flow Matching

Han Zhu, Wei Kang, Zengwei Yao, Liyong Guo, Fangjun Kuang, Zhaoqing Li, Weiji Zhuang, Long Lin, Daniel Povey

Main category: eess.AS

TL;DR: ZipVoice is a compact, fast zero-shot TTS model using flow-matching, achieving high quality while being smaller and faster than existing models.

Details

Motivation: Address slow inference speeds and large parameter sizes in existing zero-shot TTS models.

Method: Uses Zipformer-based vector field estimator, average upsampling for alignment, and flow distillation to reduce steps.

Result: Matches SOTA quality, 3x smaller, and up to 30x faster than baseline.

Conclusion: ZipVoice offers efficient, high-quality zero-shot TTS with compact size and speed.

Abstract: Existing large-scale zero-shot text-to-speech (TTS) models deliver high speech quality but suffer from slow inference speeds due to massive parameters. To address this issue, this paper introduces ZipVoice, a high-quality flow-matching-based zero-shot TTS model with a compact model size and fast inference speed. Key designs include: 1) a Zipformer-based vector field estimator to maintain adequate modeling capabilities under constrained size; 2) Average upsampling-based initial speech-text alignment and Zipformer-based text encoder to improve speech intelligibility; 3) A flow distillation method to reduce sampling steps and eliminate the inference overhead associated with classifier-free guidance. Experiments on 100k hours multilingual datasets show that ZipVoice matches state-of-the-art models in speech quality, while being 3 times smaller and up to 30 times faster than a DiT-based flow-matching baseline. Codes, model checkpoints and demo samples are publicly available at https://github.com/k2-fsa/ZipVoice.

[490] UniTalker: Conversational Speech-Visual Synthesis

Yifan Hu, Rui Liu, Yi Ren, Xiang Yin, Haizhou Li

Main category: eess.AS

TL;DR: The paper introduces Conversational Speech-Visual Synthesis (CSVS) as an extension of traditional CSS, proposing UniTalker, a unified model for generating empathetic speech and natural talking-face animations using multimodal dialogue context.

Details

Motivation: Existing CSS research lacks multimodal perception (e.g., eye contact) and speech-only responses limit interactive experiences. CSVS aims to enhance expressiveness and empathy in user-agent interactions.

Method: UniTalker integrates multimodal perception (text, speech, talking-face animations) and rendering. It uses a large-scale language model for understanding and multi-task sequence prediction for generating speech and animations, with optimizations for consistency.

Result: The model synthesizes more empathetic speech and natural, emotionally consistent talking-face animations, validated through objective and subjective experiments.

Conclusion: CSVS and UniTalker address CSS limitations by leveraging multimodal context, improving the interactive experience with coherent audiovisual responses.

Abstract: Conversational Speech Synthesis (CSS) is a key task in the user-agent interaction area, aiming to generate more expressive and empathetic speech for users. However, it is well-known that “listening” and “eye contact” play crucial roles in conveying emotions during real-world interpersonal communication. Existing CSS research is limited to perceiving only text and speech within the dialogue context, which restricts its effectiveness. Moreover, speech-only responses further constrain the interactive experience. To address these limitations, we introduce a Conversational Speech-Visual Synthesis (CSVS) task as an extension of traditional CSS. By leveraging multimodal dialogue context, it provides users with coherent audiovisual responses. To this end, we develop a CSVS system named UniTalker, which is a unified model that seamlessly integrates multimodal perception and multimodal rendering capabilities. Specifically, it leverages a large-scale language model to comprehensively understand multimodal cues in the dialogue context, including speaker, text, speech, and the talking-face animations. After that, it employs multi-task sequence prediction to first infer the target utterance’s emotion and then generate empathetic speech and natural talking-face animations. To ensure that the generated speech-visual content remains consistent in terms of emotion, content, and duration, we introduce three key optimizations: 1) Designing a specialized neural landmark codec to tokenize and reconstruct facial expression sequences. 2) Proposing a bimodal speech-visual hard alignment decoding strategy. 3) Applying emotion-guided rendering during the generation stage. Comprehensive objective and subjective experiments demonstrate that our model synthesizes more empathetic speech and provides users with more natural and emotionally consistent talking-face animations.

eess.IV

[491] Neural Field-Based 3D Surface Reconstruction of Microstructures from Multi-Detector Signals in Scanning Electron Microscopy

Shuo Chen, Yijin Li, Xi Zheng, Guofeng Zhang

Main category: eess.IV

TL;DR: NFH-SEM is a neural field-based hybrid method for 3D SEM reconstruction, eliminating manual calibration and handling shadows, validated on diverse samples.

Details

Motivation: Conventional 2D SEM images lack 3D topography, and existing methods struggle with complex microstructures due to discrete representations, calibration needs, and shadow errors.

Method: NFH-SEM uses multi-view, multi-detector 2D SEM images, fusing geometric and photometric data into a continuous neural field, with end-to-end self-calibration and shadow disentanglement.

Result: High-fidelity reconstructions of challenging samples (e.g., microstructures, pollen, particle surfaces) demonstrate precise detail and broad applicability.

Conclusion: NFH-SEM advances SEM 3D reconstruction by addressing key limitations, enabling accurate and automated reconstruction of intricate microstructures.

Abstract: The scanning electron microscope (SEM) is a widely used imaging device in scientific research and industrial applications. Conventional two-dimensional (2D) SEM images do not directly reveal the three-dimensional (3D) topography of micro samples, motivating the development of SEM 3D surface reconstruction methods. However, reconstruction of complex microstructures remains challenging for existing methods due to the limitations of discrete 3D representations, the need for calibration with reference samples, and shadow-induced gradient errors. Here, we introduce NFH-SEM, a neural field-based hybrid SEM 3D reconstruction method that takes multi-view, multi-detector 2D SEM images as input and fuses geometric and photometric information into a continuous neural field representation. NFH-SEM eliminates the manual calibration procedures through end-to-end self-calibration and automatically disentangles shadows from SEM images during training, enabling accurate reconstruction of intricate microstructures. We validate the effectiveness of NFH-SEM on real and simulated datasets. Our experiments show high-fidelity reconstructions of diverse, challenging samples, including two-photon lithography microstructures, peach pollen, and silicon carbide particle surfaces, demonstrating precise detail and broad applicability.

[492] Super-Resolution of Sentinel-2 Images Using a Geometry-Guided Back-Projection Network with Self-Attention

Ivan Pereira-Sánchez, Daniel Torres, Francesc Alcover, Bartomeu Garau, Julia Navarro, Catalina Sbert, Joan Duran

Main category: eess.IV

TL;DR: A geometry-guided super-resolution model fuses Sentinel-2’s 10m and 20m bands using cluster-based learning and multi-head attention, outperforming existing methods.

Details

Motivation: Sentinel-2's 10m bands provide fine structural detail, while 20m bands offer richer spectral information. Combining these can enhance image quality.

Method: Proposes a cluster-based learning procedure to create a geometry-rich guiding image from 10m bands, integrated into an unfolded back-projection architecture with multi-head attention.

Result: Outperforms classical and deep learning-based super-resolution and fusion techniques on urban, rural, and coastal test sets.

Conclusion: The proposed method effectively fuses Sentinel-2 bands, improving image resolution and spectral detail.

Abstract: The Sentinel-2 mission provides multispectral imagery with 13 bands at resolutions of 10m, 20m, and 60m. In particular, the 10m bands offer fine structural detail, while the 20m bands capture richer spectral information. In this paper, we propose a geometry-guided super-resolution model for fusing the 10m and 20m bands. Our approach introduces a cluster-based learning procedure to generate a geometry-rich guiding image from the 10m bands. This image is integrated into an unfolded back-projection architecture that leverages image self-similarities through a multi-head attention mechanism, which models nonlocal patch-based interactions across spatial and spectral dimensions. We also generate a dataset for evaluation, comprising three testing sets that include urban, rural, and coastal landscapes. Experimental results demonstrate that our method outperforms both classical and deep learning-based super-resolution and fusion techniques.

[493] Advanced Multi-Architecture Deep Learning Framework for BIRADS-Based Mammographic Image Retrieval: Comprehensive Performance Analysis with Super-Ensemble Optimization

MD Shaikh Rahman, Feiroz Humayara, Syed Maudud E Rabbi, Muhammad Mahbubur Rashid

Main category: eess.IV

TL;DR: The paper presents a framework for improving mammographic image retrieval by comparing CNN architectures with advanced training strategies, achieving significant performance improvements over baselines.

Details

Motivation: The complexity of BIRADS categorical matching and limitations in current medical image retrieval studies hinder clinical translation, necessitating a robust evaluation framework.

Method: The study compares CNN architectures (DenseNet121, ResNet50, VGG16) using advanced training strategies like fine-tuning, metric learning, and super-ensemble optimization, with rigorous data splitting and statistical validation.

Result: Advanced fine-tuning and super-ensemble optimization achieved substantial improvements, with DenseNet121 and ResNet50 showing ~20% gains, and the super-ensemble reaching 36.33% precision@10 (95% CI: [34.78%, 37.88%]).

Conclusion: The framework sets new benchmarks for mammographic image retrieval, offering evidence-based guidelines for clinical deployment in diagnostic support and quality assurance.

Abstract: Content-based mammographic image retrieval systems require exact BIRADS categorical matching across five distinct classes, presenting significantly greater complexity than binary classification tasks commonly addressed in literature. Current medical image retrieval studies suffer from methodological limitations including inadequate sample sizes, improper data splitting, and insufficient statistical validation that hinder clinical translation. We developed a comprehensive evaluation framework systematically comparing CNN architectures (DenseNet121, ResNet50, VGG16) with advanced training strategies including sophisticated fine-tuning, metric learning, and super-ensemble optimization. Our evaluation employed rigorous stratified data splitting (50%/20%/30% train/validation/test), 602 test queries, and systematic validation using bootstrap confidence intervals with 1,000 samples. Advanced fine-tuning with differential learning rates achieved substantial improvements: DenseNet121 (34.79% precision@10, 19.64% improvement) and ResNet50 (34.54%, 19.58% improvement). Super-ensemble optimization combining complementary architectures achieved 36.33% precision@10 (95% CI: [34.78%, 37.88%]), representing 24.93% improvement over baseline and providing 3.6 relevant cases per query. Statistical analysis revealed significant performance differences between optimization strategies (p<0.001) with large effect sizes (Cohen’s d>0.8), while maintaining practical search efficiency (2.8milliseconds). Performance significantly exceeds realistic expectations for 5-class medical retrieval tasks, where literature suggests 20-25% precision@10 represents achievable performance for exact BIRADS matching. Our framework establishes new performance benchmarks while providing evidence-based architecture selection guidelines for clinical deployment in diagnostic support and quality assurance applications.

[494] Deep Distillation Gradient Preconditioning for Inverse Problems

Romario Gualdrón-Hurtado, Roman Jacome, Leon Suarez, Laura Galvis, Henry Arguello

Main category: eess.IV

TL;DR: The paper proposes a nonlinear preconditioning operator using knowledge distillation to improve convergence and reconstruction quality in imaging inverse problems, outperforming traditional linear methods.

Details

Motivation: Advanced signal priors lose effectiveness with ill-conditioned sensing matrices, limiting reconstruction quality. Traditional preconditioners are inadequate due to their linearity and matrix dependence.

Method: A teacher-student framework is used: a teacher with a better-conditioned matrix guides a student with an ill-conditioned matrix via a preconditioning neural network.

Result: Validated on plug-and-play FISTA for imaging tasks, showing improved performance and convergence.

Conclusion: Nonlinear preconditioning via knowledge distillation enhances reconstruction quality and convergence in ill-conditioned imaging problems.

Abstract: Imaging inverse problems are commonly addressed by minimizing measurement consistency and signal prior terms. While huge attention has been paid to developing high-performance priors, even the most advanced signal prior may lose its effectiveness when paired with an ill-conditioned sensing matrix that hinders convergence and degrades reconstruction quality. In optimization theory, preconditioners allow improving the algorithm’s convergence by transforming the gradient update. Traditional linear preconditioning techniques enhance convergence, but their performance remains limited due to their dependence on the structure of the sensing matrix. Learning-based linear preconditioners have been proposed, but they are optimized only for data-fidelity optimization, which may lead to solutions in the null-space of the sensing matrix. This paper employs knowledge distillation to design a nonlinear preconditioning operator. In our method, a teacher algorithm using a better-conditioned (synthetic) sensing matrix guides the student algorithm with an ill-conditioned sensing matrix through gradient matching via a preconditioning neural network. We validate our nonlinear preconditioner for plug-and-play FISTA in single-pixel, magnetic resonance, and super-resolution imaging tasks, showing consistent performance improvements and better empirical convergence.

[495] CryoGS: Gaussian Splatting for Cryo-EM Homogeneous Reconstruction

Suyi Chen, Haibin Ling

Main category: eess.IV

TL;DR: cryoGS is a GMM-based method for cryo-EM reconstruction that integrates Gaussian splatting with cryo-EM physics, enabling stable and efficient reconstruction from raw images without external initialization.

Details

Motivation: Existing GMM-based methods for cryo-EM reconstruction rely on external consensus maps or atomic models for initialization, limiting their use in self-contained pipelines.

Method: cryoGS introduces orthogonal projection-aware Gaussian splatting, a normalization term, and an FFT-aligned coordinate system tailored for cryo-EM imaging.

Result: Experimental results show cryoGS is effective and robust compared to baselines, enabling reconstruction directly from raw particle images with random initialization.

Conclusion: cryoGS advances cryo-EM reconstruction by eliminating the need for external initialization, offering a more self-contained and efficient pipeline.

Abstract: As a critical modality for structural biology, cryogenic electron microscopy (cryo-EM) facilitates the determination of macromolecular structures at near-atomic resolution. The core computational task in single-particle cryo-EM is to reconstruct the 3D electrostatic potential of a molecule from a large collection of noisy 2D projections acquired at unknown orientations. Gaussian mixture models (GMMs) provide a continuous, compact, and physically interpretable representation for molecular density and have recently gained interest in cryo-EM reconstruction. However, existing methods rely on external consensus maps or atomic models for initialization, limiting their use in self-contained pipelines. Addressing this issue, we introduce cryoGS, a GMM-based method that integrates Gaussian splatting with the physics of cryo-EM image formation. In particular, we develop an orthogonal projection-aware Gaussian splatting, with adaptations such as a normalization term and FFT-aligned coordinate system tailored for cryo-EM imaging. All these innovations enable stable and efficient homogeneous reconstruction directly from raw cryo-EM particle images using random initialization. Experimental results on real datasets validate the effectiveness and robustness of cryoGS over representative baselines. The code will be released upon publication.

[496] MedMambaLite: Hardware-Aware Mamba for Medical Image Classification

Romina Aalishah, Mozhgan Navardi, Tinoosh Mohsenin

Main category: eess.IV

TL;DR: MedMambaLite is a lightweight, hardware-aware Mamba-based model for medical image classification, optimized via knowledge distillation, achieving high accuracy and energy efficiency on edge devices.

Details

Motivation: The need for real-time, on-device AI inference in medical applications, constrained by edge device limitations like model size and computational capacity.

Method: Developed MedMambaLite by modifying and reducing redundancies in the MedMamba architecture, then distilling knowledge into a smaller student model with reduced embedding dimensions.

Result: Achieved 94.5% accuracy on MedMNIST datasets, 22.8x fewer parameters than MedMamba, and 63% better energy efficiency on NVIDIA Jetson Orin Nano.

Conclusion: MedMambaLite is a highly efficient solution for edge deployment in medical image classification, balancing performance and resource constraints.

Abstract: AI-powered medical devices have driven the need for real-time, on-device inference such as biomedical image classification. Deployment of deep learning models at the edge is now used for applications such as anomaly detection and classification in medical images. However, achieving this level of performance on edge devices remains challenging due to limitations in model size and computational capacity. To address this, we present MedMambaLite, a hardware-aware Mamba-based model optimized through knowledge distillation for medical image classification. We start with a powerful MedMamba model, integrating a Mamba structure for efficient feature extraction in medical imaging. We make the model lighter and faster in training and inference by modifying and reducing the redundancies in the architecture. We then distill its knowledge into a smaller student model by reducing the embedding dimensions. The optimized model achieves 94.5% overall accuracy on 10 MedMNIST datasets. It also reduces parameters 22.8x compared to MedMamba. Deployment on an NVIDIA Jetson Orin Nano achieves 35.6 GOPS/J energy per inference. This outperforms MedMamba by 63% improvement in energy per inference.

[497] Beyond Pixels: Medical Image Quality Assessment with Implicit Neural Representations

Caner Özer, Patryk Rygiel, Bram de Wilde, İlkay Öksüz, Jelmer M. Wolterink

Main category: eess.IV

TL;DR: The paper proposes using implicit neural representations (INRs) for artifact detection in medical imaging, offering a compact and scalable solution compared to traditional methods.

Details

Motivation: Artifacts in medical imaging hinder diagnostic accuracy, and existing methods suffer from information loss and high memory demands, limiting scalability.

Method: The authors employ INRs for image quality assessment, using deep weight space networks, graph neural networks, and relational attention transformers.

Result: Tested on the ACDC dataset with synthetic artifacts, the method performs effectively with fewer parameters.

Conclusion: INRs provide a scalable and efficient approach for artifact detection in medical imaging.

Abstract: Artifacts pose a significant challenge in medical imaging, impacting diagnostic accuracy and downstream analysis. While image-based approaches for detecting artifacts can be effective, they often rely on preprocessing methods that can lead to information loss and high-memory-demand medical images, thereby limiting the scalability of classification models. In this work, we propose the use of implicit neural representations (INRs) for image quality assessment. INRs provide a compact and continuous representation of medical images, naturally handling variations in resolution and image size while reducing memory overhead. We develop deep weight space networks, graph neural networks, and relational attention transformers that operate on INRs to achieve image quality assessment. Our method is evaluated on the ACDC dataset with synthetically generated artifact patterns, demonstrating its effectiveness in assessing image quality while achieving similar performance with fewer parameters.

[498] Coarse-to-Fine Joint Registration of MR and Ultrasound Images via Imaging Style Transfer

Junyi Wang, Xi Zhu, Yikun Guo, Zixi Wang, Haichuan Gao, Le Zhang, Fan Zhang

Main category: eess.IV

TL;DR: A pipeline for registering pre-surgery MR and post-resection US images using 3D CycleGAN for synthetic T1 generation and coarse-to-fine registration improves consistency.

Details

Motivation: To enhance the registration performance between pre-surgery MR and post-resection US images, addressing the challenge of aligning these modalities.

Method: Uses unpaired style transfer via 3D CycleGAN to generate synthetic T1 images, followed by a coarse-to-fine registration combining affine and local deformable transformations.

Result: Improved consistency between MR and US image pairs in most cases.

Conclusion: The proposed pipeline effectively enhances registration performance for MR and US images.

Abstract: We developed a pipeline for registering pre-surgery Magnetic Resonance (MR) images and post-resection Ultrasound (US) images. Our approach leverages unpaired style transfer using 3D CycleGAN to generate synthetic T1 images, thereby enhancing registration performance. Additionally, our registration process employs both affine and local deformable transformations for a coarse-to-fine registration. The results demonstrate that our approach improves the consistency between MR and US image pairs in most cases.

[499] Artificial Intelligence-Based Classification of Spitz Tumors

Ruben T. Lucassen, Marjanna Romers, Chiel F. Ebbelaar, Aia N. Najem, Donal P. Hayes, Antien L. Mooyaart, Sara Roshani, Liliane C. D. Wynaendts, Nikolas Stathonikos, Gerben E. Breimer, Anne M. L. Jansen, Mitko Veta, Willeke A. M. Blokx

Main category: eess.IV

TL;DR: AI models outperform pathologists in distinguishing Spitz tumors from melanomas and predicting genetic aberrations and diagnostic categories, potentially reducing costs and turnaround times.

Details

Motivation: To address the diagnostic challenges of Spitz tumors due to overlapping features with melanomas and explore AI's potential in improving accuracy and efficiency.

Method: Developed and validated AI models using a dataset of 393 Spitz tumors and 379 melanomas, comparing performance with pathologists and simulating workflow impact.

Result: AI achieved AUROC 0.95 and accuracy 0.86 in distinguishing Spitz tumors from melanomas, outperforming pathologists in all tasks, though not always significantly.

Conclusion: AI models show strong predictive performance for Spitz tumors, offering potential workflow and cost benefits in pathology departments.

Abstract: Spitz tumors are diagnostically challenging due to overlap in atypical histological features with conventional melanomas. We investigated to what extent AI models, using histological and/or clinical features, can: (1) distinguish Spitz tumors from conventional melanomas; (2) predict the underlying genetic aberration of Spitz tumors; and (3) predict the diagnostic category of Spitz tumors. The AI models were developed and validated using a dataset of 393 Spitz tumors and 379 conventional melanomas. Predictive performance was measured using the AUROC and the accuracy. The performance of the AI models was compared with that of four experienced pathologists in a reader study. Moreover, a simulation experiment was conducted to investigate the impact of implementing AI-based recommendations for ancillary diagnostic testing on the workflow of the pathology department. The best AI model based on UNI features reached an AUROC of 0.95 and an accuracy of 0.86 in differentiating Spitz tumors from conventional melanomas. The genetic aberration was predicted with an accuracy of 0.55 compared to 0.25 for randomly guessing. The diagnostic category was predicted with an accuracy of 0.51, where random chance-level accuracy equaled 0.33. On all three tasks, the AI models performed better than the four pathologists, although differences were not statistically significant for most individual comparisons. Based on the simulation experiment, implementing AI-based recommendations for ancillary diagnostic testing could reduce material costs, turnaround times, and examinations. In conclusion, the AI models achieved a strong predictive performance in distinguishing between Spitz tumors and conventional melanomas. On the more challenging tasks of predicting the genetic aberration and the diagnostic category of Spitz tumors, the AI models performed better than random chance.

Chaohui Gong, Zhiying Wu, Zisheng Huang, Gaofeng Meng, Zhen Lei, Hongbin Liu

Main category: eess.IV

TL;DR: The paper introduces MM2CT, a multimodal MR-to-CT translation method using T1- and T2-weighted MRI data, leveraging a Mamba-based framework for improved synthesis and overcoming CNN and Transformer limitations.

Details

Motivation: To address the lack of multimodal fusion in MR-to-CT translation and eliminate radiation exposure and motion artifacts associated with CT scans.

Method: Proposes MM2CT, a Mamba-based framework with dynamic local convolution and enhancement modules for multimodal MRI-to-CT synthesis.

Result: Achieves state-of-the-art performance in SSIM and PSNR on a public pelvis dataset.

Conclusion: MM2CT effectively integrates multimodal MRI data and outperforms existing methods, offering a promising solution for MR-to-CT translation.

Abstract: Magnetic resonance (MR)-to-computed tomography (CT) translation offers significant advantages, including the elimination of radiation exposure associated with CT scans and the mitigation of imaging artifacts caused by patient motion. The existing approaches are based on single-modality MR-to-CT translation, with limited research exploring multimodal fusion. To address this limitation, we introduce Multi-modal MR to CT (MM2CT) translation method by leveraging multimodal T1- and T2-weighted MRI data, an innovative Mamba-based framework for multi-modal medical image synthesis. Mamba effectively overcomes the limited local receptive field in CNNs and the high computational complexity issues in Transformers. MM2CT leverages this advantage to maintain long-range dependencies modeling capabilities while achieving multi-modal MR feature integration. Additionally, we incorporate a dynamic local convolution module and a dynamic enhancement module to improve MRI-to-CT synthesis. The experiments on a public pelvis dataset demonstrate that MM2CT achieves state-of-the-art performance in terms of Structural Similarity Index Measure (SSIM) and Peak Signal-to-Noise Ratio (PSNR). Our code is publicly available at https://github.com/Gots-ch/MM2CT.

[501] Beyond Subspace Isolation: Many-to-Many Transformer for Light Field Image Super-resolution

Zeke Zexi Hu, Xiaoming Chen, Vera Yuk Ying Chung, Yiran Shen

Main category: eess.IV

TL;DR: The paper introduces a Many-to-Many Transformer (M2MT) to overcome subspace isolation in light field image super-resolution (LFSR), enabling comprehensive spatial-angular feature extraction and achieving state-of-the-art performance.

Details

Motivation: Existing LFSR methods restrict self-attention to limited subsets of light field data due to subspace decomposition, hindering optimization of spatial and angular cues.

Method: Proposes M2MT, which aggregates angular information in spatial subspaces before self-attention, allowing full access to all sub-aperture images (SAIs) and capturing long-range dependencies.

Result: M2MT achieves state-of-the-art performance on public datasets, balancing efficiency and quality with lower memory and computation demands.

Conclusion: M2MT effectively mitigates subspace isolation, providing non-local context in spatial and angular subspaces for superior LFSR results, validated by visual interpretability analysis.

Abstract: The effective extraction of spatial-angular features plays a crucial role in light field image super-resolution (LFSR) tasks, and the introduction of convolution and Transformers leads to significant improvement in this area. Nevertheless, due to the large 4D data volume of light field images, many existing methods opted to decompose the data into a number of lower-dimensional subspaces and perform Transformers in each sub-space individually. As a side effect, these methods inadvertently restrict the self-attention mechanisms to a One-to-One scheme accessing only a limited subset of LF data, explicitly preventing comprehensive optimization on all spatial and angular cues. In this paper, we identify this limitation as subspace isolation and introduce a novel Many-to-Many Transformer (M2MT) to address it. M2MT aggregates angular information in the spatial subspace before performing the self-attention mechanism. It enables complete access to all information across all sub-aperture images (SAIs) in a light field image. Consequently, M2MT is enabled to comprehensively capture long-range correlation dependencies. With M2MT as the foundational component, we develop a simple yet effective M2MT network for LFSR. Our experimental results demonstrate that M2MT achieves state-of-the-art performance across various public datasets, and it offers a favorable balance between model performance and efficiency, yielding higher-quality LFSR results with substantially lower demand for memory and computation. We further conduct in-depth analysis using local attribution maps (LAM) to obtain visual interpretability, and the results validate that M2MT is empowered with a truly non-local context in both spatial and angular subspaces to mitigate subspace isolation and acquire effective spatial-angular representation.

[502] A dataset of primary nasopharyngeal carcinoma MRI with multi-modalities segmentation

Yin Li, Qi Chen, Kai Wang, Meige Li, Liping Si, Yingwei Guo, Yu Xiong, Qixing Wang, Yang Qin, Ling Xu, Patrick van der Smagt, Jun Tang, Nutan Chen

Main category: eess.IV

TL;DR: Introduction of the first comprehensive NPC MRI dataset to aid diagnosis, treatment, and machine learning development.

Details

Motivation: Lack of publicly available NPC MRI datasets limits advancements in diagnosis and treatment.

Method: Dataset includes MR axial imaging of 277 NPC patients with T1, T2, and contrast-enhanced T1 sequences, totaling 831 scans, plus clinical data and radiologist-annotated segmentations.

Result: Provides a high-quality resource for untreated primary NPC, facilitating research and algorithm development.

Conclusion: This dataset addresses a critical gap, enabling progress in NPC management and machine learning applications.

Abstract: Multi-modality magnetic resonance imaging(MRI) data facilitate the early diagnosis, tumor segmentation, and disease staging in the management of nasopharyngeal carcinoma (NPC). The lack of publicly available, comprehensive datasets limits advancements in diagnosis, treatment planning, and the development of machine learning algorithms for NPC. Addressing this critical need, we introduce the first comprehensive NPC MRI dataset, encompassing MR axial imaging of 277 primary NPC patients. This dataset includes T1-weighted, T2-weighted, and contrast-enhanced T1-weighted sequences, totaling 831 scans. In addition to the corresponding clinical data, manually annotated and labeled segmentations by experienced radiologists offer high-quality data resources from untreated primary NPC.

[503] STARFormer: A Novel Spatio-Temporal Aggregation Reorganization Transformer of FMRI for Brain Disorder Diagnosis

Wenhao Dong, Yueyang Li, Weiming Zeng, Lei Chen, Hongjie Yan, Wai Ting Siok, Nizhuan Wang

Main category: eess.IV

TL;DR: The paper introduces STARFormer, a method for classifying brain disorders like ASD and ADHD by integrating spatial and temporal features of fMRI BOLD signals, achieving state-of-the-art results.

Details

Motivation: Existing fMRI methods often neglect spatial and temporal dependencies in BOLD signals, leading to inaccurate classification of brain disorders.

Method: STARFormer uses three modules: ROI spatial structure analysis (with eigenvector centrality), temporal feature reorganization (with window tokens and attention), and spatio-temporal feature fusion (via parallel transformers).

Result: STARFormer outperforms existing methods on ASD and ADHD classification, as validated on public datasets.

Conclusion: STARFormer provides a more accurate and reliable tool for diagnosing brain disorders, with potential for biomedical research.

Abstract: Many existing methods that use functional magnetic resonance imaging (fMRI) classify brain disorders, such as autism spectrum disorder (ASD) and attention deficit hyperactivity disorder (ADHD), often overlook the integration of spatial and temporal dependencies of the blood oxygen level-dependent (BOLD) signals, which may lead to inaccurate or imprecise classification results. To solve this problem, we propose a Spatio-Temporal Aggregation eorganization ransformer (STARFormer) that effectively captures both spatial and temporal features of BOLD signals by incorporating three key modules. The region of interest (ROI) spatial structure analysis module uses eigenvector centrality (EC) to reorganize brain regions based on effective connectivity, highlighting critical spatial relationships relevant to the brain disorder. The temporal feature reorganization module systematically segments the time series into equal-dimensional window tokens and captures multiscale features through variable window and cross-window attention. The spatio-temporal feature fusion module employs a parallel transformer architecture with dedicated temporal and spatial branches to extract integrated features. The proposed STARFormer has been rigorously evaluated on two publicly available datasets for the classification of ASD and ADHD. The experimental results confirm that the STARFormer achieves state-of-the-art performance across multiple evaluation metrics, providing a more accurate and reliable tool for the diagnosis of brain disorders and biomedical research. The codes are available at: https://github.com/NZWANG/STARFormer.

[504] Brain Network Analysis Based on Fine-tuned Self-supervised Model for Brain Disease Diagnosis

Yifei Tang, Hongjie Jiang, Changhong Jing, Hieu Pham, Shuqiang Wang

Main category: eess.IV

TL;DR: The paper proposes a fine-tuned brain network model for brain disease diagnosis, enhancing generalizability by expanding brain region representations across multiple dimensions.

Details

Motivation: Current foundation models for brain networks are limited to single dimensions, restricting their broader neuroscience applications.

Method: The model includes an adapter module for multi-dimensional feature expansion and a fine-tuned foundation model using self-supervised learning and transformer blocks.

Result: The model achieves superior performance in brain disease diagnosis, demonstrating its effectiveness.

Conclusion: The proposed model offers a promising approach for advancing brain network analysis research.

Abstract: Functional brain network analysis has become an indispensable tool for brain disease analysis. It is profoundly impacted by deep learning methods, which can characterize complex connections between ROIs. However, the research on foundation models of brain network is limited and constrained to a single dimension, which restricts their extensive application in neuroscience. In this study, we propose a fine-tuned brain network model for brain disease diagnosis. It expands brain region representations across multiple dimensions based on the original brain network model, thereby enhancing its generalizability. Our model consists of two key modules: (1)an adapter module that expands brain region features across different dimensions. (2)a fine-tuned foundation brain network model, based on self-supervised learning and pre-trained on fMRI data from thousands of participants. Specifically, its transformer block is able to effectively extract brain region features and compute the inter-region associations. Moreover, we derive a compact latent representation of the brain network for brain disease diagnosis. Our downstream experiments in this study demonstrate that the proposed model achieves superior performance in brain disease diagnosis, which potentially offers a promising approach in brain network analysis research.

[505] Capsule-ConvKAN: A Hybrid Neural Approach to Medical Image Classification

Laura Pituková, Peter Sinčák, László József Kovács, Peng Wang

Main category: eess.IV

TL;DR: The paper compares four neural network architectures, introducing Capsule-ConvKAN, a hybrid model combining Capsule Network and Convolutional Kolmogorov-Arnold Network, achieving 91.21% accuracy in histopathological image classification.

Details

Motivation: To improve feature representation and classification accuracy in biomedical image data by combining the strengths of Capsule Networks and Convolutional Kolmogorov-Arnold Networks.

Method: Proposed Capsule-ConvKAN, evaluated alongside CNN, Capsule Network, and ConvKAN on a histopathological image dataset.

Result: Capsule-ConvKAN outperformed others with 91.21% accuracy, demonstrating superior spatial pattern capture and feature management.

Conclusion: Capsule-ConvKAN shows promise for medical image classification, addressing limitations of traditional convolutional models.

Abstract: This study conducts a comprehensive comparison of four neural network architectures: Convolutional Neural Network, Capsule Network, Convolutional Kolmogorov-Arnold Network, and the newly proposed Capsule-Convolutional Kolmogorov-Arnold Network. The proposed Capsule-ConvKAN architecture combines the dynamic routing and spatial hierarchy capabilities of Capsule Network with the flexible and interpretable function approximation of Convolutional Kolmogorov-Arnold Networks. This novel hybrid model was developed to improve feature representation and classification accuracy, particularly in challenging real-world biomedical image data. The architectures were evaluated on a histopathological image dataset, where Capsule-ConvKAN achieved the highest classification performance with an accuracy of 91.21%. The results demonstrate the potential of the newly introduced Capsule-ConvKAN in capturing spatial patterns, managing complex features, and addressing the limitations of traditional convolutional models in medical image classification.

[506] Constructed Realities? Technical and Contextual Anomalies in a High-Profile Image

Matthias Wjst

Main category: eess.IV

TL;DR: The study analyzes a controversial photo of Prince Andrew, Virginia Giuffre, and Ghislaine Maxwell, identifying inconsistencies suggesting digital manipulation, though definitive proof is lacking.

Details

Motivation: To assess the authenticity of a widely circulated photograph central to public and legal debates, given its significant implications.

Method: Forensic analysis of multiple published versions of the photo, examining lighting, posture, and physical interaction for signs of manipulation.

Result: Identified inconsistencies consistent with digital compositing, but lack of original negative prevents definitive conclusions.

Conclusion: The photo may be constructed, but remains unresolved without further evidence, symbolizing broader issues of truth and memory.

Abstract: This study offers a forensic assessment of a widely circulated photograph featuring Prince Andrew, Virginia Giuffre, and Ghislaine Maxwell - an image that has played a pivotal role in public discourse and legal narratives. Through analysis of multiple published versions, several inconsistencies are identified, including irregularities in lighting, posture, and physical interaction, which are more consistent with digital compositing than with an unaltered snapshot. While the absence of the original negative and a verifiable audit trail precludes definitive conclusions, the technical and contextual anomalies suggest that the image may have been deliberately constructed. Nevertheless, without additional evidence, the photograph remains an unresolved but symbolically charged fragment within a complex story of abuse, memory, and contested truth.

[507] A Comprehensive Framework for Uncertainty Quantification of Voxel-wise Supervised Models in IVIM MRI

Nicola Casali, Alessandro Brusaferri, Giuseppe Baselli, Stefano Fumagalli, Edoardo Micotti, Gianluigi Forloni, Riaz Hussein, Giovanna Rizzo, Alfonso Mastropietro

Main category: eess.IV

TL;DR: A probabilistic deep learning framework using Deep Ensembles of Mixture Density Networks is proposed for IVIM parameter estimation, offering uncertainty quantification and improved reliability.

Details

Motivation: Accurate IVIM parameter estimation is challenging due to noise sensitivity and ill-posed inverse problems, necessitating robust uncertainty-aware methods.

Method: The framework combines Deep Ensembles and Mixture Density Networks to estimate predictive uncertainty (aleatoric and epistemic) and is benchmarked against non-probabilistic and Bayesian methods.

Result: MDNs provided calibrated and sharper predictions for diffusion parameters, though slight overconfidence was noted in pseudo-diffusion. Elevated epistemic uncertainty in vivo indicated data mismatch.

Conclusion: The framework enables reliable IVIM fitting with uncertainty quantification, adaptable for other physical models, and highlights the importance of epistemic uncertainty in real-world applications.

Abstract: Accurate estimation of intravoxel incoherent motion (IVIM) parameters from diffusion-weighted MRI remains challenging due to the ill-posed nature of the inverse problem and high sensitivity to noise, particularly in the perfusion compartment. In this work, we propose a probabilistic deep learning framework based on Deep Ensembles (DE) of Mixture Density Networks (MDNs), enabling estimation of total predictive uncertainty and decomposition into aleatoric (AU) and epistemic (EU) components. The method was benchmarked against non probabilistic neural networks, a Bayesian fitting approach and a probabilistic network with single Gaussian parametrization. Supervised training was performed on synthetic data, and evaluation was conducted on both simulated and an in vivo dataset. The reliability of the quantified uncertainties was assessed using calibration curves, output distribution sharpness, and the Continuous Ranked Probability Score (CRPS). MDNs produced more calibrated and sharper predictive distributions for the diffusion coefficient D and fraction f parameters, although slight overconfidence was observed in pseudo-diffusion coefficient D*. The Robust Coefficient of Variation (RCV) indicated smoother in vivo estimates for D* with MDNs compared to Gaussian model. Despite the training data covering the expected physiological range, elevated EU in vivo suggests a mismatch with real acquisition conditions, highlighting the importance of incorporating EU, which was allowed by DE. Overall, we present a comprehensive framework for IVIM fitting with uncertainty quantification, which enables the identification and interpretation of unreliable estimates. The proposed approach can also be adopted for fitting other physical models through appropriate architectural and simulation adjustments.

Today’s Research Highlights

Table of Contents

cs.CL

[1] Enhancing Dialogue Annotation with Speaker Characteristics Leveraging a Frozen LLM

[2] Parity-Aware Byte-Pair Encoding: Improving Cross-lingual Fairness in Tokenization

[3] Pitch Accent Detection improves Pretrained Automatic Speech Recognition

[4] Persistent Instability in LLM’s Personality Measurements: Effects of Scale, Reasoning, and Conversation History

[5] RCR-Router: Efficient Role-Aware Context Routing for Multi-Agent LLM Systems with Structured Memory

[6] I Think, Therefore I Am Under-Qualified? A Benchmark for Evaluating Linguistic Shibboleth Detection in LLM Hiring Evaluations

[7] Towards Robust Evaluation of Visual Activity Recognition: Resolving Verb Ambiguity with Sense Clustering

[8] A Multi-Stage Large Language Model Framework for Extracting Suicide-Related Social Determinants of Health

[9] Dialogues Aspect-based Sentiment Quadruple Extraction via Structural Entropy Minimization Partitioning

[10] Evaluation of LLMs in AMR Parsing

[11] Align, Don’t Divide: Revisiting the LoRA Architecture in Multi-Task Learning

[12] Recent Advances in Speech Language Models: A Survey

[13] Multimodal Fact Checking with Unified Visual, Textual, and Contextual Representations

[14] SciReplicate-Bench: Benchmarking LLMs in Agent-driven Algorithmic Reproduction from Research Papers

[15] BEE-RAG: Balanced Entropy Engineering for Retrieval-Augmented Generation

[16] Attention Basin: Why Contextual Position Matters in Large Language Models

[17] Towards Assessing Medical Ethics from Knowledge to Practice

[18] ATLANTIS at SemEval-2025 Task 3: Detecting Hallucinated Text Spans in Question Answering

[19] Resource-Limited Joint Multimodal Sentiment Reasoning and Classification via Chain-of-Thought Enhancement and Distillation

[20] Pruning Large Language Models by Identifying and Preserving Functional Networks

[21] CodeBoost: Boosting Code LLMs by Squeezing Knowledge from Code Snippets with RL

[22] ASCoT: An Adaptive Self-Correction Chain-of-Thought Method for Late-Stage Fragility in LLMs

[23] Decision-Making with Deliberation: Meta-reviewing as a Document-grounded Dialogue

[24] SONAR-LLM: Autoregressive Transformer that Thinks in Sentence Embeddings and Speaks in Tokens

[25] Efficient Reasoning for Large Reasoning Language Models via Certainty-Guided Reflection Suppression

[26] Evaluation of a Sign Language Avatar on Comprehensibility, User Experience & Acceptability

[27] Can Language Models Critique Themselves? Investigating Self-Feedback for Retrieval Augmented Generation at BioASQ 2025

[28] The TUB Sign Language Corpus Collection

[29] MyCulture: Exploring Malaysia’s Diverse Culture under Low-Resource Language Constraints

[30] LLMEval-3: A Large-Scale Longitudinal Study on Robust and Fair Evaluation of Large Language Models

[31] TASE: Token Awareness and Structured Evaluation for Multilingual Language Models

[32] Rethinking Creativity Evaluation: A Critical Analysis of Existing Creativity Evaluations

[33] LAG: Logic-Augmented Generation from a Cartesian Perspective

[34] The World According to LLMs: How Geographic Origin Influences LLMs’ Entity Deduction Capabilities

[35] CoCoLex: Confidence-guided Copy-based Decoding for Grounded Legal Text Generation

[36] Conformal Sets in Multiple-Choice Question Answering under Black-Box Settings with Provable Coverage Guarantees

[37] Do Political Opinions Transfer Between Western Languages? An Analysis of Unaligned and Aligned Multilingual LLMs

[38] MathSmith: Towards Extremely Hard Mathematical Reasoning by Forging Synthetic Problems with a Reinforced Policy

[39] Cooper: Co-Optimizing Policy and Reward Models in Reinforcement Learning for Large Language Models

[40] OmniEAR: Benchmarking Agent Reasoning in Embodied Tasks

[41] Learning to Reason for Factuality

[42] How Do LLMs Persuade? Linear Probes Can Uncover Persuasion Dynamics in Multi-Turn Conversations

[43] H-Net++: Hierarchical Dynamic Chunking for Tokenizer-Free Language Modelling in Morphologically-Rich Languages

[44] A Latent-Variable Model for Intrinsic Probing

[45] Probabilities of Chat LLMs Are Miscalibrated but Still Predict Correctness on Multiple-Choice Q&A

[46] Understanding Large Language Model Behaviors through Interactive Counterfactual Generation and Analysis

[47] CrisisSense-LLM: Instruction Fine-Tuned Large Language Model for Multi-label Social Media Text Classification in Disaster Informatics

[48] When in Doubt, Cascade: Towards Building Efficient and Capable Guardrails

[49] CRAFT Your Dataset: Task-Specific Synthetic Dataset Generation Through Corpus Retrieval and Augmentation

[50] Medal Matters: Probing LLMs’ Failure Cases Through Olympic Rankings

[51] WhisperNER: Unified Open Named Entity and Speech Recognition

[52] MedHalu: Hallucinations in Responses to Healthcare Queries by Large Language Models

[53] From Code to Correctness: Closing the Last Mile of Code Generation with Hierarchical Debugging

[54] BloomWise: Enhancing Problem-Solving capabilities of Large Language Models using Bloom’s-Taxonomy-Inspired Prompts

[55] Scaling Laws For Mixed Quantization

[56] Data Processing for the OpenGPT-X Model Family

[57] Large Language Models Still Exhibit Bias in Long Text

[58] GuARD: Effective Anomaly Detection through a Text-Rich and Graph-Informed Language Model

[59] Efficient Knowledge Injection in LLMs via Self-Distillation

[60] Rationale-guided Prompting for Knowledge-based Visual Question Answering

[61] Multi-Agents Based on Large Language Models for Knowledge-based Visual Question Answering

[62] Can open source large language models be used for tumor documentation in Germany? – An evaluation on urological doctors’ notes

[63] RLTHF: Targeted Human Feedback for LLM Alignment

[64] Which Questions Improve Learning the Most? Utility Estimation of Questions with LM-based Simulations

[65] The Impact of Item-Writing Flaws on Difficulty and Discrimination in Item Response Theory

[66] Language Model Uncertainty Quantification with Attention Chain

[67] You Cannot Feed Two Birds with One Score: the Accuracy-Naturalness Tradeoff in Translation

[68] PolyGuard: A Multilingual Safety Moderation Tool for 17 Languages

[69] DEL: Context-Aware Dynamic Exit Layer for Efficient Self-Speculative Decoding

[70] Tell Me Who Your Students Are: GPT Can Generate Valid Multiple-Choice Questions When Students’ (Mis)Understanding Is Hinted

[71] Flex-Judge: Text-Only Reasoning Unleashes Zero-Shot Multimodal Evaluators

[72] Enabling On-Device Medical AI Assistants via Input-Driven Saliency Adaptation

[73] Improving Factuality for Dialogue Response Generation via Graph-Based Knowledge Augmentation

[74] FinCoT: Grounding Chain-of-Thought in Expert Financial Reasoning

[75] Can Vision Language Models Understand Mimed Actions?

[76] McBE: A Multi-task Chinese Bias Evaluation Benchmark for Large Language Models

[77] Perception-Aware Policy Optimization for Multimodal Reasoning